Identifying “problematic” record linkage clusters using graph measures

Tony Stone, University College London / University of Sheffield

5 Sep 2024

Context

Record linkage

The process of identifying records belonging to the same individual (or entity) from a set or sets of textual records.

  • Also known as entity resolution / de-duplication

  • Often involves manual review to identify false links / missed links

Wider aims

  • Desire to reduce (resource intensive) manual review:
    • automate identification of false links / missed links

Definitions for this presentation (1/2)

  • record linkage: pairwise comparison of records \(\rightarrow\) transitive closure \(\rightarrow\) clusters

  • transitive closure:

G A A B B A--B C C A--C ?? B--C

  • cluster: set of records connected by comparisons yielding match score above a linkage threshold.

Definitions for this presentation (2/2)

  • “problematic” cluster: a cluster with one or more false positive links

  • match score: Ratio of products of conditional probabilities of observing in/equality of an attribute in two records given true matches (\(M\)) over true non-matches (\(U\)).

\[ R_i = \frac{P(\gamma_i \mid M)}{P(\gamma_i \mid U)} = \frac{\prod_j P(\gamma_{ij} \mid M)}{\prod_j P(\gamma_{ij} \mid U)} \]

We assume conditional independence amongst compared attributes.

Graph representation of pairwise record linkage outputs

Each cluster an undirected simple graph, where:

  • vertices \(\leftrightarrow\) records
  • edges \(\leftrightarrow\) comparison of records

G A Record 1 B Record 2 A--B Comparison of Records 1 & 2

Graph conceptualisations

  1. No weighted graph measures
  2. Weighted graph: match score (above linkage threshold)
  3. Multiple weighted graphs:
    • match score
    • each compared attributes’ contribution

G1 A A B B A--B C C B--C

G2 A A B B A--B 0.5 C C B--C 0.9

G3 A A B B A--B (0.5, 0.6, 0.4) C C B--C (0.9, 0.9, 0.8)

Graphs measures investigated

  • 5 measures: Diameter; Global clustering coefficient; Averaged Local Clustering Coefficient; Assortativity (degree); Density
  • Measures of vertex: connectedness, clustering, mixing
  • Weights only used for diameter

A binary classification problem

Identify clusters with false positive links

  • Units: clusters (of three or more records)
  • Features: the graph measures from the outputs of a record linkage process
  • Outcomes:
    1. No false positive links
    2. One or more false positive links

Underlying Data

Wikidata: Information on individuals appearing in Wikipedia

  • Date of death between: 1953 - 2000
  • 5 fields:
    • Unique entity identifier (assigned by Wikidata)
    • Given name
    • Family name
    • Date of birth
    • Gender
  • 311,915 records
  • no missing values

Test Datasets

  • 5 test datasets per configuration
  • Probabilistically created using all ~312k records

Configurations

  • Duplication of records
    • Power law distribution
    • \(\{\underline{49}, 99\}\) maximum duplicates
  • Record corruption:
    • Base case
    • \(\{0.5, 2, 5\} \times\) base case corruption
attribute corruption mechanism Base case probability of corruption (%)
given name QWERTY typographical error 5.0
family name QWERTY typographical error 5.0
date of birth Number pad typographical error 1.0
gender Swapped 0.5

Analysis datasets generation

  • Record linkage (de-duplication) and clustering on each test dataset
    • Linkage threshold chosen to give highest F-measure
  • Excluded clusters with:
    • fewer than 3 records
    • more than 50/100 records
  • 0.7% - 1.5% clusters had one or more false positive link
  • Calculated graph measures

Processing summary

G data underlying data corrupt Duplicate and corrupt x configurations x 5 data->corrupt test test datasets x configurations x 5 link De-duplicate and cluster x configurations x 5 test->link analysis analysis datasets x configurations x 5 measure Calculate graph measures x configurations x 5 analysis->measure eval results corrupt->test link->analysis classify Classifier (train/evaluate) measure->classify classify->eval classify->classify Train: base case only

Example clusters (graphs)

A cluster of 1 entity

A “problematic” cluster of 2 entities

Internal cross-validation

Quasi-external validation

Feature importance

TL; DL

Graph measure based classifiers appear…

  • Promising
  • Insensitive to number of duplications

But:

  • Require training data
  • Sensitive to data corruption rates

Ample room for further research…

  • Especially, contribution to match score from individual attributes