Identifying “problematic” record linkage clusters using graph measures

Tony Stone, University College London / University of Sheffield

5 Sep 2024

Context

The process of identifying records belonging to the same individual (or entity) from a set or sets of textual records.

Desire to reduce (resource intensive) manual review:
- automate identification of false links / ~~missed links~~

record linkage: pairwise comparison of records \(\rightarrow\) transitive closure \(\rightarrow\) clusters
transitive closure:

cluster: set of records connected by comparisons yielding match score above a linkage threshold.

“problematic” cluster: a cluster with one or more false positive links
match score: Ratio of products of conditional probabilities of observing in/equality of an attribute in two records given true matches (\(M\)) over true non-matches (\(U\)).

\[ R_i = \frac{P(\gamma_i \mid M)}{P(\gamma_i \mid U)} = \frac{\prod_j P(\gamma_{ij} \mid M)}{\prod_j P(\gamma_{ij} \mid U)} \]

We assume conditional independence amongst compared attributes.

Each cluster an undirected simple graph, where:

No weighted graph measures
Weighted graph: match score (above linkage threshold)
Multiple weighted graphs:
- match score
- each compared attributes’ contribution

5 measures: Diameter; Global clustering coefficient; Averaged Local Clustering Coefficient; Assortativity (degree); Density
Measures of vertex: connectedness, clustering, mixing
Weights only used for diameter

Identify clusters with false positive links

Wikidata: Information on individuals appearing in Wikipedia

Duplication of records
- Power law distribution
- \(\{\underline{49}, 99\}\) maximum duplicates

attribute	corruption mechanism	Base case probability of corruption (%)
given name	QWERTY typographical error	5.0
family name	QWERTY typographical error	5.0
date of birth	Number pad typographical error	1.0
gender	Swapped	0.5

Record linkage (de-duplication) and clustering on each test dataset
- Linkage threshold chosen to give highest F-measure
Excluded clusters with:
- fewer than 3 records
- more than 50/100 records
0.7% - 1.5% clusters had one or more false positive link
Calculated graph measures

But:

Ample room for further research…