5 Sep 2024
The process of identifying records belonging to the same individual (or entity) from a set or sets of textual records.
Also known as entity resolution / de-duplication
Often involves manual review to identify false links / missed links
record linkage: pairwise comparison of records \(\rightarrow\) transitive closure \(\rightarrow\) clusters
transitive closure:
“problematic” cluster: a cluster with one or more false positive links
match score: Ratio of products of conditional probabilities of observing in/equality of an attribute in two records given true matches (\(M\)) over true non-matches (\(U\)).
\[ R_i = \frac{P(\gamma_i \mid M)}{P(\gamma_i \mid U)} = \frac{\prod_j P(\gamma_{ij} \mid M)}{\prod_j P(\gamma_{ij} \mid U)} \]
We assume conditional independence amongst compared attributes.
Each cluster an undirected simple graph, where:
Identify clusters with false positive links
Wikidata: Information on individuals appearing in Wikipedia
attribute | corruption mechanism | Base case probability of corruption (%) |
---|---|---|
given name | QWERTY typographical error | 5.0 |
family name | QWERTY typographical error | 5.0 |
date of birth | Number pad typographical error | 1.0 |
gender | Swapped | 0.5 |
But:
Ample room for further research…