Previously it has been difficult to evaluate the relative effectiveness of different linkage
approaches, due to limited availability of ground truth data. Here we report on an
evaluation of linkage using a large Swedish dataset which has been manipulated to create a
superset of the Scottish vital event records being collected as part ...
(Show more)Previously it has been difficult to evaluate the relative effectiveness of different linkage
approaches, due to limited availability of ground truth data. Here we report on an
evaluation of linkage using a large Swedish dataset which has been manipulated to create a
superset of the Scottish vital event records being collected as part of the Digitising Scotland
project. Critically this dataset is large (373,000 records) and includes a high proportion of
ground truth links, captured within the original clerical process. The Scottish data comprises
birth, marriage and death records collected by the General Register Office for Scotland
during 1856-1973. All records include the names of the parties involved: child and parents
on birth records; spouses and their parents on marriage records; and the deceased, their
parents and their spouse on death records, the date and location of the relevant event, and
in some cases, additional information such as occupation and marital status. The extract of
the Swedish data, from CEDAR, contains 227,888 birth records, 42,207 marriage records and
102,655 death records, plus high quality ground truth. To link records we use a probabilistic
distance-based approach, in which two records are deemed to be linked if the difference
between the relevant information on the two records falls below some pre-determined
threshold. We perform two different forms of linkage on records: identity linkage and
indirect linkage. Our system currently includes 22 different linkers, each
of which creates inter-node relationships that are stored in a graph database. The
linker encodes the links between these records as relationships between the stored nodes.
Each of the relationships has attributes which encode the name of the builder used to form
the link (and thus all the provenance in the source code), the distance between the nodes,
and the type of relationship (e.g. mother, father, sibling, identity (same actor on two
records). In some cases additional information is stored in the relationship attributes (for
example in the case of identity links which actors on the records are involved). Using the
ground truth data available from the Swedish data is possible to determine the efficacy of
different approaches to linkage with a high degree of confidence. It is also possible to
determine the effect of different linkage metrics and different thresholds. We will briefly
report on our experimental results in this regard. During these experiments we have
observed the high impact of missing data on linkage quality and we will discuss these
observations and approaches to mitigating against missing data. The graph(s) that are
created by the different builders may contain a number of different classes of error. In
general errors are in one of 4 categories: errors of omission, errors of inclusion, uniqueness
constraint errors and semantic errors. Many of these errors are only readily manifested
when the outputs from the different building processes are combined into a single graph. In
this paper we will also fully describe these errors, how they may be detected and how they
may be, at least partially corrected.
(Show less)