50,000 names! (Well, nearly…)

Posted in data, identifiers by Amanda Hill on 12 July, 2013

ORO Open Research Online repositoryThe Names team have just finished processing data from the Open University’s Open Research Online repository. When the researchers’ names from Open Research Online were added to the existing Names data, there were 50,002 individuals identified in the Names system.

The matching algorithm developed by the Names project’s Dan Needham does a good job of comparing new names to those already in the system, matching up individuals based on their names, affiliations and the titles of their papers. The algorithm errs on the side of caution, however, to avoid wrongly matching people. This means that some individuals who are already in the system might not be matched up correctly.

As a result, with the OU data, we had some 850 names (out of 2,243) to check against potential matches. Most of these were not actually matches, but a sample of 10% were checked and this sample showed that around 12% of the potential matches were actual matches. To ensure the quality of the data, we decided to check the whole batch and this manual process determined that 108 of the possible matches did indeed match existing individuals in the Names system. Human intervention is the best way of ensuring the quality of data in these cases – automation can achieve a fair degree of accuracy in matching individuals, but in some cases it’s essential to have a person looking at potential matches to determine whether they really are a match or not. Sometimes it is obvious, but there were several in this batch where some additional research was needed to be absolutely sure.

The matching of those individuals left us with a total of 49,894 uniquely identified individuals in the Names database. It would have been nice to have been over the 50,000 mark – but the data would have been poorer quality if we’d left it as it was…

P.S. Come to think of it, if we include the identifiers for the 158 research institutions in Names, then we are over the 50,000 (50,042 to be precise). Yay!