Last week new data was added to Names from the Research Repository of the University of the West of England. This repository runs on the EPrints platform and we extracted information from its RDF output as we did for the University of Huddersfield’s repository earlier this year.
With the help of the quality assurance team at the British Library, 786 Names records were either created or enhanced with information from the UWE repository. In many cases for existing records we have been able to add first names where we previously only held initials, for example.
In total there are around 821 individuals with an affiliation to UWE who now have a unique identifier within the Names system.
The next data source we’ll be investigating is the aggregated data in the Institutional Repository Search service – but we’re always keen to work with individual repositories, so if you’d like to get your contributors included in Names, please get in touch.
One of the main avenues through which we hope to build up the Names core record set is through harvesting information about researchers at the repository level. There are currently two methods by which a repository can make their data accessible for use within the Names project. The first method is to submit their data to us by producing a data extract of their researcher information that conforms to our Data Format Specification. The second method requires the institution to be running EPrints 3.2.1 or above as their repository software, and was recently explored with The University of Huddersfield.
EPrints 3.2.1 and above provides semantic web support, including data export in RDF+XML. By developing specific classes to read the data output using Jena we are able to harvest data from the source to be used by our matching and disambiguation algorithms against the existing Names records. To test this out we recently collaborated with The University of Huddersfield to try and extract and disambiguate the creators from their EPrints repository.
The first, and simplest step, was to export Huddersfield’s EPrints data as RDF from their repository (http://eprints.hud.ac.uk/id/dump). Once we had done this we could easily process the resulting RDF+XML file, using our disambiguation algorithms to try and match creators identified in the document against existing individuals identified within the Names Service. Two types of creator were defined in the RDF dump: those that were internal (belong to the institution) and those that were external (don’t belong to the institution). Because the amount of disambiguating data pertaining to the external individuals was limited we decided to only process internal creators to help increase accuracy of the results, and reduce the noise of creating many files with sparse information.
Processing and testing of the Huddersfield data has been an iterative process, and we used the exercise to both contribute to our records and also help improve the accuracy of our disambiguation algorithms. After an initial run we managed to identify ~550 unique individuals, but we needed to quality assure these results in order to ascertain how accurate the matching was. In order to do this, two reports were produced, one containing potentially mis-matched records (records which contained information from two or more individuals), and one containing potentially non-matched records (separate records which contain information about the same individual). We discovered around ~300 potential mis-matches and ~200 potential non-matches.
A team at the British Library with specialist skills were made available to quality assure the results, analysing each of the potential mis-matches to see whether an actual mis-match occurred, and analysing a sample of the potential non-matches to see whether a match should have occurred and why. The results of the mis-matches were encouraging, with 0 mis-matches found, however the results of the non-matches indicated that around 80% of the potential non-matches were actual non-matches.
Using this information we were able to fix a software bug, and also make further tweaks to the disambiguation algorithms to reduce the level of non-matches. After a further round of quality assurance by the British Library we discovered that we had reduced the number of non-matches to around 50% and the remaining cases were deemed impossible to match by automated means. These final records were merged manually.
Once all identified individuals had either been matched with an existing names identifier or assigned a new one we were able to return a list of assigned names identifiers to Huddersfield.
Some of the identifiers and records created or added to as part of this exercise are listed here:
1. http://names.mimas.ac.uk/individual/46934 (a record created purely from Huddersfield data)
2. http://names.mimas.ac.uk/individual/6831 (a record which already existed, but had Huddersfield data merged into it).
The MERIT data has given us a good corpus of UK researchers’ names to use as the basis of the Names prototype. There are around 45,000 and most have institutional affiliations associated with them, too, which makes them a rich data set. What they don’t have, generally, are full names: they’re usually just surnames and initials.
This is where we need help from UK institutions to improve the data and we’ve recently been testing this process with some information supplied by Robert Gordon University (RGU) in Aberdeen. Researchers at the university were contacted and the aim of the process explained. Having cleared things with their researchers, staff at RGU then extracted information from the OpenAIR institutional repository for staff that were willing to be involved and sent them to the Names team in table form, listing surname, forename(s), publication title, date and publication type.
This data was then matched with existing Names data. 17 names were found to match with individuals already in the database, based on names and article titles. The other names were not in the database: new Names records and persistent identifiers were created for these individuals. Quality assurance on the results of the matching process was carried out by colleagues at the British Library.
In this example, the record for R. A. Laing has been enhanced with the researcher’s full forenames and with additional publications (only the first listed publication was supplied by the MERIT database). This additional information will assist in the establishment of future matches with further sources of names data that become available to the Names team.
If your institution is interested in providing similar data to improve the Names records for your researchers, then we’d love to hear from you. You can contact Dan Needham, the project’s lead developer at firstname.lastname@example.org, or if you have any questions, please email project manager Amanda Hill. We can supply you with a sample email which will introduce the project to researchers in your institution if, like RGU, you want to tell them about it.