In the past few weeks the Names team have been working with colleagues at the London School of Economics to uniquely identify individuals who have been involved in research at their institution. As with our previous work with the University of Huddersfield, this involved analysing the contents of LSE’s institutional repository, LSE Research Online.
By processing the RDF data which is automatically provided by the repository’s EPrints software, we were able to compare the information within it against the existing information in Names about LSE authors. Where individuals had already been identified from the Merit 2008 Research Assessment Exercise data, the repository information usually provided additional details to augment the Names records, including first names and other titles of papers that individuals had worked on. For individuals who were not already in Names, we created new records and assigned identifiers to them.
The Names disambiguation algorithm does a good job of automatically matching information from repository data with existing Names records, but it is configured to err on the side of caution in making matches to avoid making false connections between individuals who may have similar names but are not the same. This creates some extra work for the quality assurance process (which is undertaken by our colleagues at the British Library) , as it generates a list of potential matches which have to be checked manually. This is worth doing, however, as it ensures that the resulting data is more reliable than it would be with just an automated check. The more data is added to Names, the smoother the matching process becomes, as there is more information in the system to compare against each new source of data.
In the record below, the original Merit record has been enhanced with information about the individual’s first name and with the identifier from the LSE repository. Already this person has four separate identifiers assigned to him: the local LSE one, the national Merit one derived from the 2008 Research Assessment Exercise, the Names identifier (15711) and the international ISNI identifier. We’re also currently investigating the best way of linking this data up with the other big international initiative, ORCID.
Colleagues at LSE plan to add the Names identifiers to their local name authority file for use within the institution. I’d like to note here that working in collaboration with the LSE staff helped to improve data both in Names and at the repository. The experience has also helped us to speed up and fine-tune the quality assurance process at the Names end.
In total there are now 1,005 individuals identified in Names who are affiliated with LSE. 463 of these were new identities created from information in LSE Research Online and 413 were existing Names records which have been improved with additional information from the LSE repository.
Last week new data was added to Names from the Research Repository of the University of the West of England. This repository runs on the EPrints platform and we extracted information from its RDF output as we did for the University of Huddersfield’s repository earlier this year.
With the help of the quality assurance team at the British Library, 786 Names records were either created or enhanced with information from the UWE repository. In many cases for existing records we have been able to add first names where we previously only held initials, for example.
In total there are around 821 individuals with an affiliation to UWE who now have a unique identifier within the Names system.
The next data source we’ll be investigating is the aggregated data in the Institutional Repository Search service – but we’re always keen to work with individual repositories, so if you’d like to get your contributors included in Names, please get in touch.
One of the main avenues through which we hope to build up the Names core record set is through harvesting information about researchers at the repository level. There are currently two methods by which a repository can make their data accessible for use within the Names project. The first method is to submit their data to us by producing a data extract of their researcher information that conforms to our Data Format Specification. The second method requires the institution to be running EPrints 3.2.1 or above as their repository software, and was recently explored with The University of Huddersfield.
EPrints 3.2.1 and above provides semantic web support, including data export in RDF+XML. By developing specific classes to read the data output using Jena we are able to harvest data from the source to be used by our matching and disambiguation algorithms against the existing Names records. To test this out we recently collaborated with The University of Huddersfield to try and extract and disambiguate the creators from their EPrints repository.
The first, and simplest step, was to export Huddersfield’s EPrints data as RDF from their repository (http://eprints.hud.ac.uk/id/dump). Once we had done this we could easily process the resulting RDF+XML file, using our disambiguation algorithms to try and match creators identified in the document against existing individuals identified within the Names Service. Two types of creator were defined in the RDF dump: those that were internal (belong to the institution) and those that were external (don’t belong to the institution). Because the amount of disambiguating data pertaining to the external individuals was limited we decided to only process internal creators to help increase accuracy of the results, and reduce the noise of creating many files with sparse information.
Processing and testing of the Huddersfield data has been an iterative process, and we used the exercise to both contribute to our records and also help improve the accuracy of our disambiguation algorithms. After an initial run we managed to identify ~550 unique individuals, but we needed to quality assure these results in order to ascertain how accurate the matching was. In order to do this, two reports were produced, one containing potentially mis-matched records (records which contained information from two or more individuals), and one containing potentially non-matched records (separate records which contain information about the same individual). We discovered around ~300 potential mis-matches and ~200 potential non-matches.
A team at the British Library with specialist skills were made available to quality assure the results, analysing each of the potential mis-matches to see whether an actual mis-match occurred, and analysing a sample of the potential non-matches to see whether a match should have occurred and why. The results of the mis-matches were encouraging, with 0 mis-matches found, however the results of the non-matches indicated that around 80% of the potential non-matches were actual non-matches.
Using this information we were able to fix a software bug, and also make further tweaks to the disambiguation algorithms to reduce the level of non-matches. After a further round of quality assurance by the British Library we discovered that we had reduced the number of non-matches to around 50% and the remaining cases were deemed impossible to match by automated means. These final records were merged manually.
Once all identified individuals had either been matched with an existing names identifier or assigned a new one we were able to return a list of assigned names identifiers to Huddersfield.
Some of the identifiers and records created or added to as part of this exercise are listed here:
1. http://names.mimas.ac.uk/individual/46934 (a record created purely from Huddersfield data)
2. http://names.mimas.ac.uk/individual/6831 (a record which already existed, but had Huddersfield data merged into it).
This summer, JISC funded the Names Project to build a plugin for the EPrints software. In this post, developer Phil Cross describes his work on this.
The EPrints software has been designed to ease the process for creating add-ons and customisations for a repository. We wished to provide an automatic search of the Names API when users type author or editor details into an eprint creation form. We also wanted to be able to present disambiguating information to allow the selection of the correct author and to have the Names-assigned person URI added to the eprint metadata.
We discovered that EPrints already has a built-in autocomplete function that searches over existing repository authors and that there is also an existing creator identifier field that allows the system to identify authors of multiple eprints. We therefore created an augmented version of the existing name autocomplete script that searches the Names API. The search pulls back affiliations, fields of interest and publication details as well as name details and the person URI. When this script is inserted into the code for a specific repository, it overrides the global script, adding the new functionality. Simply removing the script returns the repository to its default behaviour.
The name details are displayed in a drop-down list together with the Names URI. Moving the cursor down the list opens a box next to each entry that contains the disambiguation information. Selecting a name adds the name details and URI to the form.
We were also able to make changes to the context-sensitive help for the author and editor fields. The altered help text contains information about the Names API search and provides a link for authors who wish to add their own details to the Names database.
The Extension Package produced can be used with any EPrints 3.x installation by unzipping the compressed package into the top directory of the chosen repository (a single EPrints installation can run multiple repositories). To disable the Names functionality, the administrator simply needs to delete the three files added. The package and instructions on how to install it are available on the Names site.
The newest version of EPrints, version 3.3, contains access to a new development called the Bazaar Store. This is an application store for the EPrints platform that enables repository administrators to install EPrints Plugins and Extensions with a single click. We have created a Bazaar Package that is a version of the Names Extension Package and this is now available in the Bazaar store for users of version 3.3.
Comments on this work are welcome and we are also interested in working with you if you would like to include details of researchers from your institution in the Names system, to help improve the data which is returned from the API and the autocompletion plugin.
The Names Project has recently started a new project-within-a-project to build some Names functionality into the EPrints repository software. The survey of UK repository managers undertaken by the project in 2010* showed that a majority (41% of respondents) were using EPrints to run their institutional repositories. This finding is confirmed by the OpenDOAR directory of repositories, which shows EPrints in use by 45% of the 193 repositories it lists for the country.
Incorporating Names into EPrints could be useful, therefore, for a large number of repositories in the UK. Our initial plans include developing the following features:
- Adapting the existing auto-completion field within EPrints to make a call to the Names API to bring back potential matches from the Names data (and distinguishing them from internal matches in some way)
- Associating the Names identifier with creator metadata within EPrints
- Allowing the update of existing materials within the repository to associate creator names with their Names identifier and to allow for searching by that identifier, so that all records for an individual can be retrieved, regardless of the form of the name
- Modifying output records from EPrints to include the Name identifier
- Packaging the resulting plug-in so that it can be made available through the EPrints Bazaar.
Any comments, suggestions, pitfalls you can foresee with this approach?
*Report on the survey [PDF]