Names Project Blog

Importing The University of Huddersfield’s researcher information.

Posted in data, EPrints, institutions by Daniel Needham on 17 February, 2012

One of the main avenues through which we hope to build up the Names core record set is through harvesting information about researchers at the repository level. There are currently two methods by which a repository can make their data accessible for use within the Names project. The first method is to submit their data to us by producing a data extract of their researcher information that conforms to our Data Format Specification. The second method requires the institution to be running EPrints 3.2.1 or above as their repository software, and was recently explored with The University of Huddersfield.

EPrints 3.2.1 and above provides semantic web support, including data export in RDF+XML. By developing specific classes to read the data output using Jena we are able to harvest data from the source to be used by our matching and disambiguation algorithms against the existing Names records. To test this out we recently collaborated with The University of Huddersfield to try and extract and disambiguate the creators from their EPrints repository.

The first, and simplest step, was to export Huddersfield’s EPrints data as RDF from their repository (http://eprints.hud.ac.uk/id/dump). Once we had done this we could easily process the resulting RDF+XML file, using our disambiguation algorithms to try and match creators identified in the document against existing individuals identified within the Names Service. Two types of creator were defined in the RDF dump: those that were internal (belong to the institution) and those that were external (don’t belong to the institution). Because the amount of disambiguating data pertaining to the external individuals was limited we decided to only process internal creators to help increase accuracy of the results, and reduce the noise of creating many files with sparse information.

Processing and testing of the Huddersfield data has been an iterative process, and we used the exercise to both contribute to our records and also help improve the accuracy of our disambiguation algorithms. After an initial run we managed to identify ~550 unique individuals, but we needed to quality assure these results in order to ascertain how accurate the matching was. In order to do this, two reports were produced, one containing potentially mis-matched records (records which contained information from two or more individuals), and one containing potentially non-matched records (separate records which contain information about the same individual). We discovered around ~300 potential mis-matches and ~200 potential non-matches.

A team at the British Library with specialist skills were made available to quality assure the results, analysing each of the potential mis-matches to see whether an actual mis-match occurred, and analysing a sample of the potential non-matches to see whether a match should have occurred and why. The results of the mis-matches were encouraging, with 0 mis-matches found, however the results of the non-matches indicated that around 80% of the potential non-matches were actual non-matches.

Using this information we were able to fix a software bug, and also make further tweaks to the disambiguation algorithms to reduce the level of non-matches. After a further round of quality assurance by the British Library we discovered that we had reduced the number of non-matches to around 50% and the remaining cases were deemed impossible to match by automated means. These final records were merged manually.

Once all identified individuals had either been matched with an existing names identifier or assigned a new one we were able to return a list of assigned names identifiers to Huddersfield.

Some of the identifiers and records created or added to as part of this exercise are listed here:

1. http://names.mimas.ac.uk/individual/46934  (a record created purely from Huddersfield data)
2. http://names.mimas.ac.uk/individual/6831 (a record which already existed, but had Huddersfield data merged into it).

Recent JISC-sponsored reports on researcher identifiers

Posted in identifiers, reports by Amanda Hill on 9 February, 2012

Late 2011 saw a small flurry of reports commissioned by JISC in the area of researcher identifiers, to support the work of the JISC Researcher Identifier Task and Finish Group. These reports are available from the JISC Information Environment Repository.

They are:

Researcher Identifiers Data sources report [PDF, 669Kb] by Cottage Labs

This report provides an overview of sources of data relevant to the task of creating profiles for academic researchers in the UK.

Researcher Identifiers Technical interoperability report [PDF, 506Kb] by Cottage Labs

This report discusses some of the technical aspects of implementing an identifier and profile system for researchers.

Stakeholder use cases and identifier needs: Report One [PDF, 204Kb] by Clax Limited

This report analyses UK research organisations’ use cases, needs, requirements and roles for an identifier system for researchers.

Stakeholder use cases and identifier needs: Report Two [PDF, 378Kb] by Clax Limited

This report investigates which technical systems would need to interoperate with any identifier infrastructure and examines the question of at what point an individual becomes a ‘researcher’.

Report on National Approaches to Researcher Identification Systems [PDF, 463Kb] by Hillbraith Limited

The remit of the report was to examine the approaches taken in other countries to the creation and maintenance of researcher identifiers.