Names Project Blog

Importing The University of Huddersfield’s researcher information.

Posted in data, EPrints, institutions by Daniel Needham on 17 February, 2012

One of the main avenues through which we hope to build up the Names core record set is through harvesting information about researchers at the repository level. There are currently two methods by which a repository can make their data accessible for use within the Names project. The first method is to submit their data to us by producing a data extract of their researcher information that conforms to our Data Format Specification. The second method requires the institution to be running EPrints 3.2.1 or above as their repository software, and was recently explored with The University of Huddersfield.

EPrints 3.2.1 and above provides semantic web support, including data export in RDF+XML. By developing specific classes to read the data output using Jena we are able to harvest data from the source to be used by our matching and disambiguation algorithms against the existing Names records. To test this out we recently collaborated with The University of Huddersfield to try and extract and disambiguate the creators from their EPrints repository.

The first, and simplest step, was to export Huddersfield’s EPrints data as RDF from their repository (http://eprints.hud.ac.uk/id/dump). Once we had done this we could easily process the resulting RDF+XML file, using our disambiguation algorithms to try and match creators identified in the document against existing individuals identified within the Names Service. Two types of creator were defined in the RDF dump: those that were internal (belong to the institution) and those that were external (don’t belong to the institution). Because the amount of disambiguating data pertaining to the external individuals was limited we decided to only process internal creators to help increase accuracy of the results, and reduce the noise of creating many files with sparse information.

Processing and testing of the Huddersfield data has been an iterative process, and we used the exercise to both contribute to our records and also help improve the accuracy of our disambiguation algorithms. After an initial run we managed to identify ~550 unique individuals, but we needed to quality assure these results in order to ascertain how accurate the matching was. In order to do this, two reports were produced, one containing potentially mis-matched records (records which contained information from two or more individuals), and one containing potentially non-matched records (separate records which contain information about the same individual). We discovered around ~300 potential mis-matches and ~200 potential non-matches.

A team at the British Library with specialist skills were made available to quality assure the results, analysing each of the potential mis-matches to see whether an actual mis-match occurred, and analysing a sample of the potential non-matches to see whether a match should have occurred and why. The results of the mis-matches were encouraging, with 0 mis-matches found, however the results of the non-matches indicated that around 80% of the potential non-matches were actual non-matches.

Using this information we were able to fix a software bug, and also make further tweaks to the disambiguation algorithms to reduce the level of non-matches. After a further round of quality assurance by the British Library we discovered that we had reduced the number of non-matches to around 50% and the remaining cases were deemed impossible to match by automated means. These final records were merged manually.

Once all identified individuals had either been matched with an existing names identifier or assigned a new one we were able to return a list of assigned names identifiers to Huddersfield.

Some of the identifiers and records created or added to as part of this exercise are listed here:

1. http://names.mimas.ac.uk/individual/46934  (a record created purely from Huddersfield data)
2. http://names.mimas.ac.uk/individual/6831 (a record which already existed, but had Huddersfield data merged into it).

About these ads

2 Responses

Subscribe to comments with RSS.

  1. Nick said, on 23 February, 2012 at 9:45 am

    Hi Daniel

    Might there also be an opportunity to work with commercial research management systems like Atira PURE and Symplectic?

    At Leeds Metropolitan, for example, we are in the process of implementing Symplectic alongside our intraLibrary repository; HR data is fed into Symplectic and can easily be exposed via an API.

    Speaking also as Technical Officer for UKCoRR – http://ukcorr.org/ – I think this may be an approach that our membership would be interested in exploring – especially as many are in the process of implementing these types of systems alongside their repository.

    For reference I’ve created a Google doc “CRIS + Repositories at UK Universities” at: https://docs.google.com/document/d/1BQadMoMXbKHucuzlBpJajQVuJTrxWzJ-gR_K2nhyniQ/edit (a little old so possibly needs updating)

    Regards

    Nick

    • Daniel Needham said, on 23 February, 2012 at 11:34 am

      Hi Nick,

      We’d certainly be interested in looking at how data might be acquired from commercial research management systems. EPrints provided a good test case for interacting with external stores containing researcher information as it has been widely adopted, and more recent versions come with the ability to provide data as RDF. Similarly we’ve been able to work on a number of plugins for EPrints that enables users to search for Names records from within EPrints, and also use Names Identifiers to identify themselves.
      Having said that, we’ve designed our back-end (the bit that attempts to automatically identify and disambiguate individuals within datasets) to be as flexible as possible, as we are trying to draw information for a wide variety of different data sources and we need to be able to acquire that data through a variety of means (APIs, triple stores, raw data dumps). So if there is an API available that exposes your researcher information then we’d probably be able to tailor our system to access it.

      We’re also planning to look at working with other repository systems in due course such as DSpace and Fedora.

      And finally we’re hoping to implement a CERIF output of our data at some point, which hopefully will be of use in the ‘CRIS-sphere’.

      If you’d be interested in providing us with access to your researcher information then feel free to get in touch. My email address is on the help page of our main site.

      Phew…

      Thanks for getting in touch,

      Dan


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: