One of the main avenues through which we hope to build up the Names core record set is through harvesting information about researchers at the repository level. There are currently two methods by which a repository can make their data accessible for use within the Names project. The first method is to submit their data to us by producing a data extract of their researcher information that conforms to our Data Format Specification. The second method requires the institution to be running EPrints 3.2.1 or above as their repository software, and was recently explored with The University of Huddersfield.
EPrints 3.2.1 and above provides semantic web support, including data export in RDF+XML. By developing specific classes to read the data output using Jena we are able to harvest data from the source to be used by our matching and disambiguation algorithms against the existing Names records. To test this out we recently collaborated with The University of Huddersfield to try and extract and disambiguate the creators from their EPrints repository.
The first, and simplest step, was to export Huddersfield’s EPrints data as RDF from their repository (http://eprints.hud.ac.uk/id/dump). Once we had done this we could easily process the resulting RDF+XML file, using our disambiguation algorithms to try and match creators identified in the document against existing individuals identified within the Names Service. Two types of creator were defined in the RDF dump: those that were internal (belong to the institution) and those that were external (don’t belong to the institution). Because the amount of disambiguating data pertaining to the external individuals was limited we decided to only process internal creators to help increase accuracy of the results, and reduce the noise of creating many files with sparse information.
Processing and testing of the Huddersfield data has been an iterative process, and we used the exercise to both contribute to our records and also help improve the accuracy of our disambiguation algorithms. After an initial run we managed to identify ~550 unique individuals, but we needed to quality assure these results in order to ascertain how accurate the matching was. In order to do this, two reports were produced, one containing potentially mis-matched records (records which contained information from two or more individuals), and one containing potentially non-matched records (separate records which contain information about the same individual). We discovered around ~300 potential mis-matches and ~200 potential non-matches.
A team at the British Library with specialist skills were made available to quality assure the results, analysing each of the potential mis-matches to see whether an actual mis-match occurred, and analysing a sample of the potential non-matches to see whether a match should have occurred and why. The results of the mis-matches were encouraging, with 0 mis-matches found, however the results of the non-matches indicated that around 80% of the potential non-matches were actual non-matches.
Using this information we were able to fix a software bug, and also make further tweaks to the disambiguation algorithms to reduce the level of non-matches. After a further round of quality assurance by the British Library we discovered that we had reduced the number of non-matches to around 50% and the remaining cases were deemed impossible to match by automated means. These final records were merged manually.
Once all identified individuals had either been matched with an existing names identifier or assigned a new one we were able to return a list of assigned names identifiers to Huddersfield.
Some of the identifiers and records created or added to as part of this exercise are listed here:
1. http://names.mimas.ac.uk/individual/46934 (a record created purely from Huddersfield data)
2. http://names.mimas.ac.uk/individual/6831 (a record which already existed, but had Huddersfield data merged into it).
Late 2011 saw a small flurry of reports commissioned by JISC in the area of researcher identifiers, to support the work of the JISC Researcher Identifier Task and Finish Group. These reports are available from the JISC Information Environment Repository.
Researcher Identifiers Data sources report [PDF, 669Kb] by Cottage Labs
This report provides an overview of sources of data relevant to the task of creating profiles for academic researchers in the UK.
Researcher Identifiers Technical interoperability report [PDF, 506Kb] by Cottage Labs
This report discusses some of the technical aspects of implementing an identifier and profile system for researchers.
Stakeholder use cases and identifier needs: Report One [PDF, 204Kb] by Clax Limited
This report analyses UK research organisations’ use cases, needs, requirements and roles for an identifier system for researchers.
Stakeholder use cases and identifier needs: Report Two [PDF, 378Kb] by Clax Limited
This report investigates which technical systems would need to interoperate with any identifier infrastructure and examines the question of at what point an individual becomes a ‘researcher’.
Report on National Approaches to Researcher Identification Systems [PDF, 463Kb] by Hillbraith Limited
The remit of the report was to examine the approaches taken in other countries to the creation and maintenance of researcher identifiers.
A report on identifiers for digital object and authorshas been made available on the website of the EU-funded DIGOIDUNA study team. The project team state that:
The final report of the study is focused on three key objectives:
1. analyzing the fundamental role of identifiers as enablers of value in e-infrastructures and presenting forward looking scenarios as examples of the benefits of a systematic usage of identifiers for digital objects and authors to locate and integrate information from multiple sources;
2. reporting the results of the analysis of the Strengths, Weaknesses, Opportunities and Threats (SWOT) associated with establishing in Europe an open, dynamic and sustainable governance of e-infrastructure using identifiers for digital objects and authors;
3. presenting the main challenges and recommendations which European Commission and other relevant stakeholders should address to develop an open and sustainable e-infrastructure for locators of digital objects and identifiers of authors supporting scientific information access, curation and preservation.
The report provides a good analysis of the requirements for establishment of an infrastructure for digital identifiers and maintains that Europe is in a good position to set up initiatives in this area. Some of the issues identified by the report (and familiar to the Names Project team) include: fragmented current approaches, lack of financial sustainability, lack of consensus and resistance to change.
As the authors point out:
…technology is not the main driver in leading this process. Any identifier solution is always used within cultural, geographical, disciplinary and organizational boundaries through a technical system and the process of reaching an agreement between parties over possibly conflicting purposes and objectives is a process which is played out at the interfaces of these boundaries.
This summer, JISC funded the Names Project to build a plugin for the EPrints software. In this post, developer Phil Cross describes his work on this.
The EPrints software has been designed to ease the process for creating add-ons and customisations for a repository. We wished to provide an automatic search of the Names API when users type author or editor details into an eprint creation form. We also wanted to be able to present disambiguating information to allow the selection of the correct author and to have the Names-assigned person URI added to the eprint metadata.
We discovered that EPrints already has a built-in autocomplete function that searches over existing repository authors and that there is also an existing creator identifier field that allows the system to identify authors of multiple eprints. We therefore created an augmented version of the existing name autocomplete script that searches the Names API. The search pulls back affiliations, fields of interest and publication details as well as name details and the person URI. When this script is inserted into the code for a specific repository, it overrides the global script, adding the new functionality. Simply removing the script returns the repository to its default behaviour.
The name details are displayed in a drop-down list together with the Names URI. Moving the cursor down the list opens a box next to each entry that contains the disambiguation information. Selecting a name adds the name details and URI to the form.
We were also able to make changes to the context-sensitive help for the author and editor fields. The altered help text contains information about the Names API search and provides a link for authors who wish to add their own details to the Names database.
The Extension Package produced can be used with any EPrints 3.x installation by unzipping the compressed package into the top directory of the chosen repository (a single EPrints installation can run multiple repositories). To disable the Names functionality, the administrator simply needs to delete the three files added. The package and instructions on how to install it are available on the Names site.
The newest version of EPrints, version 3.3, contains access to a new development called the Bazaar Store. This is an application store for the EPrints platform that enables repository administrators to install EPrints Plugins and Extensions with a single click. We have created a Bazaar Package that is a version of the Names Extension Package and this is now available in the Bazaar store for users of version 3.3.
Comments on this work are welcome and we are also interested in working with you if you would like to include details of researchers from your institution in the Names system, to help improve the data which is returned from the API and the autocompletion plugin.
The latest issue of Information Standards Quarterly (ISQ) is devoted to the topic of identifiers for people and organisations. There are featured articles on ISNI and ORCID, an update from the NISO I2 group and one from the Names Project.
We’re getting a good response from repositories and institutions who would like to provide information about their researchers to improve the data in the Names system (see previous post for details). As a consequence, a data submission specification has been drawn up by Dan Needham. This lists the mandatory and optional fields that Names needs in order to create or match records for institutional staff. It also explains the best way to format your records.
The table below shows the information we’d like to receive from institutions:
|Primary author family names||Mandatory|
|Primary author given names||Mandatory|
|Primary author title||Name prefix / salutation e.g. Mrs, Dr, Sir …||Optional|
|Primary author date of birth||YYYY-MM-DD||Optional|
|Primary author date of death||YYYY-MM-DD||Optional|
|Primary author fields of interest||Semi-colon delimited list of strings describing fields of interest associated with the individual. Preferably values taken from a controlled of terms, although this is not required.||Optional|
|Primary author home page||URL of web page that contains information that helps identify the individual e.g. personal homepage, institutional page, linkedin page||Optional|
|Primary author internal identifier||Your internally used identifier.||Optional|
|Primary author external identifiers||A list of identifiers from other providers assigned to an individual. The list should be semi-colon delimited and contain alternating values for the identifier provider and the identifier itself, i.e. <source>;<identifier>;<source>;<identifier>||Optional|
|Result publication title||Please generate a complete new row for each publication given for each author.||Mandatory|
|Year of publication||YYYY||Optional|
|Subject area||Classified subject area, may be different from author’s field of activity.||Optional|
|Co-authors||Semi-colon delimited list of co-author names. May include the primary author name if necessary. Preferably in the format <family name(s)> , <given names(s)> however if necessary another format is acceptable as long as there is consistency.||Mandatory|
There’s also an example record in the required format to illustrate this.
The MERIT data has given us a good corpus of UK researchers’ names to use as the basis of the Names prototype. There are around 45,000 and most have institutional affiliations associated with them, too, which makes them a rich data set. What they don’t have, generally, are full names: they’re usually just surnames and initials.
This is where we need help from UK institutions to improve the data and we’ve recently been testing this process with some information supplied by Robert Gordon University (RGU) in Aberdeen. Researchers at the university were contacted and the aim of the process explained. Having cleared things with their researchers, staff at RGU then extracted information from the OpenAIR institutional repository for staff that were willing to be involved and sent them to the Names team in table form, listing surname, forename(s), publication title, date and publication type.
This data was then matched with existing Names data. 17 names were found to match with individuals already in the database, based on names and article titles. The other names were not in the database: new Names records and persistent identifiers were created for these individuals. Quality assurance on the results of the matching process was carried out by colleagues at the British Library.
In this example, the record for R. A. Laing has been enhanced with the researcher’s full forenames and with additional publications (only the first listed publication was supplied by the MERIT database). This additional information will assist in the establishment of future matches with further sources of names data that become available to the Names team.
If your institution is interested in providing similar data to improve the Names records for your researchers, then we’d love to hear from you. You can contact Dan Needham, the project’s lead developer at firstname.lastname@example.org, or if you have any questions, please email project manager Amanda Hill. We can supply you with a sample email which will introduce the project to researchers in your institution if, like RGU, you want to tell them about it.
The Names Project has recently started a new project-within-a-project to build some Names functionality into the EPrints repository software. The survey of UK repository managers undertaken by the project in 2010* showed that a majority (41% of respondents) were using EPrints to run their institutional repositories. This finding is confirmed by the OpenDOAR directory of repositories, which shows EPrints in use by 45% of the 193 repositories it lists for the country.
Incorporating Names into EPrints could be useful, therefore, for a large number of repositories in the UK. Our initial plans include developing the following features:
- Adapting the existing auto-completion field within EPrints to make a call to the Names API to bring back potential matches from the Names data (and distinguishing them from internal matches in some way)
- Associating the Names identifier with creator metadata within EPrints
- Allowing the update of existing materials within the repository to associate creator names with their Names identifier and to allow for searching by that identifier, so that all records for an individual can be retrieved, regardless of the form of the name
- Modifying output records from EPrints to include the Name identifier
- Packaging the resulting plug-in so that it can be made available through the EPrints Bazaar.
Any comments, suggestions, pitfalls you can foresee with this approach?
*Report on the survey [PDF]
How vain, without the merit, is the name!1
At the end of October 2010, the Merit project made its cleaned-up version of the 2008 Research Assessment Exercise (RAE) data available through the project’s website. This data set includes names of the top researchers in the UK (Stephen Hawking, for example, Monica Grady or Brian Cox), with the titles of the materials that were submitted for assessment by their institutions. It seemed to be an ideal set of information for the Names Project to use, as the information includes institutional affiliations, which is not easy to track down from other data sources we’ve been investigating, such as the Zetoc table of contents data from the British Library.
Names records have now been generated for all of the individuals represented in the Merit data. This creates a core of nearly 47,000 disambiguated names of UK researchers for the project, associated with 158 institutions. As a result of the earlier work of the Merit project, the quality of the data was good. There were occasions where individuals had more than one identifier in the Merit data (when their work had been submitted by more than one institution), but these were successfully identified and merged in the disambiguation process.
Our British Library colleagues’ quality assurance process identified only one case where the system wrongly suggested a match. There were two D. J. Siveters listed in the data, one at the University of Leicester, the other at the University of Oxford, both writing on the subject of palaeontology. A little investigation revealed that these were in fact two distinct individuals: twin brothers (Derek and David) working in the same field who often co-author papers. Perhaps these two form the ultimate test of any disambiguation mechanism?
The project team are intending to share the records generated by this process with colleagues working on the ISNI (International Standard Name Identifier) to see if they can identify matches with records in the ISNI data.
1Homer’s Iliad, Book XVII, translation by Alexander Pope, 1715