On the second day of the Digital Author Identifier Summit, the participants spent time divided into separate groups, looking at issues of governance, interoperability and added value. I was in the Interoperability group which was concerned with identifying barriers to the interchange of digital author identifier information and recommending ‘next steps’ for the international scene.
It was a lively discussion, eventually focusing on the need for a canonical identifier for individuals at the international level. Paolo Bouquet advanced the idea that the canonical ID should be a light-weight service with a minimal set of metadata which would be sufficient to distinguish one entity from another. The first step is to identify who should provide this thin layer: both ORCID and ISNI were seen as candidate services, but ideally they should co-operate in this area. Once the ‘thin’ identifier layer is agreed upon, other identifier services would be able to map information found in their systems to the canonical ID. These lower-level systems would be able to provide various value-added services, tailored for their particular constituencies, and would have to agree standard ways of sharing data between them. (For an example, see the Names Project’s API documentation.)
Paolo demonstrated the sig.ma Semantic Information Mashup as an example of a service which could then aggregate information from other services about an individual (Paolo himself, in this case). Sig.ma illustrates part of what Cliff Lynch was talking about on Day 1, with the ability of creating new biography services with data from author identifier systems. Paolo’s vision gained a fair degree of support from the group, although the issue of collaboration between ISNI and ORCID was seen as a possible problem area: the two approaches have very different business models and ways of obtaining information.
The feedback from the Added Value group was that the practical steps for existing systems would be to develop local IDs for authors/contributors and to make those available to other systems. The Governance group agreed that ISNI and ORCID are part of the solution and complementary but were concerned that if they did not agree on a way of collaborating, the landscape would become fragmented. They saw the importance of aligning business models with available funding sources and thought that the data should be open and trustworthy. In the summing-up of the two days, Cliff Lynch noted that both ORCID and ISNI are relatively young services and that there is still time to provide feedback at a high level to help ensure that they evolve in the most useful direction for the communities which need them.
Brian Kelly has pulled together the tweets from the workshop and there are overall summaries of the event on the Knowledge Exchange site and by Talat Chaudhri at the JISC Innovation Support Centre blog. It was an interesting and stimulating two days (it’s not often that I get to talk for two solid days about digital author identifiers!) and I’d like to take this opportunity to thank the organisers of the event for the chance of taking part.
UPDATED 11 April 2012: just to note that the Knowledge Exchange team have now published a report [PDF, 440KB] on the event.
For a meeting held in the grounds of the former Royal Mint near the Tower of London, it was probably appropriate that at lot of the discussion on the first day of the Digital Author Identifier Summit should focus on the financial aspects of building identifier systems for researchers and/or authors. An international group representing digital infrastructure specialists and people involved in building identifier systems are looking at the requirements of researchers, institutions, funders and publishers in this rapidly-evolving field. The meeting has been convened by the Knowledge Exchange, a Danish/Dutch/German/British grouping of institutions interesting in using technology to improve access to research materials.
It is interesting to see how the discussion has moved on since March 2009, when many of the participants in this meeting met in Amsterdam to begin discussions in this area. Existing systems have matured since then, and back in March 2009 no-one had heard of ORCID.
Several points came up yesterday which I think are worth mentioning here. One was the notion that different users of author identifier systems have different requirements in terms of the quality and completeness of the data in those systems. So a service which covers 80% of researchers might have enough to be useful for a range of other services, even though it is not complete.
Participants were asked to imagine that they had a magic wand and could grant three wishes in relation to DAIs. Common themes quickly appeared: openness of the data was an oft-mentioned priority – the information needs to be freely available in order to build other useful services on top of it. Other popular choices were the importance of having a single identifier for an author at an international level and an agreed way of aligning national identifier services with international ones. It was agreed that the benefits to the individuals being identified should be easily demonstrated to ensure their engagement.
Group discussions in the afternoon focused on the role of DAIs from the viewpoint of suppliers of information, those needing the data and those in charge of working out how the systems should be overseen. One interesting point from the reporting of these groups was the general acceptance that digital author identifier systems are ‘resistant to traditional business models’ (Cliff Lynch) and ideally should be funded as elements of infrastructure. This is mainly because the data held in the systems needs to be freely available for re-use to make the most of having them (and to create the ‘frictionless sharing’ and ‘bridges of trust’ which were mentioned in the meeting), but no-one is expecting individual researchers or authors to pay a fee in order to register their identifier.
Today the discussion will move on to analysing opportunities and challenges in issues of governance, interoperability and added value and maybe come up with some actions for members of this international group to take on.
One of the main avenues through which we hope to build up the Names core record set is through harvesting information about researchers at the repository level. There are currently two methods by which a repository can make their data accessible for use within the Names project. The first method is to submit their data to us by producing a data extract of their researcher information that conforms to our Data Format Specification. The second method requires the institution to be running EPrints 3.2.1 or above as their repository software, and was recently explored with The University of Huddersfield.
EPrints 3.2.1 and above provides semantic web support, including data export in RDF+XML. By developing specific classes to read the data output using Jena we are able to harvest data from the source to be used by our matching and disambiguation algorithms against the existing Names records. To test this out we recently collaborated with The University of Huddersfield to try and extract and disambiguate the creators from their EPrints repository.
The first, and simplest step, was to export Huddersfield’s EPrints data as RDF from their repository (http://eprints.hud.ac.uk/id/dump). Once we had done this we could easily process the resulting RDF+XML file, using our disambiguation algorithms to try and match creators identified in the document against existing individuals identified within the Names Service. Two types of creator were defined in the RDF dump: those that were internal (belong to the institution) and those that were external (don’t belong to the institution). Because the amount of disambiguating data pertaining to the external individuals was limited we decided to only process internal creators to help increase accuracy of the results, and reduce the noise of creating many files with sparse information.
Processing and testing of the Huddersfield data has been an iterative process, and we used the exercise to both contribute to our records and also help improve the accuracy of our disambiguation algorithms. After an initial run we managed to identify ~550 unique individuals, but we needed to quality assure these results in order to ascertain how accurate the matching was. In order to do this, two reports were produced, one containing potentially mis-matched records (records which contained information from two or more individuals), and one containing potentially non-matched records (separate records which contain information about the same individual). We discovered around ~300 potential mis-matches and ~200 potential non-matches.
A team at the British Library with specialist skills were made available to quality assure the results, analysing each of the potential mis-matches to see whether an actual mis-match occurred, and analysing a sample of the potential non-matches to see whether a match should have occurred and why. The results of the mis-matches were encouraging, with 0 mis-matches found, however the results of the non-matches indicated that around 80% of the potential non-matches were actual non-matches.
Using this information we were able to fix a software bug, and also make further tweaks to the disambiguation algorithms to reduce the level of non-matches. After a further round of quality assurance by the British Library we discovered that we had reduced the number of non-matches to around 50% and the remaining cases were deemed impossible to match by automated means. These final records were merged manually.
Once all identified individuals had either been matched with an existing names identifier or assigned a new one we were able to return a list of assigned names identifiers to Huddersfield.
Some of the identifiers and records created or added to as part of this exercise are listed here:
1. http://names.mimas.ac.uk/individual/46934 (a record created purely from Huddersfield data)
2. http://names.mimas.ac.uk/individual/6831 (a record which already existed, but had Huddersfield data merged into it).
Late 2011 saw a small flurry of reports commissioned by JISC in the area of researcher identifiers, to support the work of the JISC Researcher Identifier Task and Finish Group. These reports are available from the JISC Information Environment Repository.
Researcher Identifiers Data sources report [PDF, 669Kb] by Cottage Labs
This report provides an overview of sources of data relevant to the task of creating profiles for academic researchers in the UK.
Researcher Identifiers Technical interoperability report [PDF, 506Kb] by Cottage Labs
This report discusses some of the technical aspects of implementing an identifier and profile system for researchers.
Stakeholder use cases and identifier needs: Report One [PDF, 204Kb] by Clax Limited
This report analyses UK research organisations’ use cases, needs, requirements and roles for an identifier system for researchers.
Stakeholder use cases and identifier needs: Report Two [PDF, 378Kb] by Clax Limited
This report investigates which technical systems would need to interoperate with any identifier infrastructure and examines the question of at what point an individual becomes a ‘researcher’.
Report on National Approaches to Researcher Identification Systems [PDF, 463Kb] by Hillbraith Limited
The remit of the report was to examine the approaches taken in other countries to the creation and maintenance of researcher identifiers.
A report on identifiers for digital object and authorshas been made available on the website of the EU-funded DIGOIDUNA study team. The project team state that:
The final report of the study is focused on three key objectives:
1. analyzing the fundamental role of identifiers as enablers of value in e-infrastructures and presenting forward looking scenarios as examples of the benefits of a systematic usage of identifiers for digital objects and authors to locate and integrate information from multiple sources;
2. reporting the results of the analysis of the Strengths, Weaknesses, Opportunities and Threats (SWOT) associated with establishing in Europe an open, dynamic and sustainable governance of e-infrastructure using identifiers for digital objects and authors;
3. presenting the main challenges and recommendations which European Commission and other relevant stakeholders should address to develop an open and sustainable e-infrastructure for locators of digital objects and identifiers of authors supporting scientific information access, curation and preservation.
The report provides a good analysis of the requirements for establishment of an infrastructure for digital identifiers and maintains that Europe is in a good position to set up initiatives in this area. Some of the issues identified by the report (and familiar to the Names Project team) include: fragmented current approaches, lack of financial sustainability, lack of consensus and resistance to change.
As the authors point out:
…technology is not the main driver in leading this process. Any identifier solution is always used within cultural, geographical, disciplinary and organizational boundaries through a technical system and the process of reaching an agreement between parties over possibly conflicting purposes and objectives is a process which is played out at the interfaces of these boundaries.
This summer, JISC funded the Names Project to build a plugin for the EPrints software. In this post, developer Phil Cross describes his work on this.
The EPrints software has been designed to ease the process for creating add-ons and customisations for a repository. We wished to provide an automatic search of the Names API when users type author or editor details into an eprint creation form. We also wanted to be able to present disambiguating information to allow the selection of the correct author and to have the Names-assigned person URI added to the eprint metadata.
We discovered that EPrints already has a built-in autocomplete function that searches over existing repository authors and that there is also an existing creator identifier field that allows the system to identify authors of multiple eprints. We therefore created an augmented version of the existing name autocomplete script that searches the Names API. The search pulls back affiliations, fields of interest and publication details as well as name details and the person URI. When this script is inserted into the code for a specific repository, it overrides the global script, adding the new functionality. Simply removing the script returns the repository to its default behaviour.
The name details are displayed in a drop-down list together with the Names URI. Moving the cursor down the list opens a box next to each entry that contains the disambiguation information. Selecting a name adds the name details and URI to the form.
We were also able to make changes to the context-sensitive help for the author and editor fields. The altered help text contains information about the Names API search and provides a link for authors who wish to add their own details to the Names database.
The Extension Package produced can be used with any EPrints 3.x installation by unzipping the compressed package into the top directory of the chosen repository (a single EPrints installation can run multiple repositories). To disable the Names functionality, the administrator simply needs to delete the three files added. The package and instructions on how to install it are available on the Names site.
The newest version of EPrints, version 3.3, contains access to a new development called the Bazaar Store. This is an application store for the EPrints platform that enables repository administrators to install EPrints Plugins and Extensions with a single click. We have created a Bazaar Package that is a version of the Names Extension Package and this is now available in the Bazaar store for users of version 3.3.
Comments on this work are welcome and we are also interested in working with you if you would like to include details of researchers from your institution in the Names system, to help improve the data which is returned from the API and the autocompletion plugin.
The latest issue of Information Standards Quarterly (ISQ) is devoted to the topic of identifiers for people and organisations. There are featured articles on ISNI and ORCID, an update from the NISO I2 group and one from the Names Project.
We’re getting a good response from repositories and institutions who would like to provide information about their researchers to improve the data in the Names system (see previous post for details). As a consequence, a data submission specification has been drawn up by Dan Needham. This lists the mandatory and optional fields that Names needs in order to create or match records for institutional staff. It also explains the best way to format your records.
The table below shows the information we’d like to receive from institutions:
|Primary author family names||Mandatory|
|Primary author given names||Mandatory|
|Primary author title||Name prefix / salutation e.g. Mrs, Dr, Sir …||Optional|
|Primary author date of birth||YYYY-MM-DD||Optional|
|Primary author date of death||YYYY-MM-DD||Optional|
|Primary author fields of interest||Semi-colon delimited list of strings describing fields of interest associated with the individual. Preferably values taken from a controlled of terms, although this is not required.||Optional|
|Primary author home page||URL of web page that contains information that helps identify the individual e.g. personal homepage, institutional page, linkedin page||Optional|
|Primary author internal identifier||Your internally used identifier.||Optional|
|Primary author external identifiers||A list of identifiers from other providers assigned to an individual. The list should be semi-colon delimited and contain alternating values for the identifier provider and the identifier itself, i.e. <source>;<identifier>;<source>;<identifier>||Optional|
|Result publication title||Please generate a complete new row for each publication given for each author.||Mandatory|
|Year of publication||YYYY||Optional|
|Subject area||Classified subject area, may be different from author’s field of activity.||Optional|
|Co-authors||Semi-colon delimited list of co-author names. May include the primary author name if necessary. Preferably in the format <family name(s)> , <given names(s)> however if necessary another format is acceptable as long as there is consistency.||Mandatory|
There’s also an example record in the required format to illustrate this.
The MERIT data has given us a good corpus of UK researchers’ names to use as the basis of the Names prototype. There are around 45,000 and most have institutional affiliations associated with them, too, which makes them a rich data set. What they don’t have, generally, are full names: they’re usually just surnames and initials.
This is where we need help from UK institutions to improve the data and we’ve recently been testing this process with some information supplied by Robert Gordon University (RGU) in Aberdeen. Researchers at the university were contacted and the aim of the process explained. Having cleared things with their researchers, staff at RGU then extracted information from the OpenAIR institutional repository for staff that were willing to be involved and sent them to the Names team in table form, listing surname, forename(s), publication title, date and publication type.
This data was then matched with existing Names data. 17 names were found to match with individuals already in the database, based on names and article titles. The other names were not in the database: new Names records and persistent identifiers were created for these individuals. Quality assurance on the results of the matching process was carried out by colleagues at the British Library.
In this example, the record for R. A. Laing has been enhanced with the researcher’s full forenames and with additional publications (only the first listed publication was supplied by the MERIT database). This additional information will assist in the establishment of future matches with further sources of names data that become available to the Names team.
If your institution is interested in providing similar data to improve the Names records for your researchers, then we’d love to hear from you. You can contact Dan Needham, the project’s lead developer at firstname.lastname@example.org, or if you have any questions, please email project manager Amanda Hill. We can supply you with a sample email which will introduce the project to researchers in your institution if, like RGU, you want to tell them about it.
The Names Project has recently started a new project-within-a-project to build some Names functionality into the EPrints repository software. The survey of UK repository managers undertaken by the project in 2010* showed that a majority (41% of respondents) were using EPrints to run their institutional repositories. This finding is confirmed by the OpenDOAR directory of repositories, which shows EPrints in use by 45% of the 193 repositories it lists for the country.
Incorporating Names into EPrints could be useful, therefore, for a large number of repositories in the UK. Our initial plans include developing the following features:
- Adapting the existing auto-completion field within EPrints to make a call to the Names API to bring back potential matches from the Names data (and distinguishing them from internal matches in some way)
- Associating the Names identifier with creator metadata within EPrints
- Allowing the update of existing materials within the repository to associate creator names with their Names identifier and to allow for searching by that identifier, so that all records for an individual can be retrieved, regardless of the form of the name
- Modifying output records from EPrints to include the Name identifier
- Packaging the resulting plug-in so that it can be made available through the EPrints Bazaar.
Any comments, suggestions, pitfalls you can foresee with this approach?
*Report on the survey [PDF]