The source code for the Names project’s disambiguation service and user front end are now available from the Bitbucket code-sharing service. The various components are:
The user interface to the Names data has been updated to use the code available through Bitbucket. Here’s an example of a Names record in the new web view of the data:
The Names project came to an end in July 2013 – the final report [PDF, 525KB] is available here. The conclusions of the report were:
- The Names project has demonstrated that automated or semi-automated solutions can be applied to bulk-process complex authority control tasks traditionally undertaken by cataloguers on an item by item basis. This approach offers the potential to extend authority control to types of resource, such as journal articles, which have previously been neglected on grounds of cost.
- The quality of the outcome is directly affected by the range and quality of the metadata available. Publishing conventions, such as use of initials rather than full names, hinder accurate identification and comprehensive disambiguation of individuals. Human intervention is still necessary, but filtering enables the human intervention to be focused on ambiguous and anomalous identities.
- Retrospective author disambiguation is complex and costly, even when partially automated and should be regarded as the solution to a legacy problem rather than the preferred way forward. The Names database and the components of the Names system are resources which can be used by other services to improve their own efficiency.
- Integration between national systems such as Names and international services like ISNI is possible, with the national system offering the opportunity of liaising with institutions to feed data into the international level and with the potential for saving the research community the fees for institutional membership for ORCID and registration agency costs ISNI. Further investment in Names would be required to establish an automatic updating mechanism between the Names system and ISNI and/or ORCID.
- The major achievements of the Names project have been the development of the disambiguation algorithm and the quality assurance process for the resulting data. These have enabled the creation of a useful set of information in the Names database which offers free and flexible access to its contents. By making the database structure, the data, and the disambiguation algorithm available through a code-hosting service, it will be possible for other services to make use of these elements in the future. It should be noted that the quality assurance expertise provided by the Names project team is not something that can be made available externally.
As we wind up the project, I would like to acknowledge the huge amount of work that Dan Needham at Mimas has put into developing this code and into sharing it so that others can benefit from his expertise in this area. Also, many thanks to our colleagues in the British Library: Alan Danskin, Stephen Andrews, Michael Docherty, Alison Wood, Richard Moore, Susan Skaife, Jasper Jackson and Andrew MacEwan whose time and efforts contributed to the success of Names, particularly in the development of a data model and in the quality assurance of data. They also helped to ensure that the results of the project live on in the form of ISNI identifiers for many UK researchers.
The Names team have just finished processing data from the Open University’s Open Research Online repository. When the researchers’ names from Open Research Online were added to the existing Names data, there were 50,002 individuals identified in the Names system.
The matching algorithm developed by the Names project’s Dan Needham does a good job of comparing new names to those already in the system, matching up individuals based on their names, affiliations and the titles of their papers. The algorithm errs on the side of caution, however, to avoid wrongly matching people. This means that some individuals who are already in the system might not be matched up correctly.
As a result, with the OU data, we had some 850 names (out of 2,243) to check against potential matches. Most of these were not actually matches, but a sample of 10% were checked and this sample showed that around 12% of the potential matches were actual matches. To ensure the quality of the data, we decided to check the whole batch and this manual process determined that 108 of the possible matches did indeed match existing individuals in the Names system. Human intervention is the best way of ensuring the quality of data in these cases – automation can achieve a fair degree of accuracy in matching individuals, but in some cases it’s essential to have a person looking at potential matches to determine whether they really are a match or not. Sometimes it is obvious, but there were several in this batch where some additional research was needed to be absolutely sure.
The matching of those individuals left us with a total of 49,894 uniquely identified individuals in the Names database. It would have been nice to have been over the 50,000 mark – but the data would have been poorer quality if we’d left it as it was…
P.S. Come to think of it, if we include the identifiers for the 158 research institutions in Names, then we are over the 50,000 (50,042 to be precise). Yay!
A press release from Ringgold this week and a blog post from ORCID tell us that ORCID will be using Ringgold’s institutional identifiers (soon to be converted into ISNI institutional identifiers) as a means of recording institutional affiliations of researchers identified in the ORCID system. This is a promising step towards interoperability of identifiers at the institutional level, at least (although departments and research groups are a whole other problem!).
Last month the work of the NISO I2 (Institutional Identifier) group culminated in the publication of a NISO Recommended Practice document entitled Institutional Identification: Identifying Organizations in the Information Supply Chain [PDF]. The I2 group was established in 2008 with the task of looking at the issue of uniquely identifying institutions and other organizations.
From the report:
As the digital information landscape grows increasingly crowded and customized, and as institutions achieve economies of scale through increased collaboration, the need to unambiguously identify organizations engaged in any aspect of information acquisition, supply, archiving, and discovery becomes a critical enabler for efficient and trustworthy information practices.
The use of the International Standard Name Indentifier (ISNI) (ISO 27729) for institutional identification is recommended to achieve both of these goals.
Of the 157 UK research institutions currently identified within Names, 93 also already have ISNIs (in other words they were already identified as creators or publishers in library systems which have contributed data to ISNI). We have now added those ISNIs to the Names records of those institutions and will be requesting identifiers for the remaining 64 institutions in the coming weeks.
Our aim by the end of the current phase of the project (July 2013) is to have ISNIs assigned to all of the individuals and organisations identified within Names. ISNI disambiguates and assigns unique identifiers to institutions and individuals internationally. Where ORCID provides a service for individuals to identify themselves, ISNI relies on data from third parties and combines it to create merged records. This means that, in contrast to ORCID, it can include records for organisations and for individuals who may be unable (or unwilling) to manage an online profile.
In the past few weeks the Names team have been working with colleagues at the London School of Economics to uniquely identify individuals who have been involved in research at their institution. As with our previous work with the University of Huddersfield, this involved analysing the contents of LSE’s institutional repository, LSE Research Online.
By processing the RDF data which is automatically provided by the repository’s EPrints software, we were able to compare the information within it against the existing information in Names about LSE authors. Where individuals had already been identified from the Merit 2008 Research Assessment Exercise data, the repository information usually provided additional details to augment the Names records, including first names and other titles of papers that individuals had worked on. For individuals who were not already in Names, we created new records and assigned identifiers to them.
The Names disambiguation algorithm does a good job of automatically matching information from repository data with existing Names records, but it is configured to err on the side of caution in making matches to avoid making false connections between individuals who may have similar names but are not the same. This creates some extra work for the quality assurance process (which is undertaken by our colleagues at the British Library) , as it generates a list of potential matches which have to be checked manually. This is worth doing, however, as it ensures that the resulting data is more reliable than it would be with just an automated check. The more data is added to Names, the smoother the matching process becomes, as there is more information in the system to compare against each new source of data.
In the record below, the original Merit record has been enhanced with information about the individual’s first name and with the identifier from the LSE repository. Already this person has four separate identifiers assigned to him: the local LSE one, the national Merit one derived from the 2008 Research Assessment Exercise, the Names identifier (15711) and the international ISNI identifier. We’re also currently investigating the best way of linking this data up with the other big international initiative, ORCID.
Colleagues at LSE plan to add the Names identifiers to their local name authority file for use within the institution. I’d like to note here that working in collaboration with the LSE staff helped to improve data both in Names and at the repository. The experience has also helped us to speed up and fine-tune the quality assurance process at the Names end.
In total there are now 1,005 individuals identified in Names who are affiliated with LSE. 463 of these were new identities created from information in LSE Research Online and 413 were existing Names records which have been improved with additional information from the LSE repository.
Yesterday saw an important milestone in the progress of the ORCID researcher and contributor identifier initiative, as the service was launched to the public. You can now register for an ORCID and use the built-in CrossRef search to choose publications that you have written to link to your identifier.
Following on from the launch, the ORCID team hosted a meeting today in Berlin to celebrate the launch and to share news about recent developments in the work of the ORCID team and the broader ORCID stakeholder community.
Howard Ratner gave an overview of the history of ORCID to date to kick off the meeting. The initiative started back in 2009 (we reported on the first stakeholder meeting in this blog post). The first phase of ORCID is aimed squarely at individual researchers and by lunchtime today over 850 people had signed up for an identifier in the system. Future phases will look at options for importing records from other trusted sources and conversations with ORCID representatives at the meeting today confirmed our feeling that the disambiguated data within Names would be a good set for ORCID to work with, especially as the records already contain ISNIs (International Standard Name Identifiers) which are in the same form as ORCIDs. Watch this space for updates on that!
A number of systems which had implemented some kind of ORCID integration were demonstrated at the meeting, including ImpactStory, a site which measures the bookmarking, sharing and saving of publications, slideshows and data sets from a range of different websites, including ORCID, SlideShare, Dryad, GitHub and Google Scholar.
After a keynote from JISC’s Josh Brown, a panel session in the afternoon discussed the relationship between international initiatives such as ORCID and ISNI and specialist or national services operating in the same space. Magchiel Bijsterbosch of the SURF foundation in the Netherlands talked about the situation there, where there is a fairly mature author identification system. Magchiel raised a number of challenges faced by national systems but concluded that there would probably still be a role for such systems in the area of disambiguation. There was some consensus in the meeting that ORCID might want to delegate disambiguation to the wider community, particularly to local experts, rather than attempt to take on this role itself for the whole world.
The Names Project team have been collaborating with colleagues at OCLC and the British Library who are part of an international partnership developing the International Standard Name Identifier (ISNI) system. ISNI identifies a wide range of entities, including organisations and individuals. The identifier is based on an International Organization for Standardization (ISO) standard which was published in March of this year (press release).
ISNIs take the form of 16-digit numbers which are associated with a name and some basic identifying information such as (for an individual) the name of a publication. Over 40,000 individuals who have been uniquely identified in the Names system have now been assigned ISNIs and in the last week those identifiers were imported into the Names database and are now available.
The record below shows how Names records now bring together information on a variety of identifiers. This individual, John Albarran, has three external identifiers associated with him. The MERIT identifier is the one derived from the 2008 Research Assessment Exercise data processed by Names; an identifier within a national system. The UWE identifier is a local identifier from the University of the West of England’s Research Repository, while the ISNI is the first international identifier for this individual. These are all associated with John Albarran’s Names identifier (http://names.mimas.ac.uk/individual/885.html).
The ISNI database is designed to be relatively lightweight, so the information available there is less comprehensive than that in the Names system, as can be seen in the screenshot below:
As ISNI is interested in organisations as well as individuals, there is also an identifier for the University of the West of England in the ISNI database:
The ISNI database holds records derived from OCLC’s Virtual International Authority File, which brings together data on individuals identified in national library authority files. The Names records, and information from ProQuest’s Scholar Universe system, extend the ISNI data into the realm of identifying authors of articles as well as those individuals who have been involved in writing books.
We have to accept that there will be many identifiers associated with individuals during the course of their careers. For UK researchers, Names provides a place where institutional identifiers can be linked to a national one. On an international level, ISNI provides an equivalent service for linking together national identifiers.
The Names Project was represented at an event in Barcelona today which looked at the role of author identifiers and ways of integrating them into the procedures of institutions, and institutional repositories in particular. A number of different perspectives were presented at the event, including publishers, funders and identifier providers. There are videos of the talks available:
Martin Fenner on ORCID:
Me on the Names Project:
Gerry Lawson on the funder’s perspective:
Some interesting statistics emerged from Gerry Lawson’s talk concerning the number of researchers in Europe. All European governments have to report on the numbers of full-time-equivalent researchers in higher education, business, government and non-profit sectors. These figures are available from the Eurostat site and cover the years 1999 to 2010. The figures for the UK between 2005 and 2010 are fairly static, with a high of 254,009 in 2006 to a low of 235,373 in 2010. Germany has the highest number of researchers, at 327,500, with a noticeable increase in numbers each year. For the EU as a whole, the figure for 2010 is over 1.5 million researchers.
The number of individuals represented by these figures will be higher than the total of FTEs, of course, but at least this gives us an idea of the number of people who may ultimately need to be covered by services like Names (and some confidence in the figure of 20% of UK researchers that we’ve been estimating that Names currently holds). Philip Purnell’s presentation also gave us some figures for the number of researcher who have registered with the ResearcherID service from Thomson Reuters. For the UK, this is currently 14,033.
It will be interesting to see how many researchers sign up for an ORCID identifier when the service launches on 15th October. One of the options that will be offered is the ability to transfer information from an existing ResearcherID into an ORCID, which will be useful those researchers who are already registered in that service. ResearcherID also offers institutions the option of assigning IDs in batches, free of charge, to their researchers. This differs from the ORCID model, which will allow institutions to submit ORCIDs in bulk only if they are ORCID members (at an annual cost of $5,000 for small institutions). Martin Fenner suggested that small institutions might want to encourage their researchers to register themselves, as this process is free of charge.
Last week new data was added to Names from the Research Repository of the University of the West of England. This repository runs on the EPrints platform and we extracted information from its RDF output as we did for the University of Huddersfield’s repository earlier this year.
With the help of the quality assurance team at the British Library, 786 Names records were either created or enhanced with information from the UWE repository. In many cases for existing records we have been able to add first names where we previously only held initials, for example.
In total there are around 821 individuals with an affiliation to UWE who now have a unique identifier within the Names system.
The next data source we’ll be investigating is the aggregated data in the Institutional Repository Search service – but we’re always keen to work with individual repositories, so if you’d like to get your contributors included in Names, please get in touch.
Three members of the Names Project team attended the Open Repositories conference, OR2012, in Edinburgh this week. It’s a really packed conference, with fascinating sessions, a Developers’ Challenge and this year a side helping of Repository Fringe with its challenging format of Pecha Kucha presentations.
The conference was very ably live-blogged by Nicola Osborne and Zack O’Leary. I won’t attempt to compete with their thorough work of describing the sessions but instead will mention some of the presentations I attended which touched on the name authority space, of which there were quite a few.
Most of the national name identifier systems presented upon or mentioned during the conference were familiar to me as ones we covered in the report commissioned by JISC last year. One that we somehow missed in that report though, was the Portuguese Cirriculum DeGóis researcher CV service, which is integrated with the national repository service, RCAAP. You can see José Carvalho’s paper on RCAAP here [PDF].
Kei Kurakawa presented on the Japanese Researcher Name Resolver in a Pecha Kucha session, which was also the format for Natasha Simons’ talk on the researcher identifier activity in Australia. Natasha noted that institutions needed both sticks and carrots to engage with national researcher identifier systems.
Simeon Warner set the scene for the need for researcher identifiers in his presentation on progress and plans for ORCID (liveblogged here), which was followed by the talk I gave on Names (that link is to the YouTube video – you can also get the slides from Slideshare), which referenced both ORCID and ISNI and in which I attempted to characterise the different national and international researcher identifier services currently operating or in development. It is a rapidly-evolving area in which we’re all working and linking between the various systems is going to be essential.