The Names team have just finished processing data from the Open University’s Open Research Online repository. When the researchers’ names from Open Research Online were added to the existing Names data, there were 50,002 individuals identified in the Names system.
The matching algorithm developed by the Names project’s Dan Needham does a good job of comparing new names to those already in the system, matching up individuals based on their names, affiliations and the titles of their papers. The algorithm errs on the side of caution, however, to avoid wrongly matching people. This means that some individuals who are already in the system might not be matched up correctly.
As a result, with the OU data, we had some 850 names (out of 2,243) to check against potential matches. Most of these were not actually matches, but a sample of 10% were checked and this sample showed that around 12% of the potential matches were actual matches. To ensure the quality of the data, we decided to check the whole batch and this manual process determined that 108 of the possible matches did indeed match existing individuals in the Names system. Human intervention is the best way of ensuring the quality of data in these cases – automation can achieve a fair degree of accuracy in matching individuals, but in some cases it’s essential to have a person looking at potential matches to determine whether they really are a match or not. Sometimes it is obvious, but there were several in this batch where some additional research was needed to be absolutely sure.
The matching of those individuals left us with a total of 49,894 uniquely identified individuals in the Names database. It would have been nice to have been over the 50,000 mark – but the data would have been poorer quality if we’d left it as it was…
P.S. Come to think of it, if we include the identifiers for the 158 research institutions in Names, then we are over the 50,000 (50,042 to be precise). Yay!
A press release from Ringgold this week and a blog post from ORCID tell us that ORCID will be using Ringgold’s institutional identifiers (soon to be converted into ISNI institutional identifiers) as a means of recording institutional affiliations of researchers identified in the ORCID system. This is a promising step towards interoperability of identifiers at the institutional level, at least (although departments and research groups are a whole other problem!).
Last month the work of the NISO I2 (Institutional Identifier) group culminated in the publication of a NISO Recommended Practice document entitled Institutional Identification: Identifying Organizations in the Information Supply Chain [PDF]. The I2 group was established in 2008 with the task of looking at the issue of uniquely identifying institutions and other organizations.
From the report:
As the digital information landscape grows increasingly crowded and customized, and as institutions achieve economies of scale through increased collaboration, the need to unambiguously identify organizations engaged in any aspect of information acquisition, supply, archiving, and discovery becomes a critical enabler for efficient and trustworthy information practices.
The use of the International Standard Name Indentifier (ISNI) (ISO 27729) for institutional identification is recommended to achieve both of these goals.
Of the 157 UK research institutions currently identified within Names, 93 also already have ISNIs (in other words they were already identified as creators or publishers in library systems which have contributed data to ISNI). We have now added those ISNIs to the Names records of those institutions and will be requesting identifiers for the remaining 64 institutions in the coming weeks.
Our aim by the end of the current phase of the project (July 2013) is to have ISNIs assigned to all of the individuals and organisations identified within Names. ISNI disambiguates and assigns unique identifiers to institutions and individuals internationally. Where ORCID provides a service for individuals to identify themselves, ISNI relies on data from third parties and combines it to create merged records. This means that, in contrast to ORCID, it can include records for organisations and for individuals who may be unable (or unwilling) to manage an online profile.
In the past few weeks the Names team have been working with colleagues at the London School of Economics to uniquely identify individuals who have been involved in research at their institution. As with our previous work with the University of Huddersfield, this involved analysing the contents of LSE’s institutional repository, LSE Research Online.
By processing the RDF data which is automatically provided by the repository’s EPrints software, we were able to compare the information within it against the existing information in Names about LSE authors. Where individuals had already been identified from the Merit 2008 Research Assessment Exercise data, the repository information usually provided additional details to augment the Names records, including first names and other titles of papers that individuals had worked on. For individuals who were not already in Names, we created new records and assigned identifiers to them.
The Names disambiguation algorithm does a good job of automatically matching information from repository data with existing Names records, but it is configured to err on the side of caution in making matches to avoid making false connections between individuals who may have similar names but are not the same. This creates some extra work for the quality assurance process (which is undertaken by our colleagues at the British Library) , as it generates a list of potential matches which have to be checked manually. This is worth doing, however, as it ensures that the resulting data is more reliable than it would be with just an automated check. The more data is added to Names, the smoother the matching process becomes, as there is more information in the system to compare against each new source of data.
In the record below, the original Merit record has been enhanced with information about the individual’s first name and with the identifier from the LSE repository. Already this person has four separate identifiers assigned to him: the local LSE one, the national Merit one derived from the 2008 Research Assessment Exercise, the Names identifier (15711) and the international ISNI identifier. We’re also currently investigating the best way of linking this data up with the other big international initiative, ORCID.
Colleagues at LSE plan to add the Names identifiers to their local name authority file for use within the institution. I’d like to note here that working in collaboration with the LSE staff helped to improve data both in Names and at the repository. The experience has also helped us to speed up and fine-tune the quality assurance process at the Names end.
In total there are now 1,005 individuals identified in Names who are affiliated with LSE. 463 of these were new identities created from information in LSE Research Online and 413 were existing Names records which have been improved with additional information from the LSE repository.
Yesterday saw an important milestone in the progress of the ORCID researcher and contributor identifier initiative, as the service was launched to the public. You can now register for an ORCID and use the built-in CrossRef search to choose publications that you have written to link to your identifier.
Following on from the launch, the ORCID team hosted a meeting today in Berlin to celebrate the launch and to share news about recent developments in the work of the ORCID team and the broader ORCID stakeholder community.
Howard Ratner gave an overview of the history of ORCID to date to kick off the meeting. The initiative started back in 2009 (we reported on the first stakeholder meeting in this blog post). The first phase of ORCID is aimed squarely at individual researchers and by lunchtime today over 850 people had signed up for an identifier in the system. Future phases will look at options for importing records from other trusted sources and conversations with ORCID representatives at the meeting today confirmed our feeling that the disambiguated data within Names would be a good set for ORCID to work with, especially as the records already contain ISNIs (International Standard Name Identifiers) which are in the same form as ORCIDs. Watch this space for updates on that!
A number of systems which had implemented some kind of ORCID integration were demonstrated at the meeting, including ImpactStory, a site which measures the bookmarking, sharing and saving of publications, slideshows and data sets from a range of different websites, including ORCID, SlideShare, Dryad, GitHub and Google Scholar.
After a keynote from JISC’s Josh Brown, a panel session in the afternoon discussed the relationship between international initiatives such as ORCID and ISNI and specialist or national services operating in the same space. Magchiel Bijsterbosch of the SURF foundation in the Netherlands talked about the situation there, where there is a fairly mature author identification system. Magchiel raised a number of challenges faced by national systems but concluded that there would probably still be a role for such systems in the area of disambiguation. There was some consensus in the meeting that ORCID might want to delegate disambiguation to the wider community, particularly to local experts, rather than attempt to take on this role itself for the whole world.
The Names Project team have been collaborating with colleagues at OCLC and the British Library who are part of an international partnership developing the International Standard Name Identifier (ISNI) system. ISNI identifies a wide range of entities, including organisations and individuals. The identifier is based on an International Organization for Standardization (ISO) standard which was published in March of this year (press release).
ISNIs take the form of 16-digit numbers which are associated with a name and some basic identifying information such as (for an individual) the name of a publication. Over 40,000 individuals who have been uniquely identified in the Names system have now been assigned ISNIs and in the last week those identifiers were imported into the Names database and are now available.
The record below shows how Names records now bring together information on a variety of identifiers. This individual, John Albarran, has three external identifiers associated with him. The MERIT identifier is the one derived from the 2008 Research Assessment Exercise data processed by Names; an identifier within a national system. The UWE identifier is a local identifier from the University of the West of England’s Research Repository, while the ISNI is the first international identifier for this individual. These are all associated with John Albarran’s Names identifier (http://names.mimas.ac.uk/individual/885.html).
The ISNI database is designed to be relatively lightweight, so the information available there is less comprehensive than that in the Names system, as can be seen in the screenshot below:
As ISNI is interested in organisations as well as individuals, there is also an identifier for the University of the West of England in the ISNI database:
The ISNI database holds records derived from OCLC’s Virtual International Authority File, which brings together data on individuals identified in national library authority files. The Names records, and information from ProQuest’s Scholar Universe system, extend the ISNI data into the realm of identifying authors of articles as well as those individuals who have been involved in writing books.
We have to accept that there will be many identifiers associated with individuals during the course of their careers. For UK researchers, Names provides a place where institutional identifiers can be linked to a national one. On an international level, ISNI provides an equivalent service for linking together national identifiers.
The Names Project was represented at an event in Barcelona today which looked at the role of author identifiers and ways of integrating them into the procedures of institutions, and institutional repositories in particular. A number of different perspectives were presented at the event, including publishers, funders and identifier providers. There are videos of the talks available:
Martin Fenner on ORCID:
Me on the Names Project:
Gerry Lawson on the funder’s perspective:
Some interesting statistics emerged from Gerry Lawson’s talk concerning the number of researchers in Europe. All European governments have to report on the numbers of full-time-equivalent researchers in higher education, business, government and non-profit sectors. These figures are available from the Eurostat site and cover the years 1999 to 2010. The figures for the UK between 2005 and 2010 are fairly static, with a high of 254,009 in 2006 to a low of 235,373 in 2010. Germany has the highest number of researchers, at 327,500, with a noticeable increase in numbers each year. For the EU as a whole, the figure for 2010 is over 1.5 million researchers.
The number of individuals represented by these figures will be higher than the total of FTEs, of course, but at least this gives us an idea of the number of people who may ultimately need to be covered by services like Names (and some confidence in the figure of 20% of UK researchers that we’ve been estimating that Names currently holds). Philip Purnell’s presentation also gave us some figures for the number of researcher who have registered with the ResearcherID service from Thomson Reuters. For the UK, this is currently 14,033.
It will be interesting to see how many researchers sign up for an ORCID identifier when the service launches on 15th October. One of the options that will be offered is the ability to transfer information from an existing ResearcherID into an ORCID, which will be useful those researchers who are already registered in that service. ResearcherID also offers institutions the option of assigning IDs in batches, free of charge, to their researchers. This differs from the ORCID model, which will allow institutions to submit ORCIDs in bulk only if they are ORCID members (at an annual cost of $5,000 for small institutions). Martin Fenner suggested that small institutions might want to encourage their researchers to register themselves, as this process is free of charge.
Three members of the Names Project team attended the Open Repositories conference, OR2012, in Edinburgh this week. It’s a really packed conference, with fascinating sessions, a Developers’ Challenge and this year a side helping of Repository Fringe with its challenging format of Pecha Kucha presentations.
The conference was very ably live-blogged by Nicola Osborne and Zack O’Leary. I won’t attempt to compete with their thorough work of describing the sessions but instead will mention some of the presentations I attended which touched on the name authority space, of which there were quite a few.
Most of the national name identifier systems presented upon or mentioned during the conference were familiar to me as ones we covered in the report commissioned by JISC last year. One that we somehow missed in that report though, was the Portuguese Cirriculum DeGóis researcher CV service, which is integrated with the national repository service, RCAAP. You can see José Carvalho’s paper on RCAAP here [PDF].
Kei Kurakawa presented on the Japanese Researcher Name Resolver in a Pecha Kucha session, which was also the format for Natasha Simons’ talk on the researcher identifier activity in Australia. Natasha noted that institutions needed both sticks and carrots to engage with national researcher identifier systems.
Simeon Warner set the scene for the need for researcher identifiers in his presentation on progress and plans for ORCID (liveblogged here), which was followed by the talk I gave on Names (that link is to the YouTube video – you can also get the slides from Slideshare), which referenced both ORCID and ISNI and in which I attempted to characterise the different national and international researcher identifier services currently operating or in development. It is a rapidly-evolving area in which we’re all working and linking between the various systems is going to be essential.
The Researcher Identifier Task and Finish Group convened by JISC is seeking responses to a questionnaire about the recommendations of the group. The text of the questionnaire is available in PDF format, if you want to read the whole thing before starting to answer the questions.
The purpose of this questionnaire is to consult within the UK research community about the feasibility and general acceptability of the task and finish group’s recommendations, of which those relating to ORCID and its implementation are central. The data gathered through this survey, along with that from interviews, will inform a report for JISC that will help prepare the ground for the UK-wide use of a common researcher ID which can be used to uniquely identify anyone involved in research.
Time is short – the deadline for completing the questionnaire is 4th June.
On the second day of the Digital Author Identifier Summit, the participants spent time divided into separate groups, looking at issues of governance, interoperability and added value. I was in the Interoperability group which was concerned with identifying barriers to the interchange of digital author identifier information and recommending ‘next steps’ for the international scene.
It was a lively discussion, eventually focusing on the need for a canonical identifier for individuals at the international level. Paolo Bouquet advanced the idea that the canonical ID should be a light-weight service with a minimal set of metadata which would be sufficient to distinguish one entity from another. The first step is to identify who should provide this thin layer: both ORCID and ISNI were seen as candidate services, but ideally they should co-operate in this area. Once the ‘thin’ identifier layer is agreed upon, other identifier services would be able to map information found in their systems to the canonical ID. These lower-level systems would be able to provide various value-added services, tailored for their particular constituencies, and would have to agree standard ways of sharing data between them. (For an example, see the Names Project’s API documentation.)
Paolo demonstrated the sig.ma Semantic Information Mashup as an example of a service which could then aggregate information from other services about an individual (Paolo himself, in this case). Sig.ma illustrates part of what Cliff Lynch was talking about on Day 1, with the ability of creating new biography services with data from author identifier systems. Paolo’s vision gained a fair degree of support from the group, although the issue of collaboration between ISNI and ORCID was seen as a possible problem area: the two approaches have very different business models and ways of obtaining information.
The feedback from the Added Value group was that the practical steps for existing systems would be to develop local IDs for authors/contributors and to make those available to other systems. The Governance group agreed that ISNI and ORCID are part of the solution and complementary but were concerned that if they did not agree on a way of collaborating, the landscape would become fragmented. They saw the importance of aligning business models with available funding sources and thought that the data should be open and trustworthy. In the summing-up of the two days, Cliff Lynch noted that both ORCID and ISNI are relatively young services and that there is still time to provide feedback at a high level to help ensure that they evolve in the most useful direction for the communities which need them.
Brian Kelly has pulled together the tweets from the workshop and there are overall summaries of the event on the Knowledge Exchange site and by Talat Chaudhri at the JISC Innovation Support Centre blog. It was an interesting and stimulating two days (it’s not often that I get to talk for two solid days about digital author identifiers!) and I’d like to take this opportunity to thank the organisers of the event for the chance of taking part.
UPDATED 11 April 2012: just to note that the Knowledge Exchange team have now published a report [PDF, 440KB] on the event.