Text Mining for Scholarly Communications and Repositories Joint Workshop
Two members of the Names Project team attended the NaCTeM/UKOLN text mining workshop in Manchester on 28-29th October. The event was an opportunity for us to find out how text mining tools have been used within the academic community and to understand the relevance of them to repositories and publishers which are important stakeholders for the Names Project.
The Director of the National Centre for Text Mining (NaCTeM), Sophia Ananiadou, gave a good introduction to the event, explaining that text mining provides annotations to unstructured textual materials which allow semantic enrichment of the text; making implicit knowledge within the materials explicit. A range of perspectives on text mining were then represented, from the academic (linguistics, biology, chemistry and social science) to publishers (Elsevier and the Nature Publishing Group) and service providers (Mimas, EDINA and Microsoft Research).
A theme mentioned by Tony Hey of Microsoft was that if tools like text mining are to be taken up widely by the scientific community (and I presume, by extension, the wider academic world), then they need to be as simple to use as the Web 2.0 tools that are being widely used by general web users. This was echoed in two subsequent talks: Rafael Sidi of Elsevier (who got through an eye-boggling 180 slides in 30 minutes!) emphasised the importance of openness in encouraging innovation and Paul Walk of UKOLN gave us the developers’ point of view, pointing out that access to data without unnecessary obstacles was essential to get the developer community to make use of services.
The closing session allowed a panel of six experts to give their view of the future of text mining, particularly in the context of institutional repositories. Areas that were seen as important were involving end-users in evaluating the effectiveness of text-mining tools (comparing results to those that can be obtained using manual methods); improving repository metadata by using automatic classification of full-text materials such as theses and papers; searching across multiple repositories; developing standards for semantically annotating materials and recording the provenance of those annotations; capturing work-in-progress information generated by researchers that does not get formally published (e.g. laboratory workbooks recording unsuccessful experiments). One issue that (inevitably) generated a lot of discussion was the problem of getting permission to use full-text materials for text-mining purposes given restrictions imposed by copyright laws and by publishers who put limits on annotation of their articles.
Thanks to UKOLN and NaCTeM for organising an interesting event which gave all the attendees plenty to think about and to discuss.
Names Project recruiting
The British Library are recruiting an Analyst for the Names Project. The information below is taken from the job details on their recruitment site.
Ref: O&S00184
Location: Boston Spa, Yorkshire
Position Type: Fixed Term
Specialism: Cataloguers
Salary: £22,063 – £23,896
Fixed term appointment for 2 years
Closing date: 18 October 2009
A. Rose by any other name might be, “Alex” to her friends, “Dr. Alexandra Rose”, to her students, “Dr. Alexandra N. Rose”, to her funders and, “A.N. Rose, PhD”, to her publishers; and she is not the only A. Rose. For the higher education and research communities identification of researchers and authors is difficult. The Names 2 Project aims to develop innovative and scalable solutions to problems of identification, attribution and affiliation.
We are recruiting an Analyst to help turn this project from a concept to a service. This is a full time, fixed term post, funded for 2 years by JISC (Joint Information Systems Committee). Names 2 is led by Mimas, based at the University of Manchester.
The successful candidate will have excellent communications skills and work effectively to deadlines. Experience of cataloguing at a professional level, using internationally recognised standards is essential. First hand knowledge or experience of institutional repositories or authority control will be an advantage. The post holder will work as part of a distributed project team.
For an informal discussion about this role please contact Alan Danskin on 01937 546669.
Looking forwards, looking back
Just a brief note to say that the final report from the first phase of the Names project and the project plan for the second phase are both now available from the project website.
Name authority for dead people
A JISC-funded project on the possibilities of using automatically generated metadata in the context of UK higher education has recently been co-ordinated by Intrallect Ltd. The project commissioned a series of reports on different aspects of metadata that might be obtained automatically. These reports are now available on the project’s wiki. They include one on ‘Person Metadata’, which was written by me, based on the experiences we’ve had with the Names Project. The wiki allows for the reports to be annotated with comments, so please chip in if you have any observations.
One area I am keen to see progress in is in building a name authority file that would be a shared resource for the cultural heritage sector. This formed one of the recommendations in my report. Perhaps it might seem a bit off-topic, but I do worry that the needs of institutional repositories have somewhat eclipsed the requirements of archives, museums and galleries in this area. I’ve been peripherally involved in some discussions with the Archives Hub team and others about this. The National Archives (TNA) maintains the kernel of an archival national name authority file as part of the UK’s National Register of Archives (NRA), but this is not easily added to by staff at other institutions and (from my perspective, anyway), there seems little will by TNA to further develop this resource in ways that would make it more useful for the cultural heritage sector and for the users of electronic resources provided by museums, galleries, archives and other organisations with a more historical view of the world.
As is the case with repositories, people mentioned in archives (or creators and owners of archival and museum materials) may not be represented in library authority files. An archival standard for authority files allows for rich description of individuals, families and organisations but as yet there is no easy way for institutions to share this information or to pool these descriptions together. A set of rules developed within the UK archival community in the 1990s gives guidance on creating an authoritative form of a name, but this has not solved the problem, as this screenshot of name index terms in the Archives Hub illustrates:
A way of associating the different forms of a name with a unique identifier would be more useful than ensuring that Alice Green’s name is always written in exactly the same way. That identifier could then be used to group all records relating to Alice together. The National Register of Archives’ page for Alice Green attempts to do just that, but is not open for additions by anyone outside TNA. The NRA’s identifier for Alice is GB/NNAF/P125310 but the number that retrieves her page within the system is an earlier version of this (GB/NNAF/P11998), which isn’t ideal.
There seems to me to be an opportunity here to build a collaborative service that would be of enormous benefit to those documenting our heritage and those seeking to find out about it. The current information in the National Register of Archives could be the core of this, in a service that is open to other institutions to edit and that is made available to both web users and to other systems. Lukas Koster’s overview of ‘Linked Data for Libraries’ describes the principles and the end result I have in mind for such information. Tim Berners-Lee’s TED talk in February this year is a great introduction to this area, too.
Actually, now I’ve written all that, this sounds a lot like what we’re trying to do for the repository sector with the Names project. It’s just that there isn’t a big overlap with the people currently active in UK research and those that the cultural heritage community care about…
Institutional identifiers for repositories
The Names Project is represented on the NISO I2 group that is looking into the requirements for unique identifiers for organisations. There is a subset of this group which is focusing on the needs of digital repositories and this group is currently asking repository managers and other interested parties to complete a questionnaire about current and future practice in repositories in relation to uniquely identifying organisations and their constituent parts.
If you are interested in this area, please take the survey.
Tweeting
Dan Needham is now sharing updates about his work developing the Names prototype on Twitter.
Web Services and Repositories
Dan Needham, the developer working on the Names Project, attended the Web Services and Repositories workshop that was organised by the EThOS project and held at the British Library on 2nd June.
He gave a presentation [PowerPoint format, 205KB] on the project and the aims behind the web services for the Names prototype that he’s been working on and recently testing with colleagues from Cranfield University.
UPDATE: the audio from Dan’s presentation and all the other materials from the day are now available on the EThoS site.
Double-barrelled Names
Just a brief update to say that the Names project is entering a second phase, thanks to continuing funding from the JISC. In this next period we will be further developing the prototype name authority system into a pilot. This continuation will extend the project for a further two years, building the prototype into a form that will be useful for repository services, and working with new sources of information to improve the quality of the data within the system.
Managing identities

Edinburgh Castle at dusk
In the afternoon I attended the session on e-theses, which was chaired by Owen Stephens and also thoroughly blogged by him (which is quite an impressive feat). Author identities were only touched upon in passing here, but the Entry to EThOS (E2E) project at King’s College is using student record systems to populate name (and other) metadata associated with electronic theses, which sounded interesting. The overlap between the people involved in the creation of theses and those who are producing research outputs is clearly high, meaning that there will be good reasons in the near future for the Names Project to work together with those involved in managing e-theses and digitising the paper versions.





