Lisboa paper

Status

This is the Lisboa paper. It deals with matters of author identification and institutional identification in scholarly digital libraries.

I, Thomas Krichel, started work on this paper in Novosibirsk, Russia on 2008–03–07. This is the version of 2008–03–17. An older version is from 2008–03–16.

Background

The issue

The identification of authors is a generic issue in bibliographic datasets. It would be nice for each item in the dataset, all of its authors are identified.

No such bibliographic dataset exists. Instead, author names are used. Author names are not identifiers.

For most authors, there is a plethora of options for writing their names.
For any given name expression, it is not known which person is meant.

Since computers don't know who people are, a precise identification by computers is not possible.

While the issue is ancient, no good solution has been found. In principle, a central agency could maintain author records. But the task is large. Therefore the running of such an agency remains a difficult proposal.

Is the issue of author identification a problem?

The problem

At this time, scholarly communication is supposedly going through a transformation. The technological stimulus is that modern communication devices allow for open access to documents. The presumed social consequence is that authors will provide open access to maximise visibility of their works.

Overall, it does not happen, or only very slowly. Thus the benefits of an important technical change are lost on academia.

Here is the link between the open access issue and author registration. It is not possible at this point to demonstrate the link between open access and visibility of works. We need to work towards a situation where

academic documents are registered in digital libraries
authors are registered and linked to their documents

Then performance data of the documents (citations, downloads) can be given to authors. Authors will have incentive to improve performance by providing open access.

The approach

There are efforts, in most areas, to build systematic collections of descriptions of academic documents. These efforts vary in scale and scope. These efforts are sometimes freely available, some other times they are not. Sometimes they include links to freely available fulltext, sometime to restricted access fulltext, sometimes to no fulltext at all. Such collection are referred to as document metadata collections.

The key requirement for document metadata collection is that they keep a persistent identifier for documents. The metadata may be quite trivial, a title, a list of author name expressions and some further indication of the status of the paper is quite sufficient.

Freely (as is 0$) available document metadata collection cover a majority of disciplines, but not all. Coverage in the broader social sciences, humanities, mathematics and chemistry remains thin. However, such collection are not copyrightable and may easier leak out.

To enable author identification, we need to build a system that uses document metadata collections as input, and produces author records as output. Authors contact such a system, tell it something about themselves, and choose, among the document records, what documents they claim authorship over.

The author records are then exported to document metadata collection. There is some problem with this stage as far as the transfer of email addresses is concerned.

If a document record interface service obtains only, it can provide value added services such as grouping papers by the same author together, linking to author's homepage, opening mailing lists for new additions from a particular author, etc.
If email addresses are available, the document collection can provide push service such usage analysis.

Surely, every digital collection can build it's own author identification system. The problem with this approach is that it create identification islands. There would be no way for a computer system to reliably put these pieces together unless authors can claim the same documents across different author registries. Like with documents, it is in the authors best interest that registration records are used as widely as possible.

Building a central author registration system for a set of digital libraries will participating digital libraries to concentrate on other matters. For authors, a centralised system will be easier to handle because they simultaneously deal with a number of collections where they have their documents listed.

The experience

I started author registration for the RePEc digital library in 1999. It became later known as the RePEc Author Service (RAS). It is maintained by Christian Zimmermann.

Markus J.R. Klink wrote most of the code for the initial system. In 2000, Ivan V. Kurmanov took over the maintenance of the code.

Since then, RAS has over time become the crucial enabling service of RePEc. Monthly mailings inform registrants of the success of their documents through RePEc services.

In 2002, the Open Society Institute gave Long Island University a grant of $50k to develop a general system. This is called the Academic Contribution Information System. (ACIS). Ivan Kurmanov worked on this system between 2001 and 2007.

ACIS contains almost 10 years of experience with author registration service management. Building a web application for author registration remains difficult because there is no precedent for such a service.

The pioneer

I set up the NetEc service for academic economists as early as 1993. As part of that effort, I published the the first electronic academic paper in economics. Since 1994, I have been privileged to partner with José Manuel Barrueco Cruz. We worked for 3 years essentially on our own. Only in 1997, RePEc was created as the other finally started to join the effort. RePEc really now runs pretty much without me.

The situation today is similar to the time when I started with what was to become RePEc. I can build a service where authors can create profiles, but they will not sign up until the profiles are used. Usage of the profiles is limited as long as few authors have signed up.

I still believe that the principle of author registration are applicable to all types of digital collection where authorship is an important factor. Author registration is an important helper function in the building of these collections.

I also believe that it is possible to find a small group of volunteers, within the technically minded library community, to provide such a service, and maintain it in perpetuity without external subsidy.

Therefore I will work for the remainder of my career, or until such time as the service can run without my intervention.

Current state

A few complications

In fact, things are slightly more complicated than suggested above.

An ACIS system does not only care about registering authors. It also cares about registered institutions. Thus we really have two sets of data to care about, one set for the institutions and one set for the authors.

In principle we should cleanly distinguish between the data and the service implementing the data. The data will be in two datasets, “whoarewe” for institutions and “whoami” for authors. The service currently implementing work on the datasets are “ariw” for instititions. The data is held in namespaces “info:3lib:we” and “info:3lib:am”. This info scheme is not registered. But I hold domains “3lib.org”, “3lib.net”, and “3lib.info”.

There needs to be a metadata format that encodes the data. This is the Academic Metadata Format (AMF) metadata format, I build, with help from others. It provides a simple but flexible approach to metadata that has proven successful in other areas of my work.

An institutional service

ariw stands for academic & research institutions in the world. It runs on a server and on bandwidth that has been donated to me. It is a small service that is not at risk. I hold the domains “ariw.org”, and “ariw.info”.

Ariw is a model service. It is strictly XHTML 1.0 compliant. All its internals are publicly available. To reproduce the site, a single tarball is required, that includes a single maintainance script that builds the whole site out of the library files, the documentation files, and the style files.

Ariw is not quite properly separated from whoarewe. At this stage, it is not much of an issue since ariw is in fact the only independent source and service for whoarewe data. The service could be restructured in putting the source data onto a separate domain. At this point, source data are of two forms.

The main institution registration data, available on ariw but also on http://whoarewe.3lib.org.
The contributed data about authors working at institutions, which is calculated on the ariw site on a mirrored copy of the whoami author registration data. This is available on http://ariw.org/internal/var/opt/index/opt_lib.html

Institution registration at this stage is most conducted centrally by me. It is hoped that volunteers will join work, some havealready done. There is a mailing list.

Institutional registration is not supposed to go down to the level of individual departments. But how precisely to deal with hierarchical organizations is a problem.

Currently the data contains 8,000 imported from univ.cc, partially corrected. We are in the process of merging the data with a source from Cindoc. This will add records for research institutions, in total about 20,000.

The export of whoarewe data into authorclaim.org is automated.

An author registration service

AuthorClaim is the second ACIS implentation service. It is built using the ariw institution data and comprises a set of docment data from the following

PubMed, over 10,000,000, still loading more
DBLP records, over 950,000, updated weekly
CiteSeer records, partly overlapping with DBLP
E-LIS records, over 8,000

The document record collection contain records that are reduced to the bare essential required. These data are in principle not copyrightable because they are factual data.

title
author name expressions
status, such as published in a journal etc
link to original provider site for full information

The collection runs on a PowerEdge SC1420, with an Intel Xeon Processor at 2.8GHz, 1MB Cache, 800MHz Front Side Bus, and 3 GB of RAM. I have ordered a second processor. The I/O required just from feeding the data into the machine and backing up the result is considerable. In addition remote backup requires quite a bit of bandwidth. A local backup is in place but I need backup on an external machine just in case there is a catastrophic event on the local site such as a fire or flood.

I have money in the bank that could buy a bigger machine around $10k, but I'd rather have a sponsor forthcoming who would buy a new machine. The hardware sponsor would be acknowledged on every page that is on the web server and on every email that the system sends.

Similarly, I need a solution to the issue of the network. The machine sits on a friend's network in New York. It should not stay there, but get to the place of a sponsor. The network sponsor would be acknowledged on every page that is on the web server and on every email that the system sends. Ideally, they would be based in not too far from NYC since this is where I am about 7 months a year.