Cardiff paper

by Thomas Krichel

Status

This is the Cardiff paper. It describes a work program for the construction of a new interface to select documents in the ACIS documents selection screen.

This is the version of 2008–05–08.

Other versions include

2008–04–30

I am grateful for comments by Ivan V. Kurmanov.

Introduction

A key function of the ACIS software is selection of documents, the authorship or editorship of which correspond to the registrant’s name variations profile.

To settle terminology, let us call such documents “nominal documents”.

ACIS was first implemented for the RePEc document data collection, which is medium sized, say roughly 500k records. The set of nominal documents remains within reasonable limits so that the manual selection of documents is not too onerous.

AuthorClaim's document collection stands at 34 times the size of RePEc. Much of that size comes from the PubMed dataset that does not have first names, just initials.

Joanna P. Davies found out that there are in excess of 400 nominal documents for her variations. And she is lucky because she is called Davies, rather than Davis which is the more common form.

Debates between Joanna P. Davies and Thomas Krichel have focussed on whether the addition of a keyword search will be sufficient, or whether a machine learning approach is required. This proposal seeks to integrate keywords with machine learning.

The interface at initial registration

This section has a basic discussion of the interface.

Nominal document can in three states.

unstated [u]
claimed [c]
refused [r]

A nominal document can only pass

from [u] to [c]
from [u] to [r]
from [c] to [r]
from [r] to [c]

On initial registration, nothing is know about the user. The user is shown all nominal documents, all have state [u]. Each carries a button to move it state [c] and one button to move it to state [r].

There are four buttons

"select all with this keyword" (textbox) [cb]
"deselect all with this keyword" (textbox) [rb]
"order" [ob]

[cb] will push all nominal documents that have the keywords from [u] to [c]. If the keyword is blank, all nominal documents in [u] will be selected. If there are documents in [r] that would be moved to [c], had they been in state [u], a warning will be issued. The warning will tell the used to press [cb] again to move even those documents. An analogous reasoning applies for [rb]. Both [cb] and [rb] use only JavaScript and make no use of the server.

[ob] will result in an error message if there is not at least [c] and at least one [r] document. Else, [ob] will push the [c] documents to the top, the [r] documents to the bottom, and order the [u] documents such that the most likely document to be going to [c] is at the top of the [u] documents, and the that the least likely document to be going to [c] is at the bottom of the [u] documents, just above the first [r] document.

Ordering uses statistical learning via SVM_light. The titles are sent to the server and ACIS sorts them.

Observabilty

In a lab environment, we use random titles from RePEc, DBLP, and PubMed. We try to learn on of them.

The lab environment can also be used to develop and test a simple reporting language, and develops database storage for expressions in that language.

Student dissertation

The title will be “Item selection with machine learning in a web interface”.

The dissertation reports on the results of the lab environment first and sets out the reporting language.

It gives some overview over the running of the system, but no serious stats, just a few graphs with results gathered from actual running.