Cardiff paper

by Thomas Krichel

Status

This is the Cardiff paper. It describes a work program for the construction of a new interface to select documents in the ACIS documents selection screen.

This is the version of 2008–07-25.

Other, deprecated versions include

I am grateful for comments by Ivan V. Kurmanov.

Introduction

A key function of the ACIS software is selection of documents, the authorship or editorship of which correspond to the registrant’s name variations profile.

To settle terminology, let us call such documents “nominal documents”. A document that is nominal but currently not suggested, is a free document.

ACIS was first implemented for the RePEc document data collection, which is medium sized, say roughly 500k records. The set of nominal documents remains within reasonable limits so that the manual selection of documents is not too onerous.

AuthorClaim's document collection stands at 34 times the size of RePEc. Much of that size comes from the PubMed dataset that does not have first names, just initials.

Joanna P. Davies found out that there are in excess of 400 nominal documents for her variations. And she is lucky because she is called Davies, rather than Davis which is the more common form.

Debates between Joanna P. Davies and Thomas Krichel have focussed on whether the addition of a keyword search will be sufficient, or whether a machine learning approach is required. This proposal seeks to integrate keywords with machine learning.

The interface after initial registration

This section has a basic discussion of the interface.

A nominal document can in four states.

  1. free [f]
  2. suggested [s]
  3. claimed [c]
  4. refused [r]

|f| documents are free, |s| documents are suggested etc...

ACIS provides facilities to transport documents between these states. These are essentially unchanged under the proposal.

There will only be an additional tab in the autosuggestion screen, labeled “all with learning (experimental)”. This subscreen is organised differently from the all or chunk subscreens. It has the document description on the left, and two radio buttons on the right. The "accept" radio button requests transport from [s] to [r], the "refuse" radio button from [s] to [c]. When the form is generated, no radio button is checked. Once a radio button for one paper is checked, it is not possible to bring it back to an undecided state, but I don't think that's an issue.

There are two types of processing buttons, let call them "save_and_continue" and "save". These buttons may be repeated on the form, and only one type button may be available at on time.

"save_and_continue" is a new button. It will be an intermediate step in the process of finding the claimable papers. When the user presses it, all papers that are have been accepted are moved to [s] to [c], all papers that have been refused from [s] to [r]. If |s|=0, the "save_and_continue_button" behaves like "save", it moves the user to the main research screen. Otherwise, it will move the user back to the suggestions screen with the learning tab. If |c|>0 and |r|>0, and |s|>1, the remaining papers will be sorted from learning, following work done by Ilya. Otherwise no sorting will be performed, the order will remain as before but the user will no longer see the papers that he has already made a decision on.

"save" is a basically the submit button as in the existing tabs. The user leaves the suggestions screen and returns to the main research screen. If |c|>0 and |r|>0, and |s|>1, the remaining papers could be sorted, but the sorting would slow the process of moving to the next screen and its results would not be observed. The sorting can be placed in the background but I am not sure it is worth the effort. Therefore I propose that no sorting is done, unless I get a better idea.

Changes to ACIS

lib/ACIS/Web/Contributions.pm

The suggestions which the user has made no decisions on are only cleared when the save_and_continue button has not been pressed.

home/configuration.xml

We introduce a relevance column into the suggestions table to hold the computed relevance.

Learning

The main learning module is ACIS/Resources/Learn.pm. This prepares a $suggestions variable that needs to be returned by a function that implements a specific learning module. ACIS/Resources/Learn/LibSVM.pm does it for LibSVM.

The learning module gets all that is needs to know from $app. Here is dump of a $app taken at the very end of ACIS/Contributions/main_screen(). Here is an extract from a Data::Dumper of that dump. It shows a part of the $app->{'variables'}->{'contributions'}:

'suggest' => [
              {
                'status' => 1,
                'reason' => 'exact-name-variation-match',
                'list' => [
                            {
                              'sid' => 'dcat3',
                              'authors' => 'Jos Manuel, & Barrueco Cruz, & Thomas Krichel,',
                              'url-about' => 'http://citeseer.ist.psu.edu/251821.html',
                              'id' => 'info:3lib:citeseerpsu:251821',
                              'type' => 'text',
                              'title' => 'Cataloging Economics preprints: an introduction to the RePEc project',
                              'role' => 'author'
                            },
                            {
                              'sid' => 'ddis136',
                              'authors' => 'Jos Manuel, & Barrueco Cruz, & Thomas Krichel,',
                              'url-about' => 'http://citeseer.ist.psu.edu/360116.html',
                              'id' => 'info:3lib:citeseerpsu:360116',
                              'type' => 'text',
                              'title' => 'Distributed Cataloging on the Internet: the RePEc project',
                              'role' => 'author'
                            },
                            {
                              'sid' => 'dwop1',
                              'authors' => 'Jos Manuel, & Barrueco Cruz, & Thomas Krichel,',
                              'url-about' => 'http://citeseer.ist.psu.edu/623255.html',
                              'id' => 'info:3lib:citeseerpsu:623255',
                              'type' => 'text',
                              'title' => 'WoPEc usage in 1999AD',
                              'role' => 'author'
                            },
                            {
                              'sid' => 'dwri7',
                              'authors' => 'Thomas Krichel,',
                              'url-about' => 'http://citeseer.ist.psu.edu/465189.html',
                              'id' => 'info:3lib:citeseerpsu:465189',
                              'type' => 'text',
                              'title' => 'Written submission to the "AG Neue Medien und Bibliotheken" of the Wissenschaftsrat',
                              'role' => 'author'
                            }
                          ]
              }
            ],
already-accepted' => {
                       'info:3lib:citeseerpsu:226647' => 'author',
                       'info:3lib:citeseerpsu:605918' => 'author',
                       'info:3lib:citeseerpsu:230941' => 'author',
                       'info:3lib:citeseerpsu:477049' => 'author',
                       'info:3lib:citeseerpsu:452118' => 'author',
                       'info:3lib:citeseerpsu:471465' => 'author',
                       'info:3lib:citeseerpsu:412176' => 'author',
                       'info:3lib:citeseerpsu:561147' => 'author',
                       'info:3lib:citeseerpsu:448236' => 'author',
                       'info:3lib:citeseerpsu:606494' => 'author',
                       'info:3lib:citeseerpsu:474526' => 'author'
                     },
refused' => [
            {
              'authors' => 'Thomas Krichel,',
              'sid' => 'drep43',
              'url-about' => 'http://citeseer.ist.psu.edu/445063.html',
              'title' => 'RePEc, an Open Library for Economics',
              'type' => 'text',
              'id' => 'info:3lib:citeseerpsu:445063',
              'role' => 'author'
            },
            {
              'authors' => 'Thomas Krichel, & Sergei I. Parinov,',
              'sid' => 'dthe592',
              'url-about' => 'http://citeseer.ist.psu.edu/498365.html',
              'title' => 'The RePEc database and its Russian partner Socionet',
              'type' => 'text',
              'id' => 'info:3lib:citeseerpsu:498365',
              'role' => 'author'
            },
            {
              'authors' => 'Thomas Krichel, & Paul Levine,',
              'sid' => 'dthe939',
              'url-about' => 'http://citeseer.ist.psu.edu/612949.html',
              'title' => 'The Economic Impact of Labour Mobility in an Enlarged European Union',
              'type' => 'text',
              'id' => 'info:3lib:citeseerpsu:612949',
              'role' => 'author'
            },
            {
              'authors' => 'Thomas Krichel, & Simeon M. Warner,',
              'sid' => 'daca3',
              'url-about' => 'http://citeseer.ist.psu.edu/496446.html',
              'title' => 'Academic Self-Documentation: Which Way Forward for Computing, Library and Information Science?',
              'type' => 'text',
              'id' => 'info:3lib:citeseerpsu:496446',
              'role' => 'author'
            },
            {
              'authors' => 'Thomas Krichel, & Steve Lawrence,',
              'sid' => 'dcit1',
              'url-about' => 'http://citeseer.ist.psu.edu/324151.html',
              'title' => 'CitEc: an Autonomous Citation Index for Economics',
              'type' => 'text',
              'id' => 'info:3lib:citeseerpsu:324151',
              'role' => 'author'
            }
          ],
already-suggested' => {
                        'ddoe12' => 'exact-name-variation-match',
                        'dunk195' => 'exact-name-variation-match',
                        'ddis136' => 'exact-name-variation-match',
                        'dcit1' => 'exact-name-variation-match',
                        'dsyn33' => 'exact-name-variation-match',
                        'dwri7' => 'exact-name-variation-match',
                        'dwop1' => 'exact-name-variation-match',
                        'dsyn11' => 'exact-name-variation-match',
                        'dabo6' => 'exact-name-variation-match',
                        'dame40' => 'exact-name-variation-match',
                        'dthe939' => 'exact-name-variation-match',
                        'dthe592' => 'exact-name-variation-match',
                        'dunk114' => 'exact-name-variation-match',
                        'daca23' => 'exact-name-variation-match',
                        'dcat3' => 'exact-name-variation-match',
                        'daca3' => 'exact-name-variation-match',
                        'dalo10' => 'exact-name-variation-match',
                        'drep43' => 'exact-name-variation-match',
                        'dwri4' => 'exact-name-variation-match',
                        'dwor3' => 'exact-name-variation-match'
                      },
'laststatus' => 'auto-search-not-needed',
'reloaded' => 1,
'autosearch' => {
               'names-list-nice' => [
                                      'Thomas Krichel',
                                      'Krichel, Thomas',
                                      'T. Krichel',
                                      'Krichel, T.',
                                      'Thomas Krichel'
                                    ],
               'for-names-last-changed' => '1247381317',
               'names-list' => [
                                 'thomas krichel',
                                 'krichel thomas',
                                 't krichel',
                                 'krichel t'
                               ]
             },
'accepted' => [
             {
               'sid' => 'dame40',
               'authors' => 'Thomas Krichel, & Simeon M. Warner,',
               'url-about' => 'http://citeseer.ist.psu.edu/605918.html',
               'id' => 'info:3lib:citeseerpsu:605918',
               'type' => 'text',
               'title' => 'A Metadata Framework to Support Scholarly Communication',
               'role' => 'author'
             },
             {
               'sid' => 'dalo10',
               'authors' => 'Thomas Krichel,',
               'url-about' => 'http://citeseer.ist.psu.edu/230941.html',
               'id' => 'info:3lib:citeseerpsu:230941',
               'type' => 'text',
               'title' => 'A long application for continuation funding for WoPEc',
               'role' => 'author'
             },
             {
               'sid' => 'dabo6',
               'authors' => 'Thomas Krichel,',
               'url-about' => 'http://citeseer.ist.psu.edu/226647.html',
               'id' => 'info:3lib:citeseerpsu:226647',
               'type' => 'text',
               'title' => 'About NetEc, with special Reference to WoPEc',
               'role' => 'author'
             },
             {
               'sid' => 'daca23',
               'authors' => 'Thomas Krichel, & Simeon M. Warner,',
               'url-about' => 'http://citeseer.ist.psu.edu/477049.html',
               'id' => 'info:3lib:citeseerpsu:477049',
               'type' => 'text',
               'title' => 'Academic Self-Documentation: Which Way Forward for Computing, Library and Information Science?',
               'role' => 'author'
             },
             {
               'sid' => 'dsyn11',
               'authors' => 'Tim D. Brody, & Zhuoan Jiao, & Thomas Krichel, & Simeon M. Warner,',
               'url-about' => 'http://citeseer.ist.psu.edu/452118.html',
               'id' => 'info:3lib:citeseerpsu:452118',
               'type' => 'text',
               'title' => 'Syntax and Vocabulary of the Academic Metadata Format',
               'role' => 'author'
             },
             {
               'authors' => 'Thomas Krichel, & Guildford Gu Xh,',
               'sid' => 'dwor3',
               'url-about' => 'http://citeseer.ist.psu.edu/448236.html',
               'title' => 'Working towards an Open Library for Economics: The RePEc project',
               'type' => 'text',
               'id' => 'info:3lib:citeseerpsu:448236',
               'role' => 'author'
             },
             {
               'authors' => 'Herbert Van De Sompel, & Thomas Krichel, & Michael L. Nelson, & Patrick Hochstenbach, & Victor M. Lyapunov, & Kurt Maly, & Mohammad Zubair, & Mohamed Kholief, & Xiaoming Liu,',
               'sid' => 'dunk195',
               'url-about' => 'http://citeseer.ist.psu.edu/606494.html',
               'title' => 'Unknown',
               'type' => 'text',
               'id' => 'info:3lib:citeseerpsu:606494',
               'role' => 'author'
             },
             {
               'authors' => 'Tim D. Brody, & Zhuoan Jiao, & Thomas Krichel, & Simeon M. Warner,',
               'sid' => 'dsyn33',
               'url-about' => 'http://citeseer.ist.psu.edu/474526.html',
               'title' => 'Syntax and Vocabulary of the Academic Metadata Format',
               'type' => 'text',
               'id' => 'info:3lib:citeseerpsu:474526',
               'role' => 'author'
             },
             {
               'authors' => 'Herbert Van De Sompel, & Thomas Krichel, & Michael L. Nelson, & Patrick Hochstenbach, & Victor M. Lyapunov, & Kurt Maly, & Mohammad Zubair, & Mohamed Kholief, & Xiaoming Liu,',
               'sid' => 'dunk114',
               'url-about' => 'http://citeseer.ist.psu.edu/561147.html',
               'title' => 'Unknown',
               'type' => 'text',
               'id' => 'info:3lib:citeseerpsu:561147',
               'role' => 'author'
             },
             {
               'authors' => 'Thomas Krichel,',
               'sid' => 'dwri4',
               'url-about' => 'http://citeseer.ist.psu.edu/471465.html',
               'title' => 'Written submission to the "AG Neue Medien und Bibliotheken" of the Wissenschaftsrat',
               'type' => 'text',
               'id' => 'info:3lib:citeseerpsu:471465',
               'role' => 'author'
             },
             {
               'authors' => 'Thomas Krichel, & Paul Levine,',
               'sid' => 'ddoe12',
               'url-about' => 'http://citeseer.ist.psu.edu/412176.html',
               'title' => 'Does Precommitment Raise Growth? The Dynamics of Growth and Fiscal Policy',
               'type' => 'text',
               'id' => 'info:3lib:citeseerpsu:412176',
               'role' => 'author'
             }
           ],
'previousstatus' => 'auto-search-not-needed',
'already-refused' => {
                      'info:3lib:citeseerpsu:498365' => 1,
                      'info:3lib:citeseerpsu:324151' => 1,
                      'info:3lib:citeseerpsu:612949' => 1,
                      'info:3lib:citeseerpsu:445063' => 1,
                      'info:3lib:citeseerpsu:496446' => 1
                    }

The feauture extraction proceeds as follows. Fields authors, url-about, title and id are concatenated and lowercased. Consecutive whitespace is collapsed. If a word contains - / . :, it is split at these, and they are removed. Other punctuation is removed. Thus

 'authors' => 'Thomas Krichel, & Paul Levine,',
 'sid' => 'ddoe12',
 'url-about' => 'http://citeseer.ist.psu.edu/412176.html',
 'title' => 'Does Precommitment Raise Growth? The Dynamics of Growth and Fiscal Policy',
 'type' => 'text',
 'id' => 'info:3lib:citeseerpsu:412176',
 'role' => 'author'

generates a whitespace-separated feature list thomas krichel & paul levine http citeseer ist psu edu 412176 html does precommitment raise growth the dynamics of growth and fiscal policy info 3lib citeseerpsu 412176 if I got it right! Feature weights are normalised such as to normalise the Euclidean sum.

Learn.pm then computes the relevance of suggestion, and stores them in the 'relevance' column in the suggestions table, using ALTER TABLE `rp_suggestions` ADD `relevance` FLOAT NULL ;. ACIS/Resources/Suggestions.pm has helper functions to manipulate the table.

Valid XHTML 1.0!