Bodes proposal for RAS user input to CitEc
Background
This documents relates to CitEc. CitEc's current system
for automatic identification of references and citations has at
least two limitations:
-
We can only process those documents that are available in
electronic format and that have no access restrictions. That
limits our stock of processed documents to less than half of
the available documents in RePEc.
-
Once a document can be processed, it is very hard to get the
text out of it. The software we are using for this purpose
(from Vividata Inc) does a good job overall. But it can't deal
with all types of PDF files. As a result we about 100k
unprocessed papers. And even we have been able to extract some
text, the quality of the text may be quite weak. The quality
of the text is very important for the identification of
citations.
These limitations are the cause of frequent complaints from
authors. A typical complaint is "Why is this citation to one of
my papers has not been included in IDEAS/EconPapers?" The answer
can be: "because the citing paper has not been processed" due to
the problems described earlier or "because the parsing of the
references failed".
At the technical level we are limited by external
applications. We can do little improve our own process.
For such reason, a way forward would be to develop procedures to
allow input from authors. Of course, we will use those authors
that are registered with the RePEc Author Service,
henceforth RAS.
Proposal
In this document we present a proposal to extend the RePEc Author
Service in order to allow input of references and citations from
documents which have not been automatically processed. Such
contributions from authors have two issues to be resolved: (1) input and
(2) validation of data in order to prevent fraud.
Typically, we have to deal with two people. The "requester" is
the author of a cited paper. Requesters want us to know that
their papers have been cited. However, they can't furnish us
with a proof. Only the second person, the "author" of the citing
paper, can do it. In the following, we try to design something
a little broader than that, but this is the basic scenario.
The input process is described later. For the validation of data
we will rely on the collaboration of registered authors in
RAS. We will develop a communication system between requesters,
authors, and CitEc management, using email or web forms. The aim
is that each input action from an requested should be validated
for the corresponding author before the citation is included in
CitEc. That is, the request to add a citation to a paper should
be validated by one of the authors of the citing paper.
The plan is agnostic about what server the new service runs. It
takes account of the fact the the Bodes server may not
completely since with the RAS and the CitEc server. The Bodes
server is assumed to hold complete profile information, from the
ACIS userdata directory. Therefore it is able to authenticate
RAS users. The Bodes server is also assumed to hold a complete
RePEc dataset, and to perform lookups on it, as described below.
Basic process
There are three actions to be performed labeled add_ref,
add_single_ref and add_cit.
There is a request to be performed by a requester. A request is an
action which requires confirmation by an author. The request is
called: Add_ref_request.
action Add_ref
This action adds references for a document not processed jet.
-
If there is no session for a RAS user, we authenticicate the
user and start a session. Before we authenticate a user, we
check that the user is not blocked for entry to the system.
-
We propose an "author" form with a select menu of papers that
the user has written, and invite her to select one to add
references to it. This is called the refdoc handle. If she
chooses to do so, she makes an in-profile request. She is give
the status of author, with repect to that request.
-
We also propose a separate "requester" form using a text input
box with AJAX completion for a valid RePEc doc handle. If the user
furnishes us with a handle, she is given a status of
"requester" with the given request. The completion to a valid
refdoc handle is important here. When a refdoc is proposed,
it becomes "owner blocked" for the user.
-
The script checks that the refdoc has not been processed jet.
Otherwise, it calls the action add_single_ref.
-
The script writes basic information about the refdoc and a
textarea to introduce the references.
-
We check that the reference listing format of
the text introduced by the user corresponds to a spec to be
determined. This is called the syntax check. The syntax
check will be performed by a module residing on the CitEc
server and copied to the Bodes server.
-
If the user has role author: we add the references
to the database and sets the document status to "ready" to be
processed.
-
If the user has role requester, we generate an add_ref_request
action.
Add_single_ref
Adds a reference to a document that has
been processed already. The document status should be "linked".
-
A cgi-bin script located at the RAS server is called. This
script requires as the handle of the document to add
references to.
-
The script checks that the user is registered. If not,
return to the referee page.
-
The script checks the identity of the user and sets its
role as requester or author.
-
Checks that the document has been processed already,
otherwise call action: Add_ref
- The script writes basic information about the document
- If the user has role author
- The script writes a list of references already processed
with a button for each reference to allow the user to delete
it.
- Adds a text area to include the new reference text
- Adds a bottom "process changes"
- The script deletes the references, and citations if any,
checked by the author.
- The script checks that the format of the text introduced
by the user is correct.
- The script checks if the reference is a citation
- If this is the case add reference and citation to the database
- If not, ask the user if she wants to add a citation for
this reference.
- If not, exit, else start an action Add_cit
- If the user has role requester
- The script writes a list of references already processed.
- Adds a text area to include the new reference text
- Adds a button "process changes"
- The script checks that the format of the text introduced
by the user is correct.
- The script generates an Add_ref_request
- Exit
Add_cit. Adds a citation to a processed document.
- A cgi-bin script located at the RAS server is called. This
script requires as option the handle of the document which
has been cited.
- The script checks that the user is registered. If not,
return to the referrer page.
- The script checks the identity of the user and sets its
role as requester or author.
- Checks that the citing handle exists, otherwise return to
refer-er page.
- Checks that the citing document has status "linked",
otherwise starts action Add_ref.
- Shows list of references and for each reference a button to select the corresponding one.
- If the user choose one of them, the system match the reference against the data of the cited document.
- If test is successful, adds citation, else it writes an error message and returns to references list.
- If the user has not selected any reference that means the cited document is not in the list of references, so an action Add_single_ref is launched.
- End of process
Add_ref_request
- A cgi-script is called with the handle on which the request is being made and contents of the request (to be defined).
- Scripts select the authors of the document which have been registered with RAS.
- If no authors selected, write error message and end.
- Otherwise add a request to the profile of all of them.
A new option to manage requests should be added to the profile of all
registered authors.
Using this manager each author should have
information about pending request, with the possibility to confirm or
refuse them. Also a list of submitted request with the date and result
of them.
remaining issues and random thoughts