This is the Lima paper. First draft by Thomas Krichel on 2006-06-31.
It discusses three additional pieces of work to be accomplished. These are document to document associations, full-text file recognition and fuzzy searching.
We need to remember that ACIS is being funded as tool to, ultimately, make more open access academic papers available. As such
The discussion here is much informed by the current perspective of current academics publishing. There is a multitude of venues, some formal some informal. While only formal channels have prestige, informal channels can hold documents the contents of which is very closed to formal contents but much more available, ofter
Up until now, searches in the author name data of documents only looks for exact occurrences of element of the name variations profile. No fuzzy searching is conducted. The main reason is the expense computational of such searches.
We should do a limited amount of fuzzy searching. With a good strategy, we can limit the amount of computations while yet achieving results that may are helpful to further raise the completeness of the dataset.
Fuzzy searching makes sense if there are misspellings in the author name. Misspellings are not variants of a person's name. They reflect error of input. Errors are not systematic. Say if a person is called Müller, then Muller and Mueller are variants since they reflect two ways that the name would be commonly written out in situations where umlaut input is difficult. But Müler would be a mistake, and most likely, it would only be appearing rarely in the dataset because of its erroneous nature.
Therefore it appears important to operate fuzzy searching for author name expressions which only appear once, and are currently not covered by any name variations of any author.
For each author, and for each name variation of the author, ACIS could search all name expressions in the database, calculate the Levenshtein distance, and if it is above a critical level, suggest to the author.
A part of the operation the amount of searching should be configurable. Some parameters are
Finally, the relationship that has to hold between the name expression, the name variation, and the Levenshtein measure before a name expression is suggested to the author as a possible match to her name, could also be made configurable, with administrator having the possibility to write out a simple Perl function that relates the three parameters.
One way to populate digital libaries is to look out for metadata on papers, and find the correspending files on the Internet using automated means. However, such identification is not reliable.
Even in the more formal collections, there often is a confusion about what the full-text is and how it should be treated. As a result, a lot of metadata collections point to intermediate pages, which usually hold a link to the full text, as the the purported full text.
While everything should be done to avoid it, we should allow for authors to recognize the full text.
An author may have an archival profile. The author can not directly read it, it remains read-only. However, the auther may express opinions about them that may be valuable for the maintainers of archives.
The first opinion is the recognize the full-text. The user is presented with a set of papers, for which an automated system has found some full-text files for. The user can select or de-select relevant full-text files.
There are three indication options associated with each full-text file.
There are three archival options associated with each full-text file.
Since any co-author may make such choices, they are not shown for papers where a co-author has already made a decision.
When a user selects to use the archival profile, she sees a list of all metadata for papers for which an archival file is available. She has to make a manual choice for each of them, or leave at the default.
The default is that the full-text file is attached to the correct version, and that storage of this file is permitted. When a user has accepted the default, the correponding files are no longer shown.
In an ACIS setting a document is a metadata record plus associated full-text. Full text is not necessary.
In many circumstances, the same paper is available in different versions. Often these are versions published in different channels. Such version may be linked. This is an adminsitrative link. The author may wish to make this link explict.
In other circumstances, an author may write a paper on a subject, and later write a second papers that's a development of the first paper. The author may wish to make this explicit. This is a thematic link.
In the context of tihs work, we ould like to consider some more generic links between papers.
Each author can only create links between papers that she has authored, i.e. that are in her contributions profile. Thus, if author A1 and author A2 have collobarated on paper P1 and author A2 and author A3 have collaborated on paper P2 and P2 is a development of paper P1, then only author A2 can make the link. As a result, some authors will have the opportunity to make a lot of links, others will have little.
Links can be symmetric or directed. Directed links may express a reverse property but the expression is left up to user services. Symmetric links can group the space of documents into overlapping sets, but expressing these sets is left to user services.
The types of links will not be built-in to the software, but rather, will be configurable.
The profile name is a first configuration parameter. This gives the name of teh profile that is made explicit by the link collection.
When a user chooses to work on the profiles, she will see a set of start documents. These are the documents to make choice from. The documents may be sorted by a sorting function. This sorting function takes a XPath reference as a parameter. The default sorting order is by reverse date, i.e. most recent first. This paramenter is start sorting function.
When a user has found a document to start with, she is presented the list of the other documents. Here there is an end sorting function that takes the XPath object of the starting document and the end documnets as parameters. The default is a comparison the Levenshtein distance of the titles.
These functions may be expensive in computing time but it is up to the administrator of the ACIS installation to figure this out.
Several types of links may be grouped together in an profile. This the same profile may manage differnet types of links which can be put on together or one-by-one. This device will avoid the user having to watch for the same couples of documents several times for relationships that are similar, say one being a refinment of the other