\documentclass{elsart}

\usepackage{natbib}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{longtable}

\bibliographystyle{elsart-harv}

\usepackage{url}
\def\UrlFont{\rm}
\def\UrlLeft#1{#1%
   \if -\noexpand#1\else
     \penalty\UrlBreakPenalty
   \fi
   \UrlLeft
}
\def\UrlRight #1\UrlLeft{}


% local definitions
\newcommand{\best}{_{\rm b}}
\newcommand{\worst}{_{\rm w}}
\newcommand{\last}{_{\rm l}}
\newcommand{\OutS}{\mathbb{O}}
\newcommand{\Expe}{{\rm E}}


\begin{document}


\begin{frontmatter}

\title{Information retrieval performance measures for 
a current awareness report composition aid}

\thanks{I am grateful for comments by Christian Calm\`es, William
S. Cooper, Robert M. Losee, Amanda Z. Xu and two referees of
``Information Processing and Management''.  I am also grateful to all
the volunteers of NEP for the work they have been putting into 
running the service. The hospitality of Tatyana I. Yakovleva provided
a congenial setting for the work on this paper.}


\author{Thomas Krichel}

\address{College of Information and Computer Science\\ Long Island
University \\ 720 Northern Boulevard \\ Brookville NY 11548--1300,
U.S.A.}

\address{Faculty of Information Technology\\ Novosibirsk State
University\\ 2, Pirogova Street\\ 630090 Novosibirsk, Russia }

\address{http://openlib.org/home/krichel \\ krichel@openlib.org}


\begin{abstract}

This papers studies a special ``small'' information retrieval problem
where user satisfaction only depends on the ordering of documents.  We
look for a retrieval performance measure applicable for this setting.
We define some requirements for such a measure.  We develop a
theoretical ordering of all outcomes. We look at some standard and
purpose-build measures and assess them against the requirements.
We conclude that a linear combination of two such measures is
adequate. 

\end{abstract}


\end{frontmatter}

\section{Introduction}


The classic measures of information retrieval performance are
precision and recall. Precision is the number of retrieved and
relevant documents divided by the number of retrieved
documents. Recall is the number of retrieved and relevant documents
divided by the number of relevant documents.  Both ratios are
used jointly because they capture two complementary aspects of the
retrieval process. Precision tells us how good the system is at
filtering out non-relevant documents.  Recall tells us how good the
system is at finding relevant documents.

Other measures have been proposed that aim to summarize information
retrieval performance in a single number.  These include the average
precision at seen document, the R-precision, the E-measure, van
Rijsbergen's F and the average precision over all documents.  All of
these numbers are directly based on the concepts of precision and
recall. In fact, they are precise mathematical functions of precision
and recall ratios.

Despite numerous critiques of these measures, they remain the most
widely deployed in ``large'' information retrieval problems.  In such
``large'' problems there is a large set of documents, typically so
large that the one-by-one examination of each document is not
realistic. Then, an information retrieval system has the task of
retrieving a set of documents that corresponds to the information
need. The user only sees the set of retrieved documents.

This paper is motivated by my concern for a special information
retrieval problem. We can call this problem a ``small'' information
retrieval problem. The service helps a user to find the relevant
documents out of a small collection of documents. The collection is
small enough that the user can examine each document one by one. The
purpose of the information retrieval system is to make it easier for
users to reach decisions.

One example of such a ``small'' information retrieval system has been
the motivation for this paper. I created the ``NEP: New Economics Papers''
service at \url{http://nep.repec.org} and I am involved in running
it. NEP is a current awareness service of the RePEc digital library,
see \url{http://repec.org}. It filters new additions to RePEc into
weekly subject-specific reports. Each report is edited by a volunteer
editor. Each week editors are given a list of new additions to RePEc.
From that list, they select the documents that are relevant to the
subject of the report.  These ``relevant'' documents form an issue of
the report. Report issues are circulated via email.

The NEP service has been running since 1998.  During that time RePEc
has grown. So has the list of new additions that appear each week. The
median number of new documents per week, over the entire life of the
service, is 300. But in recent months, bumper crops of over 600 new
documents are not uncommon. With that sort of numbers, we can not
require volunteer editors to ponder over each single one for much more
than a few seconds. At present, most editors, when composing the
report issue, look first at document titles.  If a title looks
appealing, they may look at the abstract. But sometimes a
non-appealing title may hide a relevant document. This would become
clear if the editor has read the abstract. However, with large new
additions lists it is not realistic to expect the editor to read the
entire set of abstracts. A pre-selection by the title is inevitable.
It inevitably leads to editorial mistakes.

%%
%% start of change requested by R1, fix #1
%%
% To make life easier for the editors, it would be useful to sort the
% list of new additions such as to show editors up-front those documents
% that are most likely to be included. 
  To make life easier for the editors, it would be useful to sort the
  list of new additions in order to show editors up-front those documents
  that are most likely to be included. 
%%
%% start of change requested
%%
This system can bring two benefits to editors. First they do not
have to labor through the whole of the new additions list. If the
algorithm works well, they can skip the tail end. Second, they can
increase attention to the documents that the computer has put to the
top of the list.  This has the potential to eliminate oversight of
documents with imaginative titles.

To evaluate such a sorting system, we need some measures.  Within
the specific context, precision and recall do not appear to be
useful. There are three basic ways in which one can interpret
precision and recall within the NEP context.

One approach is to say that they remain constant.  Recall is always
100\%, and precision is always equal to the number of documents that
are relevant to the subject report, divided by the size of the new
additions list. If we take that view, then precision and recall do
not depend on the sorting process.

%%
%% change requested by R1, fix #2
%%
% Another approach looks at precision and recall at the level of the
% last included document.
  Another approach looks at precision and recall at the level of the
  last relevant document.
%%
%% end of change requested by R1
%%
Thus, if the information retrieval system sorts all the relevant
documents to the front, then precision and recall are 100\%. If there
are some non-relevant documents that are found before the last
document that is relevant, precision is the number of relevant
documents divided by the position of the last relevant document. But
recall is still 100\%, therefore it is not useful. The measure of
precision that we reach with this view is isomorph to the Nosel measure
that I discuss in Subsection \ref{sec:nosel}.

A third approach is to think of the output of the information
retrieval system as sets of documents. We distinguish the documents
that the information retrieval system has predicted as relevant versus
those that it has predicted as non-relevant. We can then calculate
precision and recall figures as intended by comparing the sets
returned by the information retrieval system with the sets assembled
by the editor. The latter are assumed to be correct. There are two
problems with this approach. First, all computer-based information
retrieval systems rank documents by the likelihood that they are
relevant.  It is in the very nature of the type of calculations done
that such a ranking is produced.
%%
%% start of change requested by  R1, fix #3
%%
% Thus, just looking at sets of documents implies that one
% deliberately ignores information which is available.
%
  Therefore, if we conceive the information retrieval system output as a
  couple of sets, we exclude additional information that the system has
  produced. 
%%
%% end of change at the request of R1
%%
This is inconsistent with an assumption of rational behavior of
users. Second, it should matter a lot in what order the documents
appear in. This second problem is best illustrated by an example. For
the sake of illustration, let numbers denote documents that, according
to the editor, are relevant, and letters denote documents that are not
relevant.  Assume that the information retrieval system finds that
documents 1 and 2 are relevant. It will sort those to the front. Then
$[1,2,3,4,\text{a},\text{b},\text{c},\text{d}]$ and
$[1,2,\text{a},\text{b},\text{c},\text{d},3,4]$
%%
%% start of change requested by  R1, fix #4
%%
%  have the have the
have the
%%.
%% end of change at the request of R1
%%
same precision (100\%) and recall figures (50\%). But the former is
perfect for the editor, while the latter is basically useless.  Half
of the relevant documents are at the front, the other half is at the
rear. The editor has to still work through the entire list of new
additions to find the all relevant documents.

We hope to have convinced the reader of the need for alternative
measures. 
%%
%% start of change requested by  R2, fix #30
%%
The remainder of the analysis therefore does not start with
the analysis of precision and recall values, as 
\citet{keirij74foundation} or \citet{wilsha86on} have done.
Instead we use an informal utility maximization approach,
as, for example, \citet{wilcoo73on}. In addition, we will
consider rationality arguments, an approach that is more often found
in economics rather than in information science. 
%%
%% end of change at the request of R2
%%
Before we review criteria, we devote two sections to
further thinking about the problem. In Section \ref{sec:general} we
set out the general framework. In Section \ref{sec:natural} we look
more closely at the problem that editors face. We try to establish
what order a rational editor may place on the outcomes. In Section
\ref{sec:measures} we present alternative measures. In Section
\ref{sec:test} we test these measures on the NEP report data using
support vector machines. In Section \ref{sec:conclusions} we offer
conclusions.


\section{The problem}\label{sec:general}

% % There are basically two types of evaluation exercises one can do. We
% % can call them individual and collective exercises An individual
% % exercise associates a value with the individual outcome of a single
% % sorting trials. A collective exercise associates a value to a method,
% % given that the same method has been used on a large number of sorting
% % trials. To calculate the precise of the evaluation for such a method,
% % a large number of sorting processes will have to be run.  In this
% % section, as well as is the remainder of this paper, we only cover
% % individual exercises. That is, we require that the evaluation that
% % we conduct associates a value with an individual outcome. We can
% % then use standard statistical aggregates---such as the mean---to
% % aggregate individual trials to evaluate a method.
% %
% % To formally address the evaluation measures, let us fix some notation.

Let there be a vector $x$ called the outcome vector or simply the
outcome. It has $n$ elements, $r$ of which take the value $1$,
i.e.~they represent relevant documents, and $n-r$ take the value $0$,
i.e.~they represent non-relevant documents. The ratio $r/n$ is the
generality of the information need. Let us call $\OutS(r,n)$ the
outcome set.  This is the set of all possible outcomes. For a fixed
$r$ and a fixed $n$, this is a set with $n!/(r!\,(n-r)!)$ members.

Loosely speaking, we are looking for a measure of how much the 1s
are at the front of the vector and the 0s at the end. The best outcome
is
$$
x\best=[1,\ldots,1,0,\ldots,0]
$$
and the worst outcome is 
$$
x\worst=[0,\ldots,0,1,\ldots,1].
$$
Let $f(x)$ denote a measure of the quality of the outcome.  There are
many measures that one may define. Each of these measures is
subjective in a way. It captures a desired property of the outcome. We
can find four requirements that hopefully most people can agree with.

Requirement 0 is the most subjective. It says that $f(x)$ must not be
too complicated. It should be fairly easy to explain to people what
the measure is.

Requirement 1 is in some ways a corollary of requirement 0. 
%%
%% start of change requested by R1, fix #5
%%
% People are are used to reason that a better outcome mean a 
% higher $f(\cdot)$.
  People are used to reason that a better outcome means
  a higher $f(\cdot)$.
%%
%% end of change requested by R1
%%
Thus
$$
f(x)>f(x')\quad\Longleftrightarrow\quad x\mbox{ is better than }x'.
$$
Requirement 2 again is a corollary of requirement 0. People are used
to reasoning in term of percentages. Therefore it seems adequate to
require
\begin{equation}\label{eq:best}
f(x\best)=1.
\end{equation} 
Apart from the best outcome another benchmark is important.  That
is the case where $x$ is picked randomly out of $\OutS(r,n)$.
%%
%% change requested by R1, fix #6
%%
% In that case, the measure should be 0 in order for it to convey the
% idea that the information retrieval system that achieves a positive
% value of $f(\cdot)$ outperforms a random allocation. 
  In that case, the outcomes are of no use.
%%
%% end of change requested by R1
%%
This leads to Requirement 3
\begin{equation}\label{eq:exp}
\Expe\,f(x)=0,
\end{equation}
where $\Expe$ stands for the expected value operator.  One technical
constraint that comes out of this requirement is that there has to be a
closed form for the expected value. Of course, the expected value of
any measure $f(x)$ over the finite set $\OutS(r,n)$ can be calculated
by computer as the average over all potential outcomes. 
However, for large $n$ and $r$, the number
of members of $\OutS(r,n)$ becomes too large for this to be practical.
Current computer technology is simply not powerful enough to
accomplish the calculations in reasonable time.

Finally, we have an additional desired feature. We refer to this
as the scaling property.  With a constant generality $r/n$, if $r$ and
$n$ are both multiplied by a scale $t\in\Nset$, with $t>1$, we would
like to get the same values of $f(\cdot)$ if we construct scaled
outcome by repeating each element in the original vector $t$ times.
For an example, assume $n=3$, $r=1$ and $x=[0,1,0]$. If
$t=2$ we can construct $x'=[0,0,1,1,0,0]$, by simply repeating each
element in $x$ $t$ times.  It is obvious that we can always make
such a scaling transformation and that this transformation is
unique. Let $t\otimes x$ denote the scaled outcome vector.  If
we make such a transformation, it appears natural to wish that
$f(x)$ does not change, i.e.
$$
\forall\,x\in\OutS(r,n),
\quad\forall\,t\in\Nset:\qquad f(x)=f(t\otimes x).
$$
If $f(\cdot)$ satisfies to that property, we will say that it 
scales.

\section{Subject editor behavior modeling}\label{sec:natural}

The proposed system helps editors of a current awareness service such
as NEP.  The ultimate judge of performance is therefore the subject
report editor. To build general criteria, it is useful to build a
model of subject editor behavior. While a full mathematical model
would be outside the scope of this paper, we hope to establish some
general principles using simple deductive reasoning based on a highly
simplified view of the editorial process.

An editor faces a list of documents. A document is metadata about a
paper plus a link to the full text of that paper. An editor may spend
a lot or only a little time on the document. This decision on the
level of effort per document is difficult to model. Therefore we will
not look at it here. In other words, we assume that the examination of
a document is a discreet process i.e., the document is examined or it
is not examined.  Requirement 1 is in some ways a corollary of
requirement 0. People are are used to reason that a better outcome
mean a higher $f(\cdot)$.

%%
%% start of change requested by R1 fix #7
%%
% After examining a document, the editor knows whether that document
% is relevant or or not. 
  After examining a document, the editor knows whether that 
  document is relevant or not. 
%%
%% end of change requested by R1
%%
We further assume that the decision to include a document or to
exclude it only depends on the contents of that document.  It is
independent from the contents of other documents.  This reasoning
assumes away any learning that may take place while the list of
documents is examined.

As the editor works through the list of documents, she faces two types
of costs. First there is the cost $c_1$ of examining the next
document. Without loosing much generality we can assume that $c_1$
remains constant over the report issue composition process.  Second,
there is the cost of missing relevant documents. Let us loosely call
this $c_2$, though it is clear that $c_2$ somehow depends on actual
number of documents missed.  As the editor moves along the list, she
faces an optimal stopping problem. If she stops to examine documents,
she no longer suffers the penalty $c_1$ from examining all the following
documents. But she faces the penalty $c_2$ of missing relevant papers.  If
$c_1>>c_2$, the editor will not examine any documents at all, and if
$c_2>>c_1$ the editor will examine all documents.  In less extreme
cases, there is a balancing act.  This balancing act is 
complicated because of the uncertainty surrounding $c_2$.

To make further progress in our reasoning, we need to simplify the
problem. Let us assume that a magical interface could be built that
would remove the uncertainty regarding $c_2$. We could imagine a
traffic light sign in the editor's interface. It would show green
while there is at least one more relevant document, and it would show
red if there is no more relevant documents.  Such a scenario is
unrealistic of course, but for the moment just imagine it could be
achieved. Clearly when the traffic light turns red, the editor
will stop examining new documents. Now let us in addition assume that
the editor is a conscientious person. By this we mean that while the
traffic light is green, she will continue to examine new documents,
until the traffic light is red.

However unrealistic this scenario of the traffic light scenario is, it
can teach us one insight. With a traffic light, the editor will, when
presented with two outcomes $x$ and $x'$ prefer the one where the last
value of $i$ where $x_i$ is $1$, say $i^*$, $i^*=\text{arg} \max_i:
x_i=1$ has the lower value. For a conscientious editor examining
either $x$ or $x'$, $c_2=0$. However the examination cost $c_1$ will
be lower the lower $i^*$ is.  This reasoning establishes a weak
ordering over all outcomes. Comparing two outcomes, an editor will
prefer the one with the last relevant document at an earlier
position. The editor will be indifferent between two outcomes that
have the last relevant documents at the same position. This reasoning
requires under certainty, and that the editor is conscientious. We
can hope that it will also hold under some uncertainty, provided that
the uncertainty is not too large, and for less then full 
conscientious editors, provided they are not reckless. 

Now imagine that the traffic signal can not be completely trusted. It
would get it most of the time right but not always. Let us assume, to
simplify, that the uncertainty would only hold between the
second-to-last and the last document. Assume that the editor would
still follow the rule to stop if the traffic light turns to red,
simply because the uncertainty is marginally small.
%%
%% start of change requested by R1, fix #8
%%
Let $i^*$ be the position of the last relevant document.  Compare two
outcomes that are identical, save the fact that positions $i^*-1$ and
$i*-2$ are exchanged
\begin{align*}
x_1&=[\dots,0,1,1,0,\ldots,0]\\
x_2&=[\dots,1,0,1,0,\ldots,0].
\end{align*}
In the notation above $\ldots$ represent positions that are identical
for both outcomes. The second $1$ is the last, i.e.~it is
followed by zeros only.
%%
%% end of change requested by R1
%%

I claim that when the editor compares the two outcomes, she will
prefer $x_2$ over $x_1$. The reasoning goes as follows. If a wrong red
signal is perceived at position $i^*-1$, the editor looses the last
relevant document under $x_2$, but two relevant documents under
$x_1$. If a wrong signal is received at any earlier position than
$i^*-1$ the loss of the number of relevant documents is the same.
Therefore, when presented with two outcomes that have the same
position for the last relevant document, the editor will prefer the
one with the second-to-last document at the earlier position. This
reasoning can be repeated for the third-to-last document etc.  We
obtain a complete order over the outcomes. Let us call it the natural
order.

We can  conjecture that it is possible to find a class of functional
specifications for the loss function, and a class of distribution
functions of documents in the included/excluded domain such that, when
editors minimize total loss knowing the distribution function, they
prefer outcomes in natural order.  Developing such a generic class
would, however, go beyond the scope of this paper.


\section{Various measures}\label{sec:measures}

In this section we address different measures for the literature or
purpose-built for this paper.  
%%
%% start of change requested by R2, fix #21
%%
% For the sake of clear labeling, each
% measure that we will be using for testing in Section \ref{sec:test} is
% named after a river in Siberia. The river and the measure are
% unrelated, of course.
%%
%% end of change requested by R2
%%


\subsection{The Swets/Brookes measure}

One interesting measure was proposed by
\citet{jonswe63information}. He assumes that ``when a search query is
submitted to a retrieval system, the system assigns an index value
(call it $z$) to each item in the store.'' This assumption holds true
for all computerized text classification and information retrieval
systems. Let $z|\text{r}$ and $z|\text{n}$ be the value that the
system assigns to $z$ given that it is relevant and non-relevant.
\citet{jonswe63information} proposes to evaluate
$$
\frac{\Expe(z|\text{r})-\Expe(z|\text{n})}{\sigma^2(z|\text{n})},
$$
where $\sigma^2$ denotes the variance.
%%
%% start of change requested by R1, fix #9
%%
% According to \citet{jonswe63information} proposes to evaluate, as long
% as both conditional distributions of $z$ are normal, the measure has
% desirable properties.
%
  According to \citet{jonswe63information}, as long
  as both conditional distributions of $z$ are normal, the measure has
  desirable properties.
%%
%% end of change requested by R1
%%
But 
\citet{berbroo68the} suggests that a better measure would be 
\begin{equation}\label{eq:brooks}
\frac{\Expe(z|\text{r})-\Expe(z|\text{n})}{
\sqrt{\sigma^2(z|\text{r})+\sigma^2(z|\text{n})}}.
\end{equation}
This measure truly expresses the discriminating power of the
underlying information system. The $z$ data that it uses is much
richer than the positional data used by other measures proposed in the
following here. On the other hand, its precise values are dependent on
the technical characteristics of the information retrieval
system. Therefore, its best use is as a technical tool to compare
different parameters within the same information retrieval
methodology. 

An important problem is that the measure violates requirement
two\footnote{It is not trivial
%%
%% start of change requested by R1 fix #10
%%
% to to
to
%%
%% end of change requested by R1 
%%
normalize the value such that it is 1 in the best case.  One approach
is that, faced with an outcome $x$ and its associated $z$ vector, we
can construct a perfect outcome that would rearrange $x$ to have all
relevant documents at the top, while keeping the $z$ values
constant. Unfortunately, while the constructed best outcome certainly
has a better value in term of the numerator in the expression
\ref{eq:brooks}, we can not be sure about which way the denominator is
going.  Therefore we can not construct an artificial best outcome in
this intuitive way.}  The measure has one more problem. It is a
measure that is essentially linear. The numerator is linear, and the
sign of the expression only depends on the numerator. This causes a
problem that we discuss in the next subsection.

\subsection{The Aselt measure}

\begin{table}
\caption{The Aselt measure for $r=2$ and $n=5$}
\label{table:aselt}
\begin{center}
\begin{tabular}{@{}ccrr}
$x$&$\alpha(x)$&$a(x)$\\
1,1,0,0,0&1.5&$1.0$\\
1,0,1,0,0&2.0&$2/3$\\
0,1,1,0,0&2.5&$1/3$\\
1,0,0,1,0&2.5&$1/3$\\
0,1,0,1,0&3.0&$  0$\\
0,0,1,1,0&3.5&$-1/3$\\
1,0,0,0,1&3.0&$  0$\\
0,1,0,0,1&3.5&$-1/3$\\
0,0,1,0,1&4.0&$-2/3$\\
0,0,0,1,1&4.5&$-1.0$\\
\end{tabular}
\end{center}
\end{table}

Although Swets/Brookes measure is rarely used at present, linear
measures have not disappeared. For example,
\citet{roblos98text} considers the ``average search length'' as a
candidate measure for information retrieval performance. The average
search length is the average position of units within the vector
$x$. That is, it is the sum of the positions $i$ where $x_i=1$,
divided by $n$. If $a(x)$ is the average search length, we have
$$
\alpha(x\best)=\frac{1+r}{2}\qquad\text{and}
\qquad\alpha(x\worst)=\frac{2\,n-r+1}{2}.
$$
The expected value for the average search length\footnote{This could
be labelled the average average search length or expected average
search length. Our use of 
%%
%% start of change requested by R2, fix #26
%%
% geographical names
  acronyms
%%
%% start of change requested by R2
%%
avoids confusion.} can be readily found as
$$
\Expe\,\alpha(x)=\frac{n+1}{2},
$$ which, interestingly, does not depend on $r$. Having found the
expected value, we can construct a measure that satisfies requirements
2 and 3. 
%%
%% start of change requested by R2, fix #23
%%
% We define the Aselt measure of an outcome $x$, $a(x)$, by
  We define the Aselt---an acronym for ``\underline{a}verage
  \underline{se}arch \underline{l}ength 
  \underline{t}ransformed''---measure of an outcome $x$, $a(x)$, by
%%
%% end of change requested by R2
%%
$$
a(x)=\frac{n+1-2\,\alpha(x)}{n-r}.
$$
The Aselt measure is nicely
bounded $a(x\best)=1$, $a(x\worst)=-1$.
It satisfies requirements 1--3. And it scales, because
it is based on an average, which itself is a linear function. 

Table \ref{table:aselt} carries a numeric illustration of the Aselt
measure. Looking at it we have two remarks. First, there seems to be
a concentration of points towards the middle of the distribution.
Second, not every different outcome has a different Aselt measure.  In
particular, the measure does not penalize two relevant documents in
the middle of the vector any different than one relevant document
in the end. From a managerial point of view, this is a problem if we
are much concerned about having a measure that penalizes heavily a
single document that is left out at the end.

In Table \ref{table:aselt} outcomes are sorted by the natural order.
The table illustrates that the Aselt measure violates the natural
order.  That is, there are some couples of outcomes $x$ and $x'$,
where $x$ is better than $x'$ according to the natural order but
$a(x)<a(x')$.  We refer to the non-respect of the natural order as
``natural order reversion'' in the following. It is a problem that is
generic to all linear measures, and therefore generic to all scaling
measures. It therefore also affects the Swets and Brookes measures.


\subsection{Lofop measure}

\begin{table}
\caption{Lofop measure for $r=2$ and $n=5$}
\begin{center}
\begin{tabular}{@{}cr}\label{table:lofop}
$x$&$m(x)$\\
1,1,0,0,0&$100.00\%$\\
1,0,1,0,0&$ 73.38\%$\\
0,1,1,0,0&$ 52.73\%$\\
1,0,0,1,0&$ 35.86\%$\\
0,1,0,1,0&$ 15.22\%$\\
0,0,1,1,0&$-11.40\%$\\
1,0,0,0,1&$-28.27\%$\\
0,1,0,0,1&$-48.92\%$\\
0,0,1,0,1&$-75.54\%$\\
0,0,0,1,1&$-113.06\%$\\
\end{tabular}
\end{center}
\end{table}
One idea to combat natural order reversion is to use the natural
logarithm, in order to reduce the high numbers that occur with the
findings of relevant documents at the top. If a document in position
$i$ is relevant, let it increment a total quality indicator of the
outcome by $\ln i$. Let $\mu(x)$ be this measure.  Let $x_i=1$ if a
relevant document is at position $i$, or 0 otherwise. We get
$$
\mu(x)=\sum_{i=1}^{n} x_{i}\,\ln i,
$$
which is well defined. 
%%
%% start of change requested by R2, fix #28
%%
  A simple calculation shows that
%%
%% end of change requested by R2, 
%%
$$
\Expe\,\mu(x)=\ln(n!)\,\frac{r}{n}.
$$
%%
%% start of change requested by R2, fix #24
%%
% This motivates the definition of the Lofop measure $m(x)$ 
  This motivates the definition of the Lofop---an acronym 
  for \underline{lo}garithm of \underline{fo}und 
  \underline{p}osition''---measure $m(x)$
%%
%% end of change requested by R2
%%
as 
$$
m(x)=\frac{\mu(x)-\Expe\,\mu(x)}{\mu(x\best)-\Expe(\mu(x))}.
$$ 
Note that the $\mu(x)$ could also be defined for the logarithm to
another base. Normalization leaves the actual $m(x)$ unchanged for any
base. Table \ref{table:lofop} suggests that the Lofop measure does
have desirable properties. It spreads outcome values more evenly than
the Aselt measure and, according to the table, it obeys the natural
order.  Unfortunately, the respect for the natural order of the Lofop
measure in the $r=2$, $n=5$ case is not a general rule. We don't
have to look far before reversion on the natural order raises its ugly
head again. We will leave it as an exercise for the reader to show
that $m(1,1,0,0,0,0,0,1)=2.64\%$ but that
$m(0,0,1,0,0,1,1,0)=-15.64\%$. This is a clear violation of the
natural order.

\subsection{The Nosel measure}\label{sec:nosel}

\begin{table}
\caption{The Nosel measure for $r=2$ and $n=5$}
\begin{center}
\begin{tabular}{@{}ccrr}\label{table:nosel}
$x$&$\lambda(x)$&$l(x)$\\
1,1,0,0,0&0&$1$\\
1,0,1,0,0&1&$1/2$\\
0,1,1,0,0&1&$1/2$\\
1,0,0,1,0&2&$0$\\
0,1,0,1,0&2&$0$\\
0,0,1,1,0&2&$0$\\
1,0,0,0,1&3&$-1/2 $\\
/0,1,0,0,1&3&$-1/2$\\
0,0,1,0,1&3&$-1/2$\\
0,0,0,1,1&3&$-1/2$\\
\end{tabular}
\end{center}
\end{table}

\citet{wilcoo68expected} looked at a more general model than we do
here. He assumed that the information system would result in a weak
ordering of documents. That is, some documents would be ranked exactly
as relevant as some others. In that situation, he assumes that the
user will look at these equally ranked documents in a random order.
He calls the measure that he proposes the ``expected search
length''. It is the number of non-relevant documents that a user
would find until she finds a target number of relevant documents.
% % Since the paths of the user in a weakly ordered document set would
% % have some degree of randomness to them, Cooper considers the expected
% % value of the search length. 
In our setting, his ``expected search length'' reduces to the search
length, because there is no uncertainty about how the order in which
the documents are examined. A further simplification in our case is
that we can consider that the target number of relevant documents is
the true number of relevant documents $r$.  Let $\lambda$ denote the
search length, then
$$
\lambda(x\best)=0\qquad\text{and}\qquad
\lambda(x\worst)=n-r.
$$
The expected value of the search length\footnote{This
is not the same thing as the expected search length. Our
use of abstract geographical names avoids the potential
confusion.} over all outcomes 
in the outcome set is
$$
\Expe\,\lambda(x)=\frac{(n-r)\,r}{r+1}.
$$
The expected value is smaller than the worst case, but only
a little bit smaller, especially  if $r$ is large. This 
suggests that the distribution of values is highly skewed. 

%%
%% start of change requested by R2, fix #22
%%
% In order to comply with requirements 1--3, we 
% define the Nosel measure of the outcome $l(x)$ as
  In order to comply with requirements 1--3, we 
  define the Nosel---an acronym for ``\underline{no}rmalized
  \underline{se}arch \underline{l}ength---measure of the 
  outcome $l(x)$ as
%%
%% end of change requested by R2
%%
$$
l(x)%%=\frac{\lambda(x)-\lambda\best}{\Expe \lambda(x)-\lambda\best}
=1-\frac{\lambda(x)\,(r+1)}{r\,(n-r)}.
$$ 
As seen in Table \ref{table:nosel}, many different outcomes receive
the same Nosel measure. The Nosel measure essentially expresses how low
the last document found was. It is not interested in the position of
any other document. This can be a strength of the measure, because it
makes it easier to explain to people what has been measured. On the
other hand it can also appear as weakness of the measure. While the
position of the last document should be---according to our reasoning
in Section \ref{sec:natural}---the most important aspect
of editor satisfaction, it is doubtful that it should be the
only one. In particular, an outcome that has all the relevant
documents next to each other but away from the top should, 
be counted as worse than an outcome that has all the relevant
documents at the top bar one at a late position, even if the
position of the last relevant item is the same in both scenarios. This
idea is, of course, embodied in the natural order.

The Nosel measure does not scale.  To see this, it is sufficient to
look at a counter example.  $c(0,0,0,1,1)=-.5$, but
$c(0,0,0,0,0,0,1,1,1,1)=-.25$, which is not even close.


\subsection{The Ponori measure}

Instead of looking at various measures and examining them
if they fit the natural order, a more fruitful approach
may be to build measures that directly impose the 
natural order by construction. 

One idea is that we can consider the sequence of 1s and 0s in the
outcome vector as a binary number.  This binary number can be
converted to decimal in order to capture its position in the natural
order. Converting $x\best$ to a decimal form leads to the highest
number, and converting $x_{{\rm w}}$ to decimal leads to the lowest
possible number. However, consider
$$
x\best=[1,\ldots,1,1,0,\ldots,0,0]
$$
versus
$$
x'=[1,\ldots,1,0,0,\ldots,0,1].
$$
The difference in the decimal measure between $x_1$ and $x_2$ does not
appear to be as significant as one would like, if one does wish to
penalize late occuring relevant documents significantly. Therefore,
rather than measuring the quality of the result by assigning high
powers to the first outcomes, we invert the outcome vector.
Thus we give high powers to the lower outcomes and call the resulting
number a loss. Of course, we are not limited to considering 
powers of 2 as the binary-number interpretation suggests. 
Any power of $y>1$ will be able to accomplish the
purpose of implementing the natural order.  This motivates the
following definition.

\begin{table}
\caption{Ponori measure for $r=2$ and $n=5$, $y=2$ and $y=\infty$}
\begin{center}
\begin{tabular}{@{}ccrr}\label{table:ponori}
$x$&$\omega(x,2)$&$o(x,2)$&$o(x,\infty)$\\
1,1,0,0,0&3&$    1.0$&$1$\\
1,0,1,0,0&5&$  37/47$&$1$\\
0,1,1,0,0&6&$  32/47$&$1$\\
1,0,0,1,0&9&$  17/47$&$1$\\
0,1,0,1,0&10&$ 12/47$&$1$\\
0,0,1,1,0&12&$  2/47$&$1$\\
1,0,0,0,1&17&$-23/47$&$-1.5$\\
0,1,0,0,1&18&$-28/47$&$-1.5$\\
0,0,1,0,1&20&$-38/47$&$-1.5$\\
0,0,0,1,1&24&$-58/47$&$-1.5$\\
\end{tabular}
\end{center}
\end{table}
Let $x=[x_1,\ldots,x_n]\in\OutS(r,n)$. 
%%
%% start of change requested by R2, fix #24
%%
%  Then the $y$ Ponori penalty of $x$, $\omega(x,y)$ is 
   Then the $y$-Ponori---an acronym for ``\underline{po}lynomial
   \underline{n}atural \underline{or}der 
   \underline{i}mposition''---penalty
   of $x$, $\omega(x,y)$ is
$$
\omega(x,y)=\sum_{i=1}^{n} y^{i-1}\,x_{i}.
$$
We find
\begin{equation}\label{eq:ponoriest}
\omega(x\best,y)=\frac{y^{r}-1}{y-1}
\qquad\text{and}\qquad
\omega(x\worst,y)=y^{n-r}\,\omega(x\best,y).
\end{equation}
The expected value is
\begin{equation}\label{eq:oexp}
\Expe\,\omega(x)=\frac{y^{n}-1}{y-1}\,\frac{r}{n}.
\end{equation}
%%
%% start of change requested by R1, fix #12
%%
  We can substitute for (\ref{eq:ponoriest}) and (\ref{eq:oexp}) to
  obtain a measure that satisfies (\ref{eq:best}) and (\ref{eq:exp}).
%%
%% end of change requested by R1
%%
This motivates the following definition. The Ponori measure 
of an outcome $x$ at the power $y$ is 
$$
o(x,y)=\frac{(y^n-1)\,r-(y-1)\,n\,\omega(x,y)}{(y^n-1)\,r-n\,(y^r-1)}.
$$
As $y\to\infty$ the Ponori measure becomes as indicator if 
the last relevant document is at the last position. As 
$y\to1$ $o(x,y)\to a(x)$. Thus, the Aselt measure is nothing
but a limiting case of the Ponori measure. In this limiting
case, the Ponori measure scales. But with $y>1$ it does not
scale. 


\subsection{The Copnori measure}

The dependency of the Ponori measure on $y$ is inconvenient. It is not
clear what $y$ to choose. While the ordering of outcomes is not
sensitive to the choice of any $y>1$, the numbers coming out of the
evaluation definitely are. Thus a more fundamental measure is called
for.

One very simple way to achieve this is to count through the elements
in the outcome set in the natural order, assigning each worse outcome
an incremental penalty of 1.  Computing the sequence is reasonably
straightforward\footnote{The detail of our computational
implementation is as recursive.  For any outcome vector, we first
remove the trailing non-relevant outcomes. They will not affect the
result.  Thus we have a shortened outcome vector with $n_1$
elements, say, $r$ of which are relevant. We can then calculate a
minimum value for $\kappa(x)$ as
$\kappa\best(1)=(n_1-1)!/(n_1-1-r)/r!$. We remove the last relevant
outcome from the vector.  This completes the first step.  We have a
new vector of $n_1-1$ element, $r-1$ of which are relevant. Again,
we remove any non-relevant outcomes of the end of that new
vector. We find the next relevant outcome at $n_2$. We have a new
minimum value, $\kappa\best(2)$ which we add to the value found in
the previous step $\kappa\best(1)$, etc.  We continue proceeding
until we arrive at a vector that has only relevant outcomes. There
we find the sum of all $\kappa\best(\iota)$, where $\iota$ is the
step number.}  If we start with counting at 0, we get the counts
$\kappa(x)$ as
\begin{equation}\label{eq:kbest}
\kappa(x\best)=0\qquad\text{and}\qquad
\kappa(x\worst)=\frac{n!}{r!\,(n-r)!}-1.
\end{equation}
\begin{table}\label{table:nisa}
\caption{The Copnori measure for $r=2$ and $n=5$}
\begin{center}
\begin{tabular}{@{}ccrr}
$x$&$\kappa(x)$&$k(x)$\\
1,1,0,0,0&0&$1.0$\\
1,0,1,0,0&1&$7/9$\\
0,1,1,0,0&2&$5/9$\\
1,0,0,1,0&3&$1/3$\\
0,1,0,1,0&4&$1/9$\\
0,0,1,1,0&5&$-1/9$\\
1,0,0,0,1&6&$-1/3$\\
0,1,0,0,1&7&$-5/9$\\
0,0,1,0,1&8&$-7/9$\\
0,0,0,1,1&9&$-1.0$\\
\end{tabular}
\end{center}
\end{table}
%%
%% start of change requested by R2, fix #27
%%
% To find the expected value is really easy
  The expected value is readily found as
%%
%% start of change requested by R2, fix #27
%%
\begin{equation}\label{eq:kexp}
\Expe\,\kappa(x)=\frac{\kappa\worst+\kappa\best}{2}.
\end{equation}
%%
%% start of change requested by R1 fix #13
%%
  We can substitute for (\ref{eq:kbest}) and (\ref{eq:kexp}) to
  obtain a measure that satisfies (\ref{eq:best}) and (\ref{eq:exp}).
%%
%% end of change requested by R1
%%
This motivates a definition.  
%%
%% start of change requested by R2 fix #25
%%
% The Copnori measure $k(x)$ of an outcome $x\in\OutS(r,n)$ is 
  The Copnori---an acronym for ``\underline{co}nstant
  \underline{p}enalty \underline{n}atural \underline{or}der
  \underline{i}mposition''---measure $k(x)$ of an outcome
  $x\in\OutS(r,n)$ is
%%
%% end of change requested by R2 
%%
$$
k(x)=1-\frac{2\,(\kappa(x))}{\frac{n!}{r!\,(n-r)!}-1}.
$$
Unfortunately, the Copnori measure does not scale.  
%%
%% start of change requested by R2, fix #28
%%
% This is easily seen with an example.  
  This is seen with an example.  
%%
%% end of change requested by R2 
%%
$$
k(0,1,0,1,0)=1/9, \qquad k(0,0,1,1,0,0,1,1,0,0)=89/209.
$$
%%
%% start of change requested by R2, fix #29
%%
% which is not even close.
  Both numbers are not even close to each other. 
%%
%% end of change requested by R2
%%
But, the measure has three strong
points. First, if one understands the natural order, it is a very
intuitive measure. Second, it does not depend on an arbitrary
parameter. Third, given the algorithm that we developed, the Copnori
measure is easy to compute even for large $n$ and $r$.
 

\section{Test}\label{sec:test}


Our aim is to develop a sorted list of new additions to RePEc for
editors of NEP. In NEP each subject issue has a code nep-{\em xxx}
where {\em xxx} is a sequence of three letters. A special report
nep-all contains the list of all new additions to RePEc. Thus sorting
the list of new addition is like sorting the nep-all report issue.
Each time a new nep-all issue is produced, it is sorted for the use of
the editors. For each subject report, the result of the sorting is
different, of course\footnote{In NEP documentation the term
``pre-sorting'' rather than ``sorting'' is used.  In NEP ``sorting''
is a different process than pre-sorting.  Sorting occurs when an
subject issue is being produced.  After an editor has discarded
non-relevant documents, (s)he may decide to sort the documents in an
order such as to put the most interesting document right to the top of
the issue. This is an optional step of the process of creating a new
subject issue.}.  To sort the nep-all for a subject report we look at
the past subject report issue data.  For each subject report, we have
two sets. The first is the set of documents that have been included in
the subject report.  The second is the set of documents that have not
been included in the report. The membership of the latter set is
somewhat more difficult to determine than the former. Sometimes, an
editor may not have looked at an entire list of new documents.  This
can happen, for example, if the editorship of a report is vacant.
Therefore we restrict membership of the second set to all those
documents in the nep-all issues for which at least one document of the
nep-all issue has been included from. Thus documents in nep-all issues
in which no document appeared in the subject issue have been ignored.
While this may be an oversight of negative learning examples, there
are still plenty of negative example left, because the generality of
subject reports is small.

We treat the occurrence of documents in different reports as
independent events. Data in \citet{barrueco03organizing} suggests that
this is not the case. However, in a practical application, it would be
cumbersome to rerank a nep-all report for a certain subject when it
becomes known that the editor of another subject report has included
that document in her report issue, based on that new information,
because editors make their decisions independently from each other but
typically within a short time frame after the nep-all report has been
issued. Thus, by ignoring co-occurrence of documents altogether, we
are working under realistic operating conditions.

We keep feature extraction very simple. From each document, we use the
author names, title, abstract, classification codes, and the serial in
which the paper has appeared. We concatenate the resulting string.  We
remove all punctuation, transliterate to lowercase and collapse
whitespace. Each whitespace-separated component of the resulting
string is a feature.
%%%
%%% start of change requested by R1, fix #15
%%%
% We weigh feature by the straight frequency counts first. Then we norm
% the features such that the Euclidean norm of the feature vector is
% unit. To form the ranking,
  For each document, we count the occurrence of the feature $f$ as
  $t_f$. The weight of the feature $f$ in the document, $w_f$ is then
  given as
  $$
  w_f=\frac{t_f}{\sum_{\forall f} t_f^2}.
  $$
  Note that it is not necessary to take document frequency into
  account here, because to form the ranking, 
%%%
%%% end of change requested by R1
%%%
we use support vector machines (SVM). This technique goes back to
\citet{servap95nature}. It is now a widely used text classification
technique. \citet{paugin04mapping} provide one example in a similar
context to ours. The svm\_light software of \citet{torjoa99making}
runs all the calculations. According to \citet{tomkri05developing} the
median nep-all report has 300 documents.
%%%
%%% start of change requested by R1, fix #16
%%%
% Thus each run, we use 300 documents for testing,
% the rest for training. We conduct 100 runs. 
  Therefore we set aside 300 randomly selected documents for testing.
  The rest we use for training the SVM. We 
  conduct at least 10 runs for
  each report. For some reports, where the generality is low, some
  selected testing dataset contains no relevant document. In that case,
  we repeat runs until at least one among the 300 randomly selected
  testing documents is relevant.
%%%
%%% end of change requested by R1, fix #17
%%%
The results are so bulky that we have confined them to an appendix
\ref{sec:results}. 
%%
%% start of change requested by R1 fix #18
%%
% While the Ponori measure has desirable theoretical properties, there is
% a bad problem in tests when the set of outcomes to be ordered is large,
% say more that 100. 
  The Aselt measure gives a reasonable range of results, and shows, by
  its numerical values, that the performance of the SVM is really quite
  good.  But we have rejected the measure on theoretical grounds. The
  same holds for the Lofop measure. We still include them in 
  Table \ref{table:result} for the sake of completeness. 

While the Ponori measure has desirable theoretical properties, there is a
bad problem in tests where the set of outcomes to be ordered is large,
say more than 100. 
%%
%% end of change requested by R1
%%
In that case, if $y>1$, any practical outcome that has the ability to
lift the last document to say before the last third or last quarter of
all the documents will get a measure that is close to 1, or even equal
to 1 after rounding.  As we increase the value of $y$, we are
converging toward a situation where the outcome is $100.00\%$ as soon
as the last document is not relevant, and a negative number if it
is. This clearly is not what we want.

A similar problem affects the Copnori measure. Recall that the Copnori
measure gives each outcome its own entire number.  Outcomes where a
relevant document appears late are penalized very heavily. 
%%
%% start of change requested by R1 #14 
%%
% Therefore, as soon as the retrieval system is able lift all relevant
% documents from low positions, it does very well.  Even if the last
% relevant document is is the middle, the measure shows a result that is
% close to 100%.
  Therefore, as soon as the retrieval system is to able lift up all
  relevant documents from low positions, it does very well. Even if the
  last relevant document is in the middle, the measure shows a result
  that is close to 100\%.
%%
%% end of change requested by R1
%%
In fact, is does so even more than the Ponori measure. The alert
reader will note that there are a 
number of maxima in the table are 100\% for the
Copnori measure but less than 100\% for the others. In the theoretical
framework that we use, 100\% is the value reserved for the optimal
outcome $x\best$. It is therefore impossible for the same outcome to
be evaluated 100\% by one measure and less than 100\% by another
measure. The explanation for this apparent error is the table is
rounding. The computer says its 100\% when in fact is can not see any
more the real value that is a tiny bit below 100\%.

The Nosel measure somehow has the opposite problem. It only looks at
the last position of the last document, it takes no account of the
ability of the information retrieval system to put relevant documents
to the front. However as we noted in the introduction, the feature is
also important, because it allows the editor to spend extra efforts on
the documents, reading more than the title.

Thus we can say that in principle, the Nosel measure
``underestimates'' the success of the system, whereas the Copnori
measure ``overestimates'' the success of the system. However, this
general statement only holds when the system is a success. When the
result is lousy, Copnori exaggerates the bad performance.  This is
simple the reverse of the fact that the Copnori measure gives very
good results for a large span of top-measures. Since both Copnori and
Nosel have an expected value  of zero at the random results, the
Copnori compensates with more lower values at the tail end. Related to
this, the variance of the Copnori measure is higher than the variance
of other measures. Generally, when the results are quite good, the
Nosel measure has the higher variance.  This comes as no surprise
since it only looks at one single element of the outcome vector, the
one that comes lowest.


\section{Conclusions}\label{sec:conclusions}

We think of precision and recall as set-based measures. Indeed, they
are based on the idea that the total set of documents contains two
complementary subsets, the subset of relevant documents and the subset
of non-relevant documents. A query creates two other complementary
subsets, the subset of retrieved and the subset of non-retrieved
documents.

In this paper, we discuss a different class of information retrieval
performance measures we call vector-based measures.  Vector-based
measures start with a different way of thinking about what is
happening at query time. We think of the set of documents as a
vector. Indeed, from a computational point of view, it is a vector,
because all documents are in some order in the information retrieval
system. The task of the information retrieval system is to sort all
the documents that are relevant to the beginning of the vector, and
sort the non-relevant documents to the end of the vector.

From our setup we have a theoretical ordering of outcomes we call the
natural order. Therefore to evaluate the information system, we prefer
measures that respect the natural order.
%%
%% start of change requested by R1 fix #19
%%
% The Nosel measure only weakly enforces the natural order. It
% undervalues the performance of the system. The Copnori measure strictly
% imposes the natural order. But since it attaches a constant penalty
% with each step, and therefore does not properly take account of the
% extra effort to examine the next document. Thus it appears best to
% take a linear combination of the two measures $\nu\,l+(1-\nu)\,k$.  As
% long as $0<\nu\le1$, the measure strictly respects the natural
% order. We suggest $\nu=10\%$, but other values are just as
% acceptable. All they change is the numeric value of the measure. They
% have no impact on the actual ordering of outcomes.
%
It turns out that the Nosel and Copnori measure complement each other to
provide a reasonable approximation of what the editors should want. 

The Nosel measure only weakly enforces the natural order. A very large
number of outcomes receive an identical Nosel measure despite the fact
that the arrive at different positions in the natural order.  From the
point of view of the editors, the Nosel measure only penalizes an
outcome when the editor has to look at an additional document.  It
does not take into account, that, for a given value of the position of
the last relevant document, the editor may abandon the search for
relevant documents before the last relevant document is reached, and
may, as a consequence of this action, have a varying number of
documents lost in different outcomes that receive the same Nosel
penalty.

Such differentiation is provided by the Copnori measure. It strictly
enforces the natural order. Each outcomes is assigned a different
number and the next worse outcome has a constant additional penalty of
one.  Moving the last position one further imposes no special
additional penalty. But from the point of view of the editors,
examining a new document does carry something additional to just one
step down in the natural order. From their point of view it has to be
a special penalty. Such a extra penalty is provided by the Nosel
measure.
%%
%% end of change requested by R1 
%%
Therefore it appears best to take a linear combination of the two
measures, such as say $\nu\,l(x)+(1-\nu)\,k(x)$.  As long as $0<\nu\le1$,
the measure strictly respects the natural order, and gives an extra
penalty at each extra document that has to examined in order to find
all the relevant documents.  We suggest $\nu=10\%$, but other values
are just as acceptable. All they change is the numeric value of the
measure. They have no impact on the actual ordering of outcomes.


\appendix


\section{Test results}\label{sec:results}

In this table, we report, for a selection of NEP reports, the summary
statistics for each measure. $a$ is the Aselt measure, $m$ the Lofop
measure, $l$ the Nosel measure, $k$ the Copnori measure, and $o$ the
Ponori measure. Reports are ordered by generality. To reduce the size
of the table, we omitted three out of four subject reports. For each
report and each measure, we see the mean in the line ``mean'', the
minimum in the line ``min'', the maximum in the line ``max'', and the
standard deviation in the line ``dev''.


{\small
\begin{center}
\setlength{\parskip}{0pt}
\begin{longtable}{@{}rrrrrrr}\label{table:result}
report\phantom{ mean}&$a$&$m$&$l$&$k$&$o(1.01)$\\
{\tt nep-mac} mean&80.21&84.11&21.73&93.19&84.74\\
          min&67.38&65.58&\phantom{0}0.85&39.17&67.29\\
          max&90.24&93.66&50.55&99.99&94.41\\
          dev&\phantom{0}7.69&\phantom{0}8.81&17.43&19.00&\phantom{0}9.07\\
%{\tt nep-fin} mean&77.26&82.95&23.93&96.45&83.83\\
%          min&65.23&76.13&\phantom{0}4.04&76.30&76.00\\
%          max&89.94&94.34&66.04&99.99&95.84\\
%          dev&\phantom{0}7.09&\phantom{0}6.18&18.68&\phantom{0}7.23&\phantom{0}6.51\\
%{\tt nep-bec} mean&55.87&65.48&25.19&77.41&68.58\\
%          min&33.68&17.31&$-$7.95&$-$84.68&29.27\\
%          max&72.40&79.67&49.04&99.97&81.51\\
%          dev&12.54&18.66&17.03&57.25&16.25\\
%{\tt nep-pbe} mean&82.19&88.75&51.53&99.60&90.63\\
%          min&71.29&79.29&17.31&96.15&80.46\\
%          max&89.49&94.23&80.37&99.99&96.13\\
%          dev&\phantom{0}6.83&\phantom{0}5.06&18.74&\phantom{0}1.21&\phantom{0}5.02\\
{\tt nep-lab} mean&81.08&86.16&34.23&98.45&86.96\\
          min&73.16&73.07&\phantom{0}6.96&86.68&71.20\\
          max&88.88&93.89&79.31&99.99&95.79\\
          dev&\phantom{0}5.70&\phantom{0}6.40&20.98&\phantom{0}4.16&\phantom{0}7.49\\
%{\tt nep-geo} mean&92.48&95.01&71.12&97.53&95.88\\
%          min&82.89&83.29&\phantom{0}6.04&75.35&82.90\\
%          max&98.67&99.31&95.88&100.00&99.57\\
%          dev&\phantom{0}5.09&\phantom{0}4.72&26.78&\phantom{0}7.79&\phantom{0}4.92\\
%{\tt nep-sea} mean&88.25&89.46&59.28&78.54&91.45\\
%          min&79.49&62.33&$-$6.29&$-$80.46&73.67\\
%          max&98.15&99.02&91.36&99.99&99.38\\
%          dev&\phantom{0}8.05&11.47&34.78&56.87&\phantom{0}9.19\\
%{\tt nep-dev} mean&81.02&85.30&41.39&87.15&86.49\\
%          min&66.10&71.56&$-$3.18&$-$11.08&72.86\\
%          max&92.31&95.49&83.91&99.99&97.06\\
%          dev&\phantom{0}9.16&\phantom{0}9.13&28.79&34.83&\phantom{0}9.44\\
{\tt nep-ure} mean&89.62&93.59&64.09&99.86&94.79\\
          min&82.63&88.15&36.99&98.71&88.97\\
          max&97.51&98.72&96.53&99.99&99.25\\
          dev&\phantom{0}5.31&\phantom{0}3.70&18.88&\phantom{0}0.40&\phantom{0}3.50\\
%{\tt nep-mon} mean&92.06&94.50&55.50&99.75&94.95\\
%          min&85.12&88.48&19.27&98.60&88.39\\
%          max&98.65&99.29&95.12&100.00&99.56\\
%          dev&\phantom{0}4.39&\phantom{0}3.65&28.07&\phantom{0}0.46&\phantom{0}3.97\\
%{\tt nep-ecm} mean&96.16&97.83&84.30&99.99&98.46\\
%          min&92.33&95.51&54.72&99.99&96.21\\
%          max&98.68&99.32&97.37&100.00&99.59\\
%          dev&\phantom{0}2.18&\phantom{0}1.37&13.84&\phantom{0}5.80e$-$05&\phantom{0}1.15\\
{\tt nep-eec} mean&74.72&79.83&26.12&83.03&80.68\\
          min&55.68&57.14&$-$1.38&\phantom{0}8.50&57.12\\
          max&84.59&90.94&60.13&99.99&93.03\\
          dev&\phantom{0}9.94&11.30&21.65&30.53&12.08\\
%{\tt nep-mic} mean&61.71&72.02&25.41&96.75&73.35\\
%          min&45.92&55.02&10&87.55&53.66\\
%          max&82.23&89.54&54.33&99.99&92.00\\
%          dev&10.04&\phantom{0}9.68&16.51&\phantom{0}4.43&11.08\\
%{\tt nep-com} mean&78.69&84.79&42.86&98.58&85.66\\
%          min&64.68&69.72&14.29&91.85&67.53\\
%          max&96.44&98.04&80.89&99.99&98.65\\
%          dev&10.99&\phantom{0}8.88&24.23&\phantom{0}2.69&\phantom{0}9.45\\
{\tt nep-tra} mean&92.52&93.94&65.77&93.28&94.22\\
          min&75.76&73.46&\phantom{0}0.81&34.45&72.68\\
          max&99.79&99.89&99.24&100.00&99.93\\
          dev&\phantom{0}7.70&\phantom{0}8.12&36.01&20.67&\phantom{0}8.49\\
%{\tt nep-int} mean&95.67&97.25&79.02&99.71&97.69\\
%          min&85.65&88.55&28.31&97.10&87.95\\
%          max&98.50&99.23&97.96&99.99&99.56\\
%          dev&\phantom{0}3.73&\phantom{0}3.15&20.30&\phantom{0}0.91&\phantom{0}3.48\\
{\tt nep-ino} mean&88.04&91.31&57.95&91.44&92.40\\
          min&66.90&66.70&$-$1.16&14.74&67.77\\
          max&98.34&99.12&93.36&99.99&99.46\\
          dev&\phantom{0}9.35&\phantom{0}9.47&30.28&26.95&\phantom{0}9.37\\
%{\tt nep-his} mean&68.34&75.18&22.43&90.64&75.37\\
%          min&54.97&61.56&\phantom{0}7.58&75.45&60.00\\
%          max&87.34&92.07&49.71&99.99&93.42\\
%          dev&\phantom{0}9.86&\phantom{0}9.47&15.96&\phantom{0}8.82&10.59\\
%{\tt nep-env} mean&91.26&94.07&62.27&99.27&94.68\\
%          min&80.26&84.78&19.08&95.86&84.87\\
%          max&98.73&99.33&96.60&99.99&99.59\\
%          dev&\phantom{0}5.38&\phantom{0}4.47&27.05&\phantom{0}1.48&\phantom{0}4.85\\
%{\tt nep-acc} mean&73.97&80.13&35.02&89.41&80.80\\
%          min&53.58&61.06&\phantom{0}6.68&61.67&59.28\\
%          max&89.16&94.03&80.14&99.99&95.94\\
%          dev&11.13&11.30&26.64&13.26&13.06\\
{\tt nep-fmk} mean&90.45&94.29&71.04&99.96&95.58\\
          min&79.51&87.71&32.19&99.62&90.48\\
          max&97.29&98.60&94.71&99.99&99.15\\
          dev&\phantom{0}5.93&\phantom{0}3.76&17.39&\phantom{0}0.11&\phantom{0}3.25\\
%{\tt nep-ets} mean&91.82&92.99&75.30&89.40&93.22\\
%          min&72.11&64.15&$-$4.14&$-$2.62&61.05\\
%          max&99.57&99.78&99.62&100.00&99.87\\
%          dev&10.48&11.66&34.54&32.35&12.66\\
%{\tt nep-pol} mean&82.23&86.14&52.24&83.44&87.00\\
%          min&68.25&64.67&$-$3.47&$-$3.12&64.43\\
%          max&96.88&98.40&96.74&99.99&99.07\\
%          dev&10.78&11.12&35.18&32.35&11.97\\
%{\tt nep-cfn} mean&81.74&86.39&48.50&89.30&87.12\\
%          min&55.02&59.81&\phantom{0}3.78&45.72&58.11\\
%          max&94.26&96.98&90.36&99.99&98.14\\
%          dev&12.03&11.95&28.69&20.35&13.28\\
{\tt nep-ifn} mean&92.14&94.13&57.22&98.07&94.45\\
          min&83.91&86.67&\phantom{0}5.45&81.24&85.69\\
          max&97.69&98.78&92.02&100.00&99.21\\
          dev&\phantom{0}5.07&\phantom{0}4.79&31.32&\phantom{0}5.91&\phantom{0}5.29\\
%{\tt nep-reg} mean&67.18&76.94&36.85&87.05&79.55\\
%          min&28.70&36.79&$-$3.06&$-$6.86&40.02\\
%          max&86.96&92.85&74.96&99.99&95.21\\
%          dev&15.98&15.50&20.01&33.35&15.31\\
%{\tt nep-hpe} mean&87.95&91.39&68.45&89.24&93.00\\
%          min&78.48&69.96&$-$3.55&$-$6.67&69.51\\
%          max&96.20&98.00&91.05&99.99&98.78\\
%          dev&\phantom{0}6.40&\phantom{0}8.23&28.59&33.70&\phantom{0}8.67\\
%{\tt nep-cmp} mean&85.08&90.75&67.50&98.31&92.49\\
%          min&70.16&78.37&28&84.36&79.02\\
%          max&96.09&97.96&92.97&99.99&98.77\\
%          dev&10.01&\phantom{0}6.98&23.32&\phantom{0}4.91&\phantom{0}6.64\\
%{\tt nep-ene} mean&89.30&93.41&74.81&99.50&94.71\\
%          min&69.61&80.67&36.42&97.82&83.84\\
%          max&99.45&99.72&99.59&99.99&99.84\\
%          dev&\phantom{0}9.60&\phantom{0}6.32&22.45&\phantom{0}0.82&\phantom{0}5.57\\
%{\tt nep-edu} mean&85.21&89.84&53.97&95.91&91.07\\
%          min&75.67&78.04&\phantom{0}4.93&64.72&78.40\\
%          max&97.09&98.50&95.79&99.99&99.10\\
%          dev&\phantom{0}7.26&\phantom{0}6.26&28.57&11.04&\phantom{0}6.46\\
{\tt nep-cba} mean&87.59&92.49&60.88&99.92&94.05\\
          min&78.92&86.56&40.33&99.50&88.66\\
          max&95.28&97.55&93.62&99.99&98.51\\
          dev&\phantom{0}5.34&\phantom{0}3.52&17.13&\phantom{0}0.15&\phantom{0}3.18\\
%{\tt nep-dge} mean&89.35&93.53&65.70&99.68&94.89\\
%          min&77.60&83.50&25.63&96.89&83.87\\
%          max&93.69&96.53&84.97&99.99&97.80\\
%          dev&\phantom{0}5.04&\phantom{0}3.96&16.32&\phantom{0}0.97&\phantom{0}4.15\\
%{\tt nep-ent} mean&88.87&93.26&66.22&99.98&94.67\\
%          min&78.06&85.78&44.25&99.84&87.74\\
%          max&95.79&97.75&90.75&99.99&98.57\\
%          dev&\phantom{0}6.32&\phantom{0}4.18&17.10&\phantom{0}0.04&\phantom{0}3.73\\
%{\tt nep-afr} mean&67.01&71.43&39.11&66.47&74.01\\
%          min&22.28&21.86&$-$11.72&$-$84.41&30.75\\
%          max&97.63&98.79&97.88&99.99&99.31\\
%          dev&23.31&25.79&37.97&57.53&23.70\\
%{\tt nep-hea} mean&97.97&98.88&93.36&99.93&99.23\\
%          min&92.48&95.38&62.82&99.33&96.31\\
%          max&99.51&99.75&99.33&100.00&99.85\\
%          dev&\phantom{0}2.17&\phantom{0}1.34&11.38&\phantom{0}0.21&\phantom{0}1.08\\
%{\tt nep-agr} mean&85.80&88.62&60.99&86.04&89.53\\
%          min&55.05&49.16&$-$5.56&$-$17.96&47.92\\
%          max&95.72&97.76&91.27&99.99&98.62\\
%          dev&12.76&14.94&32.22&37.16&15.81\\
{\tt nep-tid} mean&76.93&84.14&47.17&90.77&86.33\\
          min&53.25&69.58&$-$1.56&10.17&71.03\\
          max&93.39&96.49&87.16&99.99&97.77\\
          dev&11.42&\phantom{0}8.95&24.42&28.32&\phantom{0}8.64\\
%{\tt nep-rmg} mean&77.38&83.83&46.47&88.69&85.79\\
%          min&59.34&60.90&$-$2.54&$-$6.79&64.97\\
%          max&93.54&96.53&81.28&99.99&97.73\\
%          dev&11.08&10.97&26.33&33.59&10.93\\
%{\tt nep-cbe} mean&88.97&93.33&76.06&99.70&94.77\\
%          min&68.92&78.94&39.15&97.48&80.85\\
%          max&96.11&97.98&95.93&99.99&98.79\\
%         dev&\phantom{0}8.24&\phantom{0}5.92&21.13&\phantom{0}0.78&\phantom{0}5.77\\
%{\tt nep-law} mean&88.20&90.82&57.39&89.41&91.50\\
%          min&74.64&75.24&$-$0.77&17.55&75.74\\
%          max&98.11&99.02&95.37&99.99&99.41\\
%          dev&\phantom{0}8.17&\phantom{0}8.45&31.92&26.11&\phantom{0}8.99\\
{\tt nep-gth} mean&94.48&96.76&79.96&99.99&97.56\\
          min&86.33&91.27&49.47&99.96&92.56\\
          max&99.55&99.76&98.12&100.00&99.85\\
          dev&\phantom{0}3.99&\phantom{0}2.62&17.33&\phantom{0}0.01&\phantom{0}2.34\\
%{\tt nep-exp} mean&99.66&99.83&99.48&99.99&99.90\\
%          min&99.14&99.56&98.73&99.99&99.74\\
%          max&100.00&100.00&100.00&100.00&100.00\\
%          dev&\phantom{0}0.33&\phantom{0}0.17&\phantom{0}0.38&\phantom{0}3.24e$-$06&\phantom{0}0.09\\
%{\tt nep-ltv} mean&76.20&83.16&45.58&89.39&85.10\\
%          min&61.72&69.44&$-$0.93&16.09&71.10\\
%          max&92.06&95.80&86.48&99.99&97.41\\
%          dev&\phantom{0}9.87&\phantom{0}9.07&31.40&25.99&\phantom{0}9.52\\
%{\tt nep-dcm} mean&96.20&97.76&89.05&99.81&98.33\\
%          min&85.13&91.26&55.55&98.77&93.15\\
%          max&100.00&100.00&100.00&100.00&100.00\\
%          dev&\phantom{0}5.39&\phantom{0}3.31&15.62&\phantom{0}0.41&\phantom{0}2.66\\
{\tt nep-cwa} mean&65.75&68.63&25.00&54.36&70.57\\
          min&23.64&27.54&$-$13.11&$-$81.88&25.32\\
          max&93.19&96.38&88.88&99.99&97.73\\
          dev&18.24&21.93&30.73&61.38&20.82\\
%{\tt nep-net} mean&80.68&87.40&62.42&95.01&89.17\\
%          min&50.84&64.45&19.64&70.70&65.74\\
%          max&98.75&99.36&98.41&99.99&99.63\\
%          dev&14.83&10.94&27.53&10.05&10.86\\
%{\tt nep-eff} mean&88.16&92.58&78.04&96.43&94.06\\
%          min&61.84&78.19&12.15&72.38&77.44\\
%          max&98.65&99.31&97.75&99.99&99.60\\
%          dev&11.79&\phantom{0}7.89&26.49&\phantom{0}8.60&\phantom{0}7.52\\
%{\tt nep-pub} mean&65.02&71.70&34.40&76.83&72.19\\
%          min&\phantom{0}9.01&20.01&$-$1.38&\phantom{0}8.72&15.49\\
%          max&94.83&97.24&85.95&99.99&98.25\\
%          dev&24.89&23.32&33.23&36.36&25.48\\
%{\tt nep-pke} mean&57.95&67.55&25.88&77.64&68.59\\
%          min&38.59&50.60&\phantom{0}0.59&26.56&46.34\\
%          max&77.27&86.68&67.62&99.99&90.11\\
%          dev&11.80&13.47&23.64&28.00&16.24\\
{\tt nep-evo} mean&89.92&93.97&71.19&99.49&95.35\\
          min&78.71&85.75&43.05&95.08&87.29\\
          max&98.10&99.03&97.15&99.99&99.43\\
          dev&\phantom{0}5.94&\phantom{0}4.01&16.45&\phantom{0}1.54&\phantom{0}3.65\\
%{\tt nep-ind} mean&68.11&76.09&34.91&92.02&77.02\\
%          min&43.53&47.80&\phantom{0}4.18&59.82&44.62\\
%          max&89.96&94.68&84.13&99.99&96.69\\
%          dev&14.42&14.36&26.19&13.70&16.15\\
%{\tt nep-lam} mean&60.64&64.96&28.68&58.10&68.17\\
%          min&0&0&$-$13.11&$-$81.88&0\\
%          max&88.69&93.11&59.54&99.93&94.55\\
%          dev&24.62&29.07&27.24&59.12&28.51\\
%{\tt nep-cis} mean&97.08&98.30&92.25&99.50&98.75\\
%          min&88.06&92.80&64.20&95.98&94.44\\
%          max&100.00&100.00&100.00&100.00&100.00\\
%          dev&\phantom{0}4.80&\phantom{0}2.88&14.51&\phantom{0}1.25&\phantom{0}2.22\\
{\tt nep-cdm} mean&69.30&74.01&35.95&80.52&75.88\\
          min&20.29&$-$4.61&$-$14.28&$-$74.47&$-$0.08\\
          max&95.25&97.47&87.38&99.99&98.41\\
          dev&21.32&29.15&26.40&54.54&28.23\\
%{\tt nep-res} mean&89.09&92.42&82.71&91.14&93.12\\
%          min&47.02&53.36&\phantom{0}2.13&21.40&48.59\\
%          max&97.46&98.69&95.35&99.99&99.23\\
%          dev&14.91&13.76&28.40&24.64&15.65\\
%{\tt nep-cul} mean&90.02&93.87&83.55&96.46&95.12\\
%          min&65.65&78.45&39.84&81.59&81.78\\
%          max&100.00&100.00&100.00&100.00&100.00\\
%          dev&14.09&\phantom{0}8.92&23.75&\phantom{0}6.86&\phantom{0}7.48\\
%{\tt nep-ias} mean&78.09&80.20&53.60&68.62&79.55\\
%          min&52.80&49.79&$-$7.68&$-$10.92&42.59\\
%          max&100.00&100.00&100.00&100.00&100.00\\
%          dev&20.07&21.05&46.36&45.36&23.56\\
{\tt nep-mfd} mean&74.58&79.43&32.29&79.67&80.05\\
          min&57.55&65.30&$-$4.19&$-$21.39&67.39\\
          max&92.74&95.96&82.27&99.99&97.24\\
          dev&11.08&10.66&28.94&36.99&11.03\\
%{\tt nep-spo} mean&84.26&84.56&77.21&79.75&84.14\\
%          min&0&0&$-$4.72&0&0\\
%          max&100.00&100.00&100.00&100.00&100.00\\
%          dev&32.52&32.98&42.57&41.43&33.86\\
\end{longtable}
\end{center}
}


\bibliography{sendai}


% % , for the reasons that we have outlined above.  The position of all
% % the documents should matter in the assessment of performance.
% % 
% % For another illustration, consider a response from a web search
% % engine.  Most people never go to the end of the list of responses from
% % a search engine. What matters to them is that the first few results
% % are relevant. As far as they are concerned, the idea of a set of
% % retrieved that they can examine is pointless. What counts is the
% % position of these responses. Responses beyond the 10th results pages
% % have little chance to be used. Beyond a very small number of pages,
% % the user will not look at any result. Thus, the results set may be
% % very large, in fact we can could think about it as an ordered list
% % of the pages of the Web.  But the ordering is crucial. The
% % set-theoretic measures, but discarding the ordering, leave out
% % valuable information that can be used to get a different idea of
% % information retrieval performance.
% % 
% % There is one common feature between these illustration of the interest
% % in vector-based measures. Vector-based measures are useful when recall
% % is of little or no importance. In the NEP example recall is not
% % important because a conscientious editor can always go through every
% % document. In the search engine example, recall is so difficult to
% % estimate, and potentially so large, that with the best of all
% % intentions, it does not make much sense to examine it any closer.
% % 
% % 
% % An example may illustrate the conceptual change. Let numbers denote
% % relevant documents and letters denote non-relevant documents.  In a
% % set-based approach, let the information retrieval system retrieve
% % ${1,a,2,b,3}$. Let it not retrieve ${c,d,4,e}$.  Precision is $2/4$
% % and recall is $3/4$. In a vector-based view, the same result could be
% % represented as $[1,a,2,b,3,c,d,e,4]$ or as $[1,a,2,b,3,4,c,d,e]$. In
% % the former case a user will have to go to document \#8 to find the
% % last relevant document, in the latter, she will find it at position
% % 6. Clearly, the latter seems preferable to the former. But precision
% % and recall do not convey that view.


\end{document}

%  LocalWords:  Krichel Brookville NY Mishchenko Sobolev Koptyuga Riijsbergen's
% LocalWords:  NEP RePEc Nisa Bakkalbasi Losee metadata pre Swets Brookes ccrr
% LocalWords:  jonswe berbroo roblos Lofop cr wilcoo nep xxx barrueco rerank co
% LocalWords:  servap paugin svm torjoa tomkri min dev rrrrrrr mac bec pbe geo
% LocalWords:  ure mon ecm eec mic com tra int ino env acc fmk ets cfn ifn reg
% LocalWords:  hpe cmp ene edu cba dge ent afr hea agr tid rmg cbe gth exp ltv
% LocalWords:  dcm cwa eff pke evo ind cis cdm cul ias mfd spo


%%%> Ms. Ref. No.:  IPM2414
%%%> Title: Information retrieval performance measures for a current
%%%>  awareness report composition aid
%%%> Information Processing & Management
%%%> 
%%%> Dear Dr. Krichel,
%%%> 
%%%> I am happy to inform you that your paper ms, "Information retrieval
%%%> performance measures for a current awareness report composition aid"
%%%> has been provisionally accepted for publication.  However, the
%%%> referee recommended minor revisions. We will consider the paper for
%%%> final acceptance after the revision.
%%%
%%%> For your guidance, reviewers' comments are appended below.
%%%> 
%%%> 
%%%> Reviewer #1: Page3 Para4 Line1: "included" or "relevant"
%%%
%%%  fix #2
%%%
%%%> Page3 Para5: not clear what the sentence "Thus, just looking at ?.."
%%%>  means
%%%
%%%  fix #3
%%%
%%%> Page5 Para2: "the measure should be 0" ? expected value of the
%%%>  measure should be 0"
%%%
%%%  fix #6
%%%
%%%> Page7 Para3: not clear what the sentence "thus, all documents?."
%%%> Means
%%%
%%%  fix #8, reformulation
%%%
%%%> Page8 Para2: Rephrase the sentence "According to?.."
%%%
%%%  fix #9, removed words that should not have been there
%%%
%%%> Page9 Para1: divided by "n" or "r"?
%%%
%%%  No. If we look at the best outcome, it has the relevant
%%%  outcomes at the tippy top, ok? So we count 1,2,...,r, and
%%%  need to find the average. The average of subsequent
%%%  natural numbers is the first plus the last divided by 2. 
%%%  For the worst outcome, a similar reasoning applies. 
%%%
%%%> Page10 : move Table 2 below the title 4.3
%%%
%%%  done
%%%
%%%> Page11: move Table 3 below the title 4.4
%%%
%%%  done
%%%
%%%> Page12 Footmark3 : Not sure if the geographical names help
%%%> avoid this confusion!
%%%
%%%  The other referee raised this too, so I prepare this version
%%%  without the geographical names.
%%%
%%%> Page13: move Table4 below the title 4.5
%%%
%%%  done
%%%
%%%> Page14: better if explain how and why the equation for Ob's 
%%%> measure was arrived at
%%%  
%%%  fix #12
%%%  
%%%> Page15: better if explain how and why the equation for Kolyma
%%%> measure was arrive at
%%%  
%%%  fix #13
%%%  
%%%> Page17: better if an equation is added to show how the
%%%> features are weighted/normalised
%%%  
%%%  oh well, I would not have thought this was necessary but I
%%%  have added it as fix #15
%%%
%%%> Page17 Para1: not clear the selection of documents for the test
%%%> and train sets in each iteration. Are they different in each run?
%%%
%%%  They are.
%%%
%%%> Describe.
%%%
%%%  I have added wording there and added one technical detail that  
%%%  I previously thought was not worth including. Since the editor
%%%  does not seem to be concerned about containing the length of 
%%%  the paper, I may as well write more. fix #16
%%%
%%%> Page18 Last Para in Conclusions: How/Why do you say linear
%%%> combination of two measures are the best. Where are the supporting
%%%> evidence/data? Why/How did you arrive at vl+(i-v)k?
%%%
%%%  ok. It's not conventional to have the main conclusion in the
%%%  actual conclusion, and here is burried in verbiage. fix #19
%%%
%%%> Page18 results table: Label the leftmost column to indicate that it
%%%> is about the "NEP report"
%%%>
%%%>
%%%> Some Typos/English errors:
%%%> Page3 Line1 : "such as" -> "in order to"
%%%  
%%%  fix #1 
%%%  
%%%> Page3 one before the last line: "have the" repeated
%%%  
%%%  fix #4
%%%  
%%%> Page4 Line3 from the bottom: "are" is repeated
%%%  
%%%  fix #5
%%%  
%%%> Page4 Line2 from the bottom "means"
%%%  
%%%  fix #5
%%%  
%%%> Page6 Para1 Line5: "or" repeated
%%%  
%%%  fix #7 
%%%  
%%%> Page8 Footmark Line1: "to" is repeated
%%%  
%%%  fix #10
%%%  
%%%> Page17 Para2 Line3: "that" -> "then"
%%%  
%%%  fix #11
%%%  
%%%> Page17 Para3 Line4: "able lift" -> "able to lift"
%%%  
%%%  fix #14
%%%  
%%%> Page17 Para3 Line5: "is" repeated
%%%  
%%%  fix #14
%%%  
%%%> Reviewer #2: This is an interesting manuscript.  I recommend it be
%%%> published after several issues have been addressed.
%%%>
%%%> There are no citations in sections 2 and 3.  The author needs to
%%%> address how some of the points here are addressed by others.  In
%%%> particular, the "requirements" in section 2 should be linked to
%%%> other lists of requirements for retrieval measures.  Rigorous metric
%%%> based articles include Van Rijsbergen, JDoc, 30 (4) 365-373 and
%%%> Shaw, JASIS 37 (5) 346-348.
%%%
%%%  The work in both of these references is based on precision
%%%  and recall values. By the time section 2 starts, I already
%%%  have rejected this approach, and none of the two referees
%%%  found any problems with this. However, I have included 
%%%  references to these two papers at the point in section
%%%  1 where I say that I will not pursue this approach further.
%%%  fix #30
%%%
%%%> Works by Bollman and (more recently)
%%%> Dominich might be useful.
%%%
%%%  I found Dominich's work, there is nothing in there that would 
%%%  be interesting for my probelm. I did not see Bollmann-Sdorra's
%%%  work, but the way it is cited in Dominich, does suggest it
%%%  follows the same general approach as Van Rijsbergen and Shaw,
%%%  i.e. a set-based approach. In this paper, I use a general
%%%  approach that was used by Cooper and has seems largely have been
%%%  forgotten, that is to evaluate the information system by 
%%%  the utility that it brings to its users. In most cases the 
%%%  utility of the system is too vague to serve as a basic starting
%%%  point, but in my case it is reasonably straightforward. 
%%%
%%%> Better names for the models would help - using the names of rivers
%%%> or girlfriends or puppies is silly.  Use explanatory names; do you
%%%> find "Edith" or "Expected Search Length" to be easier to understand?
%%%
%%%  It's an easy suggestion for a reviewer to make. It's also easy 
%%%  to find descriptive  names but they tend to be uncomfortably
%%%  long. I therefore have resorted to long names but use acronyms
%%%  in the discussion of the measures. If the referree can come
%%%  with names that are short enough to be used fully in the text
%%%  I will rather pleased. fixes #21 to #26
%%% 
%%%> When discussing precision and recall, as well as other measures such
%%%> as ASL and ESL, remember that one can compute any of these measures
%%%> for all or just the first part of a dataset, and one may compute the
%%%> chosen measure at any given point in the set of documents being
%%%> examined, such as computing the precision and recall after the first
%%%> two documents have been retrieved.
%%%
%%%  I am actually aware of this. I could mention this in the 
%%%  paper, but it really is either to state that I know the
%%%  very basics of information retrieval, or to educate the
%%%  reader about them. I don't see how this improves
%%%  the problems with the precision and recall in this setting. 
%%%  Maybe the reviewer knows, I would be interested. 
%%%  What I propose in this paper is to move from set-based to 
%%%  vector-based measures. Surly, if you can vary the sets along
%%%  your vector
%%%  of results. Presumably, if either precision or recall are
%%%  completely known at the all level of the size of the outcome,
%%%  the outcome vector can be determined. But a sequence of sets 
%%%  strikes me a cumbersome kludge for a vector. 
%%%
%%%> Avoid colloquialisms ("To find the expected value is *really* easy")
%%%
%%%  fix #27
%%%
%%%> as well as very subjective statements ("easy").  Be precise when
%%%> possible ("which is not even close" [to what?]).
%%%  
%%% both numbers are not close, fix #29
%%%
%%%> Remember to end each sentence with a period, which may occur after
%%%> a formula (put differently, the author often leaves out periods
%%%> after formulas).
%%%
%%%  I personnally find periods in displayed math an aberration, but
%%%  I don't want to make an issue of my personal preferences. Therefore
%%%  I added 23 periods and 4 commas to the manuscript. These changes
%%%  are not marked up in the source code.
%%%
%%%> More interpretation in the "Test" section of specific numbers and
%%%> related comparisons and discussions would help. 
%%%
%%%  There results did raise a few eyebrows in the user community, and
%%%  interesting things could be written about them, but it is somewhat
%%%  out of scope for the purpose of this paper, because the paper is
%%%  about methods evaluation, not about the results of the evaluation in
%%%  a particular operational context, about which much more work could
%%%  be done, because it is a very interesting dataset and service.
%%%  But I did write a few more sentence about the results, limiting
%%%  myself, however, to concerns that strictly fall within the 
%%%  remit of the paper. 
%%%
%%%> With improved nomenclature and discussions of specific results,
%%%> readers will find the article far more useful.
%%%
%%%  I absolutely agree that the paper as it is now has improved
%%%  a lot over the previous version.
%%%