The matching algorithm being used was inspired by the Vector Space Model
[29, 30], a mathematical model that represents an information retrieval system.
The vector-space model assumes that an available term set is used to identify
stored records and queries, and, based on this, it calculates the similarity
between query and document. Documents and queries can be imagined as vectors,
where the components are coefficients . These coefficients
normally assume the value 1 if the term k appears in the document
or
(query
) and 0 otherwise, or another value if it is necessary to give
different weighting to each term in the respective document or query.
So we could write:
Suppose that t terms are used to define the documents (queries). Each term (keyword) can be identified with a term vector T. If the T vectors are linearly independent, each T vector can be represented as a linear combination of the t terms vectors. So we can write [31]:
In a vector space similarity between two vectors k and z is measured by the
scalar product between them: , where |k| are the length
of vector k,|z| is the length of vector z and
is the angle between
vector k and vector z.
So similarity between a document and a query can be measured in the same way.
As we assume that the terms (keywords) are not correlated, we can say that
the vectors Ti and Tj are orthogonal, so for i different from j,
is equal to 0 and for i equal j
is equal to 1.
The right choice of terms is the greatest problem in any query that is formulated. One of the advantages of the vector-space model is the facility that vectors can be modified, allowing the method to be extended to contain the ideas of the famous method for generation of improved queries called relevance-feedback [32, 33, 34]. The principle of the relevance-feedback process is the reformulation of queries based on the obtained results. Documents that are relevant for a given query are similar in the sense that their vectors resemble each other. So, we can say that if a document is retrieved succesfully, we can try to reformulate the initial query in order to make it more similar to the retrieved document. Theoretically the new query is going to retrieve additional relevant documents, that are similar to the originally identified relevant item. The aproximated expression for this method is:
where and
are suitable coeficients, R' is the set of
documents relevant to a query and N' the set of non-relevant
documents.
We used the ideas of these two methods to formulate our matching algorithm. Basically, our assumption was that when following a link we wish to find documents that are similar in respect of certain terms (keywords), so the vector-space model is an appropriate aproach. But at the same time, when creating a link for a specific anchor in a document, we are not just looking for documents that have similar keywords attached to them, but documents that are similar to the combination of terms attached to the anchor and to the node. Although we do not use the interactions and feedback proposed by the relevance-feedback method, we used the idea of trying to give more importance to terms that appeared in both anchor and node (relevant terms) and minimizing the importance of keywords appearing just for the node, keeping intact the terms that just appeared for the anchor. In this way, we approximated the query close not just to the anchor or the node isolated, but closer to the combination of them.