next up previous
Next: Comparison between Information Retrieval Up: Information retrieval Previous: The user interface

The matching algorithm

The matching algorithm being used was inspired by the Vector Space Model [29, 30], a mathematical model that represents an information retrieval system. The vector-space model assumes that an available term set is used to identify stored records and queries, and, based on this, it calculates the similarity between query and document. Documents and queries can be imagined as vectors, where the components are coefficients tex2html_wrap_inline514 . These coefficients normally assume the value 1 if the term k appears in the document tex2html_wrap_inline516 or (query tex2html_wrap_inline518 ) and 0 otherwise, or another value if it is necessary to give different weighting to each term in the respective document or query.

So we could write:

displaymath110

displaymath115

Suppose that t terms are used to define the documents (queries). Each term (keyword) can be identified with a term vector T. If the T vectors are linearly independent, each T vector can be represented as a linear combination of the t terms vectors. So we can write [31]:

displaymath121

In a vector space similarity between two vectors k and z is measured by the scalar product between them: tex2html_wrap_inline520 , where |k| are the length of vector k,|z| is the length of vector z and tex2html_wrap_inline526 is the angle between vector k and vector z. So similarity between a document and a query can be measured in the same way.

displaymath128

As we assume that the terms (keywords) are not correlated, we can say that the vectors Ti and Tj are orthogonal, so for i different from j, tex2html_wrap_inline528 is equal to 0 and for i equal j tex2html_wrap_inline528 is equal to 1.

displaymath142

The right choice of terms is the greatest problem in any query that is formulated. One of the advantages of the vector-space model is the facility that vectors can be modified, allowing the method to be extended to contain the ideas of the famous method for generation of improved queries called relevance-feedback [32, 33, 34]. The principle of the relevance-feedback process is the reformulation of queries based on the obtained results. Documents that are relevant for a given query are similar in the sense that their vectors resemble each other. So, we can say that if a document is retrieved succesfully, we can try to reformulate the initial query in order to make it more similar to the retrieved document. Theoretically the new query is going to retrieve additional relevant documents, that are similar to the originally identified relevant item. The aproximated expression for this method is:

displaymath151

where tex2html_wrap_inline532 and tex2html_wrap_inline534 are suitable coeficients, R' is the set of documents relevant to a query and N' the set of non-relevant documents.

We used the ideas of these two methods to formulate our matching algorithm. Basically, our assumption was that when following a link we wish to find documents that are similar in respect of certain terms (keywords), so the vector-space model is an appropriate aproach. But at the same time, when creating a link for a specific anchor in a document, we are not just looking for documents that have similar keywords attached to them, but documents that are similar to the combination of terms attached to the anchor and to the node. Although we do not use the interactions and feedback proposed by the relevance-feedback method, we used the idea of trying to give more importance to terms that appeared in both anchor and node (relevant terms) and minimizing the importance of keywords appearing just for the node, keeping intact the terms that just appeared for the anchor. In this way, we approximated the query close not just to the anchor or the node isolated, but closer to the combination of them.



next up previous
Next: Comparison between Information Retrieval Up: Information retrieval Previous: The user interface



Fri Dec 8 14:41:14 GMT 1995