UL | CSC | ILIAS | MINE


Home

: Our Team
: Teaching
: Publications
: Research
: Conferences
: Events
: Open Theses
: Jobs
: Contact

: mics
: binfo
: ilias
: uni gr


internal only

Goethe AG
SEREBIF



Web search engines nowadays use more and more complex and elaborated ranking functions to deliver the proper results for a given query (in appropriate order according to their relevance). Often enough though, results which are not relevant for a user show up in the result list, too. SEREBIF is an approach which tries to incorporate information taken from the users into the results to increase their quality.

SEREBIF stands for Search Engine Result Enhancement By Implicit Feedback. The overall goal is to analyze the preferences and in general the behavior of users utilizing a search engine to get information about the real relevance of the results. Observing the users actions can lead to valuable information about how important a result seems to be for the user. For example, if the users always tend to click on the second given result for a certain query and usually leave out the first one, the ranking doesn't seem to be appropriate and could be changed accordingly [1]. So it should be possible to increase the quality of the results for a given search engine over time, just by taking the implicitely collected information about the clicked results into account. One advantage of this approach is that there is no need for the users to invest any extra time to give (explicit) feedback as to what they find to be relevant or not. The users just do what they need to in order to find the relevant information that they're looking for.

We keep track of entered queries, clicked links and we try to estimate the time that users stay on the clicked result pages. To avoid users having to fear for being tracked all the way and possibly being identified by all the entered queries (as happened with a large amount of search queries released by AOL in 2006), we do not connect queries of multiple sessions together. By this way, we collect information from all users with a low chance of re-identification to make sure privacy issues are respected (of course, someone entering the own name as a query could still be identified, but we expect this not to happen). We do however try to see if the one user entered multiple queries shortly one after the other (in the same session, i.e. in a short time period and without closing the browser). This is important because such a process might indicate a connection between two different queries. See e.g. [2] for a more detailed description of so-called "Query Chains".

The whole approach requires that we already have a search engine to base upon. We chose to realize a sort of a proxy between the users and an existing search engine (for our first tests, we chose to use Google faciliating the API they provide to make a limited amount of queries per day with SOAP). So our system looks like a search engine to the users, but in fact just redirects queries to the underlying search engine. Similarly, the provided results of the engine are shown to the user afterwards.

The gathered information is preprocessed and afterwards merged in a storage system based on the ANIMA system [3] that has also been developed in our MINE research group. The following figure shows how such a storage can bascially look like.


We use three different types of nodes for this network, denoted by different colors in the picture:
  • Query terms (T) - These are single terms which have been used in user queries
  • Queries (Q) - These are the full queries that have been entered into SEREBIF
  • Documents (D) - These are the resulting documents that have been provided by the underlying search engine for one or more queries
There can be different kinds of links between these nodes, which represent a certain connection and which might (depending on the type) be weighted indicating the strength of the relationship. In particular, we want to employ the following possible connections, also denoted by lines of different styles in the picture:
  • TxT - Connections between different terms can indicate a relationship between single terms. This can simply be by co-occurence (terms which tend to occur together in queries) but it can also indicate a semantical relationship between terms (e.g., synonyms). To realize the latter, we however have to take additional knowledge (like the online-thesaurus WordNet) into account.
  • TxQ - The connection between queries and query terms is a rather trivial one - it just indicates that a term was contained in that very query.
  • QxQ - Connections between different queries can be established if it is found that users tried to find some information with multiple consecutive queries. So e.g. they couldn't find what they were looking for with the first query and afterwards reformulated the query (query chains, [2]).
  • QxD - Connections between queries and documents indicate that documents seem to be relevant for given queries. These connections are initially given just by the information of the underlying search engine (i.e., there are connections between an entered query and the given results for that query). However, the weights for these connections are subject to changes and may cause the importance of certain documents to shift as time goes by. Initially, the weights are specified by the ranking of the results, but they are changed according to documents which users actually looked upon. The visiting time [4] of documents is also taken into account here.
To see the current state of the work of this project, visit the prototypical implementation at http://mine.uni.lu/~weires/serebif/search.php. Note that we are currently at the stage of collecting queries and feedback from the entered queries to get enough test data, so the displayed search results are not yet being changed according to the gathered information.

Once we got enough test data, we can use the information in the network in different ways. Examples are:
  • Re-ranking the resulting documents for queries according to the results actually visited by previous users
  • Merging results of queries that are connected (e.g., query chains)
  • Suggesting additional query terms to the users, which are strongly connected to the given query (but not contained, of course)
Apart from realizing and testing the SEREBIF prototype, there are some open questions regarding this approach. These include:
  • How to merge the feedback information with the (unknown) ranking function of the underlying search engine in an appropriate way. We plan to develop an own search engine where we do know about the ranking of each result and thus do not have this problem, though.
  • Prevention of exploitation (Automatically generated (pseudo-)feedback which could push certain sites up when using our approach).

[1] - T. Joachims - Optimizing Search Engines using Clickthrough Data, SIGKDD 2003
[2] - F. Radlinsky, T. Joachims - Query Chains: Learning to Rank from Implicit Feedback, KDD 2005
[3] - C. Schommer, B. Schroeder - ANIMA: Associate Memories for Categorical Data Streams, ICCSA 2005
[4] - D. Kelly, N.J. Belkin - Display Time as Implicit Feedback: Understanding Task Effects, SIGIR 2004

"SEREBIF" is mentioned on: ADAM


Printable Version
VeryQuickWiki - HTML Export
Version: 2.7.1 (UniLux: 1.15.0 2006-01-19)
Modified: 2007-03-13 12:27:13
Exported: 2012-05-17 01:31:37