Seminar Series
October 25, 1999
Topic: Document Resemblance and Related Issues
Speaker: Andrei Broder, Chief Technology Officer at AltaVista
ABSTRACT:
People often claim that two web pages are "the same" or "roughly the
same", even though classic distances on strings (Hamming, Levenshtein,
etc.) might indicate that the two pages are far apart.
To formalize these intuitive ideas we defined the mathematical concept
of document resemblance. The resemblance can be estimated using a fixed
size "sketch" for each document. For a large collection of documents
(say 200 million) the size of this sketch is of the order of a few hundred
bytes per document. However, for efficient large scale web indexing
it is not necessary to determine the actual resemblance value: it suffices
to determine whether newly encountered documents are duplicates or near-duplicates
of documents already indexed; In other words, it suffices to determine
whether the resemblance is above a certain threshold. We show how this
determination can be made using a "sample" of less than 50 bytes per
document.
The basic approach for computing resemblance has two aspects: first,
resemblance is expressed as a set intersection problem, and second,
the relative size of intersections is evaluated by a process of random
sampling that can be done independently for each document. The process
of estimating the relative size of intersection of sets and the threshold
test discussed above can be applied to arbitrary sets, and thus might
be of independent interest.
The ideas for filtering near-duplicate documents discussed here have
been successfully implemented and are in current use in the context
of the AltaVista search engine. This talk is tilted towards "algorithm
engineering" rather than "algorithm analysis" and very little mathematical
background is required.
BIOGRAPHICAL SKETCH
Andrei Broder is chief technology officer of the AltaVista
Search division in the AltaVista Company. Previously he was a senior
member of the research staff at Compaq's Systems Research Center in
Palo Alto, California. He graduated from Technion, Israel's Institute
of Technology, and did his Ph.D. in Computer Science at Stanford University
under Don Knuth. He has written and co-authored more than 60 scientific
papers and numerous patents. His main research interests are the design,
analysis, and implementation of probabilistic algorithms and supporting
data structures, in particular in the context of web-scale applications.