A Comparison of Techniques for Sampling Web Pages

Baykan, Eda; Henzinger, Monika; Keller, Stefan F.; de Castelberg, Sebastian; Kinzler, Markus

doi:10.4230/LIPIcs.STACS.2009.1809

Document

A Comparison of Techniques for Sampling Web Pages

Authors Eda Baykan, Monika Henzinger, Stefan F. Keller, Sebastian de Castelberg, Markus Kinzler

Part of: Volume: 26th International Symposium on Theoretical Aspects of Computer Science (STACS 2009)
Part of: Series: Leibniz International Proceedings in Informatics (LIPIcs)
Part of: Conference: Symposium on Theoretical Aspects of Computer Science (STACS)
License: Creative Commons Attribution-NoDerivs 3.0 Unported license
Publication Date: 2009-02-19

PDF

File

PDF

LIPIcs.STACS.2009.1809.pdf

Filesize: 245 kB
18 pages

Document Identifiers

DOI: 10.4230/LIPIcs.STACS.2009.1809
URN: urn:nbn:de:0030-drops-18091

Subject Classification

Keywords

Random walks
Sampling web pages

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like sampling to determine the properties of the web. A uniform random sample of the web would be useful to determine the percentage of web pages in a specific language, on a topic or in a top level domain. Unfortunately, no approach has been shown to sample the web pages in an unbiased way. Three promising web sampling algorithms are based on random walks. They each have been evaluated individually, but making a comparison on different data sets is not possible. We directly compare these algorithms in this paper. We performed three random walks on the web under the same conditions and analyzed their outcomes in detail. We discuss the strengths and the weaknesses of each algorithm and propose improvements based on experimental results.

Cite As Get BibTex

Eda Baykan, Monika Henzinger, Stefan F. Keller, Sebastian de Castelberg, and Markus Kinzler. A Comparison of Techniques for Sampling Web Pages. In 26th International Symposium on Theoretical Aspects of Computer Science. Leibniz International Proceedings in Informatics (LIPIcs), Volume 3, pp. 13-30, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2009) https://doi.org/10.4230/LIPIcs.STACS.2009.1809

Author Details

Eda Baykan

Monika Henzinger

Stefan F. Keller

Sebastian de Castelberg

Markus Kinzler

Any Issues?

Feedback on the Current Page

Thanks for your feedback!

Feedback submitted to Dagstuhl Publishing

Could not send message

Please try again later or send an E-mail