CrawlEval

IMPORTANT: Due to data size consideration, CrawlEval is moved to HuggingFace to better collaboration and usability.

Resources and tools for evaluating the performance and behavior of web crawling systems.

Overview

CrawlEval provides a comprehensive suite of tools and datasets for evaluating web crawling systems, with a particular focus on HTML pattern extraction and content analysis. The project includes:

A curated dataset of web pages with ground truth patterns
Tools for fetching and analyzing web content
Evaluation metrics and benchmarking capabilities

Dataset

The dataset is designed to test and benchmark web crawling systems' ability to extract structured data from HTML. It includes:

Raw HTML files with various structures and complexities
Ground truth PagePattern JSON files
Metadata about each example (query, complexity, etc.)

See the dataset documentation for detailed information about the dataset structure and usage.

Tools

Web Page Fetcher (`fetch_webpage.py`)

A powerful tool for collecting and analyzing web pages for evaluation purposes.

Key features:

Fetches web pages with proper JavaScript rendering using Selenium
Extracts and analyzes metadata (DOM structure, nesting levels, etc.)
Content deduplication using SHA-256 hashing
URL deduplication with normalization
Parallel processing of multiple URLs
Progress tracking and detailed logging

Usage:

python -m crawleval.fetch_webpage --batch urls.txt [options]

Options:

--dir DIR: Base directory for storing data
--list-hashes: Display the content hash index
--list-urls: Display the URL index
--save-results FILE: Save batch processing results to a JSON file
--workers N: Number of parallel workers (default: 4)

Contributing

We welcome contributions to improve the dataset and tools. Please see the dataset documentation for guidelines on adding new examples.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.dvc		.dvc
crawleval		crawleval
.dvcignore		.dvcignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
eval_data_extract.dvc		eval_data_extract.dvc
example_urls.txt		example_urls.txt
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
urls.txt		urls.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CrawlEval

Overview

Dataset

Tools

Web Page Fetcher (`fetch_webpage.py`)

Contributing

About

Uh oh!

Releases

Packages

Languages

License

crawlab-team/crawleval

Folders and files

Latest commit

History

Repository files navigation

CrawlEval

Overview

Dataset

Tools

Web Page Fetcher (fetch_webpage.py)

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Web Page Fetcher (`fetch_webpage.py`)

Packages