IMPORTANT: Due to data size consideration, CrawlEval is moved to HuggingFace to better collaboration and usability.
Resources and tools for evaluating the performance and behavior of web crawling systems.
CrawlEval provides a comprehensive suite of tools and datasets for evaluating web crawling systems, with a particular focus on HTML pattern extraction and content analysis. The project includes:
- A curated dataset of web pages with ground truth patterns
- Tools for fetching and analyzing web content
- Evaluation metrics and benchmarking capabilities
The dataset is designed to test and benchmark web crawling systems' ability to extract structured data from HTML. It includes:
- Raw HTML files with various structures and complexities
- Ground truth PagePattern JSON files
- Metadata about each example (query, complexity, etc.)
See the dataset documentation for detailed information about the dataset structure and usage.
A powerful tool for collecting and analyzing web pages for evaluation purposes.
Key features:
- Fetches web pages with proper JavaScript rendering using Selenium
- Extracts and analyzes metadata (DOM structure, nesting levels, etc.)
- Content deduplication using SHA-256 hashing
- URL deduplication with normalization
- Parallel processing of multiple URLs
- Progress tracking and detailed logging
Usage:
python -m crawleval.fetch_webpage --batch urls.txt [options]Options:
--dir DIR: Base directory for storing data--list-hashes: Display the content hash index--list-urls: Display the URL index--save-results FILE: Save batch processing results to a JSON file--workers N: Number of parallel workers (default: 4)
We welcome contributions to improve the dataset and tools. Please see the dataset documentation for guidelines on adding new examples.