Skip to content

meysam81/sitemap-harvester

Repository files navigation

🗺️ Sitemap Harvester

PyPI - Version Python Support License: Apache 2.0 PyPI - Downloads

🚀 A blazingly fast Python tool to harvest URLs and metadata from website sitemaps like a digital archaeologist!

🚀 Quick Start

Installation

pip install sitemap-harvester

Basic Usage

# Harvest a website's sitemap
sitemap-harvester --url https://example.com

# Custom output file and timeout
sitemap-harvester --url https://example.com --output my_data.csv --timeout 15

🎯 What Gets Extracted?

  • 📝 Page Title - The main title of each page
  • 📄 Meta Description - SEO descriptions
  • 🏷️ Keywords - Meta keywords (if present)
  • 👤 Author - Page author information
  • 🔗 Canonical URL - Canonical link references
  • 🖼️ Open Graph Data - Social media metadata
  • 🌐 Custom Meta Tags - Any additional meta information

💡 Pro Tips

  • Use --timeout for slower websites or large sitemaps
  • The tool automatically deduplicates URLs for you
  • Check the console output for real-time progress updates
  • Large sitemaps? Grab a coffee ☕ and let it work its magic!

🤝 Contributing

Found a bug? Have a feature request? Contributions are welcome! Feel free to open an issue or submit a pull request.

📜 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


Happy harvesting! 🌾