Logo
Sign in

Apache StormCrawler is an open-source SDK for building scalable, low-latency web crawlers using Apache Storm. It provides reusable components for stream-based and recursive crawling, making it suitable for real-time and large-scale web data extraction.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

logo-small.png
Product details

Apache StormCrawler

Apache StormCrawler is an open-source SDK designed for building scalable, low-latency web crawlers using Apache Storm. It provides a modular and extensible framework with reusable components, enabling developers to create efficient and resilient crawlers tailored to real-time or large-scale data extraction needs.

Features

  • Built on Apache Storm for distributed stream processing.
  • Core modules for fetching, parsing, and indexing web content.
  • Integration with external tools like Apache Tika for document parsing.
  • Spouts and bolts for OpenSearch and other indexing backends.
  • Support for sitemap parsing and URL frontier management.
  • Maven archetype for quick project setup.
  • Politeness enforcement via hostname partitioning.
  • Local and cluster deployment modes.

Capabilities

  • Real-time stream-based crawling and processing.
  • Recursive crawling with low latency.
  • Fault-tolerant and scalable architecture.
  • Customizable topology for different crawling strategies.
  • Efficient handling of redirects, errors, and content updates.
  • Modular design for easy extension and integration.
  • Suitable for both small-scale and enterprise-grade deployments.

Benefits

  • Rapid development with ready-to-use components.
  • High performance for time-sensitive data extraction.
  • Flexibility to adapt to various crawling scenarios.
  • Open-source and community-supported.
  • Easy integration with existing data pipelines.
  • Maintained and used in production by multiple organizations.
  • Reduces complexity of building distributed crawlers from scratch.