Logo
Sign in

Apache Hive is a distributed data warehouse system built on Apache Hadoop. It enables reading, writing, and managing large datasets stored in distributed systems using SQL. Hive supports data warehousing tasks like ETL, reporting, and analysis, and integrates with various storage formats and engines.

Vendor

Vendor

The Apache Software Foundation

Company Website

Company Website

HMS.jpg
Product details

Apache Hive

Apache Hive is a distributed, fault-tolerant data warehouse system built on top of Apache Hadoop. It enables scalable analytics by allowing users to read, write, and manage large datasets stored in distributed systems using SQL. Hive is designed for traditional data warehousing tasks such as ETL, reporting, and data analysis, and integrates with various storage formats and engines.

Features

  • SQL-based query language with support for SQL:2003, SQL:2011, and SQL:2016 features
  • Extensibility through user-defined functions (UDFs), aggregates (UDAFs), and table functions (UDTFs)
  • Support for multiple file formats including CSV, TSV, Parquet, ORC, and custom formats
  • Execution engines including Apache Tez, MapReduce, and LLAP for low-latency queries
  • HiveServer2 for multi-client concurrency and authentication
  • Hive Metastore (HMS) for centralized metadata management
  • Integration with Apache Iceberg for cloud-native table formats
  • Built-in support for ACID transactions on ORC tables
  • Data compaction and replication capabilities

Capabilities

  • Scalable query processing over petabytes of data
  • Structured access to semi-structured and unstructured data
  • Metadata-driven architecture for interoperability with tools like Spark, Impala, and Presto
  • Interactive SQL queries via LLAP with sub-second response times
  • Cost-based query optimization using Apache Calcite
  • Secure access with Kerberos authentication and integration with Apache Ranger and Atlas
  • REST-style interface for metadata operations via WebHCat
  • Support for streaming data ingestion and replication for backup and recovery

Benefits

  • Simplifies big data analytics with familiar SQL syntax
  • Enables efficient data warehousing on Hadoop-based infrastructure
  • Reduces latency for interactive queries
  • Enhances data governance and security
  • Facilitates integration with modern data lake architectures
  • Offers flexibility in data format and storage options
  • Scales dynamically with cluster growth
  • Open-source and community-driven development