Stackable Operator for Apache Hive
This is an operator for Kubernetes that can manage Apache Hive metastores. The Apache Hive metastore (HMS) stores information on the location of tables and partitions in file and blob storages such as HDFS and S3.
Only the metastore is supported, not Hive itself. There are several reasons why running Hive on Kubernetes may not be an optimal solution. The most obvious reason is that Hive requires YARN as an execution framework, and YARN assumes much of the same role as Kubernetes - i.e. assigning resources. For this reason we provide Trino as a query engine in the Stackable Data Platform instead of Hive. Trino still uses the Hive Metastore, hence the inclusion of this operator as well. There are multiple tools that can use the HMS:
-
HiveServer2
-
This is the "original" tool using the HMS.
-
It offers an endpoint, where you can submit HiveQL (similar to SQL) queries.
-
It needs a execution engine, e.g. YARN or Spark.
-
This operator does not support running the Hive server because of the complexity needed to operate YARN on Kubernetes. YARN is a resource manager which is not meant to be running on Kubernetes as Kubernetes already manages its own resources.
-
We offer Trino as a (often times drop-in) replacement (see below)
-
-
-
Trino
-
Takes SQL queries and executes them against the tables, whose metadata are stored in HMS.
-
It should offer all the capabilities Hive offers including a lot of additional functionality, such as connections to other data sources.
-
-
Spark
-
Takes SQL or programmatic jobs and executes them against the tables, whose metadata are stored in HMS.
-
-
And others
Required external component: An SQL database
The Hive metastore requires a database to store metadata. Consult the required external components page for an overview of the supported databases and minimum supported versions.