Overview of AWS Hive on Amazon EMR
What is Amazon EMR ?
Amazon EMR is a Cloud big data Platform for large scale data processing , interactive SQL queries, ML (Machine Learning) applications using widely used Open-Source frameworks – Apache Spark, Presto, Hadoop, Hive, Trino, HBase & Flink
To find out more information on Amazon EMR , check out our post – http://www.cloudinfonow.com/amazon-emr-eks-serverless/
EMR Hive
Hive is an open-source, data warehouse, and analytic package that runs on top of a Hadoop cluster. Hive use an SQL-like language called Hive QL which utilizes Tez jobs based on DAGs or Map reduce programs for execution.
EMR supports both Hive 2.x and Hive 3.x Versions. You can configure Hive Metastore as local MySQL database on master node of the cluster or on an external supported database. EMR also supports Hive ACID transactions , Hive LLAP in latest 6.x versions.
Hive is widely used SQL Engine on big data frameworks and is predominantly utilized for Batch Analytics. Over last few years Hive usage is diminishing gradually due to demands of interactive, near/real time analytics. Its being replaced by Presto/Trino for Interactive Queries and Spark for Batch analytics.
Following are some of Key Features of Hive
- Hive allows applications to be written in SQL with out need of developing Map Reduce jobs.
- Hive Integration is available with majority of BI tools in the market – Tableau, Alteryx, Business objects, Looker.
- Hive support External tables. This feature is widely utilized in EMR as data is stored in S3 , logical tables are defined in Hive. This will eliminate the need of loading data into Hive database.
- Hive Supports Partitioning, Bucketing, Complex nested data structures (ARRAYs), UDFs.
- Hive Supports LDAP based user Authentication.
Following are some of the limitations of Hive
- Limited ACID transactions support. ACID is supported only for ORC data formats.
- Though Indexes are supported in hive, Its usage is very limited due to latency. Recommended to use Partitioning and Bucketing.
- No support for triggers, geospatial analytics
EMR Hive Architecture
Following is the high level master-slave architecture of EMR Hive.
Amazon EMR now includes EMR runtime for Hive, a performance-optimized runtime environment for Hive that includes custom performance improvements. With EMR runtime for Hive, your queries run up to 1.25 times faster. EMR runtime for Hive is 100% API compatible with open-source Hive. Highlights given below
EMR Hive Pricing
Hive is one of the framework on Amazon EMR service. Hence, EMR Hive Pricing depends on usage of EMR clusters. More details of EMR pricing are available on our post – http://www.cloudinfonow.com/aws-emr-pricing/
Pingback: Amazon EMR • Cloud InfoNow