Will Drill Leverage Hive indexing - hive

If we index a table in hive, Will drill leverage indexing while querying the hive table with hive plugin in drill.
It's cuz we have partitioned table in hive and the analytics query has a partitioned and non-partitioned column in where clause, so we want to index the non-partitioned column in hive.

Currently Drill does not leverage Hive indices, but the Drill team is currently working on adding this feature for some storage layers. Please get in touch with us on the mailing list if you are still interested in having index support added for Hive http://drill.apache.org/mailinglists/

Related

Optimize Temporary Table on Presto/Hive SQL

I would like to optimize my computation time for queries ran on PRESTO/HIVE SQL. One of the techniques I used to do on Redshift was to improve efficiency of temporary tables as in the following :
BEGIN;
CREATE TEMPORARY TABLE my_temp_table(
column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) -- Assuming you intend to join this table on column_a
sortkey (column_b) -- Assuming you are sorting or grouping by column_b
;
INSERT INTO my_temp_table SELECT column_a, column_b FROM my_table;
COMMIT;
I have tried that on Presto/Hive SQL but it is not supported. Do you please know the equivalent of this technique on Presto/Hive SQL?
Many thanks!
Redshift is relational database, Presto is a distributed SQL Query Engine. Presto currently doesn't support the creation of temporary tables and also not the creation of indexes. But you may create tables based on a SQL statement via CREATE TABLE AS - Presto Documentation
You optimize the performance of Presto in two ways:
Optimizing the query itself
Optimizing how the underlying data is stored
One of the best articles around is Top 10 Performance Tuning Tips for Amazon Athena - Athena is a AWS Service based on Presto 0.172 and therefore the tips should also work for Presto.
I am not a Redshift expert but it seems you want to precompute a data set, distributing it and sorting by selected columns, so that it is faster to query.
This corresponds to Presto Hive connector ability to:
partition data -- data with same value in partitioning column(s) will form a single partition, which is a folder on storage; do not use partitioning on high cardinality columns. This is defined using partitioned_by table property
bucket data -- data is grouped in files using hash of bucketing column(s); this is similar to partitioning to a certain extent. This is defined using bucketed_by and bucket_count table properties.
sort data -- within data file, data is sorted by given column(s). This is defined using sorted_by table property.
See examples in Trino (formerly Presto SQL) Hive connector documentation
Note: while i realize documentation is scarce at the moment, i filed an issue to improve it. In the meantime, you can get additional information on Trino (formerly Presto SQL) community slack.

Are two tables (native, external) always required in Hive for querying a DynamoDB table from AWS EMR?

Are two hive tables (native, external) always required for querying a DynamoDB table from an AWS EMR?
I have created a native hive table (CTAS, create table as select) using an hive external table that was mapped to a DynamoDB table. My (read) query times against external tables are slow and it uses up the read throughput versus native table are fast and read throughput is not consumed.
My questions:
Is this a standard practice/best practice i.e., create an external table mapped to a dynamodb table and then create a CTAS and query against CTAS for all read query use cases?
Where or how GSI's on dynamodb come into picture on hive side of things? Toward this curiosity I have tried to map my external hive table column to dynamodb GSI and some what expectedly saw NULLs.
So, back to #2 question was wondering how are GSI's used with a native or external hive table?
Thanks,
Answer is no.
However, from my observation if a hive native table data is backed (CTAS) by hive external table that is referencing a DynamoDb table: Read data is not accounted if you are querying hive native table from EMR. If you to take into account the periodic update (refresh data) of hive native table.

AWS S3 - Inserting into bucketed ORC table

I'm looking at storing data in S3 in ORC format for querying with Athena.
I want to partition the data like so ...
.../year=2019/month=7/
... and bucketing the data further by id (each id will have multiple records for each month, there are lots of id's)
I want to be able to insert new data into this structure daily... I understand that I can't use the INSERT INTO statement from Athena because bucketed tables are not supported.
What would be the best way to insert data daily into a table of this structure? Is it even possible to do with bucketed data?
Cheers
Presto allows inserts into existing partitions of bucketed partitioned tables since Presto 312. If Athena does not support this, you can very easily run a Presto cluster yourself, e.g. using Starburst Presto AWS integration (I can recommend this for other reasons too, as it can be way cheaper than using Athena if you run more than just few queries. Disclaimer: I'm from Starburst)

Pushing Hive query to database level

I have tabular data of 100 Million records, each record having 15 columns.
I need to query 3 columns of this data and filter out the records to be used in further processing.
Currently I'm deciding between two approaches
Approach 1
Store the data as a csv or parquet in HDFS. When I need to query read the whole data and query using Spark SQL.
Approach 2
Create a Hive table using HiveContext and persist the table and Hive-metadata. Query this table when needed using HiveContext.
Doubts:
In Approach 2, is the query pushed to database level(HDFS) and only the records which satisfy the criteria are read and returned? Or the entire data is read into memory(as is the case with most spark jobs) and then query is run using the metadata?
Runtime: Of the two approaches, which one will be faster?
Please note that the Hive setup isn't Hive over Spark, it's HiveContext provided with Spark.
Spark Version: 2.2.0
In approach2, You should have hive table structured and stored in proper way.
Spark doesn't load all the data if hive table is partitioned and stored in file format that supports indexing(like ORC).
Spark optimized engine will use partition pruning and predicate push down and load only relevant data for further processing(transformation/action).
Partition Pruning:
choose appropriate column(which distribute data across partition evenly) to partition the hive table.
Spark partition pruning works efficiently with hive meta store. It will look into only relevant partition as per partition_column used in WHERE clause of your query.
Predicate PushDown:
ORC file has min/max index and bloom filters . Will work for string columns also in ORC(not sure about latest parquet string support), but more efficient on numerical column.
Spark will read only rows that are matching the filters as it pushed the filter down to underlying storage (orc files).
Below is a sample spark snippet to create such hive table. (assuming raw_df is the dataframe created from your raw data)
sorted_df = raw_df .sort("column2")
sorted_df.write.mode("append").format("orc").partitionBy("column1").saveAsTable("hive_table_name")
This will partition the data as per column1 values save orc files in hdfs and update hive metastore.
Sorting the table using column2 assuming that we are going to use column2 in our query WHERE clause.(sort is needed for efficient orc index)
Then you can query hive and load spark dataframe with relevant data . below is the sample.
filtered_df = spark.sql('SELECT column1,column2,column3 FROM hive_table_name WHERE column1= "some_value1" AND column2= "some_value2"')
In above sample spark will look into only some_value1 partition as column1 is the partition column in hive table created .
Then Spark will push the predicate(i,e filter) "some_value2" for column2 in orc files only under "some_value1" partition.
Here Spark will load only values of column1,column2,column3 , ignoring even other columns in the table.
Unless you combine the second approach with more advanced storage layout (bucketBy / DISTRIBUTE BY) which can be used to optimize the query there shoulde be no difference between between these two as long as you don't use schema inference in the approach 1 (you'll have to provide schema for the DataFrameReader).
Bucketing can be used to optimize execution plans for joins, aggregations and filters on bucketing column, but everything is still executed with Spark. In general Spark will use Hive only as metastore, not as execution engine.

PXF Hive Plugin, to select only the columns selected in the query

Is there a way to PXF select only the column used in the query, apart from Hive partition filtering.
I have data stored in Hive-ORC format and using pxf external table to execute queries in HAWQ. The biggest tables are stored in Hive and we cannot make another copy of data in HAWQ.
Thanks--
P.S - Does the query optimizer collect stats on external tables in HAWQ 2.0?
You can always run a select foo from bar type query on external tables in HAWQ. However, if your question is whether PXF actually does column projection to avoid reading all the columns then the answer is No. Currently PXF will read all columns from an ORC file and return the records to HAWQ which then does the projection filtering on its end. However, https://issues.apache.org/jira/browse/HAWQ-583, is actively being worked on and should be released in an upcoming version of HAWQ which will pushdown column projections down to ORC to improve read performance of ORC files
Yes, the query optimizer does collect statistics on external tables, this is also handled by PXF. However, this is only for some data sources: https://issues.apache.org/jira/browse/HAWQ-44