AWS S3 - Inserting into bucketed ORC table - amazon-s3

I'm looking at storing data in S3 in ORC format for querying with Athena.
I want to partition the data like so ...
.../year=2019/month=7/
... and bucketing the data further by id (each id will have multiple records for each month, there are lots of id's)
I want to be able to insert new data into this structure daily... I understand that I can't use the INSERT INTO statement from Athena because bucketed tables are not supported.
What would be the best way to insert data daily into a table of this structure? Is it even possible to do with bucketed data?
Cheers

Presto allows inserts into existing partitions of bucketed partitioned tables since Presto 312. If Athena does not support this, you can very easily run a Presto cluster yourself, e.g. using Starburst Presto AWS integration (I can recommend this for other reasons too, as it can be way cheaper than using Athena if you run more than just few queries. Disclaimer: I'm from Starburst)

Related

How to createOrReplaceTempView in Delta Lake?

I want to use Delta Lake tables in my Hive Metastore on Azure Data Lake Gen2 as basis for my company's lakehouse.
Previously, I used "regular" hive catalog tables. I would load data from parquet into a spark dataframe, and create a temp table using df.CreateOrReplaceTempView("TableName"), so I could use Spark SQL or %%sql magic to do ETL. After doing this, I can use spark.sql or %%sql on the TableName. When I was done, I would write my tables to the hive metastore.
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
I know I can persist Delta Tables in the Hive Metastore through a multitude of ways, for instance by creating a Managed catalog table through df.write.format("delta").saveAsTable("LakeHouseDB.TableName")
I also know that I can create a DeltaTable object through the DeltaTable(spark, table_path_data_lake), but then I can only use the Python API and not sql.
Does there exist some equivalent of CreateOrReplaceTempView(), or is there a better way to achieve ETL with SQL without 'writing' to the data lake first?
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
Not possible with Delta Lake since it relies heavily on a transaction log (_delta_log) under the data directory of a delta table.

Hive table in Power Bi

I want to create a hive table which will store data with orc format and snappy compression. Will power bi be able to read from that table? Also do you suggest any other format/compression for my table?
ORC is a special file format only going to work with hive and its highly optimized for HDFS read operations. And power BI can connect to hive using hive odbc data connection. So, i think if you have to use hive all the time, you can use this format to store the data. But if you want flexibility of both hive and impala and use cludera provided impala ODBC driver, you can think of using parquet.
Now, both orc and parquet has their own advantages and disadvantages. And main deciding factor can be tools that access the data, how nested data is, and how many columns are there .
If you have many columns with nested data and if you want to use both hive and impala to access data, go with parquet. And if you have few columns with flat data structure and huge amount of data, go with orc.

Optimize Temporary Table on Presto/Hive SQL

I would like to optimize my computation time for queries ran on PRESTO/HIVE SQL. One of the techniques I used to do on Redshift was to improve efficiency of temporary tables as in the following :
BEGIN;
CREATE TEMPORARY TABLE my_temp_table(
column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) -- Assuming you intend to join this table on column_a
sortkey (column_b) -- Assuming you are sorting or grouping by column_b
;
INSERT INTO my_temp_table SELECT column_a, column_b FROM my_table;
COMMIT;
I have tried that on Presto/Hive SQL but it is not supported. Do you please know the equivalent of this technique on Presto/Hive SQL?
Many thanks!
Redshift is relational database, Presto is a distributed SQL Query Engine. Presto currently doesn't support the creation of temporary tables and also not the creation of indexes. But you may create tables based on a SQL statement via CREATE TABLE AS - Presto Documentation
You optimize the performance of Presto in two ways:
Optimizing the query itself
Optimizing how the underlying data is stored
One of the best articles around is Top 10 Performance Tuning Tips for Amazon Athena - Athena is a AWS Service based on Presto 0.172 and therefore the tips should also work for Presto.
I am not a Redshift expert but it seems you want to precompute a data set, distributing it and sorting by selected columns, so that it is faster to query.
This corresponds to Presto Hive connector ability to:
partition data -- data with same value in partitioning column(s) will form a single partition, which is a folder on storage; do not use partitioning on high cardinality columns. This is defined using partitioned_by table property
bucket data -- data is grouped in files using hash of bucketing column(s); this is similar to partitioning to a certain extent. This is defined using bucketed_by and bucket_count table properties.
sort data -- within data file, data is sorted by given column(s). This is defined using sorted_by table property.
See examples in Trino (formerly Presto SQL) Hive connector documentation
Note: while i realize documentation is scarce at the moment, i filed an issue to improve it. In the meantime, you can get additional information on Trino (formerly Presto SQL) community slack.

Are two tables (native, external) always required in Hive for querying a DynamoDB table from AWS EMR?

Are two hive tables (native, external) always required for querying a DynamoDB table from an AWS EMR?
I have created a native hive table (CTAS, create table as select) using an hive external table that was mapped to a DynamoDB table. My (read) query times against external tables are slow and it uses up the read throughput versus native table are fast and read throughput is not consumed.
My questions:
Is this a standard practice/best practice i.e., create an external table mapped to a dynamodb table and then create a CTAS and query against CTAS for all read query use cases?
Where or how GSI's on dynamodb come into picture on hive side of things? Toward this curiosity I have tried to map my external hive table column to dynamodb GSI and some what expectedly saw NULLs.
So, back to #2 question was wondering how are GSI's used with a native or external hive table?
Thanks,
Answer is no.
However, from my observation if a hive native table data is backed (CTAS) by hive external table that is referencing a DynamoDb table: Read data is not accounted if you are querying hive native table from EMR. If you to take into account the periodic update (refresh data) of hive native table.

Pushing Hive query to database level

I have tabular data of 100 Million records, each record having 15 columns.
I need to query 3 columns of this data and filter out the records to be used in further processing.
Currently I'm deciding between two approaches
Approach 1
Store the data as a csv or parquet in HDFS. When I need to query read the whole data and query using Spark SQL.
Approach 2
Create a Hive table using HiveContext and persist the table and Hive-metadata. Query this table when needed using HiveContext.
Doubts:
In Approach 2, is the query pushed to database level(HDFS) and only the records which satisfy the criteria are read and returned? Or the entire data is read into memory(as is the case with most spark jobs) and then query is run using the metadata?
Runtime: Of the two approaches, which one will be faster?
Please note that the Hive setup isn't Hive over Spark, it's HiveContext provided with Spark.
Spark Version: 2.2.0
In approach2, You should have hive table structured and stored in proper way.
Spark doesn't load all the data if hive table is partitioned and stored in file format that supports indexing(like ORC).
Spark optimized engine will use partition pruning and predicate push down and load only relevant data for further processing(transformation/action).
Partition Pruning:
choose appropriate column(which distribute data across partition evenly) to partition the hive table.
Spark partition pruning works efficiently with hive meta store. It will look into only relevant partition as per partition_column used in WHERE clause of your query.
Predicate PushDown:
ORC file has min/max index and bloom filters . Will work for string columns also in ORC(not sure about latest parquet string support), but more efficient on numerical column.
Spark will read only rows that are matching the filters as it pushed the filter down to underlying storage (orc files).
Below is a sample spark snippet to create such hive table. (assuming raw_df is the dataframe created from your raw data)
sorted_df = raw_df .sort("column2")
sorted_df.write.mode("append").format("orc").partitionBy("column1").saveAsTable("hive_table_name")
This will partition the data as per column1 values save orc files in hdfs and update hive metastore.
Sorting the table using column2 assuming that we are going to use column2 in our query WHERE clause.(sort is needed for efficient orc index)
Then you can query hive and load spark dataframe with relevant data . below is the sample.
filtered_df = spark.sql('SELECT column1,column2,column3 FROM hive_table_name WHERE column1= "some_value1" AND column2= "some_value2"')
In above sample spark will look into only some_value1 partition as column1 is the partition column in hive table created .
Then Spark will push the predicate(i,e filter) "some_value2" for column2 in orc files only under "some_value1" partition.
Here Spark will load only values of column1,column2,column3 , ignoring even other columns in the table.
Unless you combine the second approach with more advanced storage layout (bucketBy / DISTRIBUTE BY) which can be used to optimize the query there shoulde be no difference between between these two as long as you don't use schema inference in the approach 1 (you'll have to provide schema for the DataFrameReader).
Bucketing can be used to optimize execution plans for joins, aggregations and filters on bucketing column, but everything is still executed with Spark. In general Spark will use Hive only as metastore, not as execution engine.