PXF Hive Plugin, to select only the columns selected in the query - hive

Is there a way to PXF select only the column used in the query, apart from Hive partition filtering.
I have data stored in Hive-ORC format and using pxf external table to execute queries in HAWQ. The biggest tables are stored in Hive and we cannot make another copy of data in HAWQ.
Thanks--
P.S - Does the query optimizer collect stats on external tables in HAWQ 2.0?

You can always run a select foo from bar type query on external tables in HAWQ. However, if your question is whether PXF actually does column projection to avoid reading all the columns then the answer is No. Currently PXF will read all columns from an ORC file and return the records to HAWQ which then does the projection filtering on its end. However, https://issues.apache.org/jira/browse/HAWQ-583, is actively being worked on and should be released in an upcoming version of HAWQ which will pushdown column projections down to ORC to improve read performance of ORC files
Yes, the query optimizer does collect statistics on external tables, this is also handled by PXF. However, this is only for some data sources: https://issues.apache.org/jira/browse/HAWQ-44

Related

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

Optimize Temporary Table on Presto/Hive SQL

I would like to optimize my computation time for queries ran on PRESTO/HIVE SQL. One of the techniques I used to do on Redshift was to improve efficiency of temporary tables as in the following :
BEGIN;
CREATE TEMPORARY TABLE my_temp_table(
column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) -- Assuming you intend to join this table on column_a
sortkey (column_b) -- Assuming you are sorting or grouping by column_b
;
INSERT INTO my_temp_table SELECT column_a, column_b FROM my_table;
COMMIT;
I have tried that on Presto/Hive SQL but it is not supported. Do you please know the equivalent of this technique on Presto/Hive SQL?
Many thanks!
Redshift is relational database, Presto is a distributed SQL Query Engine. Presto currently doesn't support the creation of temporary tables and also not the creation of indexes. But you may create tables based on a SQL statement via CREATE TABLE AS - Presto Documentation
You optimize the performance of Presto in two ways:
Optimizing the query itself
Optimizing how the underlying data is stored
One of the best articles around is Top 10 Performance Tuning Tips for Amazon Athena - Athena is a AWS Service based on Presto 0.172 and therefore the tips should also work for Presto.
I am not a Redshift expert but it seems you want to precompute a data set, distributing it and sorting by selected columns, so that it is faster to query.
This corresponds to Presto Hive connector ability to:
partition data -- data with same value in partitioning column(s) will form a single partition, which is a folder on storage; do not use partitioning on high cardinality columns. This is defined using partitioned_by table property
bucket data -- data is grouped in files using hash of bucketing column(s); this is similar to partitioning to a certain extent. This is defined using bucketed_by and bucket_count table properties.
sort data -- within data file, data is sorted by given column(s). This is defined using sorted_by table property.
See examples in Trino (formerly Presto SQL) Hive connector documentation
Note: while i realize documentation is scarce at the moment, i filed an issue to improve it. In the meantime, you can get additional information on Trino (formerly Presto SQL) community slack.

Deleting records in a table with billion records using spark or scala

we have a table in Azure Data Warehouse with 17 billion records. Now we have a scenario where we have to delete records from this table based on some where condition. We are writing Spark in Scala language in Azure Databricks notebooks.
We searched for different options to do this in Spark, but all suggested to first read the entire table, delete records from this and then overwrite the entire table in Data Warehosue. However this approach will not work in our case due to huge number of records in our table.
Can you please suggest how we can achieve this functionality using spark/scala?
1) checked if we can call stored procedure through spark/scala code in azure databricks but Spark do not support stored procedures.
2) Tried reading the entire table first to delete the records but it goes into never ending loop.
Is possible to create view with select clause as per your requirement, then using of the view

Pushing Hive query to database level

I have tabular data of 100 Million records, each record having 15 columns.
I need to query 3 columns of this data and filter out the records to be used in further processing.
Currently I'm deciding between two approaches
Approach 1
Store the data as a csv or parquet in HDFS. When I need to query read the whole data and query using Spark SQL.
Approach 2
Create a Hive table using HiveContext and persist the table and Hive-metadata. Query this table when needed using HiveContext.
Doubts:
In Approach 2, is the query pushed to database level(HDFS) and only the records which satisfy the criteria are read and returned? Or the entire data is read into memory(as is the case with most spark jobs) and then query is run using the metadata?
Runtime: Of the two approaches, which one will be faster?
Please note that the Hive setup isn't Hive over Spark, it's HiveContext provided with Spark.
Spark Version: 2.2.0
In approach2, You should have hive table structured and stored in proper way.
Spark doesn't load all the data if hive table is partitioned and stored in file format that supports indexing(like ORC).
Spark optimized engine will use partition pruning and predicate push down and load only relevant data for further processing(transformation/action).
Partition Pruning:
choose appropriate column(which distribute data across partition evenly) to partition the hive table.
Spark partition pruning works efficiently with hive meta store. It will look into only relevant partition as per partition_column used in WHERE clause of your query.
Predicate PushDown:
ORC file has min/max index and bloom filters . Will work for string columns also in ORC(not sure about latest parquet string support), but more efficient on numerical column.
Spark will read only rows that are matching the filters as it pushed the filter down to underlying storage (orc files).
Below is a sample spark snippet to create such hive table. (assuming raw_df is the dataframe created from your raw data)
sorted_df = raw_df .sort("column2")
sorted_df.write.mode("append").format("orc").partitionBy("column1").saveAsTable("hive_table_name")
This will partition the data as per column1 values save orc files in hdfs and update hive metastore.
Sorting the table using column2 assuming that we are going to use column2 in our query WHERE clause.(sort is needed for efficient orc index)
Then you can query hive and load spark dataframe with relevant data . below is the sample.
filtered_df = spark.sql('SELECT column1,column2,column3 FROM hive_table_name WHERE column1= "some_value1" AND column2= "some_value2"')
In above sample spark will look into only some_value1 partition as column1 is the partition column in hive table created .
Then Spark will push the predicate(i,e filter) "some_value2" for column2 in orc files only under "some_value1" partition.
Here Spark will load only values of column1,column2,column3 , ignoring even other columns in the table.
Unless you combine the second approach with more advanced storage layout (bucketBy / DISTRIBUTE BY) which can be used to optimize the query there shoulde be no difference between between these two as long as you don't use schema inference in the approach 1 (you'll have to provide schema for the DataFrameReader).
Bucketing can be used to optimize execution plans for joins, aggregations and filters on bucketing column, but everything is still executed with Spark. In general Spark will use Hive only as metastore, not as execution engine.

Schema Evolution in Parquet Hive table

I have a lot of data in a Parquet based Hive table (Hive version 0.10). I have to add a few new columns to the table. I want the new columns to have data going forward. If the value is NULL for already loaded data, that is fine with me.
If I add the new columns and not update the old Parquet files, it gives an error and it looks strange as I am adding String columns only.
Error getting row data with exception java.lang.UnsupportedOperationException: Cannot inspect java.util.ArrayList
Can you please tell me how to add new fields to Parquet Hive without affecting the already existing data in the table ?
I use Hive version 0.10.
Thanks.
1)
Hive starting with version 0.13 has parquet schema evoultion built in.
https://issues.apache.org/jira/browse/HIVE-6456
https://github.com/Parquet/parquet-mr/pull/297
ps. Notice that out-of-the-box support for schema evolution might take a toll on performance. For example, Spark has a knob to turn parquet schema evolution on and off. After one of the recent Spark releases, it's now off by default because of performance hit (epscially when there are a lot of parquet files). Not sure if Hive 0.13+ has such a setting too.
2)
Also wanted to suggest to try creating views in Hive on top of such parquet tables where you expect often schema changes, and use views everywhere but not tables directly.
For example, if you have two tables - A and B with compatible schemas, but table B has two more columns, you could workaround this by
CREATE VIEW view_1 AS
SELECT col1,col2,col3,null as col4,null as col5 FROM tableA
UNION ALL
SELECT col1,col2,col3,col4,col5 FROM tableB
;
So you don't actually have to recreate any tables like #miljanm has suggested, you can just recreate the view. It'll help with the agility of your projects.
Create a new table with the two new columns. Insert data by issuing:
insert into new_table select old_table.col1, old_table.col2,...,null,null from old_table;
The last two nulls are for the two new columns. That's it.
If you have too many columns, it may be easier for you to write a program that reads the old files and writes the new ones.
Hive 0.10 does not have support for schema evolution in parquet as far as I know. Hive 0.13 does have it, so you may try to upgrade hive.