I have a table in hive which is partitioned on date basis. Sometimes I don't get data for some dates but still it creates empty partitions in hive warehouse dir. Is it possible to avoid creating empty partitions in hive?
Thanks!!
Related
I have an BigQuery date partitioned table that I want to convert to an ingestion time partitioned table (partitioned on _PARTITIONTIME), using the current date partitioning to feed into _PARTITIONTIME. How can I do this?
WHY? Because only ingestion partitioned tables can be incrementally loaded to using BigQuery's scheduled query functionality (by using the #rundate parameter as partition decorator)
One option is to disable the scheduled query first and copy the column-based partitioned table to a ingestion-time partitioned table. Then re-enable the scheduled query. Please follow steps:
Disable the scheduled query through the BigQuery UI: disable option on scheduled query
Create a new ingestion-time partitioned table (called ingestion_time_partitioned) and copy the column-based partitioned table (called table_column_partitioned) to the new table (ingestion_time_partitioned).
Edit the scheduled query to write to the new ingestion-time partitioned table (ingestion_time_partitioned). Please remember to re-enable the scheduled query and remove the partition field (which is used for column-based partition).
Copying from column-based partitioned table to a ingestion-time partitioned table will correctly map the column-based partition to the ingestion-time-based partition. And copy job on BigQuery is free. For more information about copying partitioned tables, please see https://cloud.google.com/bigquery/docs/managing-partitioned-tables#copying_partitioned_tables
As an example consider I have a data of all the major sports events happened.Schema given below
EventName,Date,Month,Year,City
This data that is physically structured in HDFS on year,date,month.
Now I want to create virtual partitions on that based on some other column value, eg. City.The data will be stored physically in HDFS in year,date,month structure only but my metadata keeps track of the virtual partition.
Can hive metastore do it for me?
I don't think so it will happen. Actually partitioning in Hive means creates different dir for different partition. And metastore only contains metadata of table. It won't control the actual data. Technically when ever we query based on that partitioned column in Hive table, the query will execute on that exact partitioned dir only. So virtual partitioning with out changing hdfs structure in the sense the real data will be in one dir so the query has to be execute on entire data. So technically optimisation is not at all happening.
I have tabular data of 100 Million records, each record having 15 columns.
I need to query 3 columns of this data and filter out the records to be used in further processing.
Currently I'm deciding between two approaches
Approach 1
Store the data as a csv or parquet in HDFS. When I need to query read the whole data and query using Spark SQL.
Approach 2
Create a Hive table using HiveContext and persist the table and Hive-metadata. Query this table when needed using HiveContext.
Doubts:
In Approach 2, is the query pushed to database level(HDFS) and only the records which satisfy the criteria are read and returned? Or the entire data is read into memory(as is the case with most spark jobs) and then query is run using the metadata?
Runtime: Of the two approaches, which one will be faster?
Please note that the Hive setup isn't Hive over Spark, it's HiveContext provided with Spark.
Spark Version: 2.2.0
In approach2, You should have hive table structured and stored in proper way.
Spark doesn't load all the data if hive table is partitioned and stored in file format that supports indexing(like ORC).
Spark optimized engine will use partition pruning and predicate push down and load only relevant data for further processing(transformation/action).
Partition Pruning:
choose appropriate column(which distribute data across partition evenly) to partition the hive table.
Spark partition pruning works efficiently with hive meta store. It will look into only relevant partition as per partition_column used in WHERE clause of your query.
Predicate PushDown:
ORC file has min/max index and bloom filters . Will work for string columns also in ORC(not sure about latest parquet string support), but more efficient on numerical column.
Spark will read only rows that are matching the filters as it pushed the filter down to underlying storage (orc files).
Below is a sample spark snippet to create such hive table. (assuming raw_df is the dataframe created from your raw data)
sorted_df = raw_df .sort("column2")
sorted_df.write.mode("append").format("orc").partitionBy("column1").saveAsTable("hive_table_name")
This will partition the data as per column1 values save orc files in hdfs and update hive metastore.
Sorting the table using column2 assuming that we are going to use column2 in our query WHERE clause.(sort is needed for efficient orc index)
Then you can query hive and load spark dataframe with relevant data . below is the sample.
filtered_df = spark.sql('SELECT column1,column2,column3 FROM hive_table_name WHERE column1= "some_value1" AND column2= "some_value2"')
In above sample spark will look into only some_value1 partition as column1 is the partition column in hive table created .
Then Spark will push the predicate(i,e filter) "some_value2" for column2 in orc files only under "some_value1" partition.
Here Spark will load only values of column1,column2,column3 , ignoring even other columns in the table.
Unless you combine the second approach with more advanced storage layout (bucketBy / DISTRIBUTE BY) which can be used to optimize the query there shoulde be no difference between between these two as long as you don't use schema inference in the approach 1 (you'll have to provide schema for the DataFrameReader).
Bucketing can be used to optimize execution plans for joins, aggregations and filters on bucketing column, but everything is still executed with Spark. In general Spark will use Hive only as metastore, not as execution engine.
After insertion of orc files into the folder of a table with hdfs copy, how to update that hive table's data to see those data when querying with hive.
Best Regards.
If the table is not partitioned then once the files are in HDFS in the folder that is specified in the LOCATION clause, then the data should be available for querying.
If the table is partitioned then u first need to run an ADD PARTITION statement.
As mentioned in upper answer by belostoky. if the table is not partitioned then you can directly query your table with the updated data
But in case if you table is partitioned you need to add partitions first in hive table that you can do using
You can use alter table statement to add partitions like shown below
ALTER TABLE table1
ADD PARTITION (dt='<date>')
location '<hdfs file path>'
once partitions are added hive metastore should be aware of changes so you need to run
msck repair table table1
to add partitions in metastore.
Once done you can query your data
I have created a hive table with dynamic partitioning on a column. Is there a way to directly load the data from files using "LOAD DATA" statement? Or do we have to only depend on creating a non-partitioned intermediate table and load file data to it and then inserting data from this intermediate table to partitioned table as mentioned in Hive loading in partitioned table?
No, the LOAD DATA command ONLY copies the files to the destination directory. It doesn't read the records of the input file, so it CANNOT do partitioning based on record values.
If your input data is already split into multiple files based on partitions, you could directly copy the files to table location in HDFS under their partition directory manually created by you (OR just point to their current location in case of EXTERNAL table) and use the following ALTER command to ADD the partition. This way you could skip the LOAD DATA statement altogether.
ALTER TABLE <table-name>
ADD PARTITION (<...>)
No other go, if we need to insert directly, we'll need to specify partitions manually.
For dynamic partitioning, we need staging table and then insert from there.