spark-sql overwrite hive table ,why occured duplicate records - apache-spark-sql

It occured duplicate records when spark-sql overwrite hive table . when spark job has failure stages,but dateframe has no duplicate records? when I run the job again, the reasult is correct.It confused me.why?
eg:
dataFrame.write().mode(SaveMode.Overwrite).insertInto("outputTable");
no duplicate records in dataFrame,
but duplicate records existed in hive outputTable

Related

Drop part of partitions in Hive SQL

I have an external hive table, partitioned on the date column. Each date has multiple AB test experiment’s data.
I need to create a job that drops experiments which have ended more than 6 months ago.
Since dropping data in an external hive partitioned table, drops the entire partition. In this case, data for one whole date. Is there a way to only drop part of a date?

Response too large to return. Consider specifying a destination table in your job configuration

I am getting following error when running sql script with two table join.
table_a has 603388514 records left join with table_b with 11147 records.
Total number of columns which I am selecting in this select statement are 147 columns.
"
Response too large to return. Consider specifying a destination table in your job configuration."
Any suggestion / help how to overcome this error message?

Overwrite and append big query table from csv file

I have an existing bq table with date, stock id and stock price columns. Using bq load, I can either overwrite or append data from csv file. Using csv file,
want to overwrite rows in bq table if date and stock id already exist (updating price),
Else, want to append as new rows in bq table if if date and stock id does not exist in bq table
In that scenario you would want to to a two step process.
load the data to a staging table.
Issue a MERGE statement, and define the when matched and not matched criteria as you needed.
Documentation on the MERGE statement can be found here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement

Is there are way to load partial table from hive to pig relation?

I am currently loading a hive table to pig relation using below code.
a = LOAD 'hive_db.hive_table' using org.apache.hive.hcatalog.pig.HCatLoader();
This step would get all the records from hive table into pig but for my current scenario I wouldn't need the whole table in pig. Is there way to filter out the unwanted records while I get the data from hive?
No you can't load partial table.However you can filter it after the load statement.You can use filter for specific partitions or filter out records based on column values in the table loaded.
Examples here
If your Hive table is partitioned, you can load only certain partitions by doing a FILTER statement immediately after your LOAD statement.
From the documentation:
If only some partitions of the specified table are needed, include a
partition filter statement immediately following the load statement in
the data flow. (In the script, however, a filter statement might not
immediately follow its load statement.) The filter statement can
include conditions on partition as well as non-partition columns.
A = LOAD 'tablename' USING org.apache.hive.hcatalog.pig.HCatLoader();
-- date is a partition column; age is not
B = filter A by date == '20100819' and age < 30;
The above will only load the partition date == '20100819'. This only works for partition columns.

show hive partitions in a nested sub query

I have a hive table that is partitioned by day (e.g. 20151001, 20151002,....).
Is there a hive query to list these partitions in a way that it is possible to be used in a nested sub query?
That is can I do something along the line of:
SELECT * FROM (SHOW PARTITIONS test) a where ...
The query-
SELECT ptn FROM test
returns as many rows as the number of rows in the test table. I want it to return only as many rows as the number of partitions (without using the DISTINCT function)
A potential solution is to find the partitions from the hdfs location for the table of interest by using either shell script/python.
The data that corresponds to the hive table is stored in the hdfs e.g
/hive/database/table/partition/datafiles
in your case,
/hive/database/table/20151001/datafiles
If the table is bucketed there are as many individual files as the cluster size.
Once you have the above path, create a shell script to loop through the folder (in this case 20151001 etc..)
capture this in a shell variable and pass it as a parameter to the hive query.