overwriting hive partition by a spark dataframe

overwriting hive partition by a spark dataframe - dataframe

I have a hive table partitioned on snapshot_dt
I have to delete records from each partition which have turned 180 days old or 7 years old depending on condition.
I am iterating over each partition one by one inside for loop
selecting all the records from the partition into whole_data_df dataframe
selecting records to be deleted into save_to_data dataframe
using except to get the records I want.
var cons_rec_to_retain_df = whole_data_df.except(save_to_data)
now I want to overwrite that particular hive partition with the records in the dataframe cons_rec_to_retain_df.
Please find my code below
var query5=s"""select distinct snapshot_dt from workspace_x215579.ess_consldt_cust_prospect_ord_orc_test"""
var partition_load_dates = createDataframeFromSql(hc,query5)
var output = partition_load_dates.collect().toList
output.foreach(println)
for (d1<- output){
var partition_date:String=d1(0).toString
println("=====================this is the partition date========================")
println(partition_date)
var whole_data = s"""select * From workspace_x215579.ess_consldt_cust_prospect_ord_orc_test where snapshot_dt='$partition_date'"""
println("=====================this is whole partition data df========================")
var whole_data_df = createDataframeFromSql(hc,whole_data)
whole_data_df.show(5,false)
var select_data = s"""select * From workspace_x215579.ess_consldt_cust_prospect_ord_orc_test where snapshot_dt='$partition_date'
and
bus_bacct_num_src_id = 130
and
(bacct_status_cd = 'T' and datediff(to_date(current_timestamp()), snapshot_dt) > 180)
OR
(bacct_status_cd <> 'T' and cast(datediff(to_date(current_timestamp()), snapshot_dt)/365.25 as int) > 7)
"""
var save_to_data = createDataframeFromSql(hc,select_data)
println("=====================this is df to be removed========================")
save_to_data.show(5,false)
var cons_rec_to_retain_df = whole_data_df.except(save_to_data)
println("=====================this is df REcord to be inserted=======================")
cons_rec_to_retain_df.show(5,false)
println("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
cons_rec_to_retain_df.show(5)
cons_rec_to_retain_df.write.mode("append").format("orc").insertInto("workspace_x215579.ess_consldt_cust_prospect_ord_orc_test")
}//end of for loop
In the starting I am setting
hc.sql("""set hive.exec.dynamic.partition=true""")
hc.sql("""set hive.exec.dynamic.partition.mode=nonstrict""")
after running the code the count of the record is same as was before. Can someone please suggest a proper way of doing it.

Related

Find matching rows in database table using SQL where no matching key is present

I have an old table with legacy data and approx 10,000 rows and a new table with about 500 rows. The columns are the same in both tables. I need to compare a few columns in the new table with the old one and report on data that is duplicated in the new table.
I've researched articles with similar issues, attempted table joins and where exists / where not exists clauses but I just can't get the SQL right. I have included my latest version.
One issue causing trouble for me, I think, is that there is no "Key" as such like a userid or similar unique identifier in either table.
What I want to do is find the data in the "new" table where all rows except for the "reference_number" (doesn't matter if it does or does not) is duplicated, i.e. exists already in the "old" table.
I have this so far...
select
old.reference_number
new.reference_number
new.component
new.privileges
new.protocol
new.authority
new.score
new.means
new.difficulty
new.hierarchy
new.interaction
new.scope
new.conf
new.integrity
new.availability
new.version
from old, new
where
old.component = new.component
old.privileges = new.privileges
old.protocol = new.protocol
old.authority = new.authority
old.score = new.score
old.means = new.means
old.difficulty = new.difficulty
old.hierarchy = new.hierarchy
old.interaction = new.interaction
old.scope = new.scope
old.conf = new.conf
old.integrity = new.integrity
old.availability = new.availability
old.version = new.version
I have tried this here but it doesn't seem to pull out ALL of the data for some reason.
It is evident that actually there are MORE rows in the old table that are duplicated in the new table but I'm only getting a small number of rows returned from the query.
Can anyone spot why that might be, is there another way I should be approaching this?
If it matters, this is Postgresql.
Thanks for any help given.

The following should do what you want:
select distinct o.reference_number,
n.reference_number,
n.component,
n.privileges,
n.protocol,
n.authority,
n.score,
n.means,
n.difficulty,
n.hierarchy,
n.interaction,
n.scope,
n.conf,
n.integrity,
n.availability,
n.version
from new n
inner join old o
on o.component = n.component and
o.privileges = n.privileges and
o.protocol = n.protocol and
o.authority = n.authority and
o.score = n.score and
o.means = n.means and
o.difficulty = n.difficulty and
o.hierarchy = n.hierarchy and
o.interaction = n.interaction and
o.scope = n.scope and
o.conf = n.conf and
o.integrity = n.integrity and
o.availability = n.availability and
o.version = n.version

You should use left join and then select only rows with new values is null. sql should be something like this:
select
old.reference_number
new.reference_number
new.component
new.privileges
new.protocol
new.authority
new.score
new.means
new.difficulty
new.hierarchy
new.interaction
new.scope
new.conf
new.integrity
new.availability
new.version
from old
left join new
on
old.component = new.component
old.privileges = new.privileges
old.protocol = new.protocol
old.authority = new.authority
old.score = new.score
old.means = new.means
old.difficulty = new.difficulty
old.hierarchy = new.hierarchy
old.interaction = new.interaction
old.scope = new.scope
old.conf = new.conf
old.integrity = new.integrity
old.availability = new.availability
old.version = new.version
where new.component is null

Selecting in Bigquery doesn't retrieve all data more than 53000 rows

The table named student_master in bigquery has 70000 rows and I would like to retrieve rows using this query. I found no error when doing this, however, it just retrieve 52226 rows (means, not all). I try to use row_number() over partition_by like this code but still didn't get all data. What should I do?
I am using the idea of using two query order by id_student, limit 35000, and make asc (query1), desc(query2) but it will not works if the data increase (let's say 200000 rows).
data= []
sql = ( "SELECT id_student, class,name," +
" ROW_NUMBER() OVER (PARTITION BY class ORDER BY class ASC) row_num," +
"FROM" +
" [project_name.dataset.student_master]" +
"WHERE not class = " + element['class']
)
query = client.run_sync_query(sql)
query.timeout_ms = 20000
query.run()
for row in query.rows:
data.append(row)
return data

I was able to gather 200,000+ rows by querying a public dataset, verified by using a counter variable:
query_job = client.query("""
SELECT ROW_NUMBER() OVER (PARTITION BY token_address ORDER BY token_address ASC) as row_number,token_address
FROM `bigquery-public-data.ethereum_blockchain.token_transfers`
WHERE token_address = '0x001575786dfa7b9d9d1324ec308785738f80a951'
ORDER BY 1
""")
contador = 0
for row in query_job:
contador += 1
print(contador,row)

In general, for big exports you should run an export job which will place your data into files in GCS.
https://cloud.google.com/bigquery/docs/exporting-data
But in this case you might just need to go through more pages of results:
If the rows returned by the query do not fit into the initial response, then we need to fetch the remaining rows via fetch_data():
query = client.run_sync_query(LIMITED)
query.timeout_ms = TIMEOUT_MS
query.max_results = PAGE_SIZE
query.run() # API request
assert query.complete
assert query.page_token is not None
assert len(query.rows) == PAGE_SIZE
assert [field.name for field in query.schema] == ['name']
iterator = query.fetch_data() # API request(s) during iteration
for row in iterator:
do_something_with(row)
https://gcloud-python.readthedocs.io/en/latest/bigquery/usage.html

Update table value based on presence in multiple possible arrays (SQL)

I have a members table with regions values in it. This table is first populated at sign up.
The states are sorted into an array for each region:
$region[1] = array("FL","GA","NC","SC","USVI","PR");
$region[2] = array("DC","KY","MD","OH","VA","WV");
$region[3] = array("DE","NJ","NY","PA");
$region[4] = array("CT","MA","ME","NH","RI","VT");
$region[5] = array("IA","IL","IN","MO","MI");
$region[6] = array("AL","AR","LA","MS","TN");
$region[7] = array("AZ","NM","OK","TX");
$region[8] = array("CO","KS","MN","MT","NE","ND","SD","UT","WI","WY");
I then do a recursive search to populate the table based on the region for that state.
$member_region = (recur_search($_SESSION['session_table_membersignup_mailing_state'], $region)) ? recur_search($_SESSION['session_table_membersignup_mailing_state'], $region) : 9;
I added additional regions, so that when new users sign up it will sort them based on the new arrays:
$region[1] = array("FL","GA","NC","SC","USVI","PR");
$region[2] = array("DC","KY","MD","OH","VA","WV");
$region[3] = array("DE","NJ","NY","PA");
$region[4] = array("CT","MA","ME","NH","RI","VT");
$region[5] = array("IA","IL","IN","MO","MI");
$region[6] = array("AL","AR","LA","MS","TN");
$region[7] = array("AZ","NM","OK","TX");
$region[8] = array("ND","NE","MI","MN","SD","WI");
$region[9] = array("AK","CA","NV","OR","WA","HI");
$region[10] = array("CO","ID","MT","UT","WY");
I'm running into issues on how to retroactively update the sql table for existing members using the new arrays.
Bsically: I need to loop through the Members table and update the regions value based on the mailing_state prescence in the new arrays.
I'm struggling with writing a working sql statement to do this. I figured since this is a one time update, I could just run it directly from the database.

Spark org.apache.spark.sql.DataFrame: how to calculate avg on the 2nd column

I wanted to calculate the average on a grouped field similar to the below sql query:
select count(*) as total_count
from tbl1
where col2 is NULL;
select col3, count(*)/total_count as avg_count
from tbl1
where col2 is NULL
group by col3;
Please find the Spark statements that I ran through. I already have the total_count.
val df = sqlContext.read.parquet("/user/hive/warehouse/xxx.db/fff")
val badDF = df.filter("col2 = ' '").withColumn("INVALID_COL_NAME", lit("XXX"))
val badGrp1 = df.groupBy("col3").count()
val badGrp2 = badGrp1.select(col("col3"),col("count").as("CNT"))
Now to find avg CNT/total_count, how to proceed?
I tried map and Row, it didn't work.
val badGrp3 = badGrp2.map(row => Row(row._1, row._2/20)) ---> for now I am assuming 20 as total_count.
Could someone please suggest how to proceed?
Thank you.

I don't know much about Scala but from your code I think that you've considered a Row as a Scala Tuple in this line of code:
val badGrp3 = badGrp2.map(row => Row(row._1, row._2/20))
To get data from a Row in Spark you could use methods of Row, just like:
// suppose you are getting the 1st and 2nd value of row
// where the 2nd value (count) is a Long type value
row => Row(row.get(0), row.getLong(1)/20)

Storing results of each column-operation in seperate row in pig

I need to perform some numerical operations (using a UDF) on every column of my table. And for every column I am getting 2 values (mean and standard-dev). But the final result is coming like (mean_1, sd_1, mean_2, sd_2, mean_3, sd_3...), where 1,2... are column indexes. But I need the output for every column in a separate row. Like:
mean_1, sd_1 \\for col1
mean_2, sd_2 \\for col2
...
Here is the pig script I'm using:
data = LOAD 'input_file.csv' USING PigStorage(',') AS (C0,C1,C2);
grouped_data = GROUP data ALL;
res = FOREACH grouped_data GENERATE FLATTEN(data), AVG(data.$1) as mean, COUNT(data.$1) as count;
tmp = FOREACH res {
diff = (C1-mean)*(C1-mean);
GENERATE *,diff as diff;
};
grouped_diff = GROUP tmp all;
sq_tmp = FOREACH grouped_diff GENERATE flatten(tmp), SUM(tmp.diff) as sq_sum;
stat_tmp = FOREACH sq_tmp GENERATE mean as mean, sq_sum/count as variance, SQRT(sq_sum/count) as sd;
stats = LIMIT stat_tmp 1;
Could anybody please guide me on how to achieve this?

Thanks. I got the required output by creating tuples of mean and sd values for respective columns and then storing all such tuples in a bag. Then in the next step I flattened the bag.
tupled_stats = FOREACH raw_stats generate TOTUPLE(mean_0, var_0, sd_0) as T0, TOTUPLE(mean_1, var_1, sd_1) as T1, TOTUPLE(mean_2, var_2, sd_2) as T2;
bagged_stats = FOREACH tupled_stats generate TOBAG(T0, T1, T2) as B;
stats = foreach bagged_stats generate flatten(B);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas