Best practice to add time partitions to a table

Best practice to add time partitions to a table - hive

having an event tables, partitioned by time (year,month,day,hour)
Wanna join a few events in hive script that gets the year,month,day,hour as variables,
how can you add for example also events from all 6 hours prior to my time
without 'recover all...'
10x

So basically what i needed was a way to use a date that the Hive script receives as parameter
and add all partitions 3 hour before and 3 hours after that date, without recovering all partitions and add the specific hours in every Where clause.
Didn't find a way to do it inside the hive script, so i wrote a quick python code that gets a date and table name, along with how many hours to add from before/after.
When trying to run it inside the Hive script with:
!python script.py tablename ${hivecond:my.date} 3
i was surprised that the variable substition does not take place in a line that starts with !
my workaround was to get the date that the hive script recieved from the log file in the machine using something like:
'cat /mnt/var/log/hadoop/steps/ls /mnt/var/log/hadoop/steps/ |sort -r|head -n 1/stdout'
and from there you can parse each hive parameter in the python code without passing it via Hive.

Related

Query from a just-expired view in BQ

Is it possible to query from a recently expired view in Big Query and save a snapshot? (expired 2h ago)

You can try Managing tables
In the documentation there is some examples on how to do that in section Restoring deleted tables.
You can undelete a table within seven days of deletion, including explicit deletions and implicit deletions due to table expiration. After seven days, it is not possible to undelete a table using any method, including opening a support ticket.
You can restore a deleted table by:
Using the # snapshot decorator in the bq command-line tool
Using the client libraries
To restore a table, use a table copy operation with the # snapshot decorator. First, determine a UNIX timestamp of when the table existed (in milliseconds). Then, use the bq copy command with the snapshot decorator.
For example, enter the following command to copy mydataset.mytable at the time 1418864998000 into a new table mydataset.newtable.
bq cp mydataset.mytable#1418864998000 mydataset.newtable
(Optional) Supply the --location flag and set the value to your location.
You can also specify a relative offset. The following example copies the version of a table from one hour ago:
bq cp mydataset.mytable#-3600000 mydataset.newtable
For more information, see Restore a table from a point in time.

pyspark write overwrite is partitioned but is still overwriting the previous load

I am running a pyspark script where I'm saving off some data to a s3 bucket each time the script is run and I have this code:
data.repartition(1).write.mode("overwrite").format("parquet").partitionBy('time_key').save( "s3://path/to/directory")
It is partitioned by time_key but at each run, but the latest data dump is overwriting the previous data instead of being adding a partition. The time_key is unique to each run.
Is this the correct code if I want to write the data to s3 and partition by time key at each run?

If you are on Spark Version 2.3 + then this issue has been fixed via https://issues.apache.org/jira/browse/SPARK-20236
You have to set the spark.sql.sources.partitionOverwriteMode="dynamic" flag to overwrite the specific partition of the data.
And also as per your statement time_key being unique for each run you could probably make use of the append mode itself.

Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!

There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.

So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.

How to add today's date into BigQuery destination table name

I am new to Google Cloud BigQuery. I am trying to schedule a job which runs a query periodically. In each run, I would like to create a destination table whose name contains today's date. I need something like:
bq query --destination=[project]:[dataset].[table name_date]
Is it possible to do that automatically? Any help is greatly appreciated.

This example is using shell scripting.
YEAR=$(date -d "$d" '+%Y')
MONTH=$(date -d "$d" '+%m')
DAY=$(date -d "$d" '+%d')
day_partition=$YEAR$MONTH$DAY
bq_partitioned_table="${bq_table}"'_'"${day_partition}"
bq query --destination=$bq_partitioned_table
See if it helps.

Where do you put your periodic query?
I always put in datalab notebook, and then use module datetime to get today's date and assign to the destination table name.
then set the notebook to run every day at certain time. Works great.

run shell command inside Hive that uses Hive's variables value as input

I have a python script that receives a Hive table name and 2 dates and adds all partitions between those dates. (runs a bunch of hive -e 'alter table add partition (date=...)')
What i would like to do is when running a Hive script that has a hiveconf:date variable
pass it to the python script as input.
something like:
!python addpartitions.py mytable date=${hiveconf:date}
but of course the variable substitution does not take place...
Any way to achieve this?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Best practice to add time partitions to a table - hive

having an event tables, partitioned by time (year,month,day,hour) Wanna join a few events in hive script that gets the year,month,day,hour as variables, how can you add for example also events from all 6 hours prior to my time without 'recover all...' 10x

Related

Query from a just-expired view in BQ

pyspark write overwrite is partitioned but is still overwriting the previous load

Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud

How to add today's date into BigQuery destination table name

run shell command inside Hive that uses Hive's variables value as input

Categories

Resources