I am new to Google Cloud BigQuery. I am trying to schedule a job which runs a query periodically. In each run, I would like to create a destination table whose name contains today's date. I need something like:
bq query --destination=[project]:[dataset].[table name_date]
Is it possible to do that automatically? Any help is greatly appreciated.
This example is using shell scripting.
YEAR=$(date -d "$d" '+%Y')
MONTH=$(date -d "$d" '+%m')
DAY=$(date -d "$d" '+%d')
day_partition=$YEAR$MONTH$DAY
bq_partitioned_table="${bq_table}"'_'"${day_partition}"
bq query --destination=$bq_partitioned_table
See if it helps.
Where do you put your periodic query?
I always put in datalab notebook, and then use module datetime to get today's date and assign to the destination table name.
then set the notebook to run every day at certain time. Works great.
Related
In our BQ export schema, we have one table for each day as per the screenshot below.
I want to copy the tables before a certain date (2021-feb-07). I know how to copy one day at a time via the UI, but is there not a way to use the cloud console to write a code for copying the selected date range, all at once? Or maybe an sql command directly from a query window?
I think you should transform your sharding tables into a partitioned table. So you can handled your tables with just a single query. As mention in the official documentation, partitioned tables perform better.
To make the conversion, you can just execute the following commands in the console.
bq partition \
--time_partitioning_type=DAY \
--time_partitioning_expiration 259200 \
mydataset.sourcetable_ \
mydataset.mytable_partitioned
This will make your sharded tables sourcetable_(xxx) into a single partitioned table mytable_partitioned which can be query with just a single query trough your entire set of data entries.
SELECT
*
FROM
`myprojectid.mydataset.mytable_partitioned`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2022-01-01') AND TIMESTAMP('2022-01-03')
For more details about the conversion commands you can check this link. Also, I recommend to check the links about querying partionated tables and partiotioned tables for more details.
I'm working with hive/impala and I often run into the need of having to query the results of a show partition to get specific partition. Let's suppose I have a table tbl1 partitioned by fields country and date. So, show partitions tbl1 would result in something like this
country=c1/date=d1
country=c1/date=d3
country=c2/date=d2
I want to do something like select * from (show partitions tbl1) a where a.country='c1' and I want to do this in Hue or shell (hive and impala).
Is this possible?
I don't think what you are trying is possible inside impala/hive directly.
I can suggest an alternative way:
Use bash in combination impala/hive
so instead of entering into interactive modes in hive and impala , use the command line option to pass the query from bash shell itself so that the result comes back to bash shell and then use grep or other text processing commands to process it.
so it would seem like
impala -k -i <> --ssl -ca_cert <> -B -q "show partitions tbl1" | grep "country=c1"
here u need to put required values in place of <>
so in this way you can use grep/sed or other tools to get the desired output.
Obviously it depends on your use case what you exactly want..but I hope this can help
If someone ever finds this useful, this is what I ended up doing. Assuming you have spark-shell or spark2-shell, you can store the output of show partitions in a dataframe and then transform such dataframe. This is what I did (inside spark2-shell:
val df = spark.sql("show partitions tbl1").map(row => {
val arrayValues = row.getString(0).split("/")
(arrayValues.head.split("=")(1), arrayValues(1).split("=")(1))
}).toDF("country", "date")
this takes the list of partitions (a DataFrame[String]) and splits the dataframe by / and then for each piece, splits for = and takes the value
I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!
There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.
So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.
I'm using Hive to aggregate stats, and I want to do a breakdown by the industry our customers fall under. Ideally, I'd like to write the stats for each industry to a separate output file per industry (e.g. industry1_stats, industry2_stats, etc.). I have a list of various industries our customers are in, but that list isn't pre-set.
So far, everything I've seen from Hive documentation indicates that I need to know what tables I'd want beforehand and hard-code those into my Hive script. Is there a way to do this dynamically, either in the Hive script itself (preferable) or through some external code before kicking off the Hive script?
I would suggest go for a shell script..
Get the list of columns
hive -e 'select distinct industry_name from [dbname].[table_name];' > list
Iterate over every line... passing every line(industry names) of list as argument to the do while loop
tail -n +1 list | while IFS=' ' read -r industry_name
do
hive -hiveconf MY_VAR=$industry_name -f my_script.hql
done
save the shell script as test.sh
and in my_script.hql
use uvtest;
create table ${hiveconf:MY_VAR} (id INT, name CHAR(10));
you'll have to place both the test.sh and my_script.hql in the same folder.
Below command should create all the tables from list of column names.
sh test.sh
Follow this link for using hive in shell scripts:
https://www.mapr.com/blog/quick-tips-using-hive-shell-inside-scripts
I wound up achieving this using Hive's dynamic partitioning (each partition writes to a separate directory on disk, so I can just iterate through that file). The official Hive documentation on partitioning and this blog post were particularly helpful for me.
having an event tables, partitioned by time (year,month,day,hour)
Wanna join a few events in hive script that gets the year,month,day,hour as variables,
how can you add for example also events from all 6 hours prior to my time
without 'recover all...'
10x
So basically what i needed was a way to use a date that the Hive script receives as parameter
and add all partitions 3 hour before and 3 hours after that date, without recovering all partitions and add the specific hours in every Where clause.
Didn't find a way to do it inside the hive script, so i wrote a quick python code that gets a date and table name, along with how many hours to add from before/after.
When trying to run it inside the Hive script with:
!python script.py tablename ${hivecond:my.date} 3
i was surprised that the variable substition does not take place in a line that starts with !
my workaround was to get the date that the hive script recieved from the log file in the machine using something like:
'cat /mnt/var/log/hadoop/steps/ls /mnt/var/log/hadoop/steps/ |sort -r|head -n 1/stdout'
and from there you can parse each hive parameter in the python code without passing it via Hive.