How to get the job id of the last job run in BigQuery command line tool? - google-bigquery

I am running some commands in bq to extract the data from BigQuery to GCS. I am able to achieve the target result.
I am able to query the data and able to put the data into GCS in desired formats. I was just wondering if there's any possible way to get job id of last job and it's state. I know we can get the all jobs details using bq job list methods, but it's giving me an entire result set. I'm just looking to get only the state of that job.
bq --location=US extract --destination_format CSV --compression GZIP dataset_id.table_name gs://bucket_name/table.csv
bq ls -j -n 1
jobId Job Type State Start Time Duration
job_id extract FAILURE 30 Mar 13:36:54 0:00:29
I want only the last job id and it's state part.

I want only the last job id and it's state part.
You can pipe it to awk:
bq ls -j -n 1 | awk '{if(NR>2)print}' | awk '{print $1,$3}'
bquxjob_69ed4f1_169ba1f5665 SUCCESS

Looking at the docs, bq offers the global flag --job_id, which allows you to set the ID for the job you're launching (in this case, via the extract command). They even have a section about best practices around generating a job id.
Once you've created the job, you can get details for that specific job using bq show --job MY_JOB_ID_HERE.
If you don't want to have to generate a job ID yourself, a more hacky approach would be to have bq print out the API calls using the global --apilog stdout option, and you could potentially parse the job ID from that.

Related

Filter for failed BigQuery jobs in the webinterface

I have a lot of jobs that have run (over 1000) and some of them failed. I can grep for them on the command line with
bq ls -j --max_results=10000 | grep FAILURE and then check each job id with bq show --format=prettyjson -j JOB_ID.
I can filter for jobs in the webinterface, but the filter does not allow me to filter for failed jobs only. At least I cannot see it:
Is it possible to filter for failed jobs in the webinterface?
I’m with GCP Support. At the moment, it is not possible to filter like that.
As I think this is a good request, I have filled a Feature Request for this in the Public Issue Tracker, where you can follow the updates of this FR.

Autosys job history for last year

I need to find job history for a job for the last 6 months. I have tried autorep-j -r -n but this just gives me data for last nth execution while I want data for all the days. Is there a way to do this through command line?
There is no way with that cmd to get that information.
I would query: $AUTOUSER/archive
Or look in DB if you have access there
Dave

Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!
There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.
So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.

How to save the results of an impala query

I've loaded a large set of data from S3 into hdfs, and then inserted the data to a table in impala.
I then ran a query against this data, and I'm looking to get these results back into S3.
I'm using Amazon EMR, with impala 1.2.4. If it's not possible to get the results of the query back to S3 directly, are there options to get the data back to hdfs and then some how send it back to S3 from there?
I have messed around with the impala-shell -o filename options, but that appears to only work on the local linux file system.
I thought this would have been a common scenario, but having trouble finding any information about saving the results of a query anywhere.
Any pointers appreciated.
To add to the knowledge above I am including the command that writes the query results to a file with a delimeter as we declared using the option --output_delimeter and also by using the option
--delimeted which actually switches off the default tab delimeter option.
impala-shell -q "query " --delimited --output_delimiter='\001' --print_header -o 'filename'
What I usually do if it's a smallish result set is run the script from the command line then upload to s3 using the AWS command line tool:
impala-shell -e "select ble from bla" -o filename
aws s3 cp filename s3://mybucket/filename
An alternative is use Hive as the last step in your data pipeline after you've run your query in Impala:
1. Impala step:
create table processed_data
as
select blah
--do whatever else you need to do in here
from raw_data1
join raw_data2 on a=b
2. Hive step:
create external table export
like processed_data
location 's3://mybucket/export/';
insert into table export
select * from processed_data;
If you have aws cli installed you can use standard out of Impala shell , impala shell query | aws cli cp - s3folder/outputfilename
You can use unix pipe and stream (-)

Best practice to add time partitions to a table

having an event tables, partitioned by time (year,month,day,hour)
Wanna join a few events in hive script that gets the year,month,day,hour as variables,
how can you add for example also events from all 6 hours prior to my time
without 'recover all...'
10x
So basically what i needed was a way to use a date that the Hive script receives as parameter
and add all partitions 3 hour before and 3 hours after that date, without recovering all partitions and add the specific hours in every Where clause.
Didn't find a way to do it inside the hive script, so i wrote a quick python code that gets a date and table name, along with how many hours to add from before/after.
When trying to run it inside the Hive script with:
!python script.py tablename ${hivecond:my.date} 3
i was surprised that the variable substition does not take place in a line that starts with !
my workaround was to get the date that the hive script recieved from the log file in the machine using something like:
'cat /mnt/var/log/hadoop/steps/ls /mnt/var/log/hadoop/steps/ |sort -r|head -n 1/stdout'
and from there you can parse each hive parameter in the python code without passing it via Hive.