Dataprep is leaving Datasets/Tables behind in BigQuery - google-bigquery

I am using Google Cloud Dataprep for processing data stored in BigQuery. I am having an issue with dataprep/dataflow creates a new dataset with a name starting with "temp_dataset_beam_job_"
It seems to crate the temporary dataset both for failed and successful dataflow jobs, that dataprep creates. This is an issue as BigQuery becomes messy very quickly with all these flows.
This has not been an issue in the past.
A similar issue has been described in this in this GitHub thread: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609
Is there any way of not creating temporary datasets, or instead creating them in a Cloud Storage folder?

I wrote a cleanup script that I am running in Cloud Run (see this article) using Cloud Scheduler.
Below is the script:
#!/bin/bash
PROJECT={PROJECT_NAME}
# get list of datasets with temp_dataset_beam
# optional: write list of files to cloud storage
obj="gs://{BUCKET_NAME}/maintenance-report-$(date +%s).txt"
bq ls --max_results=100 | grep "temp_dataset_beam" | gsutil -q cp -J - "${obj}"
datasets=$(bq ls --max_results=100 | grep "temp_dataset_beam")
for dataset in $datasets
do
echo $PROJECT:$dataset
# WARNING: Uncomment the line below to remove datasets
# bq rm --dataset=true --force=true $PROJECT:$dataset
done

I solved this in Dataprep directly by running a SQL script post data publish that will run after each job. You can set this in Dataprep in the output Manual settings.
(SELECT CONCAT("drop table `<project_id>.",table_schema,".", table_name, "`;" ) AS value
FROM <dataset>.INFORMATION_SCHEMA.TABLES -- or region.INFORMATION_SCHEMA.TABLES
WHERE table_name LIKE "Dataprep_%"
ORDER BY table_name DESC)
DO
EXECUTE IMMEDIATE(drop_statement.value);--Here the table is dropped
END FOR;

Related

Is it possible to export all Bigquery scheduled queries listed into a csv?

I need to create programmatic approach in getting the updated Bigquery scheduled queries listed in csv (or Gsheet / Bigquery table if possible). But I cannot find the related documentation for that. For now I can only select all text manually from Bigquery scheduled query page.
Below are the information needed:
Display name and its URL
Schedule (UTC)
Next scheduled
Author
destination dataset and destination table
But with the new scheduled query is still being created, it is getting more complicated to track with the list is still growing.
To list the scheduled queries with the bq CLI:
bq ls --transfer_config --transfer_location=US --format=prettyjson
For viewing the details of a Schedule Query:
bq show --transfer_config [RESOURCE_NAME]
# [RESOURCE_NAME] is the value from the above bq ls command
In python you can use the below code to list transfer configurations in a project.
from google.cloud import bigquery_datatransfer
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
project_id = "my-project"
parent = transfer_client.common_project_path(project_id)
configs = transfer_client.list_transfer_configs(parent=parent)
print("Got the following configs:")
for config in configs:
print(f"\tID: {config.name}, Schedule: {config.schedule}")
For more information you can refer to link1 and link2.

most reliable format for large bigquery load jobs

I have a 100 GB table that I'm trying to load into google bigquery. It is stored as a single 100GB avro file on GCS.
Currently my bq load job is failing with an unhelpful error message:
UDF worker timed out during execution.; Unexpected abort triggered for
worker avro-worker-156907: request_timeout
I'm thinking of trying a different format. I understand that bigquery supports several formats (AVRO, JSON, CSV, Parquet, etc) and that in principle one can load large datasets in any of these formats.
However, I was wondering whether anyone here might have experience with which of these formats is most reliable / least prone to quirks in practice when loading into bigquery?
Probably I'll solve following these steps:
Creating a ton of small files in csv format
Sending the files to GCS .
Command to copy files to GCS:
gsutil -m cp <local folder>/* gs:<bucket name>
gsutil -m option to perform a parallel
(multi-threaded/multi-processing)
After that, I'll move from GCS to BQ using Cloud Dataflow default template. link. (Remember that using a default template you don't need code)
Here a example to invoke dataflow link :
gcloud dataflow jobs run JOB_NAME \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--parameters \
javascriptTextTransformFunctionName=YOUR_JAVASCRIPT_FUNCTION,\
JSONPath=PATH_TO_BIGQUERY_SCHEMA_JSON,\
javascriptTextTransformGcsPath=PATH_TO_JAVASCRIPT_UDF_FILE,\
inputFilePattern=PATH_TO_YOUR_TEXT_DATA,\
outputTable=BIGQUERY_TABLE,\
bigQueryLoadingTemporaryDirectory=PATH_TO_TEMP_DIR_ON_GCS

How Bigquery use data stored in google cloud?

Guys a very basic question but not able to decipher ,Please help me out.
Q1: When we create bigquery table using below command , the data resides in same Cloud Storage?
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
Q2: let's say my data director is gs://sp2040/raw/cards/cust/ for customer file Table structure defined is:
bq mk --time_partitioning_type=DAY market.cust \
custid:string,grp:integer,odate:string
Everyday I create new dir in the bucket such as 20170101,20170102..to load new dataset. So after the data loaded in this bucket Do I need to fire below queries.
D1:
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
D2:
bq load --source_format=CSV 'market.cust$20170102' \
gs://sp2040/raw/cards/cust/20170102/20170102_cust.csv
When we create bigquery table using below command , the data resides in same Cloud Storage?
Nope! BigQuery is not using Cloud Storage for storing data (unless it is federated Table linked to Cloud Storage)
Check BigQuery Under the Hood with Tino Tereshko and Jordan Tigani - you will like it
Do I need to fire below queries
Yes. you need to load those files into BigQuery, so you can query the data
Yes you would need load the data into BigQuery using those commands.
However, there are a couple of alternatives
PubSub and Dataflow: You could configure PubSub to watch your cloud storage and create notification when files are added, described here. You could then have Dataflow job that imported the file into BigQuery. DataFlow Documentation
BigQuery external tables: BigQuery can query cvs files that are stored in Cloud Storage without importing the data, as described here. There is wildcard support for filenames so it could be configured once. Performance might not be as good as directly storing items in BigQuery

Google BigQuery: How to use gsutil to either remove or overwrite a table?

I have a program which will download some data from the web and save it as a csv, and then upload that data to a Google Cloud Storage Bucket. Next, that program will use gsutil to create a new Google BigQuery Table by concatenating all the files in the Google Cloud Storage Bucket. To do the concatenating I run this command in command prompt:
bq load --project_id=ib-17 da.hi gs://ib/hi/* da:TIMESTAMP,bol:STRING,bp:FLOAT,bg:FLOAT,bi:FLOAT,lo:FLOAT,en:FLOAT,kh:FLOAT,ow:FLOAT,ls:FLOAT
The issue is that for some reason this command appends to the existing table, so I get a lot of duplicate data. The question is how can I either use gsutil to delete the table first maybe how can I use gsutil to overwrite the table?
If I understood correctly your question, you should delete and recreate the table with:
bq rm -f -t da.hi
bq mk --schema da:TIMESTAMP,bol:STRING,bp:FLOAT,bg:FLOAT,bi:FLOAT,lo:FLOAT,en:FLOAT,kh:FLOAT,ow:FLOAT,ls:FLOAT -t da.hi
Another possibility is to use the --replace flag, such as:
bq load --replace --project_id=ib-17 da.hi gs://ib/hi/*
I think that this flag was once called WRITE_DISPOSITION but looks like the CLI updated the name to --replace.

How to save the results of an impala query

I've loaded a large set of data from S3 into hdfs, and then inserted the data to a table in impala.
I then ran a query against this data, and I'm looking to get these results back into S3.
I'm using Amazon EMR, with impala 1.2.4. If it's not possible to get the results of the query back to S3 directly, are there options to get the data back to hdfs and then some how send it back to S3 from there?
I have messed around with the impala-shell -o filename options, but that appears to only work on the local linux file system.
I thought this would have been a common scenario, but having trouble finding any information about saving the results of a query anywhere.
Any pointers appreciated.
To add to the knowledge above I am including the command that writes the query results to a file with a delimeter as we declared using the option --output_delimeter and also by using the option
--delimeted which actually switches off the default tab delimeter option.
impala-shell -q "query " --delimited --output_delimiter='\001' --print_header -o 'filename'
What I usually do if it's a smallish result set is run the script from the command line then upload to s3 using the AWS command line tool:
impala-shell -e "select ble from bla" -o filename
aws s3 cp filename s3://mybucket/filename
An alternative is use Hive as the last step in your data pipeline after you've run your query in Impala:
1. Impala step:
create table processed_data
as
select blah
--do whatever else you need to do in here
from raw_data1
join raw_data2 on a=b
2. Hive step:
create external table export
like processed_data
location 's3://mybucket/export/';
insert into table export
select * from processed_data;
If you have aws cli installed you can use standard out of Impala shell , impala shell query | aws cli cp - s3folder/outputfilename
You can use unix pipe and stream (-)