Can we use 'bg' to get get only the changes n data of two CSV in Google Cloud storage bucket for Optimization? - sql

cat compare_data_gse_query.sql | bq query --use_legacy_sql=False
(1) Cost optimization in GSP. lots of the SQL query do need the changes.
(2) How to compare CSV data (today | yesterday) changes in GSC - Google Storage
(3) SQL * is to to be avoided. is there any utility in 'bq' CLI could handle ?

Related

Moving bigquery data to Redshift

I'm in need to move my bigquery table to redshift.
Currently I have a python job that is fetching data from redshift, and it is incremental loading my data on the redshift.
This python job is reading bigquery data, creating a csv file in the server, drops the same on s3 and the readshift table reads the data from the file on s3. But now the time size would be very big so the server won't be able to handle it.
Do you guys happen to know anything better than this ?
The new 7 tables on bigquery I would need to move, is around 1 TB each, with repeated column set. (I am doing an unnest join to flattening it)
You could actually move the data from Big Query to a Cloud Storage Bucket by following the instructions here. After that, you can easily move the data from the Cloud Storage bucket to the Amazon s3 bucket by running:
gsutil rsync -d -r gs://your-gs-bucket s3://your-s3-bucket
The documentation for this can be found here

Result of Bigquery job running on a table in which data is loaded via streamingAPI

I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.

How Bigquery use data stored in google cloud?

Guys a very basic question but not able to decipher ,Please help me out.
Q1: When we create bigquery table using below command , the data resides in same Cloud Storage?
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
Q2: let's say my data director is gs://sp2040/raw/cards/cust/ for customer file Table structure defined is:
bq mk --time_partitioning_type=DAY market.cust \
custid:string,grp:integer,odate:string
Everyday I create new dir in the bucket such as 20170101,20170102..to load new dataset. So after the data loaded in this bucket Do I need to fire below queries.
D1:
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
D2:
bq load --source_format=CSV 'market.cust$20170102' \
gs://sp2040/raw/cards/cust/20170102/20170102_cust.csv
When we create bigquery table using below command , the data resides in same Cloud Storage?
Nope! BigQuery is not using Cloud Storage for storing data (unless it is federated Table linked to Cloud Storage)
Check BigQuery Under the Hood with Tino Tereshko and Jordan Tigani - you will like it
Do I need to fire below queries
Yes. you need to load those files into BigQuery, so you can query the data
Yes you would need load the data into BigQuery using those commands.
However, there are a couple of alternatives
PubSub and Dataflow: You could configure PubSub to watch your cloud storage and create notification when files are added, described here. You could then have Dataflow job that imported the file into BigQuery. DataFlow Documentation
BigQuery external tables: BigQuery can query cvs files that are stored in Cloud Storage without importing the data, as described here. There is wildcard support for filenames so it could be configured once. Performance might not be as good as directly storing items in BigQuery

Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!
There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.
So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.

Is the cost incurred when partitioning date tables in BigQuery?

BigQuery quotes this command for creating a partition from existing tables:
bq partition mydataset.sharded_ mydataset.partitioned
(see partitioned tables)
But when I run this, I see that the data is actually getting moved. Since selecting data from raw large tables is very expensive, I wonder how Google applies billing for this situation.
The bq partition CLI command leverages copy jobs rather than queries, which don't incur execution costs (but you do still get charged for the persisted storage that it may generate).
If you're using the CLI, copy jobs can be specified using the bq cp command.