Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud - sql

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!

There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.

So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.

Related

BigQuery Scheduled Query won't run

I've got data buckets setup in GCS and using BigQuery to run all my .csv files from that bucket to build a table. That works flawlessly. I made a simple deduplication query that when manually run, selects only distinct rows and creates a new table with "DeDupe" appended (Code below). That runs flawlessly.
CREATE OR REPLACE TABLE
`project-name-123456.dataset_2022.dataset 2022 DeDuped` AS
SELECT
DISTINCT *
FROM
`project-name-123456.dataset_2022.dataset 2022`
The issue I am having is with scheduling that query. Every time it tries to run I get the error "Error status: Not found: Dataset project-name-123456:dataset_2022 was not found in location US; JobID: project-name-123456:628d7766-0000-2d36-a82f-94eb2c0a664a"
The only thing I can figure is that I have my data location for the dataset as "us-central1" as it has a free tier. And when I go to my scheduled query, whether I select the same data location, or "Default" it always changes to "US Multiple".
Is there a way to fix this?
Or do I need to create my dataset in "US Multiple"?
Trying to cut down on costs as much as possible by keeping it in the us-central1
EDIT: Seems like I just needed to delete and recreate the scheduled query again. Chatted with Google Support and they sorted it. Sorry all!

Create a BigQuery Table everyday from a CSV file stored in Cloud Storage

On GCP I have a new csv added everyday at 2 am in a Cloud Storage folder (format is something like mydata_yyyymmdd.csv).
I am trying to schedule an upload of this file at 2:30 am everyday in a Big Query Table.
I succeeded in creating a dataflow that constantly screen my Cloud Storage folder and constantly update my BigQuery table if a new file is added but I don't find this optimal as :
my dataflow is running all day long while I only need to run it once a day (increasing costs)
this solution doesn't create a new Big Query table everyday, it just appends all csv to my Big Query table
Can you help me with the tools I should use in GCP to achieve this ?
Thanks a lot for your help, it is very much appreciated

BigQuery - Transfers automation from Google Cloud Storage - Overwrite table

Here's the case:
Our client daily uploads CSVs (overwritten) to a bucket in Google Cloud Storage (each table in a different file).
We use BigQuery as DataSource in DataStudio
We want to automatically transfer the CSVs to BigQuery.
The thing is, even though we've:
Declared the tables in BigQuery with "Overwrite table" write preference option
Configured the daily Transfers vía UI (BigQuery > Transfers) to automatically upload the CSVs from Google Cloud one hour after the files are uploaded to Google Cloud, as stated by the limitations.
The automated transfer/load is by default in "WRITE_APPEND", thus the tables are appended instead of overwritten in BigQuery.
Hence the question: How/where can we change the
configuration.load.writeDisposition = WRITE_TRUNCATE
as stated here in order to overwrite the tables when the CSVs are automatically loaded?
I think that's what we're missing.
Cheers.
None of the above worked for us, so I'm posting this in case anyone has the same issue.
We scheduled a query to erase the table content just before the automatic importation process starts:
DELETE FROM project.tableName WHERE true
And then, new data will be imported to a void table, therefore default "WRITE_APPEND" doesn't affect us.
1) One way to do this is to use DDL to CREATE and REPLACE your table before running the query which imports the data.
This is an example of how to create a table
#standardSQL
CREATE TABLE mydataset.top_words
OPTIONS(
description="Top ten words per Shakespeare corpus"
) AS
SELECT
corpus,
ARRAY_AGG(STRUCT(word, word_count) ORDER BY word_count DESC LIMIT 10) AS top_words
FROM bigquery-public-data.samples.shakespeare
GROUP BY corpus;
Now that it's created you can import your data.
2) Another way is to use BigQuery schedule Queries
3) If you write Python you can find an even better solution here

Exporting query results as JSON via Google BigQuery API

I've got jobs/queries that return a few hundred thousand rows. I'd like to get the results of the query and write them as json in a storage bucket.
Is there any straightforward way of doing this? Right now the only method I can think of is:
set allowLargeResults to true
set a randomly named destination table to hold the query output
create a 2nd job to extract the data in the "temporary" destination table to a file in a storage bucket
delete the random "temporary" table.
This just seems a bit messy and roundabout. I'm going to be wrapping all this in a service hooked up to a UI that would have lots of users hitting it and would rather not be in the business of managing all these temporary tables.
1) As you mention the steps are good. You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntax.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
With this approach you first need to export to GCS, then to transfer to local machine. If you have a message queue system (like Beanstalkd) in place to drive all these it's easy to do a chain of operation: submit jobs, monitor state of the job, when done initiate export to GCS, then delete the temp table.
Please also know that you can update a table via the API and set the expirationTime property, with this aproach you don't need to delete it.
2) If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json

Hive: create table and write it locally at the same time

Is it possible in hive to create a table and have it saved locally at the same time?
When I get data for my analyses, I usually create temporary tables to track eventual
mistakes in the queries/scripts. Some of these are just temporary tables, while others contain the data that I actually need for my analyses.
What I do usually is using hive -e "select * from db.table" > filename.tsv to get the data locally; however when the tables are big this can take quite some time.
I was wondering if there is some way in my script to create the table and save it locally at the same time. Probably this is not possible, but I thought it is worth asking.
Honestly doing it the way you are is the best way out of the two possible ways but it is worth noting you can preform a similar task in an .hql file for automation.
Using syntax like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp' select * from table;
You can run a query and store it somewhere in the local directory (as long as there is enough space and correct privileges)
A disadvantage to this is that with a pipe you get the data stored nicely as '|' delimitation and new line separated, but this method will store the values in the hive default '^b' I think.
A work around is to do something like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
But this is only in Hive 0.11 or higher