Create a BigQuery Table everyday from a CSV file stored in Cloud Storage - google-bigquery

On GCP I have a new csv added everyday at 2 am in a Cloud Storage folder (format is something like mydata_yyyymmdd.csv).
I am trying to schedule an upload of this file at 2:30 am everyday in a Big Query Table.
I succeeded in creating a dataflow that constantly screen my Cloud Storage folder and constantly update my BigQuery table if a new file is added but I don't find this optimal as :
my dataflow is running all day long while I only need to run it once a day (increasing costs)
this solution doesn't create a new Big Query table everyday, it just appends all csv to my Big Query table
Can you help me with the tools I should use in GCP to achieve this ?
Thanks a lot for your help, it is very much appreciated

Related

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

Moving bigquery data to Redshift

I'm in need to move my bigquery table to redshift.
Currently I have a python job that is fetching data from redshift, and it is incremental loading my data on the redshift.
This python job is reading bigquery data, creating a csv file in the server, drops the same on s3 and the readshift table reads the data from the file on s3. But now the time size would be very big so the server won't be able to handle it.
Do you guys happen to know anything better than this ?
The new 7 tables on bigquery I would need to move, is around 1 TB each, with repeated column set. (I am doing an unnest join to flattening it)
You could actually move the data from Big Query to a Cloud Storage Bucket by following the instructions here. After that, you can easily move the data from the Cloud Storage bucket to the Amazon s3 bucket by running:
gsutil rsync -d -r gs://your-gs-bucket s3://your-s3-bucket
The documentation for this can be found here

BigQuery - Transfers automation from Google Cloud Storage - Overwrite table

Here's the case:
Our client daily uploads CSVs (overwritten) to a bucket in Google Cloud Storage (each table in a different file).
We use BigQuery as DataSource in DataStudio
We want to automatically transfer the CSVs to BigQuery.
The thing is, even though we've:
Declared the tables in BigQuery with "Overwrite table" write preference option
Configured the daily Transfers vía UI (BigQuery > Transfers) to automatically upload the CSVs from Google Cloud one hour after the files are uploaded to Google Cloud, as stated by the limitations.
The automated transfer/load is by default in "WRITE_APPEND", thus the tables are appended instead of overwritten in BigQuery.
Hence the question: How/where can we change the
configuration.load.writeDisposition = WRITE_TRUNCATE
as stated here in order to overwrite the tables when the CSVs are automatically loaded?
I think that's what we're missing.
Cheers.
None of the above worked for us, so I'm posting this in case anyone has the same issue.
We scheduled a query to erase the table content just before the automatic importation process starts:
DELETE FROM project.tableName WHERE true
And then, new data will be imported to a void table, therefore default "WRITE_APPEND" doesn't affect us.
1) One way to do this is to use DDL to CREATE and REPLACE your table before running the query which imports the data.
This is an example of how to create a table
#standardSQL
CREATE TABLE mydataset.top_words
OPTIONS(
description="Top ten words per Shakespeare corpus"
) AS
SELECT
corpus,
ARRAY_AGG(STRUCT(word, word_count) ORDER BY word_count DESC LIMIT 10) AS top_words
FROM bigquery-public-data.samples.shakespeare
GROUP BY corpus;
Now that it's created you can import your data.
2) Another way is to use BigQuery schedule Queries
3) If you write Python you can find an even better solution here

Simplest Way to Automate Appending De-Duped Data to BigQuery from Google Cloud

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
Identify new files that have not been added (something like date = today - 1)
Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
Insert the de-duped temp table into the BigQuery table.
Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!
There are a few ways to solve this, but I'd recommend something like this:
Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
Your pipeline can remove duplicates directly itself. See here and here.
Create a cloud function to monitor your GCS bucket.
When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.
So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):
how to get yesterday's date in a batch file
https://cloud.google.com/bigquery/bq-command-line-tool
cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&
set /p zvar =< C:\Desktop\tempvar.txt &&
del C:\Desktop\tempvar.txt &&
bq load
--skip_leading_rows=1
data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1
gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz
Timestamp:TIMESTAMP,TransactionID:STRING &&
bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&
bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&
bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2
A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.
The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.

Issues creating table from bucket file

I have a big table (About 10 million rows) that I'm trying to pull into my bigquery. I had to upload the CSV into the bucket due to the size constraints when creating the table. When I try to create the table using the Datastore, the job fails with the error:
Error Reason:invalid. Get more information about this error at Troubleshooting Errors: invalid.
Errors:
gs://es_main/provider.csv does not contain valid backup metadata.
Job ID: liquid-cumulus:job_KXxmLZI0Ulch5WmkIthqZ4boGgM
Start Time: Dec 16, 2015, 3:00:51 PM
End Time: Dec 16, 2015, 3:00:51 PM
Destination Table: liquid-cumulus:ES_Main.providercloudtest
Source URI: gs://es_main/provider.csv
Source Format: Datastore Backup
I've troubleshot by using a small sample file of rows from the same table and just uploading using the CSV feature in the table creation without any errors and can view the data just fine.
I'm just wondering what the metadata should be set as with the "Edit metadata" option within the bucket or if there is some other work around I'm missing. Thanks
The error message for the job that you posted is telling you that the file you're providing is not a Datastore Backup file. Note that "Datastore" here means Google Cloud Datastore, which is another storage solution that it sounds like you aren't using. A Cloud Datastore Backup is a specific file type from that storage product which is different from CSV or JSON.
Setting the file metadata within the Google Cloud Storage browser, which is where the "Edit metadata" option you're talking about lives, should have no impact on how BigQuery imports your file. It might be important if you were doing something more involved with your file from Cloud Storage, but it isn't important to BigQuery as far as I know.
To upload a CSV file from Google Cloud Storage to BigQuery, make sure to select the CSV source format and the Google Storage load source as pictured below.