upload multiple csv from google cloud to bigquery - google-bigquery

I need to upload multiple CSV files from my google bucket. Tried pointing to the bucket when creating the dataset, but i received an error. also tried
gsutil load <projectID:dataset.table> gs://mybucket
it didn't work.
I need to upload multiple files at a time as my total data is 2-3 TB and there is a large number of files

You're close. Google Cloud Storage uses gsutil, but BigQuery's command-line utility is "bq". The command you're looking for is bq load <table> gs://mybucket/file.csv.
bq's documentation is over here: https://developers.google.com/bigquery/bq-command-line-tool

Related

Unzip files from S3 before putting them into Snowflake

I have data available in an S3 bucket we don't own, with a zipped folder containing files for each date.
We are using Snowflake as our data warehouse. Snowflake accepts gzip'd files, but does not ingest zip'd folders.
Is there a way to directly ingest the files into Snowflake that will be more efficient than copying them all into our own S3 bucket and unzipping them there, then pointing e.g. Snowpipe to that bucket? The data is on the order of 10GB per day, so copying is very doable, but would introduce (potentially) unnecessary latency and cost. We also don't have access to their IAM policies, so can't do something like S3 Sync.
I would be happy to write something myself, or use a product/platform like Meltano or Airbyte, but I can't find a suitable solution.
How about using SnowSQL to load the data into Snowflake, and using Snowflake stage table/user/named stage to hold files at stages.
https://docs.snowflake.com/en/user-guide/data-load-local-file-system-create-stage.html
I had a similar use case. I use an event based trigger that runs a Lambda function everytime there is a new zipped file in my S3 folder. The Lambda functions opens the zipped files, gzips each individual file and re-uploads them to a different S3 folder. Here's the full working code: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

Bigquery Unloading Large Data to a Single GZIP File

I'm using the BigQuery console and was planning to extract a table and put the results into Google Cloud Storage as a GZIP file but encountered an error asking to wilcard the filename as based on Google docs, it's like a limitation for large volume of data and extract needs to be splitted.
https://cloud.google.com/bigquery/docs/exporting-data#console
By any chance is there a workaround so I could have a single compressed file loaded to Google Cloud Storage instead of multiple files? I was using Redshift previously and this wasn't an issue.

Inserting realtime data into Bigquery with a file on compute engine?

I'm downloading realtime data into a csv file on Google's Compute Engine instance and want to load this file into Bigquery for realtime analysis.
Is there a way for me to do this without first uploading the file to Cloud Storage?
I tried this: https://cloud.google.com/bigquery/streaming-data-into-bigquery but since my file isnt in JSON, this fails.
Have you tried the command line tool? You can upload CSVs from it.

How to download all data in a Google BigQuery dataset?

Is there an easy way to directly download all the data contained in a certain dataset on Google BigQuery? I'm actually downloading "as csv", making one query after another, but it doesn't allow me to get more than 15k rows, and rows i need to download are over 5M.
Thank you
You can run BigQuery extraction jobs using the Web UI, the command line tool, or the BigQuery API. The data can be extracted
For example, using the command line tool:
First install and auth using these instructions:
https://developers.google.com/bigquery/bq-command-line-tool-quickstart
Then make sure you have an available Google Cloud Storage bucket (see Google Cloud Console for this purpose).
Then, run the following command:
bq extract my_dataset.my_table gs://mybucket/myfilename.csv
More on extracting data via API here:
https://developers.google.com/bigquery/exporting-data-from-bigquery
Detailed step-by-step to download large query output
enable billing
You have to give your credit card number to Google to export the output, and you might have to pay.
But the free quota (1TB of processed data) should suffice for many hobby projects.
create a project
associate billing to a project
do your query
create a new dataset
click "Show options" and enable "Allow Large Results" if the output is very large
export the query result to a table in the dataset
create a bucket on Cloud Storage.
export the table to the created bucked on Cloud Storage.
make sure to click GZIP compression
use a name like <bucket>/prefix.gz.
If the output is very large, the file name must have an asterisk * and the output will be split into multiple files.
download the table from cloud storage to your computer.
Does not seem possible to download multiple files from the web interface if the large file got split up, but you could install gsutil and run:
gsutil -m cp -r 'gs://<bucket>/prefix_*' .
See also: Download files and folders from Google Storage bucket to a local folder
There is a gsutil in Ubuntu 16.04 but it is an unrelated package.
You must install and setup as documented at: https://cloud.google.com/storage/docs/gsutil
unzip locally:
for f in *.gz; do gunzip "$f"; done
Here is a sample project I needed this for which motivated this answer.
For python you can use following code,it will download data as a dataframe.
from google.cloud import bigquery
def read_from_bqtable(bq_projectname, bq_query):
client = bigquery.Client(bq_projectname)
bq_data = client.query(bq_query).to_dataframe()
return bq_data #return dataframe
bigQueryTableData_df = read_from_bqtable('gcp-project-id', 'SELECT * FROM `gcp-project-id.dataset-name.table-name` ')
yes steps suggested by Michael Manoochehri are correct and easy way to export data from Google Bigquery.
I have written a bash script so that you do not required to do these steps every time , just use my bash script .
below are the github url :
https://github.com/rajnish4dba/GoogleBigQuery_Scripts
scope :
1. export data based on your Big Query SQL.
2. export data based on your table name.
3. transfer your export file to SFtp server.
try it and let me know your feedback.
to help use ExportDataFromBigQuery.sh -h

How to upload multiple files to google cloud storage bucket as a transaction

Use Case:
Upload multiple files into a cloud storage bucket, and then use that data as a source to a bigquery import. Use the name of the bucket as the metadata to drive which sharded table the data should go into.
Question:
In order to prevent partial import to the bigquery table, ideally, I would like to do the following,
Upload the files into a staging bucket
Verify all files have been uploaded correctly
Rename the staging bucket to its final name (for example, gs://20130112)
Trigger the bigquery import to load the bucket into a sharded table
Since gsutil does not seem to support bucket rename, what are the alternative ways to accomplish this?
Google Cloud Storage does not support renaming buckets, or more generally an atomic way to operate on more than one object at a time.
If your main concern is that all objects were uploaded correctly (as opposed to needing to ensure the bucket content is only visible once all objects are uploaded), gsutil cp supports that -- if any object fails to upload, it will report the number that failed to upload and exit with a non-zero status.
So, a possible implementation would be a script that runs gsutil cp to upload all your files, and then checks the gsutil exit status before creating the BigQuery table load job.
Mike Schwartz, Google Cloud Storage team
Object names are actually flat in Google Cloud Storage; from the service's perspective, '/' is just another character in the name. The folder abstraction is provided by clients, like gsutil and various GUI tools. Renaming a folder requires clients to request a sequence of copy and delete operations on each object in the folder. There is no atomic way to rename a folder.
Mike Schwartz, Google Cloud Storage team