How to Load data into bigquery from buckets partitioned with Year/Month/Day - google-bigquery

We have a data stored on gcp bucket in below format -
gs:/gcptest/Year=2020/Month=06/day=18/test1.parquet and so many files under the day=18 folder.
I want to create a table in bigquery with the columns present in the files and partitioned by Year,Month,Day that is present on the file path.
So that when I will load the data into table I can just select the path from the gcp bucket and load the data which will partitioned by values of Year/Month/Day present on the path

BigQuery supports loading externally partitioned data in Avro, Parquet, ORC, CSV and JSON formats that is stored on Cloud Storage using a default hive partitioning layout.
Support is currently limited to the BigQuery web UI, command-line tool, and REST API.
You can see more in Loading externally partitioned data documentation
Also see how to Query externally partitioned data

Related

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

BigQuery Table creation options

When we create a table under a particular dataset, we have 5 options like empty table , Google cloud storage and upload etc.My question is if it is a Cloud storage , where does this table gets created in BigQuery or Cloud storage ? as my intention is to dump the data in cloud storage and then load in to BigQuer. Same goes for empty table also as we explicitly define schema , I understand the table will reside in BQ.
I have load the data by below script:
bq load --source_format=CSV --skip_leading_rows=1 --autodetect --ignore_unknown_values \
commerce.balltoball gs://balltoballbucket/head_usa_names.csv
I suppose the balltoballbucket is referred to storage bucket where as commerce.balltoball is BigQuery refrence.
Apologies for newbie question. Thanks for your help.
If your bq load works, then UI should work for you. The documentation is here:
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#loading_csv_data_into_a_table (then pick Console tab)
Select file from GCS bucket: gs://balltoballbucket/head_usa_names.csv
File Format: CSV
Dataset Name: commerce
Table Name: balltoball
Other options see on the page:
(Optional) Click Advanced options.
As to where the table is stored, if you pick Native table as Table type, it is stored inside BigQuery storage, and External table for letting the data stay on GCS and only read from GCS when there is a query hitting the table.

Indexing and partitioning Parquet in S3

Is it possible to both index and partition a Parquet file in S3 or is this functionality only available on File Storage types of volumes?
I'm looking for a way to provide researchers to access the same data in S3 via EMR notebooks for (a) generic R and Python scripts, and (b) Spark-enabled querying. But the proprietary solution and query language we have right now provides indexing and partitioning on an NFS store - so I want to preserve this functionality. I see that Delta Lake provides this, but I'm wondering if this can be achieved with simpler tools like Arrow.
You could use Delta lake to Partition a Parquet file. They are also indexed by default.
You can do it like this
%sql
CREATE TABLE UsableTable_unpartitioned
USING DELTA
LOCATION 'Location of the Parquet File on S3' ;
CREATE TABLE UsableTable
USING DELTA
PARTITIONED BY (my_partitioned_column)
LOCATION 'MyS3Location'
select * from UsableTable_unpartitioned;
DROP TABLE UsableTable_unpartitioned;
Verify your partitions and all the required information created :
%sql
describe detail UsableTable
You could expose this table using JDBC

Upload CSV files as partitioned table

I have a folder of CSV files separated by date in Google Cloud Storage. How can I upload it directly to BigQuery as a partitioned table?
You can do the following:
Create partitioned table (for example: T)
Run multiple load jobs to load each day's data into the corresponding partition. So for example, you can load data for May 15th, 2016 by specifying the destination table of load as 'T$20160515'
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#restating_data_in_a_partition