Bigquery Wildcard Matching data transfer - google-bigquery

I am trying to setup a BQ Data transfer job that copies files from GCS into BQ.
I want to be able to transfer files of the following name format :
BQ_MONETARY_20210101.csv
BQ_MONETARY_20210102.csv
AND skip the files like
BQ_MONETARY_20210101_extra.csv
So far what I have seen is you can only specify the * character for wildcard matching. I tried using regular expressions but they dont seem to work.
Anyone has recommendations on what can I try?

Related

Connecting Tranco Google BigQuery with Metabase

I am trying to connect third party ranking management system (https://tranco-list.eu/) with metabase. Tranco is giving us an option to see the record on Google BigQuery but when I am trying to connect the Tranco with Metabase then it is asking for dataset from my Google cloud console project. Since Tranco is an external database source and I don't have access to the dataset Id from this.
If you want to get the result of tranco in Google BigQuery then run below query.
select * from `tranco.daily.daily` where domain ='google.com' limit 10
When I am searching Tranco in public dataset then also I am not finding this over their also. Is anyone aware of, how to add the third party dataset to our Google cloud project.
Thanks in advance.
Unfortunately, you can’t read the Tranco dataset directly from BigQuery; but, what you can do is to load the CSV data from Tranco into a Cloud Storage Bucket and then read your bucket in BigQuery.
When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.
Note that it has the next limitations:
CSV files do not support nested or repeated data.
Remove byte order mark (BOM) characters. They might cause unexpected
issues.
If you use gzip compression, BigQuery cannot read the data in
parallel. Loading compressed CSV data into BigQuery is slower than
loading uncompressed data.
You cannot include both compressed and uncompressed files in the same
load job.
The maximum size for a gzip file is 4 GB. When you load CSV or JSON
data, values in DATE columns must use the dash (-) separator and the
date must be in the following format: YYYY-MM-DD (year-month-day).
When you load JSON or CSV data, values in TIMESTAMP columns must use
a dash (-) separator for the date portion of the timestamp, and the
date must be in the following format: YYYY-MM-DD (year-month-day).
The hh:mm:ss (hour-minute-second) portion of the timestamp must use a
colon (:) separator.
Also, you can see this documentation if you don’t know how you can upload and read your CSV data.
And also in the next link I'm sending you is a step by step guide in how yo can create / select the bucket you will use.

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

Glue create_dynamic_frame.from_catalog return empty data

I'm debugging issue which create_dynamic_frame.from_catalog return no data, despite I'm able to view the data through Athena.
The Data Catelog is pointed to S3 folder and there are multiple files with same structure. The file type is csv, delimiter is space " ", consists of two column (string and json string), with no header.
This is CSV format file.
This is Athena query using crawler generated.
No result returned from dataframe when debug, any thought?
Take a look if you have enabled the Bookmark for this job. If you are running it multiple times, you need to reset the Bookmark or disable it.
Other thing to check is the logs. Maybe you can find some AccessDenied, the role that is running the job might have no access to this bucket.

wildcard in bigquery load from gs

I want to load many parquet files from Google Storage into Bigquery.
The file format is
gs://abc/date=2018-01-01/*.parquet
Where every date folder has 1 file, but I have many date folders
When I try to use
gs://abc/date=2018-*/*.parquet
I get an error about files not found.
I am doing this via the UI.
You can use only one wildcard character.
If the filename is same everywhere you can use
gs://abc/date=2018-*/<filename>.parquet
More here wildcard characters