wildcard in bigquery load from gs - google-bigquery

I want to load many parquet files from Google Storage into Bigquery.
The file format is
gs://abc/date=2018-01-01/*.parquet
Where every date folder has 1 file, but I have many date folders
When I try to use
gs://abc/date=2018-*/*.parquet
I get an error about files not found.
I am doing this via the UI.

You can use only one wildcard character.
If the filename is same everywhere you can use
gs://abc/date=2018-*/<filename>.parquet
More here wildcard characters

Related

Pyspark write a DataFrame to csv files in S3 with a custom name

I am writing files to an S3 bucket with code such as the following:
df.write.format('csv').option('header','true').mode("append").save("s3://filepath")
This outputs to the S3 bucket as several files as desired, but each part has a long file name such as:
part-00019-tid-5505901395380134908-d8fa632e-bae4-4c7b-9f29-c34e9a344680-236-1-c000.csv
Is there a way to write this as a custom file name, preferably in the PySpark write function? Such as:
part-00019-my-output.csv
You can't do that with only Spark. The long random numbers behind are to make sure there is no duplication, no overwriting would happen when there are many many executors trying to write files at the same location.
You'd have to use AWS SDK to rename those files.
P/S: If you want one single CSV file, you can use coalesce. But the file name is still not determinable.
df.coalesce(1).write.format('csv')...

Bigquery Wildcard Matching data transfer

I am trying to setup a BQ Data transfer job that copies files from GCS into BQ.
I want to be able to transfer files of the following name format :
BQ_MONETARY_20210101.csv
BQ_MONETARY_20210102.csv
AND skip the files like
BQ_MONETARY_20210101_extra.csv
So far what I have seen is you can only specify the * character for wildcard matching. I tried using regular expressions but they dont seem to work.
Anyone has recommendations on what can I try?

How to filter s3 path while reading data from s3 using pyspark

I have a s3 folder structure like this:
bucketname/20211127123456/.parquet files
bucketname/20211127456789/.parquet files
bucketname/20211126123455/.parquet files
bucketname/20211126746352/.parquet files
bucketname/20211124123455/.parquet files
bucketname/20211124746352/.parquet files
Basically for each day there are two folders and inside that I have multiple parquet files which I want to read.
Let's say I want to read all files from the folders for 27th and 26th Nov.
Right now I have boto3 function which is giving me a python list that includes all parquet files complete s3 path which has 20211126 and 20211127 in the s3 path and that list I am passing to spark.read. Is there any better way to achieve this?
Yes, you should be partitioning your data based on date. Then your spark queries would only need to include date parameters and only the files related to that date would be read for the query.
Here's an example of how that works with Athena; It will work with Glue and Spark too.

How to load multiple files (same schema) into a table in BigQuery?

I have a folder of csv files with the same schema that I want to load into a bigquery table.
Is there an option to give folder path as the input to BQ command to load into bigquery table? I'm interested to know if it can be done without iterating over the files or merging the input files at the source.
If using cloud storage is an option, you can put them all in a common prefix in a bucket and use a wildcard e.g. gs://my_bucket/some/path/files* to specify a single load job with multiple inputs quickly.
Note that
You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.
so something like gs://my_bucket/some/*/files* is not supported.
Source: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage#load-wildcards
The files can be in subdirectories, if you want to recursively include all CSV:
bq load --source_format=CSV \
dataset_name.table_name \
"gs://my_bucket/folder/*.csv"
This puts a wildcard on intermediate path and filename. (ex. * expands to subfolder/folder2/filename)

Cannot load backup data from GCS to BigQuery

My backup table has 3 files: 2 ending with .backup_info and one folder with another folder containing 10 CSV files. What would be format of the URL which will specify the backup file location?
I'm trying below and every time I get a file not found error.
gs://bucket_name/name_of_the_file_which_ended_with_backup_info.info
When you go to look at your file from your backup, it should have a structure like this:
Buckets/app-id-999999999-backups
And the filenames should look like:
2017-08-20T02:05:19Z_app-id-999999999_data.json.gz
Therefore the path will be:
gs://app-id-9999999999-backups/2017-08-20T02:05:19Z_app-id-9999999999_data.json.gz
Make sure you do not include the word "Buckets", I am guess that is the confusion.