I have a folder of csv files with the same schema that I want to load into a bigquery table.
Is there an option to give folder path as the input to BQ command to load into bigquery table? I'm interested to know if it can be done without iterating over the files or merging the input files at the source.
If using cloud storage is an option, you can put them all in a common prefix in a bucket and use a wildcard e.g. gs://my_bucket/some/path/files* to specify a single load job with multiple inputs quickly.
Note that
You can use only one wildcard for objects (filenames) within your bucket. The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.
so something like gs://my_bucket/some/*/files* is not supported.
Source: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage#load-wildcards
The files can be in subdirectories, if you want to recursively include all CSV:
bq load --source_format=CSV \
dataset_name.table_name \
"gs://my_bucket/folder/*.csv"
This puts a wildcard on intermediate path and filename. (ex. * expands to subfolder/folder2/filename)
Related
This statement exports the query results to GCS:
EXPORT DATA OPTIONS(
uri='gs://<bucket>/<file_name>.*.csv',
format='CSV',
overwrite=true,
header=true
) AS
SELECT * FROM dataset.table
It splits big amounts of data into multiple files, sometimes it also produces empty files. I can't seem to find any info in BigQuery docs on how to control this. Can I configure export into a single file? Or into N files up to 1M rows each? Or N files up to 50MB each?
I have tested different scenarios (using Public datasets) and discovered that export data gets split into multiple files when your table is partitioned and is less than 1 GB. This result happens when using wildcard operator during the export.
BigQuery supports a single wildcard operator (*) in each URI. The wildcard can appear anywhere in the URI except as part of the bucket name. Using the wildcard operator instructs BigQuery to create multiple sharded files based on the supplied pattern.
Unfortunately, wildcard is a requirement for the EXPORT DATA syntax, otherwise your query will fail and get this error:
Can I configure export into a single file? Or into N files up to 1M rows each? Or N files up to 50MB each?
As mentioned above, exporting a partitioned table into a single file would not be possible using the EXPORT DATA syntax. A workaround for this, is to export using the UI or bq command.
Using UI export:
Open table > Export > Export to GCS > Fill in GCS location and filename
Using bq tool:
bq extract --destination_format CSV \
bigquery-public-data:covid19_geotab_mobility_impact.us_border_wait_times \
gs://bucket_name/900k_rows_using_bq_extract.csv
Using public data partitioned table, bigquery-public-data.covid19_geotab_mobility_impact.us_border_wait_times. See csv files exported to GCS bucket using these three different methods.
I have a requirement to create an Athena table from multiple zip files of multiple folders in S3.
I have a folder structure in S3 as follows: S3 bucket==>Clients folder==> multiple folders for multiple countries like (US, JAPAN,UK... till 50 countries) ==> 10 to 50 '.gz' files in each country folder
I need to merge all the '.gz' files from all the region folders and create a single table in S3, i used the glue crawlers and classifiers but the files are not getting merged into table.
Please help me with other ways to create a table 'companies_all_regions' on Athena from all the files
You could create an Amazon Athena external table at the top-level of the bucket. All files at that level, and in sub-folders, will be included in the table. All files will need to be in the same format.
If your CSV files contain commas within a column, then the values for the column would need to be placed "inside double quotes".
If you are able to change the way the files are created, you could choose an alternate column separator, such as the pipe (|) character. That will avoid problems with commas inside field values. You can then configure the table to use the pipe as the separator character.
Is there a way to extract the complete BigQuery partitioned table with one command so that data of each partition is extracted into a separate folder of the format part_col=date_yyyy-mm-dd
Since Bigquery partitioned table can read files from the hive type partitioned directories, is there a way to extract the data in a similar way. I can extract each partition separately, however that is very cumbersome when i an extracting a lot of partitions
You could do this programmatically. For instance, you can export partitioned data by using the partition decorator such as table$20190801. And then on the bq extract command you can use URI Patterns (look the example of the workers pattern) for the GCS objects.
Since all objects will be within the same bucket, the folders are just an hierarchical illusion, so you can specify URI patterns on the folders as well, but not on the bucket.
So you would do a script where you loop over the DATE value, with something like:
bq extract
--destination_format [CSV, NEWLINE_DELIMITED_JSON, AVRO]
--compression [GZIP, AVRO supports DEFLATE and SNAPPY]
--field_delimiter [DELIMITER]
--print_header [true, false]
[PROJECT_ID]:[DATASET].[TABLE]$[DATE]
gs://[BUCKET]/part_col=[DATE]/[FILENAME]-*.[csv, json, avro]
You can't do it automatically with just a bq command. For this it would be better to raise a feature request as suggested by Felipe.
Set the project as test_dataset using gcloud init before running the below command.
bq extract --destination_format=CSV 'test_partitiontime$20210716' gs://testbucket/20210716/test*.csv
This will create a folder with the name 20210716 inside testbucket and write the file there.
I want to load many parquet files from Google Storage into Bigquery.
The file format is
gs://abc/date=2018-01-01/*.parquet
Where every date folder has 1 file, but I have many date folders
When I try to use
gs://abc/date=2018-*/*.parquet
I get an error about files not found.
I am doing this via the UI.
You can use only one wildcard character.
If the filename is same everywhere you can use
gs://abc/date=2018-*/<filename>.parquet
More here wildcard characters
I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.