We have a csv file with 300 columns. Size is approx 250 MB. Trying to upload it to BQ through Web UI but the schema specification is hard work. I was anticipating BQ will identify file headers but it doesn't seems to be recognising unless I am missing something.Is there a way forward ?
Yes, you have to write the schema by your own. Bigquery is not able to auto-infert it. If you have 300 columns, I sugget writing a script to automatically create the schema.
With the command-line tool (cf here) If you have some lines with the wrong/different schema, you can use the following option to continue for other records :
--max_bad_records : The maximum number of bad rows to skip before the load job
In your case if you want to skip the first line of headers, that can be the following :
bq load --skip_leading_rows=1 --max_bad_records=10000 <destination_table> <data_source_uri> [<table_schema>]
Related
I have a SQL notebook to change data and insert into another table.
I have a situation when I'm trying to change the storaged block size in blobStorage, I want to have less and bigger files. I try change a lot of parameters.
So i found a behaviour.
When I run the notebook the command create the files with almost 10MB for each.
If I create the table internaly in databricks and run another comand
create external_table as
select * from internal_table
the files had almost 40 MB...
So my question is..
There is a way to fix the minimal block size in external databricks tables?
When i'm transforming data in a SQL Notebook we have best pratices? like transform all data and store locally so after that move the data to external source?
Thanks!
Spark doesn't have a straightforward way to control the size of output files. One method people use is to call repartition or coalesce to the number of desired files. To use this to control the size of output files, you need to have an idea of how many files you want to create, e.g. to create 10MB files, if your output data is 100MB, you could call repartition(10) before the write command.
It sounds like you are using Databricks, in which case you can use the OPTIMIZE command for Delta tables. Delta's OPTIMIZE will take your underlying files and compact them for you into approximately 1GB files, which is an optimal size for the JVM in large data use cases.
https://docs.databricks.com/spark/latest/spark-sql/language-manual/optimize.html
I have a simple delimited file on gcs. I need to load that file as is(without transfermation) to bigqyery table. Either we can use data flow or bigqyery command line utility to load that file to bigqyery table. I need to understand which one is the best option bigqyery or bq command line utility. Please consider factors like cost, performance etc before providing your valuable inputs.
Running a BigQuery load using Dataflow or running it using bq command line is the same in terms of cost. Using bq load directly should be easier if you don't need to process the data.
My requirement is to pull the data from Different sources(Facebook,youtube, double click search etc) and load into BigQuery. When I try to pull the data, in some of the sources I was getting "NULL" when the column is empty.
I tried to load the same data to BigQuery and BigQuery is treating as a string instead of NULL(empty).
Right now replacing ""(empty string) where NULL is there before loading into BigQuery. Instead of doing this is there any way to load the file directly without any manipulations(replacing).
Thanks,
What is the file format of source file e.g. CSV, New Line Delimited JSON, Avro etc?
The reason is CSV treats an empty string as a null and the NULL is a string value. So, if you don't want to manipulate the data before loading you should save the files in NLD Json format.
As you mentioned that you are pulling data from Social Media platforms, I assume you are using their REST API and as a result it will be possible for you to save that data in NLD Json instead of CSV.
Answer to your question is there a way we can load this from web console?:
Yes, Go to your bigquery project console https://bigquery.cloud.google.com/ and create table in a dataset where you can specify the source file and table schema details.
From Comment section (for the convenience of other viewers):
Is there any option in bq commands for this?
Try this:
bq load --format=csv --skip_leading_rows=1 --null_marker="NULL" yourProject:yourDataset.yourTable ~/path/to/file/x.csv Col1:string,Col2:string,Col2:integer,Col3:string
You may consider running a command similar to: bq load --field_delimiter="\t" --null_marker="\N" --quote="" \
PROJECT:DATASET.tableName gs://bucket/data.csv.gz table_schema.json
More details can be gathered from the replies to the "Best Practice to migrate data from MySQL to BigQuery" question.
I have a set of avro files with slightly varying schemas which I'd like to load into one bq table.
Is there a way to do that with one line? Every automatic way to handle schema difference would be fine for me.
Here is what I tried so far.
0) If I try to do it in a straightforward way, bq fails with error:
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*
Waiting on bqjob_r4e484dc546c68744_0000015bcaa30f59_1 ... (4s) Current status: DONE
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r4e484dc546c68744_0000015bcaa30f59_1': The Apache Avro library failed to read data with the follwing error: EOF reached
1) Quick googling shows that there is --schema_update_option=ALLOW_FIELD_ADDITION option which, added to bq load job, changes nothing. ALLOW_FIELD_RELAXATION does not change anything either.
2) Actually, schema id is mentioned in the file name, so files look like:
gs://mybucket/logs/*_schemaA_*
gs://mybucket/logs/*_schemaB_*
Unfortunately, bq load does not allow more that on asterisk (as is written in bq manual too):
bq load --source_format=AVRO myproject:mydataset.logs gs://mybucket/logs/*_schemaA_*
BigQuery error in load operation: Error processing job 'iow-rnd:bqjob_r5e14bb6f3c7b6ec3_0000015bcaa641f3_1': Not found: Uris gs://otishutin-eu/imp/2016-06-27/*_schemaA_*
3) When I try to list the files explicitly, the list happens to be too long, so bq load does not work either:
bq load --source_format=AVRO myproject:mydataset.logs $(gsutil ls gs://mybucket/logs/*_schemaA_* | xargs | tr ' ' ',')
Too many positional args, still have ['gs://mybucket/logs/log_schemaA_2658.avro,gs://mybucket/logs/log_schemaA_2659.avro,gs://mybucket/logs/log_schemaA_2660.avro,...
4) When I try to use files as external table and list the files explicitly in external table definition, I also get "too many files" error:
BigQuery error in query operation: Table definition may not have more than 500 source_uris
I understand that I could first copy files to different folders and then process them folder-by-folder, and this is what I'm doing now as last resort, but this is only a small part of data processing pipeline, and copying is not acceptable as production solution.
I'm using bq command line and trying to insert large amount of json files with one table per day.
My approach:
list all file to be push (date named YYYMMDDHHMM.meta1.meta2.json)
concatenate in the same day file => YYYMMDD.ndjson
split YYYMMDD.ndjson file (500 lines files each) YYYMMDD.ndjson_splittedij
loop over YYYMMDD.ndjson_splittedij and run
bq insert --template_suffix=20160331 --dataset_id=MYDATASET TEMPLATE YYYMMDD.ndjson_splittedij
This approach works. I just wonder if it is possible to improve it.
Again you are confusing streaming inserts and job loads.
You don't need to split each file in 500 rows (that applies to streaming insert).
You can have very large files for insert, see the Command line tab examples listed here: https://cloud.google.com/bigquery/loading-data#loading_csv_files
You have to run only:
bq load --source_format=NEWLINE_DELIMITED_JSON --schema=personsDataSchema.json mydataset.persons_data personsData.json
JSON file compressed must be under 4 GB if uncompressed must be under 5 TB, so larger files are better. Always try with 10 line sample file until you get the command working.