I'm trying to bulk load some records to BigQuery, but it takes a long time to upload even a few thousands records.
I'm using the following command to load a gzipped JSON file. The file has ~2k rows with ~200 columns each:
./bin/bq load --project_id=my-project-id --source_format=NEWLINE_DELIMITED_JSON dataset.table /tmp/file.json.gz
Waiting on bqjob_r3a269dd7388c7b8e_000001579a6e064f_1 ... (50s)
Current status: DONE
This command takes ~50 seconds to load the records. As I want to load at least 1 million records, this would take ~7 hours, which seems too much for a tool that is supposed to handle petabytes of data.
Is it possible to speed up the process?
Try using --nosync flag. This will start an Asynchronous job over bigQuery, found this having much better performance.
Optimally, I would suggest to store file.json.gz inside Google Cloud Storage.
./bin/bq load --nosync
Related
I have csv files in gcs, I want to load them in bigquery, I'm using pandas to ingest files in bigquery, but these files are large (10gb), I use cloud run to execute the job:
df=pd.read_csv(uri,sep=delimiter,dtype = str)
# Run the load job
load_job = client.load_table_from_dataframe(df, table)
I always get errors
Memory limit of 512M exceeded with 519M used. Consider increasing the memory limit
how to choose the best memory to my cloud run, can I use chunck dataframe to load data to bigquery
Thanks
The bad idea is to increase the Cloud Run memory. (It's not scalable)
The good idea is to use the BigQuery CSV import feature. And if you have transforms to perform on your data, you can run a query just after to perform them in SQL.
Say I have a very large CSV file that I'm loading to a bigquery table. Will this data be available for querying only after the whole file has been uploaded and the job is finnished or will it be available for querying as as the file is being uploaded?
BigQuery will commit data from a load job in an all-or-none fashion. Partial results will not appear in the destination table as a job is progessing, results are committed at the end of the load job.
A load job that terminates with an error will commit no rows. However, for use cases where you have poorly sanitized data, you can optionally choose to configure your load job to allow bad/malformed data to be ignored through configuration values like MaxBadRecords. In such cases, a job may have warnings and still commit the successfully processed data, but the commit semantics remain the same (all at the end, or none if the defined threshold for bad data is exceeded).
Thanks in advance.
I have been trying to import the data from DB2 to HBase table using SQOOP which is taking very very long time to even initiate the map and reduce . I can see only Map 0 and Reduce 0 all the times .
I can put the same command in DB2 and the results are quite faster than I expected. But when I import the same to HBASE . Taking very long time(10 hours) . Created a sample data in DB2(150 records) and tried to import to HBASE still taking the same amount of time .
sqoop import --connect jdbc:db2://{hostname}:50001/databasename --username user --password pass --hbase-create-table --hbase-table new_tbl --column-family abc --hbase-row-key=same --query "select a,b,c,d,e concat(a,e) from table_name where \$CONDITIONS AND a>='2018-08-01 00:00:01' and b<='2018-08-01 00:00:02'" -m 1
Tried adjusted all the configurations
yarn.nodemanager.resource.memory-mb=116800
yarn.scheduler.minimum-allocation-mb=4096
mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=8192
mapreduce.map.java.opts=-Xmx3072m
mapreduce.reduce.java.opts=-Xmx6144m
yarn.nodemanager.vmem-pmem-ratio=2.1
In Sqoop Side I have tried to tweak the query as well little configurations as well
-m 4 create some inconsistency in records
-removed the filter(timestamps(a,b)) still taking longtime (10 hours)
HBASE performance test results are pretty good .
HBase Performance Evaluation
Elapsed time in milliseconds=705914
Row count=1048550
File Input Format Counters
Bytes Read=778810
File Output Format Counters
Bytes Written=618
real 1m29.968s
user 0m10.523s
sys 0m1.140s
It is hard to suggest unless you show the sample data and data type. The extra mapper will work correctly and efficiently only when you have a fair distribution of records among mappers. If you have a primary key available in the table, you can give it as split column and mappers will distribute the workload equally and start fetching slices in balanced mode. While running you can also see the split key distribution and record count from the log itself.
If your cluster is not having enough memory for resources, it may take longer time and sometimes it is in submit mode for a long time as YARN cannot allocate memory to run it.
Instead of trying to HBase, you can first try doing it with HDFS as a storage location and see the performance and also check the Job detail to understand the MapReduce behavior.
I'm in trouble in loading Huge data to Bigquery.
In GCS, I have huge & many files like this:
gs://bucket/many_folders/yyyy/mm/dd/many_files.gz
I want to load it to BigQuery, so first, I tried:
bq load --source_format=NEWLINE_DELIMITED_JSON \
--ignore_unknown_values\
--max_bad_records=2100000000\
--nosync\
project:dataset.table \
gs://bucket/* \
schema.txt
which failed because of it exceeded "max_bad_records" limit(the file is an aggregation of many types of log so it causes many errors).
Then I calculated to found that I need to use "*" like:
bq load --source_format=NEWLINE_DELIMITED_JSON \
--ignore_unknown_values\
--max_bad_records=2100000000\
--nosync\
gs://bucket/many_folders/yyyy/mm/dd/*\
schema.txt
because of the max_bad_records limitation.
But I found it is very slow(because of pararell-run limitation in BigQuery). And it exceedes daily loading job limitation also. I prefer not doing this option.
Any idea for solving this situation? I want to load this data as fast as I can.
Thank you for reading.
I solved it by loading GCS data as one column.
Then as a next step I parsed the data.
I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.