BigQuery CLI: load commands stays pending - google-bigquery

I have a csv file on my computer. I would like to load this CSV file into a BigQuery table.
I'm using the following command from a terminal:
bq load --apilog=./logs --field_delimiter=$(printf ';') --skip_leading_rows=1 --autodetect dataset1.table1 mycsvfile.csv myschema.json
The command in my terminal doesn't give any output. In the GCP interface, I see no job being created, which makes me think the request doesn't even reach GCP.
In the logs file (from the --apilog parameter) I get informations about the request being made, and it ends with this:
INFO:googleapiclient.discovery:URL being requested: POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/myproject/jobs?uploadType=resumable&alt=json
and that's it. No matter how long I wait, nothing happens.

You are mixing --autodetect with myschema.json, something like the following shoud work:
bq load --apilog=logs \
--source_format=CSV \
--field_delimiter=';' \
--skip_leading_rows=1 \
--autodetect \
dataset.table \
mycsvfile.csv
If you continue having issues, please post the content of the apilog, the line you shared doesn't seem to be an error. There should be more than one line and normally contains the error in a json structure, for instance:
"reason": "invalid",
"message": "Provided Schema does not match Table project:dataset.table. Field users is missing in new schema"

I'm not sure why you are using
--apilog=./logs
I did not find this in the bq load documentation, please clarify.
Based on that, maybe the bq load command could be the issue, you can try with something like:
bq load \
--autodetect \
--source_format=CSV \
--skip_leading_rows= 1 \
--field_delimiter=';'
dataset1.table1 \
gs://mybucket/mycsvfile.csv \
./myschema.json
If it fails, please check your job list to get the job created, then use bq show to view the information about that job, there you should find an error messag which can help you to determine the cause of the issue.

Related

transfer files from S3 bucket to BigQuery every minute using runtime parameter

i'd like to transfer data from an S3 bucket to BQ every minute using the runtime parameter to define which folder to take the data from but i get : Missing argument for parameter runtime.
the parameter is defined under the --params with "data_path"
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_tranfer \
--target_dataset=$ds \
--schedule=None \
--params='{"destination_table_name_template":$ds,
"data_path":"s3://bucket/test/${runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'
Apparently you have to add the run_time in the destination_table_name_template
so the cmd line works like this:
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_transfer \
--target_dataset=demo \
--schedule=None \
--params='{"destination_table_name_template":"demo_${run_time|\"%Y%m%d%H\"}",
"data_path":"s3://bucket/test/{runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'
the runtime has to be the same as the partition_id. above the partition is hourly. the records in the files have to belong to the that partition_id or the jobs will fail. to see your partition ids use:
SELECT table_name, partition_id, total_rows
FROM `mydataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id IS NOT NULL
but, important to mention. it's not a good idea to rely on this service for an every minute ingestion into BigQuery since your jobs get queued and can take several minutes. the service seems to be designed to run only once every 24H.

How to execute a one off S3 data transfer using BigQuery command line tool?

Transfers from S3 to BigQuery works properly if you use the console. On the command line I have also everything working just there is one parameter that I can't find how to configure.
On the console UI you have "Schedule Options" and you could set the repeat as "on demand":
However on the command line I can't find a way to set the transfer as "on demand". Do you know which parameter do I need to pass to set it as on demand? it automatically set a schedule of every 24 hours.
Example run:
bq mk --transfer_config \
--target_dataset=my_dataset \
--display_name="my_transfer" \
--params='{"data_path":"s3://my_bucket/my_path*",
"destination_table_name_template":"testing",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":";",
"skip_leading_rows":"0",
"allow_quoted_newlines":"false",
"allow_jagged_rows":"false",
"access_key_id": "",
"secret_access_key": ""}' \
--data_source=amazon_s3
#how can I setup the schedule options as on demand?
You need to set the disableAutoScheduling parameter to false in the DTS API.
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs#TransferConfig.ScheduleOptions
For example:
{
"dataSourceId":"google_cloud_storage",
"displayName":"bar",
"params":{
"destination_table_name_template":"bart",
"data_path_template":"gs://fuzzy-wuzzy/wiki_1M.csv",
"write_disposition":"APPEND",
"file_format":"CSV",
"max_bad_records":"0",
"field_delimiter":",",
"skip_leading_rows":"0"
},
"emailPreferences":{
"enableFailureEmail":false
},
"notificationPubsubTopic":null,
"destinationDatasetId":"another_test",
"schedule":"",
"scheduleOptions":{
"disableAutoScheduling":true
}
}
To do this via the BigQuery CLI tool, you need to use the no_auto_scheduling flag.

How to export from BigQuery to Datastore?

I have tables in BigQuery which I want to export and import in Datastore.
How to achieve that?
Table from BigQuery can be exported and imported to your datastore.
Download the jar file from https://github.com/yu-iskw/bigquery-to-datastore/releases
Then run the command
java -cp bigquery-to-datastore-bundled-0.5.1.jar com.github.yuiskw.beam.BigQuery2Datastore --project=yourprojectId --runner=DataflowRunner --inputBigQueryDataset=datastore --inputBigQueryTable=metainfo_internal_2 --outputDatastoreNamespace=default --outputDatastoreKind=meta_internal --keyColumn=key --indexedColumns=column1,column2 --tempLocation=gs://gsheetbackup_live/temp --gcpTempLocation=gs://gsheetlogfile_live/temp
--tempLocation and --gcpTempLocation are valid cloud storage bucket urls.
--keyColumn=key - the key here is the unique field on your big query table
2020 anwer,
use GoogleCloudPlatform/DataflowTemplates, BigQueryToDatastore
# Builds the Java project and uploads an artifact to GCS
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.BigQueryToDatastore \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=<project-id> \
--region=<region-name> \
--stagingLocation=gs://<bucket-name>/staging \
--tempLocation=gs://<bucket-name>/temp \
--templateLocation=gs://<bucket-name>/templates/<template-name>.json \
--runner=DataflowRunner"
# Uses the GCS artifact to run the transfer job
gcloud dataflow jobs run <job-name> \
--gcs-location=<template-location> \
--zone=<zone> \
--parameters "\
readQuery=SELECT * FROM <dataset>.<table>,readIdColumn=<id>,\
invalidOutputPath=gs://your-bucket/path/to/error.txt,\
datastoreWriteProjectId=<project-id>,\
datastoreWriteNamespace=<namespace>,\
datastoreWriteEntityKind=<kind>,\
errorWritePath=gs://your-bucket/path/to/errors.txt"
I hope this will get a proper user interface in GCP Console on day! (as this is already possible for Pub/Sub to BigQuery using Dataflow SQL)
You may export BigQuery data to CSV, then import CSV into Datastore. The first step is easy and well documented https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_stored_in_bigquery. For the second step, there are many resources that help you achieve that. For example,
https://groups.google.com/forum/#!topic/google-appengine/L64wByP7GAY
Import CSV into google cloud datastore

Pentaho Spoon Job Executes Fine, Endless Loop in Kitchen

Without getting too much into the weeds, I have a Pentaho PDI job with multiple sub-transformations and sub-jobs (ETL from MySQL to Postgres). This job runs exactly as expected from Spoon, no errors, but when I run the job--with the following command--I am met with an endless loop error at the first step where a parameter would need to be defined and passed from within the job (the named params from the command seem to integrate fine). The command I am using is as follows:
sudo /bin/sh kitchen.sh \
-rep=KettleFileRepo \
-dir=M2P \
-job=ETL-M2P \
-level=Rowlevel \
-param:MY.PAR.LOADTYPE=full \
-param:MY.PAR.TABLELIST=table1 \
-param:MY.PAR.TENANTS=tenant1 \
/
Has anyone run into this type of issue with a discrepancy between Spoon and Kitchen? Is there some sort of config or command line option that I am missing? I am running version 6.0.1.0-386 on OS X 10.11.4.
If you think more details would be beneficial please let me know and I can provide whatever is necessary.
I am not aware of any discrepancy between Spoon and Kitchen. Are you sure, its not something in the ETL that causing the loop. I would suggest to go through your ETL in detail.
Another thing you can try to debug is run only part of the job in kitchen and keep adding more as you see success.

bigquery pass property in commandline

I am getting an error following in bigquery :
Error: Response too large to return.
After couple of google search I found workaround is set configuration.query.allowLargeResults=true
But not sure how to pass this property value in bq command line tool.
any help ?
Thanks
$ bq help query
USAGE: bq.py [--global_flags] <command> [--command_flags] [args]
[...]
--[no]allow_large_results: Enables larger destination table sizes
--destination_table: Name of destination table for query results.
So:
$ bq query --allow_large_results --destination_table "dataset.table" "SELECT 1"