transfer files from S3 bucket to BigQuery every minute using runtime parameter - amazon-s3

i'd like to transfer data from an S3 bucket to BQ every minute using the runtime parameter to define which folder to take the data from but i get : Missing argument for parameter runtime.
the parameter is defined under the --params with "data_path"
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_tranfer \
--target_dataset=$ds \
--schedule=None \
--params='{"destination_table_name_template":$ds,
"data_path":"s3://bucket/test/${runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'

Apparently you have to add the run_time in the destination_table_name_template
so the cmd line works like this:
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_transfer \
--target_dataset=demo \
--schedule=None \
--params='{"destination_table_name_template":"demo_${run_time|\"%Y%m%d%H\"}",
"data_path":"s3://bucket/test/{runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'
the runtime has to be the same as the partition_id. above the partition is hourly. the records in the files have to belong to the that partition_id or the jobs will fail. to see your partition ids use:
SELECT table_name, partition_id, total_rows
FROM `mydataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id IS NOT NULL
but, important to mention. it's not a good idea to rely on this service for an every minute ingestion into BigQuery since your jobs get queued and can take several minutes. the service seems to be designed to run only once every 24H.

Related

How to execute a one off S3 data transfer using BigQuery command line tool?

Transfers from S3 to BigQuery works properly if you use the console. On the command line I have also everything working just there is one parameter that I can't find how to configure.
On the console UI you have "Schedule Options" and you could set the repeat as "on demand":
However on the command line I can't find a way to set the transfer as "on demand". Do you know which parameter do I need to pass to set it as on demand? it automatically set a schedule of every 24 hours.
Example run:
bq mk --transfer_config \
--target_dataset=my_dataset \
--display_name="my_transfer" \
--params='{"data_path":"s3://my_bucket/my_path*",
"destination_table_name_template":"testing",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":";",
"skip_leading_rows":"0",
"allow_quoted_newlines":"false",
"allow_jagged_rows":"false",
"access_key_id": "",
"secret_access_key": ""}' \
--data_source=amazon_s3
#how can I setup the schedule options as on demand?
You need to set the disableAutoScheduling parameter to false in the DTS API.
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs#TransferConfig.ScheduleOptions
For example:
{
"dataSourceId":"google_cloud_storage",
"displayName":"bar",
"params":{
"destination_table_name_template":"bart",
"data_path_template":"gs://fuzzy-wuzzy/wiki_1M.csv",
"write_disposition":"APPEND",
"file_format":"CSV",
"max_bad_records":"0",
"field_delimiter":",",
"skip_leading_rows":"0"
},
"emailPreferences":{
"enableFailureEmail":false
},
"notificationPubsubTopic":null,
"destinationDatasetId":"another_test",
"schedule":"",
"scheduleOptions":{
"disableAutoScheduling":true
}
}
To do this via the BigQuery CLI tool, you need to use the no_auto_scheduling flag.

BigQuery CLI: load commands stays pending

I have a csv file on my computer. I would like to load this CSV file into a BigQuery table.
I'm using the following command from a terminal:
bq load --apilog=./logs --field_delimiter=$(printf ';') --skip_leading_rows=1 --autodetect dataset1.table1 mycsvfile.csv myschema.json
The command in my terminal doesn't give any output. In the GCP interface, I see no job being created, which makes me think the request doesn't even reach GCP.
In the logs file (from the --apilog parameter) I get informations about the request being made, and it ends with this:
INFO:googleapiclient.discovery:URL being requested: POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/myproject/jobs?uploadType=resumable&alt=json
and that's it. No matter how long I wait, nothing happens.
You are mixing --autodetect with myschema.json, something like the following shoud work:
bq load --apilog=logs \
--source_format=CSV \
--field_delimiter=';' \
--skip_leading_rows=1 \
--autodetect \
dataset.table \
mycsvfile.csv
If you continue having issues, please post the content of the apilog, the line you shared doesn't seem to be an error. There should be more than one line and normally contains the error in a json structure, for instance:
"reason": "invalid",
"message": "Provided Schema does not match Table project:dataset.table. Field users is missing in new schema"
I'm not sure why you are using
--apilog=./logs
I did not find this in the bq load documentation, please clarify.
Based on that, maybe the bq load command could be the issue, you can try with something like:
bq load \
--autodetect \
--source_format=CSV \
--skip_leading_rows= 1 \
--field_delimiter=';'
dataset1.table1 \
gs://mybucket/mycsvfile.csv \
./myschema.json
If it fails, please check your job list to get the job created, then use bq show to view the information about that job, there you should find an error messag which can help you to determine the cause of the issue.

Copy table structure alone in Bigquery

In Google's Big query, is there a way to clone (copy the structure alone) a table without data?
bq cp doesn't seem to have an option to copy structure without data.
And Create table as Select (CTAS) with filter such as "1=2" does create the table without data. But, it doesn't copy the partitioning/clustering properties.
BigQuery now supports CREATE TABLE LIKE explicitly for this purpose.
See documentation linked below:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_like
You can use DDL and limit 0, but you need to express partitioning and clustering in the query as well
#standardSQL
CREATE TABLE mydataset.myclusteredtable
PARTITION BY DATE(timestamp)
CLUSTER BY
customer_id
AS SELECT * FROM mydataset.myothertable LIMIT 0
If you want to clone structure of table along with partitioning/clustering properties w/o having need in knowing what exactly those partitioning/clustering properties - follow below steps:
Step 1: just copy your_table to new table - let's say your_table_copy. This will obviously copy whole table including all properties (including such like descriptions, partition's expiration etc. - which is very simple to miss if you will try to set them manually) and data. Note: copy is cost free operation
Step 2: To get rid of data in newly created table - run below query statement
SELECT * FROM `project.dataset.your_table_copy` LIMIT 0
while running above make sure you set project.dataset.your_table_copy as destination table with 'Overwrite Table' as 'Write Preference'. Note: this is also cost free step (because of LIMIT 0)
You can easily do both above steps from within Web UI or Command Line or API or any client of your choice - whatever you are most comfortable with
This is possible with the BQ CLI.
First download the schema of the existing table:
bq show --format=prettyjson project:dataset.table | jq '.schema.fields' > table.json
Then, create a new table with the provided schema and required partitioning:
bq mk \
--time_partitioning_type=DAY \
--time_partitioning_field date_field \
--require_partition_filter \
--table dataset.tablename \
table.json
See more info on bq mk options: https://cloud.google.com/bigquery/docs/tables
Install jq with: npm install node-jq
You can use BigQuery API to run a select, as you suggested, which will return an empty result and set the partition and cluster fields.
This is an example (Only partition but cluster works as well)
curl --request POST \
'https://www.googleapis.com/bigquery/v2/projects/myProject/jobs' \
--header 'Authorization: Bearer [YOUR_BEARER_TOKEN]' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '{"configuration":{"query":{"query":"SELECT * FROM `Project.dataset.audit` WHERE 1 = 2","timePartitioning":{"type":"DAY"},"destinationTable":{"datasetId":"datasetId","projectId":"projectId","tableId":"test"},"useLegacySql":false}}}' \
--compressed
Result
Finally, I went with below python script to detect the schema/partitioning/clustering properties to re-create(clone) the clustered table without data. I hope we get an out of the box feature from bigquery to clone a table structure without the need for a script such as this.
import commands
import json
BQ_EXPORT_SCHEMA = "bq show --schema --format=prettyjson %project%:%dataset%.%table% > %path_to_schema%"
BQ_SHOW_TABLE_DEF="bq show --format=prettyjson %project%:%dataset%.%table%"
BQ_MK_TABLE = "bq mk --table --time_partitioning_type=%partition_type% %optional_time_partition_field% --clustering_fields %clustering_fields% %project%:%dataset%.%table% ./%cluster_json_file%"
def create_table_with_cluster(bq_project, bq_dataset, source_table, target_table):
cmd = BQ_EXPORT_SCHEMA.replace('%project%', bq_project)\
.replace('%dataset%', bq_dataset)\
.replace('%table%', source_table)\
.replace('%path_to_schema%', source_table)
commands.getstatusoutput(cmd)
cmd = BQ_SHOW_TABLE_DEF.replace('%project%', bq_project)\
.replace('%dataset%', bq_dataset)\
.replace('%table%', source_table)
(return_value, output) = commands.getstatusoutput(cmd)
bq_result = json.loads(output)
clustering_fields = bq_result["clustering"]["fields"]
time_partitioning = bq_result["timePartitioning"]
time_partitioning_type = time_partitioning["type"]
time_partitioning_field = ""
if "field" in time_partitioning:
time_partitioning_field = "--time_partitioning_field " + time_partitioning["field"]
clustering_fields_list = ",".join(str(x) for x in clustering_fields)
cmd = BQ_MK_TABLE.replace('%project%', bq_project)\
.replace('%dataset%', bq_dataset)\
.replace('%table%', target_table)\
.replace('%cluster_json_file%', source_table)\
.replace('%clustering_fields%', clustering_fields_list)\
.replace('%partition_type%', time_partitioning_type)\
.replace('%optional_time_partition_field%', time_partitioning_field)
commands.getstatusoutput(cmd)
create_table_with_cluster('test_project', 'test_dataset', 'source_table', 'target_table')

sqoop import staging table issue

I am trying to import the data from teradata into HDFS location.
I have access to view for that database. So I created a staging table in another database. But when I try to run the code it says error
Error: Running Sqoop version: 1.4.6.2.6.5.0-292 18/12/23 21:49:41 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 18/12/23 21:49:41 ERROR tool.BaseSqoopTool: Error parsing arguments for import:staging-table, t_hit_data_01_staging, –clear-staging-table, --query, select * from table1 where cast(date1 as Date) <= date '2017-09-02' and $CONDITIONS, --target-dir, <>, --split-by, date1, -m, 25
I have given the staging table details in the code and ran it. but throws error.
(Error parsing arguments from import and as un-recognized arguments from staging table)
sqoop import \
--connect jdbc:teradata://<server_link>/Database=db01 \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username <UN> \
--password <PWD> \
–-staging-table db02.table1_staging –clear-staging-table \
--query "select * from table1 where cast(date1 as Date) <= date '2017-09-02' and \$CONDITIONS " \
--target-dir '<hdfs location>' \
--split-by date1 -m 25`
The data should be loaded into the HDFS location, using the staging table in another database in Teradata.Then later on changing the where clause it sqoop should create another file under the same folder in HDFS location. Example: part-0000, next file as part -0001 etc.,
I dont think there is a staging option available for import command.
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

How to export from BigQuery to Datastore?

I have tables in BigQuery which I want to export and import in Datastore.
How to achieve that?
Table from BigQuery can be exported and imported to your datastore.
Download the jar file from https://github.com/yu-iskw/bigquery-to-datastore/releases
Then run the command
java -cp bigquery-to-datastore-bundled-0.5.1.jar com.github.yuiskw.beam.BigQuery2Datastore --project=yourprojectId --runner=DataflowRunner --inputBigQueryDataset=datastore --inputBigQueryTable=metainfo_internal_2 --outputDatastoreNamespace=default --outputDatastoreKind=meta_internal --keyColumn=key --indexedColumns=column1,column2 --tempLocation=gs://gsheetbackup_live/temp --gcpTempLocation=gs://gsheetlogfile_live/temp
--tempLocation and --gcpTempLocation are valid cloud storage bucket urls.
--keyColumn=key - the key here is the unique field on your big query table
2020 anwer,
use GoogleCloudPlatform/DataflowTemplates, BigQueryToDatastore
# Builds the Java project and uploads an artifact to GCS
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.BigQueryToDatastore \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=<project-id> \
--region=<region-name> \
--stagingLocation=gs://<bucket-name>/staging \
--tempLocation=gs://<bucket-name>/temp \
--templateLocation=gs://<bucket-name>/templates/<template-name>.json \
--runner=DataflowRunner"
# Uses the GCS artifact to run the transfer job
gcloud dataflow jobs run <job-name> \
--gcs-location=<template-location> \
--zone=<zone> \
--parameters "\
readQuery=SELECT * FROM <dataset>.<table>,readIdColumn=<id>,\
invalidOutputPath=gs://your-bucket/path/to/error.txt,\
datastoreWriteProjectId=<project-id>,\
datastoreWriteNamespace=<namespace>,\
datastoreWriteEntityKind=<kind>,\
errorWritePath=gs://your-bucket/path/to/errors.txt"
I hope this will get a proper user interface in GCP Console on day! (as this is already possible for Pub/Sub to BigQuery using Dataflow SQL)
You may export BigQuery data to CSV, then import CSV into Datastore. The first step is easy and well documented https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_stored_in_bigquery. For the second step, there are many resources that help you achieve that. For example,
https://groups.google.com/forum/#!topic/google-appengine/L64wByP7GAY
Import CSV into google cloud datastore