I have tables in BigQuery which I want to export and import in Datastore.
How to achieve that?
Table from BigQuery can be exported and imported to your datastore.
Download the jar file from https://github.com/yu-iskw/bigquery-to-datastore/releases
Then run the command
java -cp bigquery-to-datastore-bundled-0.5.1.jar com.github.yuiskw.beam.BigQuery2Datastore --project=yourprojectId --runner=DataflowRunner --inputBigQueryDataset=datastore --inputBigQueryTable=metainfo_internal_2 --outputDatastoreNamespace=default --outputDatastoreKind=meta_internal --keyColumn=key --indexedColumns=column1,column2 --tempLocation=gs://gsheetbackup_live/temp --gcpTempLocation=gs://gsheetlogfile_live/temp
--tempLocation and --gcpTempLocation are valid cloud storage bucket urls.
--keyColumn=key - the key here is the unique field on your big query table
2020 anwer,
use GoogleCloudPlatform/DataflowTemplates, BigQueryToDatastore
# Builds the Java project and uploads an artifact to GCS
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.BigQueryToDatastore \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=<project-id> \
--region=<region-name> \
--stagingLocation=gs://<bucket-name>/staging \
--tempLocation=gs://<bucket-name>/temp \
--templateLocation=gs://<bucket-name>/templates/<template-name>.json \
--runner=DataflowRunner"
# Uses the GCS artifact to run the transfer job
gcloud dataflow jobs run <job-name> \
--gcs-location=<template-location> \
--zone=<zone> \
--parameters "\
readQuery=SELECT * FROM <dataset>.<table>,readIdColumn=<id>,\
invalidOutputPath=gs://your-bucket/path/to/error.txt,\
datastoreWriteProjectId=<project-id>,\
datastoreWriteNamespace=<namespace>,\
datastoreWriteEntityKind=<kind>,\
errorWritePath=gs://your-bucket/path/to/errors.txt"
I hope this will get a proper user interface in GCP Console on day! (as this is already possible for Pub/Sub to BigQuery using Dataflow SQL)
You may export BigQuery data to CSV, then import CSV into Datastore. The first step is easy and well documented https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_stored_in_bigquery. For the second step, there are many resources that help you achieve that. For example,
https://groups.google.com/forum/#!topic/google-appengine/L64wByP7GAY
Import CSV into google cloud datastore
Related
i'd like to transfer data from an S3 bucket to BQ every minute using the runtime parameter to define which folder to take the data from but i get : Missing argument for parameter runtime.
the parameter is defined under the --params with "data_path"
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_tranfer \
--target_dataset=$ds \
--schedule=None \
--params='{"destination_table_name_template":$ds,
"data_path":"s3://bucket/test/${runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'
Apparently you have to add the run_time in the destination_table_name_template
so the cmd line works like this:
bq mk \
--transfer_config \
--project_id=$project_id \
--data_source=amazon_s3 \
--display_name=s3_transfer \
--target_dataset=demo \
--schedule=None \
--params='{"destination_table_name_template":"demo_${run_time|\"%Y%m%d%H\"}",
"data_path":"s3://bucket/test/{runtime|\"%M\"}/*",
"access_key_id":"***","secret_access_key":"***","file_format":"JSON"}'
the runtime has to be the same as the partition_id. above the partition is hourly. the records in the files have to belong to the that partition_id or the jobs will fail. to see your partition ids use:
SELECT table_name, partition_id, total_rows
FROM `mydataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id IS NOT NULL
but, important to mention. it's not a good idea to rely on this service for an every minute ingestion into BigQuery since your jobs get queued and can take several minutes. the service seems to be designed to run only once every 24H.
Wanted to check on the similar error that is also mentioned in the post "https://stackoverflow.com/questions/37298504/google-dataflow-job-and-bigquery-failing-on-different-regions?rq=1"
I am facing a similar issue in my data flow job where I am getting error as below
2021-03-10T06:02:26.115216545ZWorkflow failed. Causes: S01:Read File from GCS/Read+String To BigQuery Row+Write to BigQuery/NativeWrite failed., BigQuery import job "dataflow_job_15712075439082970546-B" failed., BigQuery job "dataflow_job_15712075439082970546-B" in project "whr-asia-datalake-prod" finished with error(s): errorResult: Cannot read and write in different locations: source: US, destination: asia-south1, error: Cannot read and write in different locations: source: US, destination: asia-south1
This is the error when I try to run the code using a cloud function trigger. Please find below the cloud function code. both my source data and target big query dataset resides in asia south 1
"""
Google cloud funtion used for executing dataflow jobs.
"""
from googleapiclient.discovery import build
import time
def df_load_function(file, context):
filesnames = [
'5667788_OPTOUT_',
'WHR_AD_EMAIL_CNSNT_RESP_'
]
# Check the uploaded file and run related dataflow jobs.
for i in filesnames:
if 'inbound/{}'.format(i) in file['name']:
print("Processing file: {filename}".format(filename=file['name']))
project = '<my project>'
inputfile = 'gs://<my bucket>/inbound/' + file['name']
job = 'df_load_wave1_{}'.format(i)
template = 'gs://<my bucket>/template/df_load_wave1_{}'.format(i)
location = 'us-central1'
dataflow = build('dataflow', 'v1b3', cache_discovery=False)
request = dataflow.projects().locations().templates().launch(
projectId=project,
gcsPath=template,
location=location,
body={
'jobName': job,
"environment": {
"workerZone": "us-central1-a"
}
}
)
# Execute the dataflowjob
response = request.execute()
job_id = response["job"]["id"]
I have kept both location and workerzone as us-central1 and us-central1-a respectively. I need to run my data flow job in us central 1 due to some resource issues but read and write data from asia-south1. what else do I need to add in cloud function so that region and zone are both us-central1 but data is read and written from asia south 1.
However, when I run my job manually using cloud shell using below commands it works fine and data is loaded. here both region and zone is us-central1
python -m <python script where the data is read from bucket and load big query> \
--project <my_project> \
--region us-central1 \
--runner DataflowRunner \
--staging_location gs://<bucket_name>/staging \
--temp_location gs://<bucket_name>/temp \
--subnetwork https://www.googleapis.com/compute/v1/projects/whr-ios-network/regions/us-central1/subnetworks/<subnetwork> \
--network projects/whr-ios-network/global/networks/<name> \
--zone us-central1-a \
--save_main_session
Please help anyone. Have been struggling with this issue.
I was able to fix the below error
"2021-03-10T06:02:26.115216545ZWorkflow failed. Causes: S01:Read File from GCS/Read+String To BigQuery Row+Write to BigQuery/NativeWrite failed., BigQuery import job "dataflow_job_15712075439082970546-B" failed., BigQuery job "dataflow_job_15712075439082970546-B" in project "whr-asia-datalake-prod" finished with error(s): errorResult: Cannot read and write in different locations: source: US, destination: asia-south1, error: Cannot read and write in different locations: source: US, destination: asia-south1"
I just changed by cloud function to add the temp location of my asia-south bucket location because though I was providing the tmp location as of asia-south1 while creating the template, bigquery IO in my data flow job was trying to use the temp location of us-central1 and not asia-south1 and hence the above error.
Transfers from S3 to BigQuery works properly if you use the console. On the command line I have also everything working just there is one parameter that I can't find how to configure.
On the console UI you have "Schedule Options" and you could set the repeat as "on demand":
However on the command line I can't find a way to set the transfer as "on demand". Do you know which parameter do I need to pass to set it as on demand? it automatically set a schedule of every 24 hours.
Example run:
bq mk --transfer_config \
--target_dataset=my_dataset \
--display_name="my_transfer" \
--params='{"data_path":"s3://my_bucket/my_path*",
"destination_table_name_template":"testing",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":";",
"skip_leading_rows":"0",
"allow_quoted_newlines":"false",
"allow_jagged_rows":"false",
"access_key_id": "",
"secret_access_key": ""}' \
--data_source=amazon_s3
#how can I setup the schedule options as on demand?
You need to set the disableAutoScheduling parameter to false in the DTS API.
https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs#TransferConfig.ScheduleOptions
For example:
{
"dataSourceId":"google_cloud_storage",
"displayName":"bar",
"params":{
"destination_table_name_template":"bart",
"data_path_template":"gs://fuzzy-wuzzy/wiki_1M.csv",
"write_disposition":"APPEND",
"file_format":"CSV",
"max_bad_records":"0",
"field_delimiter":",",
"skip_leading_rows":"0"
},
"emailPreferences":{
"enableFailureEmail":false
},
"notificationPubsubTopic":null,
"destinationDatasetId":"another_test",
"schedule":"",
"scheduleOptions":{
"disableAutoScheduling":true
}
}
To do this via the BigQuery CLI tool, you need to use the no_auto_scheduling flag.
I have a csv file on my computer. I would like to load this CSV file into a BigQuery table.
I'm using the following command from a terminal:
bq load --apilog=./logs --field_delimiter=$(printf ';') --skip_leading_rows=1 --autodetect dataset1.table1 mycsvfile.csv myschema.json
The command in my terminal doesn't give any output. In the GCP interface, I see no job being created, which makes me think the request doesn't even reach GCP.
In the logs file (from the --apilog parameter) I get informations about the request being made, and it ends with this:
INFO:googleapiclient.discovery:URL being requested: POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/myproject/jobs?uploadType=resumable&alt=json
and that's it. No matter how long I wait, nothing happens.
You are mixing --autodetect with myschema.json, something like the following shoud work:
bq load --apilog=logs \
--source_format=CSV \
--field_delimiter=';' \
--skip_leading_rows=1 \
--autodetect \
dataset.table \
mycsvfile.csv
If you continue having issues, please post the content of the apilog, the line you shared doesn't seem to be an error. There should be more than one line and normally contains the error in a json structure, for instance:
"reason": "invalid",
"message": "Provided Schema does not match Table project:dataset.table. Field users is missing in new schema"
I'm not sure why you are using
--apilog=./logs
I did not find this in the bq load documentation, please clarify.
Based on that, maybe the bq load command could be the issue, you can try with something like:
bq load \
--autodetect \
--source_format=CSV \
--skip_leading_rows= 1 \
--field_delimiter=';'
dataset1.table1 \
gs://mybucket/mycsvfile.csv \
./myschema.json
If it fails, please check your job list to get the job created, then use bq show to view the information about that job, there you should find an error messag which can help you to determine the cause of the issue.
I've built an application using DynamoDB Local and now I'm at the point where I want to setup on AWS. I've gone through numerous tools but have had no success finding a way to take my local DB and setup the schema and migrate data into AWS.
For example, I can get the data into a CSV format but AWS has no way to recognize that. It seems that I'm forced to create a Data Pipeline... Does anyone have a better way to do this?
Thanks in advance
As was mentioned earlier, DynamoDB local is there for testing purposes. However, you can still migrate your data if you need to. One approach would be to save data into some format, like json or csv and store it into S3, and then use something like lambdas or your own server to read from S3 and save into your new DynamoDB. As for setting up schema, You can use the same code you used to create your local table to create remote table via AWS SDK.
you can create a standalone application to get the list of tables from the local dynamoDB and create them in your AWS account after that you can get all the data for each table and save them.
I'm not sure which language you familiar with but will explain some API might help you in Java.
DynamoDB.listTables();
DynamoDB.createTable(CreateTableRequest);
example about how to create table using the above API
ProvisionedThroughput provisionedThroughput = new ProvisionedThroughput(1L, 1L);
try{
CreateTableRequest groupTableRequest = mapper.generateCreateTableRequest(Group.class); //1
groupTableRequest.setProvisionedThroughput(provisionedThroughput); //2
// groupTableRequest.getGlobalSecondaryIndexes().forEach(index -> index.setProvisionedThroughput(provisionedThroughput)); //3
Table groupTable = client.createTable(groupTableRequest); //4
groupTable.waitForActive();//5
}catch(ResourceInUseException e){
log.debug("Group table already exist");
}
1- you will create TableRequest against mapping
2- setting the provision throughput and this will vary depend on your requirements
3- if the table has global secondary index you can use this line (Optional)
4- the actual table will be created here
5- the thread will be stopped till the table become active
I didn't mention the API related to data access (insert ... etc), I supposed that you're familiar with since you already use them in local dynamodb
I did a little work setting up my local dev environment. I use SAM to create the dynamodb tables in AWS. I didn't want to do the work twice so I ended up copying the schema from AWS to my local instance. The same approach can work the other way around.
aws dynamodb describe-table --table-name chess_lobby \
| jq '.Table' \
| jq 'del(.TableArn)' \
| jq 'del(.TableSizeBytes)' \
| jq 'del(.TableStatus)' \
| jq 'del(.TableId)' \
| jq 'del(.ItemCount)' \
| jq 'del(.CreationDateTime)' \
| jq 'del(.GlobalSecondaryIndexes[].IndexSizeBytes)' \
| jq 'del(.ProvisionedThroughput.NumberOfDecreasesToday)' \
| jq 'del(.GlobalSecondaryIndexes[].IndexStatus)' \
| jq 'del(.GlobalSecondaryIndexes[].IndexArn)' \
| jq 'del(.GlobalSecondaryIndexes[].ItemCount)' \
| jq 'del(.GlobalSecondaryIndexes[].ProvisionedThroughput.NumberOfDecreasesToday)' > chess_lobby.json
aws dynamodb create-table \
--cli-input-json file://chess_lobby.json \
--endpoint-url http://localhost:8000
The top command uses describe table aws cli capabilities to get the schema json. Then I use jq to delete all unneeded keys, since create-table is strict with its parameter validation. Then I can use create-table to create the table in the local environent by using the --endpoint-url command.
You can use the --endpoint-url parameter on the top command instead to fetch your local schema and then use the create-table without the --endpoint-url parameter to create it directly in AWS.