I have just working in a cloud database project involving some big data processing and need to quickly migrate data directly from bigquery to snowflake. I was thinking if there is any direct way to push data from bigquery to snowflake without any intermediate storage.
Please let me know there is a utility or api or anything other way bigquery provides to push data to snowflake. Appreciate your help.
If this is a one-time need, you can perform it in two steps, exporting the table data to Google's Cloud Storage and then bulk loading from the Cloud Storage onto Snowflake.
For the export stage, use BigQuery's inbuilt export functionality or the BigQuery Storage API if the data is very large. All of the export formats supported by BigQuery (CSV, JSON or AVRO) are readily supported by Snowflake, so a data transformation step may not be required.
Once the export's ready on the target cloud storage address, use Snowflake's COPY INTO <table> with the external location option (or named external stages) to copy them into a Snowflake-managed table.
If you want to move all your data to snowflake from bigquery, so I have few assumption
1. You have all your data in gcs itself.
2. You can connect the snowflake cluster from gcp.
Now your problem shrinks down to moving your data from gcs bucket to snowflake directly.
SO now you can run copy command like
COPY INTO [<namespace>.]<table_name>
FROM { internalStage | externalStage | externalLocation }
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |
TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] } ) ]
[ copyOptions ]
[ VALIDATION_MODE = RETURN_<n>_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS ]
where external location is
externalLocation ::=
'gcs://<bucket>[/<path>]'
[ STORAGE_INTEGRATION = <integration_name> ]
[ ENCRYPTION = ( [ TYPE = 'GCS_SSE_KMS' ] [ KMS_KEY_ID = '<string>' ] | [ TYPE = NONE ] ) ]
The documentation for the same can be found here and here
If the Snowflake instance is hosted on a different vendor (e.g. AWS or Azure), there is a tool that I wrote that could be of use called sling. There is a blog entry about this in detail, but essentially you run one command after setting up your credentials:
$ sling run --src-conn BIGQUERY --src-stream segment.team_activity --tgt-conn SNOWFLAKE --tgt-object public.activity_teams --mode full-refresh
11:37AM INF connecting to source database (bigquery)
11:37AM INF connecting to target database (snowflake)
11:37AM INF reading from source database
11:37AM INF writing to target database [mode: full-refresh]
11:37AM INF streaming data
11:37AM INF dropped table public.activity_teams
11:38AM INF created table public.activity_teams
11:38AM INF inserted 77668 rows
11:38AM INF execution succeeded
Related
I am writing a beam job that is a simple 1:1 ETL from a binary protobuf file stored in GCS into BigQuery. The table schema is quite large, and generated automatically from a representative protobuf.
I am encountering behavior where the BigQuery table is created successfully, but no records are inserted. I have confirmed that records are being generated by the earlier stage, and when I use a normal file sink I can confirm that records are written.
Does anyone know why this is happening?
Logs:
WARNING:root:Inferring Schema...
WARNING:root:Unable to find default credentials to use: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Connecting anonymously.
WARNING:root:Defining Beam Pipeline...
<PATH REDACTED>/venv/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py:1145: BeamDeprecationWarning: options is deprecated since First stable release. References to <pipeline>.options will not be supported
experiments = p.options.view_as(DebugOptions).experiments or []
WARNING:root:Running Beam Pipeline...
WARNING:root:extracted {'counters': [MetricResult(key=MetricKey(step=extract_games, metric=MetricName(namespace=__main__.ExtractGameProtobuf, name=extracted_games), labels={}), committed=8, attempted=8)], 'distributions': [], 'gauges': []} games
Pipeline Source:
def main(args):
DEFAULT_REPLAY_IDS_PATH = "./replay_ids.txt"
DEFAULT_BQ_TABLE_OUT = "<PROJECT REDACTED>:<DATASET REDACTED>.games"
# configure logging
logging.basicConfig(level=logging.WARNING)
# set up replay source
replay_source = ETLReplayRemoteSource.default()
# TODO: load the example replay and parse schema
logging.warning("Inferring Schema...")
sample_replay = replay_source.load_replay(DEFAULT_REPLAY_IDS[0])
game_schema = ProtobufToBigQuerySchemaGenerator(
sample_replay.analysis.DESCRIPTOR).schema()
# print("GAME SCHEMA:\n{}".format(game_schema)) # DEBUG
# submit beam job that reads replays into bigquery
def count_ones(word_ones):
(word, ones) = word_ones
return (word, sum(ones))
with beam.Pipeline(options=PipelineOptions()) as p:
logging.warning("Defining Beam Pipeline...")
# replay_ids = p | "create_replay_ids" >> beam.Create(DEFAULT_REPLAY_IDS)
(p | "read_replay_ids" >> beam.io.ReadFromText(DEFAULT_REPLAY_IDS_PATH)
| "extract_games" >> beam.ParDo(ExtractGameProtobuf())
| "write_out_bq" >> WriteToBigQuery(
DEFAULT_BQ_TABLE_OUT,
schema=game_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
)
logging.warning("Running Beam Pipeline...")
result = p.run()
result.wait_until_finish()
n_extracted = result.metrics().query(
MetricsFilter().with_name('extracted_games'))
logging.warning("extracted {} games".format(n_extracted))
I am trying to write a table into my data warehouse using the RPostgreSQL package
library(DBI)
library(RPostgreSQL)
pano = dbConnect(dbDriver("PostgreSQL"),
host = 'db.panoply.io',
port = '5439',
user = panoply_user,
password = panoply_pw,
dbname = mydb)
RPostgreSQL::dbWriteTable(pano, "mtcars", mtcars[1:5, ])
I am getting this error:
Error in postgresqlpqExec(new.con, sql4) :
RS-DBI driver: (could not Retrieve the result : ERROR: syntax error at or near "STDIN"
LINE 1: ..."hp","drat","wt","qsec","vs","am","gear","carb" ) FROM STDIN
^
)
The above code writes into Panoply as a 0 row, 0 byte table. Columns seem to be properly entered into Panoply but nothing else appears.
Fiest and most important redshift <> postgresql.
Redshift does not use the Postgres bulk loader. (so stdin is NOT allowed).
There are many options available which you should choose depending on your need, especially consider the volume of data.
For high volume of data you should write to s3 first and then use redshift copy command.
There are many options take a look at
https://github.com/sicarul/redshiftTools
for low volume see
inserting multiple records at once into Redshift with R
Let's say I have a table with one single field named "version", which is a string. When I try to load data into the table using autodetect with values like "1.1" or "1", the autodetect feature infers these values as float or integer type respectively.
data1.json example:
{ "version": "1.11.0" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data1.json
Upload complete.
Waiting on bqjob_ZZZ ... (1s) Current status: DONE
data2.json example:
{ "version": "1.11" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data2.json
Upload complete.
Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to FLOAT
data3.json example:
{ "version": "1" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data3.json
Upload complete.
Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to INTEGER
The scenario where this problem doesn't happen is when you have, in the same file, another JSON where the value is inferred correctly as string (as seen in Bigquery autoconverting fields in data question):
{ "version": "1.12" }
{ "version": "1.12.0" }
In the question listed above, there's an answer stating that a fix was pushed to production, but it looks like the bug is back again. Is there a way/workaround to prevent this?
Looks like the confusing part here is whether "1.12" should be detected as string or float. BigQuery chose to detect as float. Before autodetect is introduced in BigQuery, BigQuery allows users to load float values in string format. This is very common in CSV/JSON format. So when autodetect is introduced, it kept this behavior. Autodetect will scan up to 100 rows to detect the type. If for all 100 rows, the data is like "1.12", then very likely this field is a float value. If one of the row has value "1.12.0", then BigQuery will detect the type is string, as you have observed.
I am using the confluent to import data from kafka to hive, trying to do the same thing as this: Bucket records based on time(kafka-hdfs-connector)
my sink config is like this:
{
"name":"yangfeiran_hive_sink_9",
"config":{
"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector",
"topics":"peoplet_people_1000",
"name":"yangfeiran_hive_sink_9",
"tasks.max":"1",
"hdfs.url":"hdfs://master:8020",
"flush.size":"3",
"partitioner.class":"io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner",
"partition.duration.ms":"300000",
"path.format":"'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm/",
"locale":"en",
"logs.dir":"/tmp/yangfeiran",
"topics.dir":"/tmp/yangfeiran",
"hive.integration":"true",
"hive.metastore.uris":"thrift://master:9083",
"schema.compatibility":"BACKWARD",
"hive.database":"yangfeiran",
"timezone": "UTC",
}
}
Everything works fine, I can see that data is in the hdfs, the table is created in the hive, except when I am using "select * from yang" to check if the data is already in hive.
It prints the error:
FAILED: SemanticException Unable to determine if hdfs://master:8020/tmp/yangfeiran/peoplet_people_1000 is encrypted: java.lang.IllegalArgumentException: Wrong FS: hdfs://master:8020/tmp/yangfeiran/peoplet_people_1000, expected: hdfs://nsstargate
How to solve this problem?
Feiran
My current workflow in BigQuery is as follows:
(1) query data in a public repository (stored in the US), (2) write it to a table in my repository, (3) export a csv to a cloud bucket and (4) download the csv on the server I work on and (5) work with that on the server.
The problem I have now, is that the server I work on is located in EU. Thus, I have to pay quite some fees for transfering data between my US bucket and my EU server. I could now go ahead and locate my bucket in EU, but then I still have the problem that I would transfer data from the US (BigQuery) to EU (bucket). So I could also set my dataset in bq to be located in the EU, but then I cant do any queries anylonger, because the data in the public repository is located in the US, and queries between different locations are not allowed.
Does anyone have an idea of how to approach this?
One way to copy a BigQuery dataset from one region to another is to take advantage of the Storage Data Transfer Service. It doesn't get around the fact that you still have to pay for bucket-to-bucket network traffic, but might save you some CPU time on copying data to a server in the EU.
The flow would be to:
Extract all the BigQuery tables into a bucket in the same region as the tables. (Recommend Avro format for best fidelity in data types and fastest loading speed.)
Run a storage transfer job to copy the extracted files from the starting location bucket to a bucket in the destination location.
Load all the files into a BigQuery dataset located in the destination location.
Python example:
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import sys
import time
import googleapiclient.discovery
from google.cloud import bigquery
import json
import pytz
PROJECT_ID = 'swast-scratch' # TODO: set this to your project name
FROM_LOCATION = 'US' # TODO: set this to the BigQuery location
FROM_DATASET = 'workflow_test_us' # TODO: set to BQ dataset name
FROM_BUCKET = 'swast-scratch-us' # TODO: set to bucket name in same location
TO_LOCATION = 'EU' # TODO: set this to the destination BigQuery location
TO_DATASET = 'workflow_test_eu' # TODO: set to destination dataset name
TO_BUCKET = 'swast-scratch-eu' # TODO: set to bucket name in destination loc
# Construct API clients.
bq_client = bigquery.Client(project=PROJECT_ID)
transfer_client = googleapiclient.discovery.build('storagetransfer', 'v1')
def extract_tables():
# Extract all tables in a dataset to a Cloud Storage bucket.
print('Extracting {}:{} to bucket {}'.format(
PROJECT_ID, FROM_DATASET, FROM_BUCKET))
tables = list(bq_client.list_tables(bq_client.dataset(FROM_DATASET)))
extract_jobs = []
for table in tables:
job_config = bigquery.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.AVRO
extract_job = bq_client.extract_table(
table.reference,
['gs://{}/{}.avro'.format(FROM_BUCKET, table.table_id)],
location=FROM_LOCATION, # Available in 0.32.0 library.
job_config=job_config) # Starts the extract job.
extract_jobs.append(extract_job)
for job in extract_jobs:
job.result()
return tables
def transfer_buckets():
# Transfer files from one region to another using storage transfer service.
print('Transferring bucket {} to {}'.format(FROM_BUCKET, TO_BUCKET))
now = datetime.datetime.now(pytz.utc)
transfer_job = {
'description': '{}-{}-{}_once'.format(
PROJECT_ID, FROM_BUCKET, TO_BUCKET),
'status': 'ENABLED',
'projectId': PROJECT_ID,
'transferSpec': {
'transferOptions': {
'overwriteObjectsAlreadyExistingInSink': True,
},
'gcsDataSource': {
'bucketName': FROM_BUCKET,
},
'gcsDataSink': {
'bucketName': TO_BUCKET,
},
},
# Set start and end date to today (UTC) without a time part to start
# the job immediately.
'schedule': {
'scheduleStartDate': {
'year': now.year,
'month': now.month,
'day': now.day,
},
'scheduleEndDate': {
'year': now.year,
'month': now.month,
'day': now.day,
},
},
}
transfer_job = transfer_client.transferJobs().create(
body=transfer_job).execute()
print('Returned transferJob: {}'.format(
json.dumps(transfer_job, indent=4)))
# Find the operation created for the job.
job_filter = {
'project_id': PROJECT_ID,
'job_names': [transfer_job['name']],
}
# Wait until the operation has started.
response = {}
while ('operations' not in response) or (not response['operations']):
time.sleep(1)
response = transfer_client.transferOperations().list(
name='transferOperations', filter=json.dumps(job_filter)).execute()
operation = response['operations'][0]
print('Returned transferOperation: {}'.format(
json.dumps(operation, indent=4)))
# Wait for the transfer to complete.
print('Waiting ', end='')
while operation['metadata']['status'] == 'IN_PROGRESS':
print('.', end='')
sys.stdout.flush()
time.sleep(5)
operation = transfer_client.transferOperations().get(
name=operation['name']).execute()
print()
print('Finished transferOperation: {}'.format(
json.dumps(operation, indent=4)))
def load_tables(tables):
# Load all tables into the new dataset.
print('Loading tables from bucket {} to {}:{}'.format(
TO_BUCKET, PROJECT_ID, TO_DATASET))
load_jobs = []
for table in tables:
dest_table = bq_client.dataset(TO_DATASET).table(table.table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.AVRO
load_job = bq_client.load_table_from_uri(
['gs://{}/{}.avro'.format(TO_BUCKET, table.table_id)],
dest_table,
location=TO_LOCATION, # Available in 0.32.0 library.
job_config=job_config) # Starts the load job.
load_jobs.append(load_job)
for job in load_jobs:
job.result()
# Actually run the script.
tables = extract_tables()
transfer_buckets()
load_tables(tables)
The preceding sample uses google-cloud-bigquery library for BigQuery API and google-api-python-client for Storage Data Transfer API.
Note that this sample does not account for partitioned tables.
No matter what, you have data in the US that you need in the EU, so I think you have two options:
You could continue to pay many smaller fees to move your reduced datasets from the US to the EU as you're doing today.
You could pay the one-off fee to transfer the original public BQ dataset from the US to your own dataset in the EU. From then on, all queries you run stay in the same region, and you have no more trans-continental transfers.
It really depends on how many queries you plan to do. If it's not a lot, then the way you're doing things today seems like it'd be the most efficient. If it's a lot, then moving the data once (paying the up-front fee) might work out cheaper.
Maybe Google has some magical way to make this better, but as far as I can tell, you're dealing with lots of data on one side of the Atlantic that you need on the other side, and moving it across that wire costs money.