BigQuery: Export data into hierarchical folders: YYYY/MM/DD - google-bigquery

I have a date-partitioned table in BigQuery that I'd like to export. I would like to export it such that the data from each day ends up in a different file. For example, to a GS bucket with a nested folder structure like gs://my-bucket/YYYY/MM/DD/. Is this possible?
Please don't tell me I need to run a separate export job for each day of data: I know this is possible but it is painful when exporting many years worth of data, as you need to run thousands of export jobs.
On the import side, this is possible with the parquet format.
If this is not possible with BigQuery directly, is there a GCS tool like dataproc or dataflow that would make this easy (bonus points for linking to a script that actually performs this export).

Would a bash script with bq extract work?
#!/bin/bash
# Stop on first error
set -e;
# Used for Bigquery partitioning (to distinguish from bash variable reference)
DOLLAR="\$"
# -I ISO DATE
# -d FROM STRING
start=$(date -I -d 2019-06-01) || exit -1
end=$(date -I -d 2019-06-15) || exit -1
d=${start}
# string(d) <= string(end)
while [[ ! "$d" > "$end" ]]; do
YYYYMMDD=$(date -d ${d} +"%Y%m%d")
YYYY=$(date -d ${d} +"%Y")
MM=$(date -d ${d} +"%m")
DD=$(date -d ${d} +"%d")
# print current date
echo ${d}
cmd="bq extract --destination_format=AVRO \
'project:dataset.table${DOLLAR}${YYYYMMDD}' \
'gs://my-bucket/${YYYY}/${MM}/${DD}/part*.avro'
"
# execute
eval ${cmd}
# d++
d=$(date -I -d "$d + 1 day")
done
Maybe you should request a new feature at https://issuetracker.google.com/savedsearches/559654.
Not a bash ninja, so sure that there is a cooler way to compare dates.

At #Ben P's request here's the solution (a python script) I've used previously to run lots of export jobs in parallel. This is pretty rough code and should be improved by checking the status of each export job after it runs to see whether it succeeded.
I won't accept this an an answer because the question is looking for a bigquery-native way of performing this task.
Note that this script was for exporting a versioned dataset, so there's a bit of extra logic around that which many users may not need. It assumes that the input table and output folder names both use the version. This should be easy to strip out.
import argparse
import datetime as dt
from google.cloud import bigquery
from multiprocessing import Pool
import random
import time
GCS_EXPORT_BUCKET = "YOUR_BUCKET_HERE"
VERSION = "dataset_v1"
def export_date(export_dt, bucket=GCS_EXPORT_BUCKET, version=VERSION):
table_id = '{}${:%Y%m%d}'.format(version, export_dt)
gcs_filename = '{}/{:%Y/%m/%d}/{}-*.jsonlines.gz'.format(version, export_dt, table_id)
gcs_path = 'gs://{}/{}'.format(bucket, gcs_filename)
job_id = export_data_to_gcs(table_id, gcs_path, 'currents')
return (export_dt, job_id)
def export_data_to_gcs(table_id, destination_gcs_path, dataset):
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = 'NEWLINE_DELIMITED_JSON'
job_config.compression = 'GZIP'
job_id = 'export-{}-{:%Y%m%d%H%M%S}'.format(table_id.replace('$', '--'),
dt.datetime.utcnow())
# Add a bit of jitter
time.sleep(5 * random.random())
job = bigquery_client.extract_table(table_ref,
destination_gcs_path,
job_config=job_config,
job_id=job_id)
print(f'Now running job_id {job_id}')
time.sleep(50)
job.reload()
while job.running():
time.sleep(10)
job.reload()
return job_id
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-s', "--startdate",
help="The Start Date - format YYYY-MM-DD (Inclusive)",
required=True,
type=dt.date.fromisoformat)
parser.add_argument('-e', "--enddate",
help="The End Date format YYYY-MM-DD (Exclusive)",
required=True,
type=dt.date.fromisoformat)
args = parser.parse_args()
start_date = args.startdate
end_date = args.enddate
dates = []
while start_date < end_date:
dates.append(start_date)
start_date += dt.timedelta(days=1)
with Pool(processes=30) as pool:
jobs = pool.map(export_date, dates, chunksize=1)
To run this code, put it into a file called bq_exporter.py and then run python bq_exporter.py -s 2019-01-01 -e 2019-02-01. That'll export January of 2019, and print each of the export job's IDs. You can check on the status of a job using BigQuery CLI via bq show -j JOB_ID.

Related

BigQuery, is there a way to search the whole db for a string?

I have a datum that I need to find within the database, for example 'dsfsdfsads'. But, there are 100+ tables and views to search through. I have found numerous queries written for other databases that can find a specific string within the database. Example postings are below. However, I don't see the same for BigQuery. I found this question: Is it possible to do a full text search in all tables in BigQuery?, but this post feels incomplete and the 2 links provided in the answer do not answer my question.
Examples of other database queries capable of finding specific string:
Find a string by searching all tables in SQL Server Management Studio 2008
Search all tables, all columns for a specific value SQL Server
How do I search an SQL Server database for a string?
I am not sure why it doesn't suit you to search through your database using a wildcard table like in the post you mentioned. Because I have run this sample query to search through a public dataset and it works just fine.
SELECT *
FROM `bigquery-public-data.baseball.*` b
WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r'Cubs')
I guess it is because one of the limitation is that the wildcard table functionality does not support views.
Do you have a lot of them?
In that case you can use the wildcard only for your tables and filter out the views with _TABLE_SUFFIX or with a less general wildcard (it depends on the names of your views).
In general, with wildcard tables, using _TABLE_SUFFIX can greatly reduce the number of bytes scanned, which reduces the cost of running your queries. So use it also if you suspect of some tables to contain the string more then others.
For the views (or the whole dataset), you could:
• Iterate by calling the BigQuery API using one of the libraries with some multiprocessing module like multiprocessing in Python.
• Iterate by calling the REST API from a bash script.
• Iterate by using the bq command from a bash script.
If you get stuck with the programmatic part, post a new question and add the link here.
EDIT:
Here are two examples for you (bash and python). I tried them both and they work but any comments to help improve are welcome of course.
Python:
Install packages
pip install --upgrade google-cloud-bigquery
pip install multiprocess
Create filename.py. Change YOUR_PROJECT_ID and YOUR_DATASET.
from google.cloud import bigquery
import multiprocessing
def search(dataset_id):
"""
Lists and filters your dataset to keep only views
"""
client = bigquery.Client()
tables = client.list_tables(dataset_id)
views = []
for table in tables:
if table.table_type == 'VIEW':
views.append(table.table_id)
return views
def query(dataset_id, view):
"""
Searches for the string in your views and prints the first one it finds.
You can change or remove 'LIMIT 1' if needed.
"""
client = bigquery.Client()
query_job = client.query(
"""
SELECT *
FROM {}.{} b
WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r"true")
LIMIT 1
""".format(dataset_id, view)
)
results = query_job.result() # Waits for job to complete.
for row in results:
print(row)
if __name__ == '__main__':
# TODO: Set dataset_id to the ID of the dataset that contains the tables you are listing.
dataset_id = 'YOUR_PROJECT_ID.YOUR_DATASET'
views = search(dataset_id)
processes = []
for i in views:
p = multiprocessing.Process(target=query, args=(dataset_id, i))
p.start()
processes.append(p)
for process in processes:
process.join()
Run python filename.py
Bash:
Install jq (json parser) and test it
sudo apt-get install jq
Test
echo '{ "name":"John", "age":31, "city":"New York" }' | jq .
Output:
{
"name": "John",
"age": 31,
"city": "New York"
}
Reference
Create filename.sh. Change YOUR_PROJECT_ID and YOUR_DATASET.
#!/bin/bash
FILES="bq ls --format prettyjson YOUR_DATASET"
RESULTS=$(eval $FILES)
DETAILS=$(echo "${RESULTS}" | jq -c '.[]')
for d in $DETAILS
do
ID=$(echo $d | jq -r .tableReference.tableId)
table_type=$(echo $d | jq -r '.type')
if [[ $table_type == "VIEW" ]]
then
bq query --use_legacy_sql=false \
'SELECT *
FROM
`YOUR_PROJECT_ID`.YOUR_DATASET.'$ID' b
WHERE REGEXP_CONTAINS(TO_JSON_STRING(b), r"true")
LIMIT 1'
fi
done
Run bash filename.sh

Scheduling Query by using Script

I'd like to ask if it's possible or not to rum query scheduling by using script?
As for creating table, we could use script
CREATE TABLE dataset.xxx AS
...
Is there any way to do this but to CREATE A SCHEDULER instead of clicking the 'Schedule Query' Button?
As per documentation, in order to schedule a query , you can use one of the following methods:
BigQuery console and click "Schedule Query" as you mentioned in your question.
bq command
Python API
I will share same examples with you, first using bq command. From the Cloud Shell environment you can execute the following command:
bq query \
--use_legacy_sql=false \
--destination_table=mydataset.mytable \
--display_name='My Scheduled Query' \
--replace=true \
'SELECT
1
FROM
mydataset.test'
In addition, using bq command you can also use other flags, described here.
Second, using the Python API, you can configure your schedule query using the DataTransferServiceClient, which allows you to pass all the query configuration through a json dictionary, such as this example in the documentation and below:
from google.cloud import bigquery_datatransfer_v1
import google.protobuf.json_format
client = bigquery_datatransfer_v1.DataTransferServiceClient()
# TODO(developer): Set the project_id to the project that contains the
# destination dataset.
# project_id = "your-project-id"
# TODO(developer): Set the destination dataset. The authorized user must
# have owner permissions on the dataset.
# dataset_id = "your_dataset_id"
# TODO(developer): The first time you run this sample, set the
# authorization code to a value from the URL:
# https://www.gstatic.com/bigquerydatatransfer/oauthz/auth?client_id=433065040935-hav5fqnc9p9cht3rqneus9115ias2kn1.apps.googleusercontent.com&scope=https://www.googleapis.com/auth/bigquery%20https://www.googleapis.com/auth/drive&redirect_uri=urn:ietf:wg:oauth:2.0:oob
#
# authorization_code = "_4/ABCD-EFGHIJKLMNOP-QRSTUVWXYZ"
#
# You can use an empty string for authorization_code in subsequent runs of
# this code sample with the same credentials.
#
# authorization_code = ""
# Use standard SQL syntax for the query.
query_string = """
SELECT
CURRENT_TIMESTAMP() as current_time,
#run_time as intended_run_time,
#run_date as intended_run_date,
17 as some_integer
"""
parent = client.project_path(project_id)
transfer_config = google.protobuf.json_format.ParseDict(
{
"destination_dataset_id": dataset_id,
"display_name": "Your Scheduled Query Name",
"data_source_id": "scheduled_query",
"params": {
"query": query_string,
"destination_table_name_template": "your_table_{run_date}",
"write_disposition": "WRITE_TRUNCATE",
"partitioning_field": "",
},
"schedule": "every 24 hours",
},
bigquery_datatransfer_v1.types.TransferConfig(),
)
response = client.create_transfer_config(
parent, transfer_config, authorization_code=authorization_code
)
print("Created scheduled query '{}'".format(response.name))

liquibase : generate changelogs from existing database

Is it possible with liquibase to generate changelogs from an existing database?
I would like to generate one xml changelog per table (not every create table statements in one single changelog).
If you look into documentation it looks like it generates only one changelog with many changesets (one for each table). So by default there is no option to generate changelogs per table.
While liquibase generate-changelog still doesn't support splitting up the generated changelog, you can split it yourself.
If you're using JSON changelogs, you can do this with jq.
I created a jq filter to group the related changesets, and combined it with a Bash script to split out the contents. See this blog post
jq filter, split_liquibase_changelog.jq:
# Define a function for mapping a changes onto its destination file name
# createTable and createIndex use the tableName field
# addForeignKeyConstraint uses baseTableName
# Default to using the name of the change, e.g. createSequence
def get_change_group: map(.tableName // .baseTableName)[0] // keys[0];
# Select the main changelog object
.databaseChangeLog
# Collect the changes from each changeSet into an array
| map(.changeSet.changes | .[])
# Group changes according to the grouping function
| group_by(get_change_group)
# Select the grouped objects from the array
| .[]
# Get the group name from each group
| (.[0] | get_change_group) as $group_name
# Select both the group name...
| $group_name,
# and the group, wrapped in a changeSet that uses the group name in the ID and
# the current user as the author
{ databaseChangelog: {
changeSet: {
id: ("table_" + $group_name),
author: env.USER,
changes: . } } }
Bash:
#!/usr/bin/env bash
# Example: ./split_liquibase_changelog.sh schema < changelog.json
set -e -o noclobber
OUTPUT_DIRECTORY="${1:-schema}"
OUTPUT_FILE="${2:-schema.json}"
# Create the output directory
mkdir --parents "$OUTPUT_DIRECTORY"
# --raw-output: don't quote the strings for the group names
# --compact-output: output one JSON object per line
jq \
--raw-output \
--compact-output \
--from-file split_liquibase_changelog.jq \
| while read -r group; do # Read the group name line
# Read the JSON object line
read -r json
# Process with jq again to pretty-print the object, then redirect it to the
# new file
(jq '.' <<< "$json") \
> "$OUTPUT_DIRECTORY"/"$group".json
done
# List all the files in the input directory
# Run jq with --raw-input, so input is parsed as strings
# Create a changelog that includes everything in the input path
# Save the output to the desired output file
(jq \
--raw-input \
'{ databaseChangeLog: [
{ includeAll:
{ path: . }
}
] }' \
<<< "$OUTPUT_DIRECTORY"/) \
> "$OUTPUT_FILE"
If you need to use XML changesets, you can try adapting this solution using an XML tool like XQuery instead.

mysql workbench migration select data

I use MySQL Workbench to copy a table MS SQL to MYSQL server. It's possible to select the data between 2 date ?
Today this export lasts 3h with more than 150K rows and I would like to speed up the treatment.
Thanks
You must run migration wizard once again, but in 'Data Transfer Setup' step choose 'Create a shell script to copy the data from outside of Workbench' option. After that Workbench generate shell script for you, which may look similar to this:
#!/bin/sh
# Workbench Table Data copy script
# Workbench Version: 6.3.10
#
# Execute this to copy table data from a source RDBMS to MySQL.
# Edit the options below to customize it. You will need to provide passwords, at least.
#
# Source DB: Mysql#localhost:8000 (MySQL)
# Target DB: Mysql#localhost:8000
# Source and target DB passwords
arg_source_password=
arg_target_password=
if [ -z "$arg_source_password" ] && [ -z "$arg_target_password" ] ; then
echo WARNING: Both source and target RDBMSes passwords are empty. You should edit this file to set them.
fi
arg_worker_count=2
# Uncomment the following options according to your needs
# Whether target tables should be truncated before copy
# arg_truncate_target=--truncate-target
# Enable debugging output
# arg_debug_output=--log-level=debug3
/home/milosz/Projects/Oracle/workbench/master/wb_run/usr/local/bin/wbcopytables \
--mysql-source="root#localhost:8000" \
--target="root#localhost:8000" \
--source-password="$arg_source_password" \
--target-password="$arg_target_password" \
--thread-count=$arg_worker_count \
$arg_truncate_target \
$arg_debug_output \
--table '`test`' '`t1`' '`test_target`' '`t1`' '`id`' '`id`' '`id`, `name`, `date`'
First of all you need to put your password for source and target databases. Then change last argument of wbcopytables command from --table to --table-where and add condition to the end of line.
Side note: you can run wbcopytables command with --help argument to see all options.
After all you should get script that looks like similar to:
#<...>
# Source and target DB passwords
arg_source_password=your_sorce_password
arg_target_password=your_target_password
if [ -z "$arg_source_password" ] && [ -z "$arg_target_password" ] ; then
echo WARNING: Both source and target RDBMSes passwords are empty. You should edit this file to set them.
fi
arg_worker_count=2
# Uncomment the following options according to your needs
# Whether target tables should be truncated before copy
# arg_truncate_target=--truncate-target
# Enable debugging output
# arg_debug_output=--log-level=debug3
/home/milosz/Projects/Oracle/workbench/master/wb_run/usr/local/bin/wbcopytables \
--mysql-source="root#localhost:8000" \
--target="root#localhost:8000" \
--source-password="$arg_source_password" \
--target-password="$arg_target_password" \
--thread-count=$arg_worker_count \
$arg_truncate_target \
$arg_debug_output \
--table-where '`test`' '`t1`' '`test_target`' '`t1`' '`id`' '`id`' '`id`, `name`, `date`' '`date` >= "2017-01-02" and `date` <= "2017-01-03"'
I hope that is helpful for you.

Using the BigQuery Connector with Spark

I'm not getting the Google example work
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example
PySpark
There are a few mistakes in the code i think, like:
'# Output Parameters
'mapred.bq.project.id': '',
Should be: 'mapred.bq.output.project.id': '',
and
'# Write data back into new BigQuery table.
'# BigQueryOutputFormat discards keys, so set key to None.
(word_counts
.map(lambda pair: None, json.dumps(pair))
.saveAsNewAPIHadoopDataset(conf))
will give an error message. If I change it to:
(word_counts
.map(lambda pair: (None, json.dumps(pair)))
.saveAsNewAPIHadoopDataset(conf))
I get the error message:
org.apache.hadoop.io.Text cannot be cast to com.google.gson.JsonObject
And whatever I try I can not make this work.
There is a dataset created in BigQuery with the name I gave it in the 'conf' with a trailing '_hadoop_temporary_job_201512081419_0008'
And a table is created with '_attempt_201512081419_0008_r_000000_0' on the end. But are always empty
Can anybody help me with this?
Thanks
We are working to update the documentation because, as you noted, the docs are incorrect in this case. Sorry about that! While we're working to update the docs, I wanted to get you a reply ASAP.
Casting problem
The most important problem you mention is the casting issue. Unfortunately,PySpark cannot use the BigQueryOutputFormat to create Java GSON objects. The solution (workaround) is to save the output data into Google Cloud Storage (GCS) and then load it manually with the bq command.
Code example
Here is a code sample which exports to GCS and loads the data into BigQuery. You could also use subprocess and Python to execute the bq command programatically.
#!/usr/bin/python
"""BigQuery I/O PySpark example."""
import json
import pprint
import pyspark
sc = pyspark.SparkContext()
# Use the Google Cloud Storage bucket for temporary BigQuery export data used
# by the InputFormat. This assumes the Google Cloud Storage connector for
# Hadoop is configured.
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory ='gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
# Input Parameters
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
# Perform word count.
word_counts = (
table_data
.map(lambda (_, record): json.loads(record))
.map(lambda x: (x['word'].lower(), int(x['word_count'])))
.reduceByKey(lambda x, y: x + y))
# Display 10 results.
pprint.pprint(word_counts.take(10))
# Stage data formatted as newline delimited json in Google Cloud Storage.
output_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_output'.format(bucket)
partitions = range(word_counts.getNumPartitions())
output_files = [output_directory + '/part-{:05}'.format(i) for i in partitions]
(word_counts
.map(lambda (w, c): json.dumps({'word': w, 'word_count': c}))
.saveAsTextFile(output_directory))
# Manually clean up the input_directory, otherwise there will be BigQuery export
# files left over indefinitely.
input_path = sc._jvm.org.apache.hadoop.fs.Path(input_directory)
input_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(input_path, True)
print """
###########################################################################
# Finish uploading data to BigQuery using a client e.g.
bq load --source_format NEWLINE_DELIMITED_JSON \
--schema 'word:STRING,word_count:INTEGER' \
wordcount_dataset.wordcount_table {files}
# Clean up the output
gsutil -m rm -r {output_directory}
###########################################################################
""".format(
files=','.join(output_files),
output_directory=output_directory)