mysql workbench migration select data - migration

I use MySQL Workbench to copy a table MS SQL to MYSQL server. It's possible to select the data between 2 date ?
Today this export lasts 3h with more than 150K rows and I would like to speed up the treatment.
Thanks

You must run migration wizard once again, but in 'Data Transfer Setup' step choose 'Create a shell script to copy the data from outside of Workbench' option. After that Workbench generate shell script for you, which may look similar to this:
#!/bin/sh
# Workbench Table Data copy script
# Workbench Version: 6.3.10
#
# Execute this to copy table data from a source RDBMS to MySQL.
# Edit the options below to customize it. You will need to provide passwords, at least.
#
# Source DB: Mysql#localhost:8000 (MySQL)
# Target DB: Mysql#localhost:8000
# Source and target DB passwords
arg_source_password=
arg_target_password=
if [ -z "$arg_source_password" ] && [ -z "$arg_target_password" ] ; then
echo WARNING: Both source and target RDBMSes passwords are empty. You should edit this file to set them.
fi
arg_worker_count=2
# Uncomment the following options according to your needs
# Whether target tables should be truncated before copy
# arg_truncate_target=--truncate-target
# Enable debugging output
# arg_debug_output=--log-level=debug3
/home/milosz/Projects/Oracle/workbench/master/wb_run/usr/local/bin/wbcopytables \
--mysql-source="root#localhost:8000" \
--target="root#localhost:8000" \
--source-password="$arg_source_password" \
--target-password="$arg_target_password" \
--thread-count=$arg_worker_count \
$arg_truncate_target \
$arg_debug_output \
--table '`test`' '`t1`' '`test_target`' '`t1`' '`id`' '`id`' '`id`, `name`, `date`'
First of all you need to put your password for source and target databases. Then change last argument of wbcopytables command from --table to --table-where and add condition to the end of line.
Side note: you can run wbcopytables command with --help argument to see all options.
After all you should get script that looks like similar to:
#<...>
# Source and target DB passwords
arg_source_password=your_sorce_password
arg_target_password=your_target_password
if [ -z "$arg_source_password" ] && [ -z "$arg_target_password" ] ; then
echo WARNING: Both source and target RDBMSes passwords are empty. You should edit this file to set them.
fi
arg_worker_count=2
# Uncomment the following options according to your needs
# Whether target tables should be truncated before copy
# arg_truncate_target=--truncate-target
# Enable debugging output
# arg_debug_output=--log-level=debug3
/home/milosz/Projects/Oracle/workbench/master/wb_run/usr/local/bin/wbcopytables \
--mysql-source="root#localhost:8000" \
--target="root#localhost:8000" \
--source-password="$arg_source_password" \
--target-password="$arg_target_password" \
--thread-count=$arg_worker_count \
$arg_truncate_target \
$arg_debug_output \
--table-where '`test`' '`t1`' '`test_target`' '`t1`' '`id`' '`id`' '`id`, `name`, `date`' '`date` >= "2017-01-02" and `date` <= "2017-01-03"'
I hope that is helpful for you.

Related

How can I run .sql file having multiple .sql file inside using Snowsql?

I want to figure out how to run multiple sql files on one go. Suppose I have this test.sql file which has file1.sql, file2.sql and file3.sql and so on. Along with some DML/DDL.
use database &{db};
use schema &{sc};
file1.sql;
file2.sql;
file3.sql;
create table snow_test1
(
name varchar
,add1 varchar
,id number
)
comment = 'this is snowsql testing table' ;
desc table snow_test1;
insert into snow_test1
values('prachi', 'testing', 1);
select * from snow_test1;
here what I run in power shell,
snowsql -c pp_conn -f ...\test.sql -D db=tbc -D sc=testing;
Is there any way to do this ? I know It is possible in Oracle but I want to do this using snowsql. Please guide me. Thanks in advance!
you can run multiple files in a single call:
snowsql -c pp_conn -f file1.sql -f file2.sql -f file3.sql -D db=tbc -D sc=testing;
You might need to put the addition DMLs in a file.
I have tried defining .sql file with !source inside my test.sql file and its working:
!source file1.sql;
!source file2.sql;
!source file3.sql;
....
Also, run the same command in power shell using one .sql file and it is working.

BigQuery: Export data into hierarchical folders: YYYY/MM/DD

I have a date-partitioned table in BigQuery that I'd like to export. I would like to export it such that the data from each day ends up in a different file. For example, to a GS bucket with a nested folder structure like gs://my-bucket/YYYY/MM/DD/. Is this possible?
Please don't tell me I need to run a separate export job for each day of data: I know this is possible but it is painful when exporting many years worth of data, as you need to run thousands of export jobs.
On the import side, this is possible with the parquet format.
If this is not possible with BigQuery directly, is there a GCS tool like dataproc or dataflow that would make this easy (bonus points for linking to a script that actually performs this export).
Would a bash script with bq extract work?
#!/bin/bash
# Stop on first error
set -e;
# Used for Bigquery partitioning (to distinguish from bash variable reference)
DOLLAR="\$"
# -I ISO DATE
# -d FROM STRING
start=$(date -I -d 2019-06-01) || exit -1
end=$(date -I -d 2019-06-15) || exit -1
d=${start}
# string(d) <= string(end)
while [[ ! "$d" > "$end" ]]; do
YYYYMMDD=$(date -d ${d} +"%Y%m%d")
YYYY=$(date -d ${d} +"%Y")
MM=$(date -d ${d} +"%m")
DD=$(date -d ${d} +"%d")
# print current date
echo ${d}
cmd="bq extract --destination_format=AVRO \
'project:dataset.table${DOLLAR}${YYYYMMDD}' \
'gs://my-bucket/${YYYY}/${MM}/${DD}/part*.avro'
"
# execute
eval ${cmd}
# d++
d=$(date -I -d "$d + 1 day")
done
Maybe you should request a new feature at https://issuetracker.google.com/savedsearches/559654.
Not a bash ninja, so sure that there is a cooler way to compare dates.
At #Ben P's request here's the solution (a python script) I've used previously to run lots of export jobs in parallel. This is pretty rough code and should be improved by checking the status of each export job after it runs to see whether it succeeded.
I won't accept this an an answer because the question is looking for a bigquery-native way of performing this task.
Note that this script was for exporting a versioned dataset, so there's a bit of extra logic around that which many users may not need. It assumes that the input table and output folder names both use the version. This should be easy to strip out.
import argparse
import datetime as dt
from google.cloud import bigquery
from multiprocessing import Pool
import random
import time
GCS_EXPORT_BUCKET = "YOUR_BUCKET_HERE"
VERSION = "dataset_v1"
def export_date(export_dt, bucket=GCS_EXPORT_BUCKET, version=VERSION):
table_id = '{}${:%Y%m%d}'.format(version, export_dt)
gcs_filename = '{}/{:%Y/%m/%d}/{}-*.jsonlines.gz'.format(version, export_dt, table_id)
gcs_path = 'gs://{}/{}'.format(bucket, gcs_filename)
job_id = export_data_to_gcs(table_id, gcs_path, 'currents')
return (export_dt, job_id)
def export_data_to_gcs(table_id, destination_gcs_path, dataset):
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = 'NEWLINE_DELIMITED_JSON'
job_config.compression = 'GZIP'
job_id = 'export-{}-{:%Y%m%d%H%M%S}'.format(table_id.replace('$', '--'),
dt.datetime.utcnow())
# Add a bit of jitter
time.sleep(5 * random.random())
job = bigquery_client.extract_table(table_ref,
destination_gcs_path,
job_config=job_config,
job_id=job_id)
print(f'Now running job_id {job_id}')
time.sleep(50)
job.reload()
while job.running():
time.sleep(10)
job.reload()
return job_id
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-s', "--startdate",
help="The Start Date - format YYYY-MM-DD (Inclusive)",
required=True,
type=dt.date.fromisoformat)
parser.add_argument('-e', "--enddate",
help="The End Date format YYYY-MM-DD (Exclusive)",
required=True,
type=dt.date.fromisoformat)
args = parser.parse_args()
start_date = args.startdate
end_date = args.enddate
dates = []
while start_date < end_date:
dates.append(start_date)
start_date += dt.timedelta(days=1)
with Pool(processes=30) as pool:
jobs = pool.map(export_date, dates, chunksize=1)
To run this code, put it into a file called bq_exporter.py and then run python bq_exporter.py -s 2019-01-01 -e 2019-02-01. That'll export January of 2019, and print each of the export job's IDs. You can check on the status of a job using BigQuery CLI via bq show -j JOB_ID.

liquibase : generate changelogs from existing database

Is it possible with liquibase to generate changelogs from an existing database?
I would like to generate one xml changelog per table (not every create table statements in one single changelog).
If you look into documentation it looks like it generates only one changelog with many changesets (one for each table). So by default there is no option to generate changelogs per table.
While liquibase generate-changelog still doesn't support splitting up the generated changelog, you can split it yourself.
If you're using JSON changelogs, you can do this with jq.
I created a jq filter to group the related changesets, and combined it with a Bash script to split out the contents. See this blog post
jq filter, split_liquibase_changelog.jq:
# Define a function for mapping a changes onto its destination file name
# createTable and createIndex use the tableName field
# addForeignKeyConstraint uses baseTableName
# Default to using the name of the change, e.g. createSequence
def get_change_group: map(.tableName // .baseTableName)[0] // keys[0];
# Select the main changelog object
.databaseChangeLog
# Collect the changes from each changeSet into an array
| map(.changeSet.changes | .[])
# Group changes according to the grouping function
| group_by(get_change_group)
# Select the grouped objects from the array
| .[]
# Get the group name from each group
| (.[0] | get_change_group) as $group_name
# Select both the group name...
| $group_name,
# and the group, wrapped in a changeSet that uses the group name in the ID and
# the current user as the author
{ databaseChangelog: {
changeSet: {
id: ("table_" + $group_name),
author: env.USER,
changes: . } } }
Bash:
#!/usr/bin/env bash
# Example: ./split_liquibase_changelog.sh schema < changelog.json
set -e -o noclobber
OUTPUT_DIRECTORY="${1:-schema}"
OUTPUT_FILE="${2:-schema.json}"
# Create the output directory
mkdir --parents "$OUTPUT_DIRECTORY"
# --raw-output: don't quote the strings for the group names
# --compact-output: output one JSON object per line
jq \
--raw-output \
--compact-output \
--from-file split_liquibase_changelog.jq \
| while read -r group; do # Read the group name line
# Read the JSON object line
read -r json
# Process with jq again to pretty-print the object, then redirect it to the
# new file
(jq '.' <<< "$json") \
> "$OUTPUT_DIRECTORY"/"$group".json
done
# List all the files in the input directory
# Run jq with --raw-input, so input is parsed as strings
# Create a changelog that includes everything in the input path
# Save the output to the desired output file
(jq \
--raw-input \
'{ databaseChangeLog: [
{ includeAll:
{ path: . }
}
] }' \
<<< "$OUTPUT_DIRECTORY"/) \
> "$OUTPUT_FILE"
If you need to use XML changesets, you can try adapting this solution using an XML tool like XQuery instead.

Using documentParser function in Teradata Aster

I'm working with Teradata's Aster and am trying to parse a pdf(or html) file such that it is inserted into a table in the Beehive database in Aster. The entire pdf should correspond to a single row of data in the table.
This is to be done by using one of Aster's SQL-MR functions called documentParser. This will produce a text file(.rtf) containing a single row produced by parsing all the chapters from the pdf file, which would be then loaded into the table in Beehive.
I have been given this script that shows the use of documentParser and other steps involved in this parsing process -
/* SHELL INSTRUCTIONS */
--transform file in b64 (change file names to your relevant file)
base64 pp.pdf>pp.b64
--prepare a loadfile
rm my_load_file.txt
-- get the content of the file
var=$(cat pp.b64)
-- put in file
echo \""pp.b64"\"","\""$var"\" >> "my_load_file.txt"
-- create staging table
act -U db_superuser -w db_superuser -d beehive -c "drop table if exists public.cf_load_file;"
act -U db_superuser -w db_superuser -d beehive -c "create dimension table public.cf_load_file(file_name varchar, content varchar);"
-- load into staging table
ncluster_loader -U db_superuser -w db_superuser -d beehive --csv --verbose public.cf_load_file my_load_file.txt
-- use document parser to load the clean text (you will need to create the table beforehand)
act -U db_superuser -w db_superuser -d beehive -c "INSERT INTO got_data.cf_got_text_data (file_name, content) SELECT * FROM documentParser (ON public.cf_load_file documentCol ('content') mode ('text'));"
--done
However, I am stuck on the last step of the script because it looks like there is no function called documentParser in the list of functions that are available in Aster. This is the error I get -
ERROR: function "documentparser" does not exist
I tried to search for this function several times with the command \dF, but did not get any match.
I've attached a picture which present the gist of what I'm trying to do.
SQL-MR Document Parser
I would appreciate any help if any one has any experience with this.
What happened is that someone told you about this function documentParser but never gave you the function archive file (documentParser.zip) to install in Aster. This function does exist but it's not part of the official Aster Analytics Foundation (AAF). Please contact person who gave you this info for help.
documentParser belongs to so-called field functions that are developed and used by the Aster field team only. Not that you can't use it, but don't expect support to help you - only whoever gave you access to it.
If you don't have any contacts then next course of action I'd suggest to go to Aster Community Network and ask question about it there.

How can i view all comments posted by users in bitbucket repository

In the repository home page , i can see comments posted in recent activity at the bottom, bit it only shows 10 commnets.
i want to all the comments posted since beginning.
Is there any way
Comments of pull requests, issues and commits can be retrieved using bitbucket’s REST API.
However it seems that there is no way to list all of them at one place, so the only way to get them would be to query the API for each PR, issue or commit of the repository.
Note that this takes a long time, since bitbucket has seemingly set a limit to the number of accesses via API to repository data: I got Rate limit for this resource has been exceeded errors after retrieving around a thousand results, then I could retrieve about only one entry per second elapsed from the time of the last rate limit error.
Finding the API URL to the repository
The first step is to find the URL to the repo. For private repositories, it is necessary to get authenticated by providing username and password (using curl’s -u switch). The URL is of the form:
https://api.bitbucket.org/2.0/repositories/{repoOwnerName}/{repoName}
Running git remote -v from the local git repository should provide the missing values. Check the forged URL (below referred to as $url) by verifying that repository information is correctly retrieved as JSON data from it: curl -u username $url.
Fetching comments of commits
Comments of a commit can be accessed at $url/commit/{commitHash}/comments.
The resulting JSON data can be processed by a script. Beware that the results are paginated.
Below I simply extract the number of comments per commit. It is indicated by the value of the member size of the retrieved JSON object; I also request a partial response by adding the GET parameter fields=size.
My script getNComments.sh:
#!/bin/sh
pw=$1
id=$2
json=$(curl -s -u username:"$pw" \
https://api.bitbucket.org/2.0/repositories/{repoOwnerName}/{repoName}/commit/$id/comments'?fields=size')
printf '%s' "$json" | grep -q '"type": "error"' \
&& printf "ERROR $id\n" && exit 0
nComments=$(printf '%s' "$json" | grep -o '"size": [0-9]*' | cut -d' ' -f2)
: ${nComments:=EMPTY}
checkNumeric=$(printf '%s' "$nComments" | tr -dc 0-9)
[ "$nComments" != "$checkNumeric" ] \
&& printf >&2 "!ERROR! $id:\n%s\n" "$json" && exit 1
printf "$nComments $id\n"
To use it, taking into account the possibility for the error mentioned above:
A) Prepare input data. From the local repository, generate the list of commits as wanted (run git fetch -a prior to update the local git repo if needed); check out git help rev-list for how it can be customised.
git rev-list --all | sort > sorted-all.id
cp sorted-all.id remaining.id
B) Run the script. Note that the password is passed here as a parameter – so first assign it to a variable safely using stty -echo; IFS= read -r passwd; stty echo, in one line; also see security considerations below. The processing is parallelised onto 15 processes here, using the option -P.
< remaining.id xargs -P 15 -L 1 ./getNComments.sh "$passwd" > commits.temp
C) When the rate limit is reached, that is when getNComments.sh prints !ERROR!, then kill the above command (Ctrl-C), and execute these below to update the input and output files. Wait a while for the request limit to increase, then re-execute the above one command and repeat until all the data is processed (that is when wc -l remaining.id returns 0).
cat commits.temp >> commits.result
cut -d' ' -f2 commits.result | sort | comm -13 - sorted-all.id > remaining.id
D) Finally, you can get the commits which received comments with:
grep '^[1-9]' commits.result
Fetching comments of pull requests and issues
The procedure is the same as for fetching commits’ comments, but for the following two adjustments:
Edit the script to replace in the URL commit by pullrequests or by issues, as appropriate;
Let $n be the number of issues/PRs to search. The git rev-list command above becomes: seq 1 $n > sorted-all.id
The total number of PRs in the repository can be obtained with:
curl -su username $url/pullrequests'?state=&fields=size'
and, if the issue tracker is set up, the number of issues with:
curl -su username $url/issues'?fields=size'
Hopefully, the repository has few enough PRs and issues so that all data can be fetched in one go.
Viewing comments
They can be viewed normally via the web interface on their commit/PR/issue page at:
https://bitbucket.org/{repoOwnerName}/{repoName}/commits/{commitHash}
https://bitbucket.org/{repoOwnerName}/{repoName}/pull-requests/{prId}
https://bitbucket.org/{repoOwnerName}/{repoName}/issues/{issueId}
For example, to open all PRs with comments in firefox:
awk '/^[1-9]/{print "https://bitbucket.org/{repoOwnerName}/{repoName}/pull-requests/"$2}' PRs.result | xargs firefox
Security considerations
Arguments passed on the command line are visible to all users of the system, via ps ax (or /proc/$PID/cmdline). Therefore the bitbucket password will be exposed, which could be a concern if the system is shared by multiple users.
There are three commands getting the password from the command line: xargs, the script, and curl.
It appears that curl tries to hide the password by overwriting its memory, but it is not guaranteed to work, and even if it does, it leaves it visible for a (very short) time after the process starts. On my system, the parameters to curl are not hidden.
A better option could be to pass the sensitive information through environment variables. They should be visible only to the current user and root via ps axe (or /proc/$PID/environ); although it seems that there are systems that let all users access this information (do a ls -l /proc/*/environ to check the environment files’ permissions).
In the script simply replace the lines pw=$1 id=$2 with id=$1, then pass pw="$passwd" before xargs in the command line invocation. It will make the environment variable pw visible to xargs and all of its descendent processes, that is the script and its children (curl, grep, cut, etc), which may or may not read the variable. curl does not read the password from the environment, but if its password hiding trick mentioned above works then it might be good enough.
There are ways to avoid passing the password to curl via the command line, notably via standard input using the option -K -. In the script, replace curl -s -u username:"$pw" with printf -- '-s\n-u "%s"\n' "$authinfo" | curl -K - and define the variable authinfo to contain the data in the format username:password. Note that this method needs printf to be a shell built-in to be safe (check with type printf), otherwise the password will show up in its process arguments. If it is not a built-in, try with print or echo instead.
A simple alternative to an environment variable that will not appear in ps output in any case is via a file. Create a file with read/write permissions restricted to the current user (chmod 600), and edit it so that it contains username:password as its first line. In the script, replace pw=$1 with IFS= read -r authinfo < "$1", and edit it to use curl’s -K option as in the paragraph above. In the command line invocation replace $passwd with the filename.
The file approach has the drawback that the password will be written to disk (note that files in /proc are not on the disk). If this too is undesirable, it is possible to pass a named pipe instead of a regular file:
mkfifo pipe
chmod 600 pipe
# make sure printf is a builtin, or use an equivalent instead
(while :; do printf -- '%s\n' "username:$passwd"; done) > pipe&
pid=$!
exec 3<pipe
Then invoke the script passing pipe instead of the file. Finally, to clean up do:
kill $pid
exec 3<&-
This will ensure the authentication info is passed directly from the shell to the script (through the kernel), is not written to disk and is not exposed to other users via ps.
You can go to Commits and see the top line for each commit, you will need to click on each one to see further information.
If I find a way to see all without drilling into each commit, I will update this answer.