liquibase : generate changelogs from existing database - liquibase

Is it possible with liquibase to generate changelogs from an existing database?
I would like to generate one xml changelog per table (not every create table statements in one single changelog).

If you look into documentation it looks like it generates only one changelog with many changesets (one for each table). So by default there is no option to generate changelogs per table.

While liquibase generate-changelog still doesn't support splitting up the generated changelog, you can split it yourself.
If you're using JSON changelogs, you can do this with jq.
I created a jq filter to group the related changesets, and combined it with a Bash script to split out the contents. See this blog post
jq filter, split_liquibase_changelog.jq:
# Define a function for mapping a changes onto its destination file name
# createTable and createIndex use the tableName field
# addForeignKeyConstraint uses baseTableName
# Default to using the name of the change, e.g. createSequence
def get_change_group: map(.tableName // .baseTableName)[0] // keys[0];
# Select the main changelog object
.databaseChangeLog
# Collect the changes from each changeSet into an array
| map(.changeSet.changes | .[])
# Group changes according to the grouping function
| group_by(get_change_group)
# Select the grouped objects from the array
| .[]
# Get the group name from each group
| (.[0] | get_change_group) as $group_name
# Select both the group name...
| $group_name,
# and the group, wrapped in a changeSet that uses the group name in the ID and
# the current user as the author
{ databaseChangelog: {
changeSet: {
id: ("table_" + $group_name),
author: env.USER,
changes: . } } }
Bash:
#!/usr/bin/env bash
# Example: ./split_liquibase_changelog.sh schema < changelog.json
set -e -o noclobber
OUTPUT_DIRECTORY="${1:-schema}"
OUTPUT_FILE="${2:-schema.json}"
# Create the output directory
mkdir --parents "$OUTPUT_DIRECTORY"
# --raw-output: don't quote the strings for the group names
# --compact-output: output one JSON object per line
jq \
--raw-output \
--compact-output \
--from-file split_liquibase_changelog.jq \
| while read -r group; do # Read the group name line
# Read the JSON object line
read -r json
# Process with jq again to pretty-print the object, then redirect it to the
# new file
(jq '.' <<< "$json") \
> "$OUTPUT_DIRECTORY"/"$group".json
done
# List all the files in the input directory
# Run jq with --raw-input, so input is parsed as strings
# Create a changelog that includes everything in the input path
# Save the output to the desired output file
(jq \
--raw-input \
'{ databaseChangeLog: [
{ includeAll:
{ path: . }
}
] }' \
<<< "$OUTPUT_DIRECTORY"/) \
> "$OUTPUT_FILE"
If you need to use XML changesets, you can try adapting this solution using an XML tool like XQuery instead.

Related

How to save the multiple output of single process in publishDir in Nextflow

I have the process create_parallel_params whose output is parallel_params folder containing json files.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.spectra = "$baseDir/data/spectra/"
params.library = "$baseDir/data/library/"
params.workflow_parameter="$baseDir/data/workflowParameters.xml"
TOOL_FOLDERS="$baseDir/bin"
process create_parallel_params{
publishDir "$baseDir/nf_output", mode: 'copy'
output:
path "parallel_params/*.json"
script:
"""
mkdir parallel_params | python $TOOL_FOLDERS/parallel_paramgen.py \
parallel_params \
10
"""
}
The output of the above process passed into process searchlibrarysearch_molecularv2_parallelstep1 which process each json file.
process searchlibrarysearch_molecularv2_parallelstep1{
publishDir "$baseDir/nf_output", mode: 'copy'
input:
path json_file
//path params.spectra
//path params.library
output:
path "result_folder" emit:"result_folder/*.tsv"
script:
"""
mkdir result_folder convert_binary librarysearch_binary | \
python $TOOL_FOLDERS/searchlibrarysearch_molecularv2_parallelstep1.py \
$params.spectra \
$json_file \
$params.workflow_parameter \
$params.library \
result_folder \
convert_binary \
librarysearch_binary \
"""
}
workflow{
ch_parallel_params=create_parallel_params()
ch_searchlibrarysearch=searchlibrarysearch_molecularv2_parallelstep1(create_parallel_params.out.flatten())
ch_searchlibrarysearch.view()
}
I want the output of these file in publishDir (nf_output) in a single folder. So How can i do that. Provide some example.
The emit option can be used to assign a name identifier to an output channel. This is helpful if your output declaration defines more than one output channels, but isn't usually necessary if you make only a single declaration. Providing a glob pattern as an identifier doesn't make much sense: if you need only the output TSV files (and not the whole folder), you can just use the following and the output TSV files will be published to the publishDir:
output:
path "result_folder/*.tsv"
If you want to declare the folder itself, usually you can just update your publishDir to include a subdirectory with a unique name. You could use something like:
publishDir "$baseDir/nf_output/${json_file.baseName}", mode: 'copy'
But this will give you a 'result_folder' in every subdirectory. If that's not desirable, it might be preferable to change your output declaration to:
output:
path "result_folder/*"

How to merge multiple markdown files with pandoc while retaining cross document links?

I am trying to merge multiple markdown documents in a single folder together into a PDF with pandoc.
The documents may contain links to each other which should be browseable in the markdown format, e.g. through IntelliJ or within GitLab.
Simple example documents:
0001-file-a.md
---
id: 0001
---
# File a
This is a simple file without an [external link](www.stackoverflow.com).
0002-file-b.md
---
id: 0002
---
# File b
This file links to [another file](0001-file-a.md).
By default pandoc does not handle this case out of the box, e.g. when running the following command:
pandoc -s -f markdown -t pdf *.md -V linkcolor=blue -o test.pdf
It merges the files, creates a PDF and highlights the links correctly, but when clicking the second link it wants to open the file instead of jumping to the right location in the document.
This problem has been experienced by many before me but none of the solutions I found so far have solved it. The closest I came was with the help of this answer: https://stackoverflow.com/a/61908457/6628753
It defines a filter that is first applied to each file and then the resulting JSON files are merged.
I modified this filter to fit my needs:
Add the number of the file to the label of the top-level header
Prepend the top-level header to all other header labels
Remove .md from internal links
Here is the filter:
#!/usr/bin/env python3
from pandocfilters import toJSONFilter, Header, Link
import re
import sys
"""
Pandoc filter to convert internal links for multifile documents
"""
headerL1 = []
def fix_links(key, value, format, meta):
global headerL1
# Store level 1 headers
if key == "Header":
[level, [label, t1, t2], header] = value
if level == 1:
id = meta.get("id")
newlabel = f"{id['c'][0]['c']}-{label}"
headerL1 = [newlabel]
sys.stderr.write(f"\nGlobal header: {headerL1}\n")
return Header(level, [newlabel, t1, t2], header)
# Prepend level 1 header label to all other header labels
if level > 1:
prefix = headerL1[0]
newlabel = prefix + "-" + label
sys.stderr.write(f"Header label: {label} -> {newlabel}\n")
return Header(level, [newlabel, t1, t2], header)
if key == "Link":
[t1, linktext, [linkref, t4]] = value
if ".md" in linkref:
newlinkref = re.sub(r'.md', r'', linkref)
sys.stderr.write(f'Link: {linkref} -> {newlinkref}\n')
return Link(t1, linktext, [newlinkref, t4])
else:
sys.stderr.write(f'External link: {linkref}\n')
if __name__ == "__main__":
toJSONFilter(fix_links)
And here is a script that executes the whole thing:
#!/bin/bash
MD_INPUT=$(find . -type f | grep md | sort)
# Pass the markdown through the gitlab filters into Pandoc JSON files
echo "Filtering Gitlab markdown"
for file in $MD_INPUT
do
echo "Filtering $file"
pandoc \
--filter fix-links.py \
"$file" \
-t json \
-o "${file%.md}.json"
done
JSON_INPUT=$(find . -type f | grep json | sort)
echo "Generating LaTeX"
pandoc -s -f json -t latex $JSON_INPUT -V linkcolor=blue -o test.tex
echo "Generating PDF"
pandoc -s -f json -t pdf $JSON_INPUT -V linkcolor=blue -o test.pdf
Applying this script generates a PDF where the second link does not work at all.
Looking at the LaTeX code the problem can be solved by replacing the generated \href directive with \hyperlink.
Once this is done the linking works as expected.
The problem now is that this isn't done automatically by pandoc, which almost seems like a bug.
Is there a way to tell pandoc a link is internal from within the filter?
After running the filter it is non-trivial to fix the issue since there is no good way to differentiate internal and external links.

BigQuery: Export data into hierarchical folders: YYYY/MM/DD

I have a date-partitioned table in BigQuery that I'd like to export. I would like to export it such that the data from each day ends up in a different file. For example, to a GS bucket with a nested folder structure like gs://my-bucket/YYYY/MM/DD/. Is this possible?
Please don't tell me I need to run a separate export job for each day of data: I know this is possible but it is painful when exporting many years worth of data, as you need to run thousands of export jobs.
On the import side, this is possible with the parquet format.
If this is not possible with BigQuery directly, is there a GCS tool like dataproc or dataflow that would make this easy (bonus points for linking to a script that actually performs this export).
Would a bash script with bq extract work?
#!/bin/bash
# Stop on first error
set -e;
# Used for Bigquery partitioning (to distinguish from bash variable reference)
DOLLAR="\$"
# -I ISO DATE
# -d FROM STRING
start=$(date -I -d 2019-06-01) || exit -1
end=$(date -I -d 2019-06-15) || exit -1
d=${start}
# string(d) <= string(end)
while [[ ! "$d" > "$end" ]]; do
YYYYMMDD=$(date -d ${d} +"%Y%m%d")
YYYY=$(date -d ${d} +"%Y")
MM=$(date -d ${d} +"%m")
DD=$(date -d ${d} +"%d")
# print current date
echo ${d}
cmd="bq extract --destination_format=AVRO \
'project:dataset.table${DOLLAR}${YYYYMMDD}' \
'gs://my-bucket/${YYYY}/${MM}/${DD}/part*.avro'
"
# execute
eval ${cmd}
# d++
d=$(date -I -d "$d + 1 day")
done
Maybe you should request a new feature at https://issuetracker.google.com/savedsearches/559654.
Not a bash ninja, so sure that there is a cooler way to compare dates.
At #Ben P's request here's the solution (a python script) I've used previously to run lots of export jobs in parallel. This is pretty rough code and should be improved by checking the status of each export job after it runs to see whether it succeeded.
I won't accept this an an answer because the question is looking for a bigquery-native way of performing this task.
Note that this script was for exporting a versioned dataset, so there's a bit of extra logic around that which many users may not need. It assumes that the input table and output folder names both use the version. This should be easy to strip out.
import argparse
import datetime as dt
from google.cloud import bigquery
from multiprocessing import Pool
import random
import time
GCS_EXPORT_BUCKET = "YOUR_BUCKET_HERE"
VERSION = "dataset_v1"
def export_date(export_dt, bucket=GCS_EXPORT_BUCKET, version=VERSION):
table_id = '{}${:%Y%m%d}'.format(version, export_dt)
gcs_filename = '{}/{:%Y/%m/%d}/{}-*.jsonlines.gz'.format(version, export_dt, table_id)
gcs_path = 'gs://{}/{}'.format(bucket, gcs_filename)
job_id = export_data_to_gcs(table_id, gcs_path, 'currents')
return (export_dt, job_id)
def export_data_to_gcs(table_id, destination_gcs_path, dataset):
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = 'NEWLINE_DELIMITED_JSON'
job_config.compression = 'GZIP'
job_id = 'export-{}-{:%Y%m%d%H%M%S}'.format(table_id.replace('$', '--'),
dt.datetime.utcnow())
# Add a bit of jitter
time.sleep(5 * random.random())
job = bigquery_client.extract_table(table_ref,
destination_gcs_path,
job_config=job_config,
job_id=job_id)
print(f'Now running job_id {job_id}')
time.sleep(50)
job.reload()
while job.running():
time.sleep(10)
job.reload()
return job_id
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-s', "--startdate",
help="The Start Date - format YYYY-MM-DD (Inclusive)",
required=True,
type=dt.date.fromisoformat)
parser.add_argument('-e', "--enddate",
help="The End Date format YYYY-MM-DD (Exclusive)",
required=True,
type=dt.date.fromisoformat)
args = parser.parse_args()
start_date = args.startdate
end_date = args.enddate
dates = []
while start_date < end_date:
dates.append(start_date)
start_date += dt.timedelta(days=1)
with Pool(processes=30) as pool:
jobs = pool.map(export_date, dates, chunksize=1)
To run this code, put it into a file called bq_exporter.py and then run python bq_exporter.py -s 2019-01-01 -e 2019-02-01. That'll export January of 2019, and print each of the export job's IDs. You can check on the status of a job using BigQuery CLI via bq show -j JOB_ID.

Append text line after multiline match and stop on first multiline match

I would need a sed or awk command, not script, that:
1) matches in file sequentially 2 strings:
# This configuration option has an automatic default value.
# filter = [ "a|.*/|" ]
This is required because any single string can occur in file more than once.
But two of such sequential strings are pretty unique to match them.
2) inserts/appends after matched strings this text line:
filter = ["a|sd.*|", "a|drbd.*|", "r|.*|"]
3) stops processing after first match and append
So, text file looks like this:
...
# filter = [ "a|^/dev/hda8$|", "r|.*/|" ]
#
# This configuration option has an automatic default value.
# filter = [ "a|.*/|" ]
# Configuration option devices/global_filter.
# Limit the block devices that are used by LVM system components.
# Because devices/filter may be overridden from the command line, it is
# not suitable for system-wide device filtering, e.g. udev and lvmetad.
# Use global_filter to hide devices from these LVM system components.
# The syntax is the same as devices/filter. Devices rejected by
# global_filter are not opened by LVM.
# This configuration option has an automatic default value.
# global_filter = [ "a|.*/|" ]
# Configuration option devices/types.
# List of additional acceptable block device types.
# These are of device type names from /proc/devices, followed by the
...
I would need to have output like this:
...
# filter = [ "a|^/dev/hda8$|", "r|.*/|" ]
#
# This configuration option has an automatic default value.
# filter = [ "a|.*/|" ]
filter = ["a|sd.*|", "a|drbd.*|", "r|.*|"]
# Configuration option devices/global_filter.
# Limit the block devices that are used by LVM system components.
# Because devices/filter may be overridden from the command line, it is
# not suitable for system-wide device filtering, e.g. udev and lvmetad.
# Use global_filter to hide devices from these LVM system components.
# The syntax is the same as devices/filter. Devices rejected by
# global_filter are not opened by LVM.
# This configuration option has an automatic default value.
# global_filter = [ "a|.*/|" ]
# Configuration option devices/types.
# List of additional acceptable block device types.
# These are of device type names from /proc/devices, followed by the
...
None of found examples on multiline sed examples on stackoverflow is working for me.
I tried F. Hauri example from this topic: Append a string after a multiple line match in bash
sed -e $'/^admin:/,/^$/{/users:/a\ NewUser\n}'
It works fine, when matching unique words, but did not work for matching sequential text lines like this:
# This configuration option has an automatic default value.
# filter = [ "a|.*/|" ]
and also adding '0, to sed expression to stop on first match did not work in that case.
Updated description to better describe what is the goal.
awk '
/^\s+\# This configuration option has an automatic default value\./{
found=1
}
found && !flag && /\s+\# filter = \[ \"a\|\.\*\/\|\" \]/{
flag=1
$0=$0 ORS ORS " filter = [\"a|sd.*|\", \"a|drbd.*|\", \"r|.*|\"]"
}
1
' test.conf > test.tmp && cp test.conf test.conf.bak && mv -f test.tmp test.conf

mysql workbench migration select data

I use MySQL Workbench to copy a table MS SQL to MYSQL server. It's possible to select the data between 2 date ?
Today this export lasts 3h with more than 150K rows and I would like to speed up the treatment.
Thanks
You must run migration wizard once again, but in 'Data Transfer Setup' step choose 'Create a shell script to copy the data from outside of Workbench' option. After that Workbench generate shell script for you, which may look similar to this:
#!/bin/sh
# Workbench Table Data copy script
# Workbench Version: 6.3.10
#
# Execute this to copy table data from a source RDBMS to MySQL.
# Edit the options below to customize it. You will need to provide passwords, at least.
#
# Source DB: Mysql#localhost:8000 (MySQL)
# Target DB: Mysql#localhost:8000
# Source and target DB passwords
arg_source_password=
arg_target_password=
if [ -z "$arg_source_password" ] && [ -z "$arg_target_password" ] ; then
echo WARNING: Both source and target RDBMSes passwords are empty. You should edit this file to set them.
fi
arg_worker_count=2
# Uncomment the following options according to your needs
# Whether target tables should be truncated before copy
# arg_truncate_target=--truncate-target
# Enable debugging output
# arg_debug_output=--log-level=debug3
/home/milosz/Projects/Oracle/workbench/master/wb_run/usr/local/bin/wbcopytables \
--mysql-source="root#localhost:8000" \
--target="root#localhost:8000" \
--source-password="$arg_source_password" \
--target-password="$arg_target_password" \
--thread-count=$arg_worker_count \
$arg_truncate_target \
$arg_debug_output \
--table '`test`' '`t1`' '`test_target`' '`t1`' '`id`' '`id`' '`id`, `name`, `date`'
First of all you need to put your password for source and target databases. Then change last argument of wbcopytables command from --table to --table-where and add condition to the end of line.
Side note: you can run wbcopytables command with --help argument to see all options.
After all you should get script that looks like similar to:
#<...>
# Source and target DB passwords
arg_source_password=your_sorce_password
arg_target_password=your_target_password
if [ -z "$arg_source_password" ] && [ -z "$arg_target_password" ] ; then
echo WARNING: Both source and target RDBMSes passwords are empty. You should edit this file to set them.
fi
arg_worker_count=2
# Uncomment the following options according to your needs
# Whether target tables should be truncated before copy
# arg_truncate_target=--truncate-target
# Enable debugging output
# arg_debug_output=--log-level=debug3
/home/milosz/Projects/Oracle/workbench/master/wb_run/usr/local/bin/wbcopytables \
--mysql-source="root#localhost:8000" \
--target="root#localhost:8000" \
--source-password="$arg_source_password" \
--target-password="$arg_target_password" \
--thread-count=$arg_worker_count \
$arg_truncate_target \
$arg_debug_output \
--table-where '`test`' '`t1`' '`test_target`' '`t1`' '`id`' '`id`' '`id`, `name`, `date`' '`date` >= "2017-01-02" and `date` <= "2017-01-03"'
I hope that is helpful for you.