I want to read the last (latest) records written to bigtable with cbt. The docs don't mention which order cbt read reads records in, however. I don't know what the row key(s) would be.
Is there a way to read records with cbt by insertion time?
Update:
Here is what I see when reading a table:
cbt read table_name count=10 | grep processedTime
2021/12/17 09:20:42 -creds flag unset, will use gcloud credential
general:processedTime # 2021/06/29-14:40:04.028000
general:processedTime # 2021/06/17-12:32:04.055000
general:processedTime # 2021/06/17-12:32:40.032000
general:processedTime # 2021/06/17-12:32:43.047000
general:processedTime # 2021/06/10-18:45:53.495000
general:processedTime # 2021/06/17-12:31:28.772000
general:processedTime # 2021/06/17-12:30:41.205000
general:processedTime # 2021/06/17-12:30:33.960000
general:processedTime # 2021/06/29-14:40:17.811000
general:processedTime # 2021/06/17-12:32:06.795000
general:processedTime # 2021/06/17-12:31:49.202000
The cbt read does not give results in order by time.
Is there a way to get cbt read to order the results by time?
Currently, cbt tool may not guarantee to return data in sorted order specially when there's a lot of data. It is possible that Bigtable takes time to organize it. However based on the documentation, Bigtable read requests in the order in which they are stored. Therefore, the latest record would have to be at the top.
Referencing the 2 testing scenarios I did. First is to insert data without specifying the timestamp, second is to insert data specifying the timestamp. Either way, I got the data in descending order based on timestamp.
Timestamp not specified:
#cloudshell:~ $ cbt set my-table r1 cf1:c1=val5
2021/12/15 10:51:52 -creds flag unset, will use gcloud credential
#cloudshell:~ $ cbt read my-table
2021/12/15 10:52:07 -creds flag unset, will use gcloud credential
----------------------------------------
r1
cf1:c1 # 2021/12/15-10:51:59.760000
"val5"
cf1:c1 # 2021/12/15-10:26:00.471000
"val4"
cf1:c1 # 2021/12/15-10:25:26.863000
"val3"
cf1:c1 # 2021/12/15-10:24:58.021000
"val2"
cf1:c1 # 2021/12/15-10:24:52.259000
"val1"
#cloudshell:~ $ cbt read my-table cells-per-column=1
2021/12/15 10:52:17 -creds flag unset, will use gcloud credential
----------------------------------------
r1
cf1:c1 # 2021/12/15-10:51:59.760000
"val5"
Timestamp specified:
Your use case might not be entirely covered by CBT tool. My suggestion is to file a feature request to their GitHub repo. No guarantees when will it be implemented.
Related
I'm trying to create a new BigQuery destination on Airbyte with Octavia cli.
When launching:
octavia apply
I receive:
Error: {"message":"The provided configuration does not fulfill the specification. Errors: json schema validation failed when comparing the data to the json schema. \nErrors:
$.loading_method.method: must be a constant value Standard
Here is my conf:
# Configuration for airbyte/destination-bigquery
# Documentation about this connector can be found at https://docs.airbyte.com/integrations/destinations/bigquery
resource_name: "BigQueryFromOctavia"
definition_type: destination
definition_id: 22f6c74f-5699-40ff-833c-4a879ea40133
definition_image: airbyte/destination-bigquery
definition_version: 1.2.12
# EDIT THE CONFIGURATION BELOW!
configuration:
dataset_id: "airbyte_octavia_thibaut" # REQUIRED | string | The default BigQuery Dataset ID that tables are replicated to if the source does not specify a namespace. Read more here.
project_id: "data-airbyte-poc" # REQUIRED | string | The GCP project ID for the project containing the target BigQuery dataset. Read more here.
loading_method:
## -------- Pick one valid structure among the examples below: --------
# method: "Standard" # REQUIRED | string
## -------- Another valid structure for loading_method: --------
method: "GCS Staging" # REQUIRED | string}
credential:
## -------- Pick one valid structure among the examples below: --------
credential_type: "HMAC_KEY" # REQUIRED | string
hmac_key_secret: ${AIRBYTE_BQ1_HMAC_KEY_SECRET} # SECRET (please store in environment variables) | REQUIRED | string | The corresponding secret for the access ID. It is a 40-character base-64 encoded string. | Example: 1234567890abcdefghij1234567890ABCDEFGHIJ
hmac_key_access_id: ${AIRBYTE_BQ1_HMAC_KEY_ACCESS_ID} # SECRET (please store in environment variables) | REQUIRED | string | HMAC key access ID. When linked to a service account, this ID is 61 characters long; when linked to a user account, it is 24 characters long. | Example: 1234567890abcdefghij1234
gcs_bucket_name: "airbyte-octavia-thibaut-gcs" # REQUIRED | string | The name of the GCS bucket. Read more here. | Example: airbyte_sync
gcs_bucket_path: "gcs" # REQUIRED | string | Directory under the GCS bucket where data will be written. | Example: data_sync/test
# keep_files_in_gcs-bucket: "Delete all tmp files from GCS" # OPTIONAL | string | This upload method is supposed to temporary store records in GCS bucket. By this select you can chose if these records should be removed from GCS when migration has finished. The default "Delete all tmp files from GCS" value is used if not set explicitly.
credentials_json: ${AIRBYTE_BQ1_CREDENTIALS_JSON} # SECRET (please store in environment variables) | OPTIONAL | string | The contents of the JSON service account key. Check out the docs if you need help generating this key. Default credentials will be used if this field is left empty.
dataset_location: "europe-west1" # REQUIRED | string | The location of the dataset. Warning: Changes made after creation will not be applied. Read more here.
transformation_priority: "interactive" # OPTIONAL | string | Interactive run type means that the query is executed as soon as possible, and these queries count towards concurrent rate limit and daily limit. Read more about interactive run type here. Batch queries are queued and started as soon as idle resources are available in the BigQuery shared resource pool, which usually occurs within a few minutes. Batch queries don’t count towards your concurrent rate limit. Read more about batch queries here. The default "interactive" value is used if not set explicitly.
big_query_client_buffer_size_mb: 15 # OPTIONAL | integer | Google BigQuery client's chunk (buffer) size (MIN=1, MAX = 15) for each table. The size that will be written by a single RPC. Written data will be buffered and only flushed upon reaching this size or closing the channel. The default 15MB value is used if not set explicitly. Read more here. | Example: 15
It was an indentation issue on my side:
gcs_bucket_name: "airbyte-octavia-thibaut-gcs" # REQUIRED | string | The name of the GCS bucket. Read more here. | Example: airbyte_sync
gcs_bucket_path: "gcs" # REQUIRED | string | Directory under the GCS bucket where data will be written. | Example: data_sync/test
Should be at 1 upper level (this wasn't clear in the commented template, hence the error and the possibility that others persons will do the same).
Here is full final conf:
# Configuration for airbyte/destination-bigquery
# Documentation about this connector can be found at https://docs.airbyte.com/integrations/destinations/bigquery
resource_name: "BigQueryFromOctavia"
definition_type: destination
definition_id: 22f6c74f-5699-40ff-833c-4a879ea40133
definition_image: airbyte/destination-bigquery
definition_version: 1.2.12
# EDIT THE CONFIGURATION BELOW!
configuration:
dataset_id: "airbyte_octavia_thibaut" # REQUIRED | string | The default BigQuery Dataset ID that tables are replicated to if the source does not specify a namespace. Read more here.
project_id: "data-airbyte-poc" # REQUIRED | string | The GCP project ID for the project containing the target BigQuery dataset. Read more here.
loading_method:
## -------- Pick one valid structure among the examples below: --------
# method: "Standard" # REQUIRED | string
## -------- Another valid structure for loading_method: --------
method: "GCS Staging" # REQUIRED | string}
credential:
## -------- Pick one valid structure among the examples below: --------
credential_type: "HMAC_KEY" # REQUIRED | string
hmac_key_secret: ${AIRBYTE_BQ1_HMAC_KEY_SECRET} # SECRET (please store in environment variables) | REQUIRED | string | The corresponding secret for the access ID. It is a 40-character base-64 encoded string. | Example: 1234567890abcdefghij1234567890ABCDEFGHIJ
hmac_key_access_id: ${AIRBYTE_BQ1_HMAC_KEY_ACCESS_ID} # SECRET (please store in environment variables) | REQUIRED | string | HMAC key access ID. When linked to a service account, this ID is 61 characters long; when linked to a user account, it is 24 characters long. | Example: 1234567890abcdefghij1234
gcs_bucket_name: "airbyte-octavia-thibaut-gcs" # REQUIRED | string | The name of the GCS bucket. Read more here. | Example: airbyte_sync
gcs_bucket_path: "gcs" # REQUIRED | string | Directory under the GCS bucket where data will be written. | Example: data_sync/test
# keep_files_in_gcs-bucket: "Delete all tmp files from GCS" # OPTIONAL | string | This upload method is supposed to temporary store records in GCS bucket. By this select you can chose if these records should be removed from GCS when migration has finished. The default "Delete all tmp files from GCS" value is used if not set explicitly.
credentials_json: ${AIRBYTE_BQ1_CREDENTIALS_JSON} # SECRET (please store in environment variables) | OPTIONAL | string | The contents of the JSON service account key. Check out the docs if you need help generating this key. Default credentials will be used if this field is left empty.
dataset_location: "europe-west1" # REQUIRED | string | The location of the dataset. Warning: Changes made after creation will not be applied. Read more here.
transformation_priority: "interactive" # OPTIONAL | string | Interactive run type means that the query is executed as soon as possible, and these queries count towards concurrent rate limit and daily limit. Read more about interactive run type here. Batch queries are queued and started as soon as idle resources are available in the BigQuery shared resource pool, which usually occurs within a few minutes. Batch queries don’t count towards your concurrent rate limit. Read more about batch queries here. The default "interactive" value is used if not set explicitly.
big_query_client_buffer_size_mb: 15 # OPTIONAL | integer | Google BigQuery client's chunk (buffer) size (MIN=1, MAX = 15) for each table. The size that will be written by a single RPC. Written data will be buffered and only flushed upon reaching this size or closing the channel. The default 15MB value is used if not set explicitly. Read more here. | Example: 15
I would need a sed or awk command, not script, that:
1) matches in file sequentially 2 strings:
# This configuration option has an automatic default value.
# filter = [ "a|.*/|" ]
This is required because any single string can occur in file more than once.
But two of such sequential strings are pretty unique to match them.
2) inserts/appends after matched strings this text line:
filter = ["a|sd.*|", "a|drbd.*|", "r|.*|"]
3) stops processing after first match and append
So, text file looks like this:
...
# filter = [ "a|^/dev/hda8$|", "r|.*/|" ]
#
# This configuration option has an automatic default value.
# filter = [ "a|.*/|" ]
# Configuration option devices/global_filter.
# Limit the block devices that are used by LVM system components.
# Because devices/filter may be overridden from the command line, it is
# not suitable for system-wide device filtering, e.g. udev and lvmetad.
# Use global_filter to hide devices from these LVM system components.
# The syntax is the same as devices/filter. Devices rejected by
# global_filter are not opened by LVM.
# This configuration option has an automatic default value.
# global_filter = [ "a|.*/|" ]
# Configuration option devices/types.
# List of additional acceptable block device types.
# These are of device type names from /proc/devices, followed by the
...
I would need to have output like this:
...
# filter = [ "a|^/dev/hda8$|", "r|.*/|" ]
#
# This configuration option has an automatic default value.
# filter = [ "a|.*/|" ]
filter = ["a|sd.*|", "a|drbd.*|", "r|.*|"]
# Configuration option devices/global_filter.
# Limit the block devices that are used by LVM system components.
# Because devices/filter may be overridden from the command line, it is
# not suitable for system-wide device filtering, e.g. udev and lvmetad.
# Use global_filter to hide devices from these LVM system components.
# The syntax is the same as devices/filter. Devices rejected by
# global_filter are not opened by LVM.
# This configuration option has an automatic default value.
# global_filter = [ "a|.*/|" ]
# Configuration option devices/types.
# List of additional acceptable block device types.
# These are of device type names from /proc/devices, followed by the
...
None of found examples on multiline sed examples on stackoverflow is working for me.
I tried F. Hauri example from this topic: Append a string after a multiple line match in bash
sed -e $'/^admin:/,/^$/{/users:/a\ NewUser\n}'
It works fine, when matching unique words, but did not work for matching sequential text lines like this:
# This configuration option has an automatic default value.
# filter = [ "a|.*/|" ]
and also adding '0, to sed expression to stop on first match did not work in that case.
Updated description to better describe what is the goal.
awk '
/^\s+\# This configuration option has an automatic default value\./{
found=1
}
found && !flag && /\s+\# filter = \[ \"a\|\.\*\/\|\" \]/{
flag=1
$0=$0 ORS ORS " filter = [\"a|sd.*|\", \"a|drbd.*|\", \"r|.*|\"]"
}
1
' test.conf > test.tmp && cp test.conf test.conf.bak && mv -f test.tmp test.conf
Is it possible with liquibase to generate changelogs from an existing database?
I would like to generate one xml changelog per table (not every create table statements in one single changelog).
If you look into documentation it looks like it generates only one changelog with many changesets (one for each table). So by default there is no option to generate changelogs per table.
While liquibase generate-changelog still doesn't support splitting up the generated changelog, you can split it yourself.
If you're using JSON changelogs, you can do this with jq.
I created a jq filter to group the related changesets, and combined it with a Bash script to split out the contents. See this blog post
jq filter, split_liquibase_changelog.jq:
# Define a function for mapping a changes onto its destination file name
# createTable and createIndex use the tableName field
# addForeignKeyConstraint uses baseTableName
# Default to using the name of the change, e.g. createSequence
def get_change_group: map(.tableName // .baseTableName)[0] // keys[0];
# Select the main changelog object
.databaseChangeLog
# Collect the changes from each changeSet into an array
| map(.changeSet.changes | .[])
# Group changes according to the grouping function
| group_by(get_change_group)
# Select the grouped objects from the array
| .[]
# Get the group name from each group
| (.[0] | get_change_group) as $group_name
# Select both the group name...
| $group_name,
# and the group, wrapped in a changeSet that uses the group name in the ID and
# the current user as the author
{ databaseChangelog: {
changeSet: {
id: ("table_" + $group_name),
author: env.USER,
changes: . } } }
Bash:
#!/usr/bin/env bash
# Example: ./split_liquibase_changelog.sh schema < changelog.json
set -e -o noclobber
OUTPUT_DIRECTORY="${1:-schema}"
OUTPUT_FILE="${2:-schema.json}"
# Create the output directory
mkdir --parents "$OUTPUT_DIRECTORY"
# --raw-output: don't quote the strings for the group names
# --compact-output: output one JSON object per line
jq \
--raw-output \
--compact-output \
--from-file split_liquibase_changelog.jq \
| while read -r group; do # Read the group name line
# Read the JSON object line
read -r json
# Process with jq again to pretty-print the object, then redirect it to the
# new file
(jq '.' <<< "$json") \
> "$OUTPUT_DIRECTORY"/"$group".json
done
# List all the files in the input directory
# Run jq with --raw-input, so input is parsed as strings
# Create a changelog that includes everything in the input path
# Save the output to the desired output file
(jq \
--raw-input \
'{ databaseChangeLog: [
{ includeAll:
{ path: . }
}
] }' \
<<< "$OUTPUT_DIRECTORY"/) \
> "$OUTPUT_FILE"
If you need to use XML changesets, you can try adapting this solution using an XML tool like XQuery instead.
I use MySQL Workbench to copy a table MS SQL to MYSQL server. It's possible to select the data between 2 date ?
Today this export lasts 3h with more than 150K rows and I would like to speed up the treatment.
Thanks
You must run migration wizard once again, but in 'Data Transfer Setup' step choose 'Create a shell script to copy the data from outside of Workbench' option. After that Workbench generate shell script for you, which may look similar to this:
#!/bin/sh
# Workbench Table Data copy script
# Workbench Version: 6.3.10
#
# Execute this to copy table data from a source RDBMS to MySQL.
# Edit the options below to customize it. You will need to provide passwords, at least.
#
# Source DB: Mysql#localhost:8000 (MySQL)
# Target DB: Mysql#localhost:8000
# Source and target DB passwords
arg_source_password=
arg_target_password=
if [ -z "$arg_source_password" ] && [ -z "$arg_target_password" ] ; then
echo WARNING: Both source and target RDBMSes passwords are empty. You should edit this file to set them.
fi
arg_worker_count=2
# Uncomment the following options according to your needs
# Whether target tables should be truncated before copy
# arg_truncate_target=--truncate-target
# Enable debugging output
# arg_debug_output=--log-level=debug3
/home/milosz/Projects/Oracle/workbench/master/wb_run/usr/local/bin/wbcopytables \
--mysql-source="root#localhost:8000" \
--target="root#localhost:8000" \
--source-password="$arg_source_password" \
--target-password="$arg_target_password" \
--thread-count=$arg_worker_count \
$arg_truncate_target \
$arg_debug_output \
--table '`test`' '`t1`' '`test_target`' '`t1`' '`id`' '`id`' '`id`, `name`, `date`'
First of all you need to put your password for source and target databases. Then change last argument of wbcopytables command from --table to --table-where and add condition to the end of line.
Side note: you can run wbcopytables command with --help argument to see all options.
After all you should get script that looks like similar to:
#<...>
# Source and target DB passwords
arg_source_password=your_sorce_password
arg_target_password=your_target_password
if [ -z "$arg_source_password" ] && [ -z "$arg_target_password" ] ; then
echo WARNING: Both source and target RDBMSes passwords are empty. You should edit this file to set them.
fi
arg_worker_count=2
# Uncomment the following options according to your needs
# Whether target tables should be truncated before copy
# arg_truncate_target=--truncate-target
# Enable debugging output
# arg_debug_output=--log-level=debug3
/home/milosz/Projects/Oracle/workbench/master/wb_run/usr/local/bin/wbcopytables \
--mysql-source="root#localhost:8000" \
--target="root#localhost:8000" \
--source-password="$arg_source_password" \
--target-password="$arg_target_password" \
--thread-count=$arg_worker_count \
$arg_truncate_target \
$arg_debug_output \
--table-where '`test`' '`t1`' '`test_target`' '`t1`' '`id`' '`id`' '`id`, `name`, `date`' '`date` >= "2017-01-02" and `date` <= "2017-01-03"'
I hope that is helpful for you.
I use redis.
I want that the DB will be persistent, but when I kill my process, I notice that the data doesn't recover.
In example, I have 100 keys and values. my process run on id = 26060. When I do:
kill -9 26060
and run redis-server again, all the keys are lost.
I check relevant definition in redis.conf, but don't find anything.
How can I make it persistent?
Regarding your test, you should wait 5 minutes before killing the process if you want it to be snapshotted.
This is the default config for Redis (2.8 - 3.0):
################################ SNAPSHOTTING ################################
#
# Save the DB on disk:
#
# save <seconds> <changes>
#
# Will save the DB if both the given number of seconds and the given
# number of write operations against the DB occurred.
#
# In the example below the behaviour will be to save:
# after 900 sec (15 min) if at least 1 key changed
# after 300 sec (5 min) if at least 10 keys changed
# after 60 sec if at least 10000 keys changed
#
# Note: you can disable saving completely by commenting out all "save" lines.
#
# It is also possible to remove all the previously configured save
# points by adding a save directive with a single empty string argument
# like in the following example:
#
# save ""
save 900 1
save 300 10
save 60 10000
Everything about persistence is explained in the documentation
The file where the data will be saved is defined by the following configuration options:
# The filename where to dump the DB
dbfilename dump.rdb
# For default save/load DB in/from the working directory
# Note that you must specify a directory not a file name.
dir /var/lib/redis/