I'm trying to dump the schema of a Cassandra keyspace that can be used to import it into another keyspace, for example to backup a cassandra keyspace and restore it into a different keyspace with the same schema.
I'm using cqlsh:
[cqlsh 2.3.0 | Cassandra 1.2.2 | CQL spec 3.0.1 | Thrift protocol 19.35.0]
I initially create the schema with a CLUSTERING ORDER timestamp DESCENDING order:
DROP KEYSPACE mailbox;
CREATE KEYSPACE mailbox
WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': '1' };
USE mailbox;
CREATE TABLE messages (
id uuid,
user_id uuid,
contents varchar,
created timestamp,
PRIMARY KEY (id, created)
)
WITH CLUSTERING ORDER BY (created DESC);
Then I use the CQL function to dump what I thought would be valid CQL3:
cqlsh:mailbox> describe keyspace mailbox;
CREATE KEYSPACE mailbox WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': '1'
};
USE mailbox;
CREATE TABLE messages (
id uuid,
created 'org.apache.cassandra.db.marshal.ReversedType'<timestamp>,
contents text,
user_id uuid,
PRIMARY KEY (id, created)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
When I try to import that back into the cqlsh I get the following error:
Bad Request: line 3:56 mismatched input '<' expecting ')'
text could not be lexed at line 16, char 14
I believe it's failing to parse the created column definition (which was originally created with a CLUSTERING ORDER BY):
created 'org.apache.cassandra.db.marshal.ReversedType'<timestamp>
Is there some other method for dumping the schema of a given keyspace?
Hm.. think this issue is resolved in cassandra 1.2.5 and you are using cassandra 1.2.2 https://issues.apache.org/jira/browse/CASSANDRA-5528 and better yet, upgrade to cassandra 2.0 because tested here, it works fine in cassandra 2.0.2.
As a side note, you may also want to keep an eye on cql version. http://cassandra.apache.org/doc/cql3/CQL.html#Versioning
Related
I'm relatively new to Flink and today I encountered a problem while using Flink SQL on Flink 1.11.3 session cluster.
Problem
I registered a source table which uses jdbc postgres driver. I am trying to move some data from this online DB to AWS S3 in parquet format. This table is huge in size (~43 GB). The job failed after around 1 minute, and the task manager crashed without any warning. But my best guess is task manager ran out of memory.
My Observation
I found that when I do tableEnv.executeSql("select ... from huge_table limit 1000") flink attempted to scan the entire source table into memory and only after that planned to do the limit.
Question
Since I only care about the most recent several days of data, is there any way to limit how many rows a job would scan by timestamp?
Appendix
Here is a minimal setup that can reproduce the issue (lots of noise removed)
Env setup code
var blinkSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build();
var tableEnv = TableEnvironment.create(blinkSettings);
Source table DDL in Flink SQL
CREATE TABLE source_transactions (
txid STRING,
username STRING,
amount BIGINT,
ts TIMESTAMP,
PRIMARY KEY (txid) NOT ENFORCED
) WITH (
'connector'='jdbc',
'url'='jdbc:postgresql://my.bank',
'table-name'='transactions',
'driver'='org.postgresql.Driver',
'username'='username',
'password'='password',
'scan.fetch-size'='2000'
)
Sink table DDL in Flink SQL
CREATE TABLE sink_transactions (
create_time TIMESTAMP,
username STRING,
delta_amount DOUBLE,
dt STRING
) PARTITIONED BY (dt) WITH (
'connector'='filesystem',
'path'='s3a://s3/path/to/transactions',
'format'='parquet'
)
Insert query in Flink SQL
INSERT INTO sink_transactions
SELECT ts, username, CAST(t.amount AS DOUBLE) / 100, DATE_FORMAT(ts, 'yyyy-MM-dd')
FROM source_transactions
Your observation is right,Flink doesn't support limit pushdown optimization for JDBC connector, and there's an nearly merged PR to support this feature, this will be used in Flink 1.13 and you can cherry-pick this patch to your code if you're urgent to this feature.
1.JIRA: FLINK-19650 Support the limit push down for the Jdbc
2.PR: https://github.com/apache/flink/pull/13800
I'm trying to create a search index on my table in DSE 6.8. This is my table in the test keyspace:
CREATE TABLE users (
username text,
first_name text,
last_name text,
password text,
email text,
last_access timeuuid,
PRIMARY KEY(username));
I tried this query:
CREATE SEARCH INDEX ON test.users;
and this is the response:
InvalidRequest: Error from server: code=2200 [Invalid query] message="Search statements are not supported on this node"
I think there must be something that I should change in some file for DSE to support search statements. I've already set the SOLR_ENABLED in /etc/default/dse to 1. I'm totally new to this and I don't know if there's something wrong with my table or anything else.
can anyone suggest what might be causing this error? Thanks in advance.
As the error message suggests, you can only create a Search index on DSE nodes running in Search mode.
Check the node's workload by running the command below. It will tell you if the node is running in pure Cassandra mode or Search mode.
$ dsetool status
If you have installed DSE using the binary tarball, it doesn't use /etc/default/dse. Instead start DSE as a standalone process with the -s flag to start it in Search mode:
$ dse cassandra -s
Cheers!
In first time,I created a empty table with partition and cluster. After that, I would like to configure data transfere service to fill my table from Google Cloud Storage.But when I configure the transfer, I didn't see a parameter field which allows to choose the cluster field.
I tried to do the same thing without the cluster and I can fill my table easily.
Big query error when I ran the transfer:
Failed to start job for table matable$20190701 with error INVALID_ARGUMENT: Incompatible table partitioning specification. Destination table exists with partitioning specification interval(type:DAY,field:) clustering(string_field_15), but transfer target partitioning specification is interval(type:DAY,field:). Please retry after updating either the destination table or the transfer partitioning specification.
When you define the table you specify the partitioning and clustering columns. That's everything you need to do.
When you load the data (from CLI or UI) from GCS BigQuery automatically partition and cluster the data.
If you can give more detail of how you create the table and set up the transfer would be helpful to provide a more detailed explanation.
Thanks for your time.
Of course :
empty table configuration
transfer configuration
I success to transfer datat without cluster but, when I add a cluster in my empty table,the trasnfer fails.
Situation:
I have a PostgreSQL-database that is logging data from sensors in a field-deployed unit (let's call this the source database). The unit has a very limited hard-disk space, meaning that if left untouched, the data-logging will cause the disk where the database is residing to fill up within a week. I have a (very limited) network link to the database (so I want to compress the dump-file), and on the other side of said link I have another PostgreSQL database (let's call that the destination database) that has a lot of free space (let's just, for argument's sake, say that the source is very limited with regard to space, and the destination is unlimited with regard to space).
I need to take incremental backups of the source database, append the rows that have been added since last backup to the destination database, and then clean out the added rows from the source database.
Now the source database might or might not have been cleaned since a backup was last taken, so the destination database needs to be able to only imported the new rows in an automated (scripted) process, but pg_restore fails miserably when trying to restore from a dump that has the same primary key numbers as the destination database.
So the question is:
What is the best way to restore only the rows from a source that are not already in the destination database?
The only solution that I've come up with so far is to pg_dump the database and restore the dump to a new secondary-database on the destination-side with pg_restore, then use simple sql to sort out which rows already exist in my main-destination database. But it seems like there should be a better way...
(extra question: Am I completely wrong in using PostgreSQL in such an application? I'm open to suggestions for other data-collection alternatives...)
A good way to start would probably be to use the --inserts option to pg_dump. From the documentation (emphasis mine) :
Dump data as INSERT commands (rather than COPY). This will make
restoration very slow; it is mainly useful for making dumps that can
be loaded into non-PostgreSQL databases. However, since this option
generates a separate command for each row, an error in reloading a row
causes only that row to be lost rather than the entire table contents.
Note that the restore might fail altogether if you have rearranged
column order. The --column-inserts option is safe against column order
changes, though even slower.
I don't have the means to test it right now with pg_restore, but this might be enough for your case.
You could also use the fact that from the version 9.5, PostgreSQL provides ON CONFLICT DO ... for INSERTs. Use a simple scripting language to add these to the dump and you should be fine. I haven't found an option for pg_dump to add those automatically, unfortunately.
You might google "sporadically connected database synchronization" to see related solutions.
It's not a neatly solved problem as far as I know - there are some common work-arounds, but I am not aware of a database-centric out-of-the-box solution.
The most common way of dealing with this is to use a message bus to move events between your machines. For instance, if your "source database" is just a data store, with no other logic, you might get rid of it, and use a message bus to say "event x has occurred", and point the endpoint of that message bus at your "destination machine", which then writes that to your database.
You might consider Apache ActiveMQ or read "Patterns of enterprise integration".
#!/bin/sh
PSQL=/opt/postgres-9.5/bin/psql
TARGET_HOST=localhost
TARGET_DB=mystuff
TARGET_SCHEMA_IMPORT=copied
TARGET_SCHEMA_FINAL=final
SOURCE_HOST=192.168.0.101
SOURCE_DB=slurpert
SOURCE_SCHEMA=public
########
create_local_stuff()
{
${PSQL} -h ${TARGET_HOST} -U postgres ${TARGET_DB} <<OMG0
CREATE SCHEMA IF NOT EXISTS ${TARGET_SCHEMA_IMPORT};
CREATE SCHEMA IF NOT EXISTS ${TARGET_SCHEMA_FINAL};
CREATE TABLE IF NOT EXISTS ${TARGET_SCHEMA_FINAL}.topic
( topic_id INTEGER NOT NULL PRIMARY KEY
, topic_date TIMESTAMP WITH TIME ZONE
, topic_body text
);
CREATE TABLE IF NOT EXISTS ${TARGET_SCHEMA_IMPORT}.tmp_topic
( topic_id INTEGER NOT NULL PRIMARY KEY
, topic_date TIMESTAMP WITH TIME ZONE
, topic_body text
);
OMG0
}
########
find_highest()
{
${PSQL} -q -t -h ${TARGET_HOST} -U postgres ${TARGET_DB} <<OMG1
SELECT MAX(topic_id) FROM ${TARGET_SCHEMA_IMPORT}.tmp_topic;
OMG1
}
########
fetch_new_data()
{
watermark=${1-0}
echo ${watermark}
${PSQL} -h ${SOURCE_HOST} -U postgres ${SOURCE_DB} <<OMG2
\COPY (SELECT topic_id, topic_date, topic_body FROM ${SOURCE_SCHEMA}.topic WHERE topic_id >${watermark}) TO '/tmp/topic.dat';
OMG2
}
########
insert_new_data()
{
${PSQL} -h ${TARGET_HOST} -U postgres ${TARGET_DB} <<OMG3
DELETE FROM ${TARGET_SCHEMA_IMPORT}.tmp_topic WHERE 1=1;
COPY ${TARGET_SCHEMA_IMPORT}.tmp_topic(topic_id, topic_date, topic_body) FROM '/tmp/topic.dat';
INSERT INTO ${TARGET_SCHEMA_FINAL}.topic(topic_id, topic_date, topic_body)
SELECT topic_id, topic_date, topic_body
FROM ${TARGET_SCHEMA_IMPORT}.tmp_topic src
WHERE NOT EXISTS (
SELECT *
FROM ${TARGET_SCHEMA_FINAL}.topic nx
WHERE nx.topic_id = src.topic_id
);
OMG3
}
########
delete_below_watermark()
{
watermark=${1-0}
echo ${watermark}
${PSQL} -h ${SOURCE_HOST} -U postgres ${SOURCE_DB} <<OMG4
-- delete not yet activated; COUNT(*) instead
-- DELETE
SELECT COUNT(*)
FROM ${SOURCE_SCHEMA}.topic WHERE topic_id <= ${watermark}
;
OMG4
}
######## Main
#create_local_stuff
watermark="`find_highest`"
echo 'Highest:' ${watermark}
fetch_new_data ${watermark}
insert_new_data
echo 'Delete below:' ${watermark}
delete_below_watermark ${watermark}
# Eof
This is just an example. Some notes:
I assume a non-decreasing serial PK for the table; in most cases it could also be a timestamp
for simplicity, all the queries are run as user postgres, you might need to change this
the watermark method will guarantee that only new records will be transmitted, minimising bandwidth usage
the method is atomic, if the script crashes, nothing is lost
only one table is fetched here, but you could add more
because I'm paranoid, I us a different name for the staging table and put it into a separate schema
The whole script does two queries on the remote machine (one for fetch one for delete); you could combine these.
but there is only one script (executing from the local=target machine) involved.
The DELETE is not yet active; it only does a count(*)
Can enyone explain, what is the corellation between HCatalog and HBase, please?
I've found these definitions:
Apache HCatalog
HCatalog is a metadata abstraction layer for referencing data without using the underlying filenames or formats. It insulates users and scripts from how and where the data is physically stored.
Apache HBase
HBase (Hadoop DataBase) is a distributed, column oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and point queries (random reads).
Whet we use CREATE TABLE in Hive, it creates table in HCatalog. I just don't get it. Why not in real DATABASE which is HBase?
HCatalog seems to be some kind of metedata repository for all data stores. Does it mean it also keeps information about databases and tables in HBase?
I'll be grateful for explanation
Regards
Pawel
When you CREATE TABLE in HIVE it registers it in HCatalog. A Table in Hive may be an HBase table but it can also be an abstraction above HDFS files and directories
You can find a nice explanation of HCatalog on HortonWorks' site
Because I've noticed the question is quite popular, I've decided to answer it as I've undrestood it quite well since I asked it
So, first of all since Hadoop 2.0 HCatalog and Hive are treated as one product. Hive creates tables in HCatalog by default. It means that natural interface for HCatalog is Hive. So you can use all SQL-92 DMLs (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML)and DDLs (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL), starting from create/alter/drop database, through create/alter/drop table ending with select, insert into etc... The only exception is that insert works only as insert into ... as select from.
For typical insert we have to use:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Tables can have partitions, indexes (but from my experioence it doesn't work well), but you it is not a relational database, so you cannot use foreign keys.
With HBase is quite different. This is one of noSQL databases (but as answered in previous post, Hive can be HBase interface fro SQL queries)
It has key-> value organized tables.
Lets compare a few commands (create table, insert into table, select from table, drop table
Hive:
create table table_name (
id int,
value1 string,
value2 string
)
partitioned by (date string)
LOAD DATA INPATH 'filepath' ] INTO TABLE table_name [PARTITION (partcol1=val1, partcol2=val2 ...)]
INSERT INTO table_name as select * from othertable
SELECT * FROM table_name
DROP TABLE table_name
HBase:
hbase> create 'test', 'cf'
hbase> put 'test', 'row1', 'cf:a', 'value1'
hbase> get 'test', 'row1'
hbase> disable 'test'
hbase> drop 'test'
As you can see the syntax is completely different. For SQL users, working with HCatalog is natural, ones working with noSQL databases will feel comfortabe with HBase.