Cassandra 1.1 composite keys, columns, and filtering in CQL 3 - indexing

I wish to have a table something as follows:
CREATE TABLE ProductFamilies (
ID varchar,
PriceLow int,
PriceHigh int,
MassLow int,
MassHigh int,
MnfGeo int,
MnfID bigint,
Data varchar,
PRIMARY KEY (ID)
);
There are 13 fields in total. Most of these represent buckets. Data is a JSON of product family IDs, which are then used in a subsequent query. Given how Cassandra works, the column names under the hood will be the values. I wish to filter these.
I wish to run queries as follows:
SELECT Data FROM MyApp.ProductFamilies WHERE ID IN (?, ?, ?) AND PriceLow >= ?
AND PriceHigh <= ? AND MassLow >= ? AND MassHigh <= ? and MnfGeo >= ? AND
MnfGeo <= ?
I read that Cassandra can only execute WHERE predicates against composite row keys or indexed columns. Is this still true? If so, I would have to make the columns < Data part of the PK.
Is it still the case that one has to include all columns from left to right and cannot skip any?
Are there any non-optimum points in my design?
I would like to add a column "Materials", which is an array of possible materials in a product family. Think pizza toppings, and querying "WHERE Materials IN ('Pineapple')". Without creating a separate inverted index of materials and then performing a manual intersection against the above query, is there any other [more elegant] way of handling this in Cassandra?

If you specify the exact PK you are looking up, as you propose here (id IN ...), you can use whatever expressions you like in the remaining predicates. There are no restrictions.
List collections are supported starting in 1.2.0, which is scheduled for release at the end of October. Indexed querying of collection contents may or may not be supported.

Basically to support you queries you need to have
create column family ProductFamilies with
comparator='CompositeType(UTF8Type, Int32Type, Int32Type, Int32Type, Int32Type, Int32Type, LongType, UTF8Type)'
and key_validation_class='UTF8Type'
or
CREATE TABLE ProductFamilies (
ID varchar,
PriceLow int,
PriceHigh int,
MassLow int,
MassHigh int,
MnfGeo int,
MnfID bigint,
Data varchar,
PRIMARY KEY (ID, PriceLow, PriceHigh, MassLow, MnfGeo, MnfID, Data)
);
Now you can query
SELECT Data FROM MyApp.ProductFamilies WHERE ID IN (?, ?, ?) AND PriceLow >= ?
AND PriceHigh <= ? AND MassLow >= ? AND MassHigh <= ? and MnfGeo >= ? AND
MnfGeo <= ?
Provided you don't miss any column from left to right [although not a filter but atleast a *] and all your values are in the column names rather the value.
One more thing you should understand about composite columns is "Column Slice must be contiguous" So, pricelow > =10 and pricelow <= 40 will return you a contiguous slice but filtering the result set with masslow and other columns will not work as it is not going to result in a contiguous slice. BTW pricelow = 10 and masslow <= 20 and masslow >=10 should work [tested with phpcassa] as it will result in a contiguous slice once again.
Else create a or multiple secondary index on any of the column of yours. Then you have the rights to query based on column values provided you always have atleast one of the indexed field in query.
http://www.datastax.com/docs/1.1/ddl/indexes
Regarding you material question there is no other go than having an inverted index if it is going to be a multivalued column as of I know.
It would be great if #jbellis verifies this

Related

MariaDB Optimizing Big Table Selects With Table Partitioning

I have the following table that contains milions of rows (2M-30M)
rowId(int) intElementId(int) timestamp(int) float float
This table gets truncated and filled every time it is used , the primary operation is select using the intElementId that is an integer and the timestamp.
For example
SELECT * from myTable where intElementId = x and timestamp = y
Currently im using btree indexes on intElementId and timestamp . The sum of the operations take quite a lot of time when theres is more data on my table.
I have thought of dynamically generating multiple tables based on my intElementId (eg testTable_xxx,testTable_xxx) that is usually bellow a thousand so i can query via the timestamp more effectivly in a smaller sample . While searching about this on different questions i have been discouraged to do this , because as i found out mariaDb is not optimized for something like this.
An alternative solution i have found is partitioning my table , i have seen a couple of applications for this but not many examples of it actually being used.
I have seen solutions partitioning by key seems to be what i want but on the mariaDB wiki its once again discouraged and suggested to use partitioning by range instead.
My questions are :
Is partitioning by key okay ?
Is having ~ a 1000 partions okay ?
Is having less partitions than the number of the disting intElementIds bad ? and how does it affect me ?
To test my solution i have used the following and it does seem to improve perfomance by about ~ 48%
testData = generateTestData()
traditionalInsert(testData)
time = timeit.timeit(benchmarkDbTable, number=20)
partitionedInsert(testData)
time = timeit.timeit(benchmarkDbTable, number=20)
Where my tables are generated like so
Traditional
CREATE TABLE testTable (`id` INT NOT NULL AUTO_INCREMENT , PRIMARY KEY (`id`),intElementIdINT, timestamp int, price FLOAT, volume float)
ALTER TABLE testTable ADD INDEX `intElementId` (`intElementId`)
Partitioned where the number of partitions equals to the number of unique intElementId (<1000)
CREATE TABLE testTable (`id` INT NOT NULL AUTO_INCREMENT , PRIMARY KEY (`id`,`intElementId`),intElementId INT, timestamp int, price FLOAT, volume INT) PARTITION BY KEY (intElementId) PARTITIONS ?;",
Testing Like so
cur.execute("SELECT COUNT(*) FROM testTable ")
cur.execute("SELECT DISTINCT intElementId FROM testTable ")
intElementId = cur.fetchall()
for element in intElementId:
cur.execute(f"SELECT COUNT(*) FROM testTable WHERE intElementId= {element[0]}")
for element in intElementId:
cur.execute(f"SELECT MIN(price), MAX(price) FROM testTable WHERE intElementId= {intElementId[0]}")
for elements in intElementId:
cur.execute(f"SELECT price FROM testTable WHERE intElementId= {intElementId[0]} ORDER BY price DESC LIMIT 10")

Do not Update the Values in Merge statement if old values do not change while update in Merge

MERGE PFM_EventPerformance_MetaData AS TARGET
USING
(
SELECT
[InheritanceMeterID] = #InheritanceMeterPointID
,[SubHourlyScenarioResourceID] = #SubHourlyScenarioResourceID
,[MeterID] = #MeterID--internal ID
,[BaselineID] = #BaselineID--internal ID
,[UpdateUtc] = GETUTCDATE()
)
AS SOURCE ON
TARGET.[SubHourlyScenarioResourceID] = SOURCE.[SubHourlyScenarioResourceID]
AND TARGET.[MeterID] = SOURCE.[MeterID]--internal ID
AND TARGET.[BaselineID] = SOURCE.[BaselineID]--internal ID
WHEN MATCHED THEN UPDATE SET
#MetaDataID = TARGET.ID--get preexisting ID when exists (must populate one row at a time)
,InheritanceMeterID = SOURCE.InheritanceMeterID
,[UpdateUtc] = SOURCE.[UpdateUtc]
WHEN NOT MATCHED
THEN INSERT
(
[InheritanceMeterID]
,[SubHourlyScenarioResourceID]
,[MeterID]--internal ID
,[BaselineID]--internal ID
)
VALUES
(
SOURCE.[InheritanceMeterID]
,SOURCE.[SubHourlyScenarioResourceID]
,SOURCE.[MeterID]--internal ID
,SOURCE.[BaselineID]--internal ID
);
In the above query I do not want to update the values in the Target table if there is no change in old values. I am not sure how to achieve this as I have used Merge statement rarely. Please help me with the solution. Thanks in advance
This is done best in two stages.
Stage 1: Merge Update on condition
SO Answer from before (Thanks to #Laurence!)
Stage 2: hash key condition to compare
Limits: max 4000 characters, including column separator characters
A rather simple way to compare multiple columns in one condition is the use of a computed column on both sides that HASHBYTES( , <column(s)> ) generates.
This moves writing lots of code from the merge statement to the table generation.
Quick example:
CREATE TABLE dbo.Test
(
id_column int NOT NULL,
dsc_name1 varchar(100),
dsc_name2 varchar(100),
num_age tinyint,
flg_hash AS HashBytes( 'SHA1',
Cast( dsc_name1 AS nvarchar(4000) )
+ N'•' + dsc_name2 + N'•' + Cast( num_age AS nvarchar(3) )
) PERSISTED
)
;
Comparing columns flg_hash between source and destination will make comparison quick as it is just a comparison between two 20 bit varbinary columns.
Couple of Caveat Emptor for working with HashBytes:
Function only works for a total of 4000 nvarchar characters
Trade off for short comparison code lies in generation of correct order in views and tables
There is a duplicate collision chance of around an 2^50+ for SHA1 - as security mechanism this is now considered insecure and a few years ago MS tried to drop SHA1 as algorithm
Added columns to tables and views can be overlooked from comparison if hash bytes code is outside of consideration for amendments
Overall I found that when comparing multiple columns this can overload my server engines but never had an issue with hash key comparisons

Sybase Check Constraint Evaluation

I'm trying to formulate some check constraints in SQL Anywhere 9.0.
Basically I have schema like this:
CREATE TABLE limits (
id INT IDENTITY PRIMARY KEY,
count INT NOT NULL
);
CREATE TABLE sum (
user INT,
limit INT,
my_number INT NOT NULL CHECK(my_number > 0),
PRIMARY KEY (user, limit)
);
I'm trying to force a constraint my_number for each limit to be at most count in table.
I've tried
CHECK ((SELECT sum(my_number) FROM sum WHERE limit = limit) <= (SELECT count FROM limits WHERE id = limit))
and
CHECK (((SELECT sum(my_number) FROM sum WHERE limit = limit) + my_number) <= (SELECT count FROM limits WHERE id = limit))
and they both seem not to do the correct thing. They are both off by one (meaning once you get a negative number, then insertion will fail, but not before that.
So my question is, with what version of the table are these subqueries being executed against? Is it the table before the insertion happens, or does the subquery check for consistency after the insert happens, and rolls back if it finds it invalid?
I do not really understand what you try to enforce here but based on this help topic.
Using CHECK constraints on columns
Once a CHECK condition is in place, future values are evaluated
against the condition before a row is modified.
I would go for a before insert trigger. You have more options and can bring up a better error message.

Find changed rows (composite key with nulls)

Im trying to build a query that will fetch all changed rows from a source table, comparing it to a target table.
The primary key (its not really defined as a primary key, just what we know identifies an unique row) is a composite that consists of lots of foreign keys. Aproximatly about 15, most of which can have NULL values. For simplicity lets say the primary key consists of these three key columns and have 2 value fields that needs to be compared:
CREATE TABLE SourceTable
(
Key1 int NOT NULL,
Key2 nvarchar(10),
Key3 int,
Value1 nvarchar(255),
Value2 int
)
If Key1 = 1, Key2 = NULL and Key3 = 4. Then I would like to compare it to the row in target that has exactly the same values in the key fields. Including NULL in key 2.
The value fields can also have NULL values.
So whats the best approach to use when designing queries like this where NULL values should be considered as real values and compared?
ISNULL? COALESCE? Intersect?
Any suggestions?
ANSI SQL has the IS [NOT] DISTINCT FROM construct that has not been implemented in SQL Server yet (Connect request).
It is possible to simulate this functionality in SQL Server using EXCEPT/INTERSECT however. Both of these treat NULL as equal in comparisons. You are wanting to find rows where the key columns are the same but the value columns are different. So this should do it.
SELECT *
FROM SourceTable S
JOIN DestinationTable D
ON S.Key1 = D.Key1
/*Join the key columns on equality*/
AND NOT EXISTS (SELECT S.Key2,
S.Key3
EXCEPT
SELECT D.Key2,
D.Key3)
/*and the value columns on unequality*/
AND NOT EXISTS (SELECT S.Value1,
S.Value2
INTERSECT
SELECT D.Value1,
D.Value2)
Nulls don't play nice with foreign keys: changing a null to a value will not (in SQL Server) cause it to cascade when updated.
Best to avoid the null value (and for many other reasons too!) Instead get the DBA to nominate some other 'magic' value of the same data type but outside of the domain type. Examples: DATE: far distant or far future date value. INTEGER: zero or negative value. VARCHAR: value in double-curly braces to denote meta data value e.g. '{{NONE}}', '{{UNKNOWN}}', '{{NA}}', etc then a CHECK constraint to ensure values cannot start/end with double curly braces.
Alternatively, model missing information by absence of a tuple in a relvar (closed world assumption) ;)

Create a unique primary key (hash) from database columns

I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S
The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.
Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.
Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.
If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?