Cassandra modeling with key and indexes

Cassandra modeling with key and indexes - sql

I've a table of "Users" each user has many "Projects" and each project has many "Clients" so it's many-to-many so I keep track of clients events in a different table.
The problem is that I can't figured out how to choose the key and the index so the queries will be with best performance.
The table with Key:
CREATE TABLE project_clients_events(
id timeuuid,
user_id int,
project_id int,
client_id text,
event text,
PRIMARY KEY ((user_id, project_id), id, client_id)
);
Now there will be more then 100K of events per (user_id, project_id) so I need to be able to paginate throw the result:
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
How can I group the results and paginate?
Thanks!

Let me answer your question in two parts. First the pagination and then the partition key
Cassandra CQL driver supports automatic paging now, so you need not worry about designing a complex where clause.
Statement stmt = new SimpleStatement("SELECT * FROM images");
stmt.setFetchSize(100);
ResultSet rs = session.execute(stmt);
// Iterate over the ResultSet here
This link will be helpful :
http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
Deciding the partition depends on the queries you may have. For example if most of your queries use the user_id and project_id (i.e. most of your queries fetch results only based on user_id and client_id) then it’s better to have then as a part of the partition key, as all those results will be placed in the same Cassandra column (on the same node) and fetched together.
Hence I would advise you to first decide the queries and select your partition keys accordingly. As your performance will depend on what the queries are vs. how the columns are stored in Cassandra
This could help you http://www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (slides 45-70)

Related

Inherits from one basetable? Good idea?

I would like to create a base table as follows:
CREATE TABLE IF NOT EXISTS basetable
(
id BIGSERIAL NOT NULL PRIMARY KEY,
createdon TIMESTAMP NOT NULL DEFAULT(NOW()),
updatedon TIMESTAMP NULL
);
Then all other tables will inherits this table. So this table contains the ids of all records. Does it becomes performance problems with more then 20 billion records (distributed on the ~10 tables).

Having one table from which "all other tables will inherit" sounds like a strange idea but you might have a use case that is unclear to us.
Regarding your question specifically, having 20B rows is going to work but as #Gordon mentioned it, you will have performance challenges. If you query a row by ID it will be perfectly fine but if you search rows by timestamp ranges, even with indexes, it will be slower (how slow will depend on how fast your server is).
For large tables, a good solution is to use table partitioning (see https://wiki.postgresql.org/wiki/Table_partitioning). Based on what you query the most in your WHERE clause (id, createdon or updatedon) you can create partitions for that column and PostgreSQL will be able to read only the partition it needs instead of the entire table.

Cassandra - secondary index and query performance

my schema for the table is : A)
CREATE TABLE friend_list (
userId uuid,
friendId uuid,
accepted boolean,
ts_accepted timestamp,
PRIMARY KEY ((userId ,accepted), ts_accepted)
) with clustering order by (ts_accepted desc);
Here I am able to perform queries like:
1. SELECT * FROM friend_list WHERE userId="---" AND accepted=true;
2. SELECT * FROM friend_list WHERE userId="---" AND accepted=false;
3. SELECT * FROM friend_list WHERE userId="---" AND accepted IN (true,false);
But the 3rd query involves more read, so I tried to change the schema like this :
B)
CREATE TABLE friend_list (
userId uuid,
friendId uuid,
accepted boolean,
ts_accepted timestamp,
PRIMARY KEY (userId , ts_accepted)
) with clustering order by (ts_accepted desc);
CREATE INDEX ON friend_list (accepted);
With this type B schema, the 1st and 2nd queries works, but I can simplify the third query as :
3. SELECT * FROM friend_list WHERE userId="---";
I believe that the second schema gives much better performance for third query, as it won't do the condition check on every row.
Cassandra experts...Please suggest me which is the best schema on achieving this.A or B.

First of all , are you aware that your second schema does not work at all like the first one ? In the first one the 'accepted' field was part of the key, but in the second not at all ! You don't have the same unique constraint, you should check that it is not a problem for your model.
Second if you only want to not have to include the 'acceptation' field for every request you have two possibilities :
1 - You can use 'acceptation' as a clustering column :
PRIMARY KEY ((userId), accepted, ts_accepted)
This way your 3rd request can be :
SELECT * FROM friend_list WHERE userId="---";
And you will get the same result more efficiently.
But this approach has a problem, it will create larger partitions, which is not the best for good performances.
2 - Create two separate tables
This approach is much more adequate for the Cassandra spirit. With Cassandra it is not unusual to duplicate the data if it can improve the efficiency of the requests.
So in your case you would keep your first schema for the first table and the first and second request,
and you would create another table with the same data but a schema slightly different , either with the secondary index if the 'accepted' does not need to be part of the primary key (as you did for your second schema), or a primary key like this :
PRIMARY KEY ((userId), accepted, ts_accepted)
I would definitely prefer the secondary index for the second table if possible because the accepted column has a low cardinality (2) and thus very well fitted for secondary indexes.
EDIT :
Also you used a timestamp in your primary key. Be aware that it may be a problem if you can have the same user creating two rows in this table. Because the timestamp does not guaranty unicity : what happens if the two rows are created the same millisecond ?
You should probably use a TimeUUID. This type very commonly used in Cassandra guaranty the unicity by combining a Timestamp and UUID.
Furthermore a timestamp in a primary key can create temporary hotspots in a Cassandra node, definitely beter to avoid.

Oracle 11g Partitioning Strategy for this table

Database: Oracle 11g
I am working on a greenfield project and designing a database schema. I have an audit table which, as the name suggests,
will grow to hold huge number of records eventually. Following is the table definition (after pruning the extraneous columns).
create table ClientAudit (
id number(19,0) primary key,
clientId number(19,0) not null,
createdOn timestamp with time zone default systimestamp not null
);
id is a natural number to be populated by oracle sequence.
clientId is a unique client identifier.
For ease of query by reporting, I am creating a following view as well, which gives the latest record for each client, based on createdOn:
create or replace view ClientAuditView
as
select * from (
select ca.*,max(ca.createdOn) keep (dense_rank last order by ca.createdOn)
over (partition by ca.clientId) maxCreatedOn
from ClientAudit ca
)
where createdOn=maxCreatedOn;
/
I am not sure what should be the partitioning key here if I were to partition ClientAudit table.
Should it be ClientId, or CreatedOn?
What should be the partitioning strategy?

Since the selection is on the createdon, i would suggest to have a range partition and also the query should refer to the correct partition based on passed date.

You will not benefit from partition pruning this way. If you plan to store data for a very long time, this will result in view working very slow.
I recommend storing "latestAuditTimestamp" or "lastAuditId" in the clients table or another entity and will re-do the view as following:
create or replace view ClientAuditView
as
select ca.* from ClientAudit ca
where (clientId,createdOn) in (select clientId,lastAuditTimestamp from Clients c)
;
/
In the later stage you can optimize it more, by adding range condition for maximum/minimum lastAuditTimstamp in case number of clients goes too high and HASH SEMI JOIN would be used.

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?

An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.

In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.

If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.

Delete sql query is very very slow

I've a little problem;
I have 2 tables:
events and multimedia.
events have the
id,
device_id
created_at field
the primary key is the id and there's a index formed by device_id and created_at field.
multimedia table have the follower field:
id
device_id
created_at
data (this field is a blob field and contains a 20k string)
the primary key is id and there's a index formed by device_id and created_by field.
The problem is when i want to delete the record with created_at before a data.
the query:
DELETE FROM events WHERE device_id = #{dev[0]}
AND created_at <= '#{mm_critical_time.to_s}'
is ok. In 5 or 6 second delete the record.
The query
DELETE FROM multimedia WHERE device_id = #{dev[0]}
AND created_at <= '#{mm_critical_time.to_s}'
give me some problem, the execution start and never finish it.
what's the problem?

You probably need to create an index for the columns you are searching.
CREATE INDEX device_created_index
ON multimedia (device_id, created_at);
If you want to learn more about optimizing your queries, refer to the answer I gave here about using EXPLAIN SELECT: is there better way to do these mysql queries?

the order of the conditions is important, you havent't told us your database server but at least in Oracle it is, so try to reverse them like
DELETE FROM multimedia WHERE
created_at <= '#{mm_critical_time.to_s}'
AND device_id = #{dev[0]}
or us an inner query on the fastest part
DELETE FROM multimedia WHERE
created_at <= '#{mm_critical_time.to_s}'
AND device_id in (select device_id from multimedia where device_id = #{dev[0]})
Also, i always break slow queries up and test the parts on speed so that you know where the bottleneck is.
Some programs show you how long a query took and in Ruby you could use benchmark, you can supplement the delete with a select while testing.
so test:
select * FROM multimedia WHERE created_at <= '#{mm_critical_time.to_s}'
and
select * from multimedia WHERE device_id = #{dev[0]}
Success..

It is quite naive to give solutions to performance problems in relational databases without knowing the whole story, since there are many variables involved.
For the data you provided though, I would suggest you to drop the primary keys and indexes and run:
CREATE UNIQUE CLUSTERED INDEX uc ON events (device_id, created_at);
CREATE UNIQUE CLUSTERED INDEX uc ON multimedia (device_id, created_at);
If you really need to enforce the uniqueness of the id field, create one unique nonclustered index for this column on each table (but it will cause the delete command to consume more time):
CREATE UNIQUE INDEX ix_id ON events (id);
CREATE UNIQUE INDEX ix_id ON multimedia (id);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas