Cassandra - secondary index and query performance

Cassandra - secondary index and query performance - sql

my schema for the table is : A)
CREATE TABLE friend_list (
userId uuid,
friendId uuid,
accepted boolean,
ts_accepted timestamp,
PRIMARY KEY ((userId ,accepted), ts_accepted)
) with clustering order by (ts_accepted desc);
Here I am able to perform queries like:
1. SELECT * FROM friend_list WHERE userId="---" AND accepted=true;
2. SELECT * FROM friend_list WHERE userId="---" AND accepted=false;
3. SELECT * FROM friend_list WHERE userId="---" AND accepted IN (true,false);
But the 3rd query involves more read, so I tried to change the schema like this :
B)
CREATE TABLE friend_list (
userId uuid,
friendId uuid,
accepted boolean,
ts_accepted timestamp,
PRIMARY KEY (userId , ts_accepted)
) with clustering order by (ts_accepted desc);
CREATE INDEX ON friend_list (accepted);
With this type B schema, the 1st and 2nd queries works, but I can simplify the third query as :
3. SELECT * FROM friend_list WHERE userId="---";
I believe that the second schema gives much better performance for third query, as it won't do the condition check on every row.
Cassandra experts...Please suggest me which is the best schema on achieving this.A or B.

First of all , are you aware that your second schema does not work at all like the first one ? In the first one the 'accepted' field was part of the key, but in the second not at all ! You don't have the same unique constraint, you should check that it is not a problem for your model.
Second if you only want to not have to include the 'acceptation' field for every request you have two possibilities :
1 - You can use 'acceptation' as a clustering column :
PRIMARY KEY ((userId), accepted, ts_accepted)
This way your 3rd request can be :
SELECT * FROM friend_list WHERE userId="---";
And you will get the same result more efficiently.
But this approach has a problem, it will create larger partitions, which is not the best for good performances.
2 - Create two separate tables
This approach is much more adequate for the Cassandra spirit. With Cassandra it is not unusual to duplicate the data if it can improve the efficiency of the requests.
So in your case you would keep your first schema for the first table and the first and second request,
and you would create another table with the same data but a schema slightly different , either with the secondary index if the 'accepted' does not need to be part of the primary key (as you did for your second schema), or a primary key like this :
PRIMARY KEY ((userId), accepted, ts_accepted)
I would definitely prefer the secondary index for the second table if possible because the accepted column has a low cardinality (2) and thus very well fitted for secondary indexes.
EDIT :
Also you used a timestamp in your primary key. Be aware that it may be a problem if you can have the same user creating two rows in this table. Because the timestamp does not guaranty unicity : what happens if the two rows are created the same millisecond ?
You should probably use a TimeUUID. This type very commonly used in Cassandra guaranty the unicity by combining a Timestamp and UUID.
Furthermore a timestamp in a primary key can create temporary hotspots in a Cassandra node, definitely beter to avoid.

Related

best way to store time series data with an identifier in sqlite3

Let's say there are a number of different sensors, all of which save data in a database as they measure it and each sensors can have more entries. I'm looking for the best way to save this data so that select queries could be done fastest possible later. Something like
"CREATE TABLE IF NOT EXISTS DataTable (sensor_id TEXT, measured_value REAL, time_of_measuring REAL)"
could basically work, but I imagine this wouldn't be very fast for selecting. I know about primary keys, but they prevent duplicates, so I can't just put sensor_id as a primary key. I'm basically looking for sqlite equivalent of saving data like this, but in a single table and as one measurement being one row:
data = {"sensor1":[x1,x2,x3], "sensor2":[z1,z2,z3]...}
I imagine something like ˇˇ would work for inserting more than a single value for each sensor, but would that help at all with selecting?
"CREATE TABLE IF NOT EXISTS DataTable (sensor_id TEXT NOT NULL, measured_value REAL, time_of_measuring REAL NOT NULL, PRIMARY KEY(sensor_id, time_of_measuring ))"

For this time-series data, the relevant primary (or unique) key is probably (time_of_measuring, sensor_id). This is close to what you suggested at the end of your question, but the columns are in reverse order.
Technically, this prevent a sensor from loging two measures at the same point in time, which seems like a relevant business rule for your data.
When it comes to the speed of queries: it highly depends on the query themselves. Say that you have query like:
select sensor_id, measured_val, time_of_measuring
from data_table
where
sensor_id = ?
and time_of_measuring >= ?
and time_of_measuring < ?
order by sensor_id, time_of_measuring
This query would take advantage of the primary key index, since the columns are the same as those of the where and order by clauses. You could add the measured_val to the index to make the query even more efficient:
create index data_table_idx1
on data_table(sensor_id, time_of_measuring, measured_val);
As another example, consider this where clause:
where time_of_measuring >= ? and time_of_measuring < ?
No predicate on sensor_id, but time_of_measuring is the first column in the index, so the primary key index can be used.
As typical counter-examples, the following where clauses would not benefit the index:
where sensor_id = ? -- need an index where `sensor_id` is first
where sensor_id = ? and measured_val >= ? -- needs an index on "(sensor_id, measured_val)"

Inherits from one basetable? Good idea?

I would like to create a base table as follows:
CREATE TABLE IF NOT EXISTS basetable
(
id BIGSERIAL NOT NULL PRIMARY KEY,
createdon TIMESTAMP NOT NULL DEFAULT(NOW()),
updatedon TIMESTAMP NULL
);
Then all other tables will inherits this table. So this table contains the ids of all records. Does it becomes performance problems with more then 20 billion records (distributed on the ~10 tables).

Having one table from which "all other tables will inherit" sounds like a strange idea but you might have a use case that is unclear to us.
Regarding your question specifically, having 20B rows is going to work but as #Gordon mentioned it, you will have performance challenges. If you query a row by ID it will be perfectly fine but if you search rows by timestamp ranges, even with indexes, it will be slower (how slow will depend on how fast your server is).
For large tables, a good solution is to use table partitioning (see https://wiki.postgresql.org/wiki/Table_partitioning). Based on what you query the most in your WHERE clause (id, createdon or updatedon) you can create partitions for that column and PostgreSQL will be able to read only the partition it needs instead of the entire table.

Cassandra modeling with key and indexes

I've a table of "Users" each user has many "Projects" and each project has many "Clients" so it's many-to-many so I keep track of clients events in a different table.
The problem is that I can't figured out how to choose the key and the index so the queries will be with best performance.
The table with Key:
CREATE TABLE project_clients_events(
id timeuuid,
user_id int,
project_id int,
client_id text,
event text,
PRIMARY KEY ((user_id, project_id), id, client_id)
);
Now there will be more then 100K of events per (user_id, project_id) so I need to be able to paginate throw the result:
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
How can I group the results and paginate?
Thanks!

Let me answer your question in two parts. First the pagination and then the partition key
Cassandra CQL driver supports automatic paging now, so you need not worry about designing a complex where clause.
Statement stmt = new SimpleStatement("SELECT * FROM images");
stmt.setFetchSize(100);
ResultSet rs = session.execute(stmt);
// Iterate over the ResultSet here
This link will be helpful :
http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
Deciding the partition depends on the queries you may have. For example if most of your queries use the user_id and project_id (i.e. most of your queries fetch results only based on user_id and client_id) then it’s better to have then as a part of the partition key, as all those results will be placed in the same Cassandra column (on the same node) and fetched together.
Hence I would advise you to first decide the queries and select your partition keys accordingly. As your performance will depend on what the queries are vs. how the columns are stored in Cassandra
This could help you http://www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (slides 45-70)

Which SQL Update is faster/ more efficient

I need to update a table every time a certain action is taken.
MemberTable
Name varchar 60
Phone varchar 20
Title varchar 20
Credits int <-- the one that needs constant updates
etc with all the relevant member columns 10 - 15 total
Should I update this table with:
UPDATE Members
SET Credits = Credits - 1
WHERE Id = 1
or should I create another table called account with only two columns like:
Account table
Id int
MemberId int <-- foreign key to members table
Credits int
and update it with:
UPDATE Accounts
SET Credits = Credits - 1
WHERE MemberId = 1
Which one would be faster and more efficient?
I have read that SQL Server must read the whole row in order to update it. I'm not sure if that's true. Any help would be greatly appreciated

I know that this doesn't directly answer the question but I'm going to throw this out there as an alternative solution.
Are you bothered about historic transactions? Not everyone will be, but in case you or other future readers are, here's how I would approach the problem:
CREATE TABLE credit_transactions (
member_id int NOT NULL
, transaction_date datetime NOT NULL
CONSTRAINT df_credit_transactions_date DEFAULT Current_Timestamp
, credit_amount int NOT NULL
, CONSTRAINT pk_credit_transactions PRIMARY KEY (member_id, transaction_date)
, CONSTRAINT fk_credit_transactions_member_id FOREIGN KEY (member_id)
REFERENCES member (id)
, CONSTRAINT ck_credit_transaction_amount_not_zero CHECK (credit_amount <> 0)
);
In terms of write performance...
INSERT INTO credit_transactions (member_id, credit_amount)
VALUES (937, -1)
;
Pretty simple, eh! No row locks required.
The downside to this method is that to work out a members "balance", you have to perform a bit of a calculation.
CREATE VIEW member_credit
AS
SELECT member_id
, Sum(credit) As credit_balance
, Max(transaction_date) As latest_transaction
FROM credit_transactions
GROUP
BY member_id
;
However using a view makes things nice and simple and can be optimized appropriately.
Heck, you might want to throw in a NOLOCK (read up about this before making your decision) on that view to reduce locking impact.
TL;DR:
Pros: quick write speed, transaction history available
Cons: slower read speed

Actually the later way would be faster.
If your number transaction is very huge, to the extent where millisecond precision is very important, it's better to do it this way.
Or maybe some members will not have credits, you might save some space here as well.
However, if it's not, it's good to keep your table structure normalized. If every account will always have a credit, it's better to include it as a column in table Member.
Try to not having unnecessary intermediate table which will consume more space (with all those foreign keys and additional IDs). Furthermore, it also makes your schema a little bit more complex.
In the end, it depends on your requirement.

As the ID is the primary key, all the dbms has to do is look up the key in the index, get the record and update. There should not be much of a performance problem.
Using an account table leads to exactly the same access method. But you are right; as there is less data per record, you might more often have the record in the memory cache already and thus save a physical read. However, I wouldn't expect that to happen too often. And well, you probably work more with your member table than with the account table. This makes it more likely to have a member record already in cache, so it's just vice versa and your account table access is slower then.
Cache access vs. physical reads is the only difference, because with the primary key you will walk the same way throgh the ID index and than access one particular record directly.
I don't recommend using the account table. It somewhat blurrs the data structure with a 1:1 relation between the two tables that may not be immediable recognized by other users. And it is not likely you will gain much from it. (As mentioned, you might even lose performance.)

SQL get last rows in table WITHOUT primary ID

I have a table with 800,000 entries without a primary key. I am not allowed to add a primary key and I cant sort by TOP 1 ....ORDER BY DESC because it takes hours to complete this task. So I tried this work around:
DECLARE #ROWCOUNT int, #OFFSET int
SELECT #ROWCOUNT = (SELECT COUNT(field) FROM TABLE)
SET #OFFSET = #ROWCOUNT-1
select TOP 1 FROM TABLE WHERE=?????NO PRIMARY KEY??? BETWEEN #Offset AND #ROWCOUNT
Of course this doesn't work.
Anyway to do use this code/or better code to retrieve the last row in table?

If your table has no primary key or your primary key is not orderly... you can try the code below... if you want see more last record, you can change the number in code
Select top (select COUNT(*) from table) * From table
EXCEPT
Select top ((select COUNT(*) from table)-(1)) * From table

I assume that when you are saying 'last rows', you mean 'last created rows'.
Even if you had primary key, it would still be not the best option to use it do determine rows creation order.
There is no guarantee that that the row with the bigger primary key value was created after the row with a smaller primary key value.
Even if primary key is on identity column, you can still always override identity values on insert by using
set identity_insert on.
It is a better idea to have timestamp column, for example CreatedDateTime with a default constraint.
You would have index on this field.Then your query would be simple, efficient and correct:
select top 1 *
from MyTable
order by CreatedDateTime desc
If you don't have timestamp column, you can't determine 'last rows'.

If you need to select 1 column from a table of 800,000 rows where that column is the min or max possible value, and that column is not indexed, then the unassailable fact is that SQL will have to read every row in the table in order to identify that min or max value.
(An aside, on the face of it reading all the rows of an 800,000 row table shouldn't take all that long. How wide is the column? How often is the query run? Are there concurrency, locking, blocking, or deadlocking issues? These may be pain points that could be addressed. End of aside.)
There are any number of work-arounds (indexes, views, indexed views, peridocially indexed copies of the talbe, run once store result use for T period of time before refreshing, etc.), but virtually all of them require making permanent modifications to the database. It sounds like you are not permitted to do this, and I don't think there's much you can do here without some such permanent change--and call it improvement, when you discuss it with your project manager--to the database.

You need to add an Index, can you?
Even if you don't have a primary key an Index will speed up considerably the query.
You say you don't have a primary key, but for your question I assume you have some type of timestamp or something similar on the table, if you create an Index using this column you will be able to execute a query like :
SELECT *
FROM table_name
WHERE timestamp_column_name=(
SELECT max(timestamp_column_name)
FROM table_name
)

If you're not allowed to edit this table, have you considered creating a view, or replicating the data in the table and moving it into one that has a primary key?
Sounds hacky, but then, your 800k row table doesn't have a primary key, so hacky seems to be the order of the day. :)

I believe you could write it simply as
SELECT * FROM table ORDER BY rowid DESC LIMIT 1;
Hope it helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas