Most efficient way to query cassandra in small time-based chunks

Most efficient way to query cassandra in small time-based chunks - optimization

My Cassandra-based application needs to read the rows changed since last read.
For this purpose, we are planning to have a table changed_rows that will contain two columns -
ID - The ID of the changed row and
Updated_Time - The timestamp when it was changed.
What is the best way to read such a table such that it reads small group of rows ordered by time.
Example: if the table is:
ID Updated_Time
foo 1000
bar 1200
abc 2000
pqr 2500
zyx 2900
...
xyz 901000
...
I have shown IDs to be simple 3-letter keys, in reality they are UUIDs.
Also, time shown above is shown as an integer for the sake of simplicity, but its an actual Cassandra timestamp (Or Java Date). The Updated_Time column is a monotonically increasing one.
If I query this data with:
SELECT * FROM changed_rows WHERE Updated_Time < toTimestamp(now())
I get the following error:
Cannot execute this query as it might involve data filtering and
thus may have unpredictable performance... Use Allow Filtering
But I think Allow Filtering in this case would kill the performance.
The Cassandra index page warns to avoid indexes for high cardinality columns and the Updated_Time above sure seems like high cardinality.
I do not know the ID column before-hand because the purpose of the query is to know the IDs updated between given time intervals.
What is the best way to query Cassandra in this case then?
Can I change my table somehow to run the time-chunk query more efficiently?
Note: This should sound somewhat similar to Cassandra-CDC feature but we cannot use the same because our solution should work for all the Cassandra versions

Assuming you know the time intervals you want to query, you need to create another table like the following:
CREATE TABLE modified_records (
timeslot timestamp,
updatedtime timestamp,
recordid timeuuid,
PRIMARY KEY (timeslot, updatedtime)
);
Now you can split your "updated record log" into time slices, eg 1 hour, and fill the table like this:
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:36:00', 123);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:56:00', 456);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 10:00:00', '2017-02-27 10:00:13', 789);
where you use a part of your updatedtime timestamp as a partition key, eg in this case you round to the integral hour. You then query by specifying the time slot only, eg:
SELECT * FROM modified_records WHERE timeslot = '2017-02-27 09:00:00';
SELECT * FROM modified_records WHERE timeslot = '2017-02-27 10:00:00';
Depending on how often your records get updated, you can go with smaller or bigger time slices, eg every 6 hours, or 1 day, or every 15 minutes. This structure is very flexible. You only need to know the timeslot you want to query. If you need to span multiple timeslots you'll need to perform multiple queries.

Related

SQL query slow performance 'LIKE CNTR%04052021%

we have a database that is growing every day. roughly 40M records as of today.
This table/database is located in Azure.
The table has a primary key 'ClassifierID', and the query is running on this primary key.
The primary key is in the format of ID + timestamp (mmddyyy HHMMSS), for example 'CNTR00220200 04052021 073000'
Here is the query to get all the IDs by date
**Select distinct ScanID
From ClassifierResults
Where ClassifierID LIKE 'CNTR%04052020%**
Very simple and straightforward, but it sometimes takes over a min to complete. Do you have any suggestion how we can optimize the query? Thanks much.

The best thing here would be to fix your design so that a) you are not storing the ID and timestamp in the same text field, and b) you are storing the timestamp in a proper date/timestamp column. Using your single point of data, I would suggest the following table design:
ID | timestamp
CNTR00220200 | timestamp '2021-04-05 07:30:00'
Then, create an index on (ID, timestamp), and use this query:
SELECT *
FROM yourTable
WHERE ID LIKE 'CNTR%' AND
timestamp >= '2021-04-05' AND timestamp < '2021-04-06';
The above query searches for records having an ID starting with CNTR and falling exactly on the date 2021-04-05. Your SQL database should be able to use the composite index I suggested above on this query.

Is hive partitioning hierarchical in nature?

Say we have a table partitioned as:-
CREATE EXTERNAL TABLE MyTable (
col1 string,
col2 string,
col3 string
)
PARTITIONED BY(year INT, month INT, day INT, hour INT, combination_id BIGINT);
Now obviously year is going to store year value (e.g. 2016), the month will store month va.ue (e.g. 7) the day will store day (e.g. 18) and hour will store hour value in 24 hour format (e.g. 13). And combination_id is going to be combination of padded (if single digit value pad it with 0 on left) values for all these. So in this case for example the combination id is 2016071813.
So we fire query (lets call it Query A):-
select * from mytable where combination_id = 2016071813
Now Hive doesn't know that combination_id is actually combination of year,month,day and hour. So will this query not take proper advantage of partitioning?
In other words, if I have another query, call it Query B, will this be more optimal than query A or there is no difference?:-
select * from mytable where year=2016 and month=7 and day=18 and hour=13
If Hive partitioning scheme is really hierarchical in nature then Query B should be better from performance point of view is what I am thinking. Actually I want to decide whether to get rid of combination_id altogether from partitioning scheme if it is not contributing to better performance at all.
The only real advantage for using combination id is to be able to use BETWEEN operator in select:-
select * from mytable where combination_id between 2016071813 and 2016071823
But if this is not going to take advantage of partitioning scheme, it is going to hamper performance.

Yes. Hive partitioning is hierarchical.
You can simply check this by printing the partitions of the table using below query.
show partitions MyTable;
Output:
year=2016/month=5/day=5/hour=5/combination_id=2016050505
year=2016/month=5/day=5/hour=6/combination_id=2016050506
year=2016/month=5/day=5/hour=7/combination_id=2016050507
In your scenario, you don't need to specify combination_id as partition column if you are not using for querying.
You can partition either by
Year, month, day, hour columns
or
combination_id only
Partitioning by Multiple columns helps in performance in grouping operations.
Say if you want to find maximum of a col1 for 'March' month of the years (2016 & 2015).
It can easily fetch the records by going to the specific 'Year' partition(year=2016/2015) and month partition(month=3)

VACUUM on Redshift (AWS) after DELETE and INSERT

I have a table as below (simplified example, we have over 60 fields):
CREATE TABLE "fact_table" (
"pk_a" bigint NOT NULL ENCODE lzo,
"pk_b" bigint NOT NULL ENCODE delta,
"d_1" bigint NOT NULL ENCODE runlength,
"d_2" bigint NOT NULL ENCODE lzo,
"d_3" character varying(255) NOT NULL ENCODE lzo,
"f_1" bigint NOT NULL ENCODE bytedict,
"f_2" bigint NULL ENCODE delta32k
)
DISTSTYLE KEY
DISTKEY ( d_1 )
SORTKEY ( pk_a, pk_b );
The table is distributed by a high-cardinality dimension.
The table is sorted by a pair of fields that increment in time order.
The table contains over 2 billion rows, and uses ~350GB of disk space, both "per node".
Our hourly house-keeping involves updating some recent records (within the last 0.1% of the table, based on the sort order) and inserting another 100k rows.
Whatever mechanism we choose, VACUUMing the table becomes overly burdensome:
- The sort step takes seconds
- The merge step takes over 6 hours
We can see from SELECT * FROM svv_vacuum_progress; that all 2billion rows are being merged. Even though the first 99.9% are completely unaffected.
Our understanding was that the merge should only affect:
1. Deleted records
2. Inserted records
3. And all the records from (1) or (2) up to the end of the table
We have tried DELETE and INSERT rather than UPDATE and that DML step is now significantly quicker. But the VACUUM still merges all 2billion rows.
DELETE FROM fact_table WHERE pk_a > X;
-- 42 seconds
INSERT INTO fact_table SELECT <blah> FROM <query> WHERE pk_a > X ORDER BY pk_a, pk_b;
-- 90 seconds
VACUUM fact_table;
-- 23645 seconds
In fact, the VACUUM merges all 2 billion records even if we just trim the last 746 rows off the end of the table.
The Question
Does anyone have any advice on how to avoid this immense VACUUM overhead, and only MERGE on the last 0.1% of the table?

How often are you VACUUMing the table? How does the long duration effect you? our load processing continues to run during VACUUM and we've never experienced any performance problems with doing that. Basically it doesn't matter how long it takes because we just keep running BAU.
I've also found that we don't need to VACUUM our big tables very often. Once a week is more than enough. Your use case may be very performance sensitive but we find the query times to be within normal variations until the table is more than, say, 90% unsorted.
If you find that there's a meaningful performance difference, have you considered using recent and history tables (inside a UNION view if needed)? That way you can VACUUM the small "recent" table quickly.

Couldn't fix it in comments section, so posting it as answer
I think right now, if the SORT keys are same across the time series tables and you have a UNION ALL view as time series view and still performance is bad, then you may want to have a time series view structure with explicit filters as
create or replace view schemaname.table_name as
select * from table_20140901 where sort_key_date = '2014-09-01' union all
select * from table_20140902 where sort_key_date = '2014-09-02' union all .......
select * from table_20140925 where sort_key_date = '2014-09-25';
Also make sure to have stats collected on all these tables on sort keys after every load and try running queries against it. It should be able to push down any filter values into the view if you are using any. End of day after load, just run a VACUUM SORT ONLY or full vacuum on the current day's table which should be much faster.
Let me know if you are still facing any issues after the above test.

Mysql improve SELECT speed

I'm currently trying to improve the speed of SELECTS for a MySQL table and would appreciate any suggestions on ways to improve it.
We have over 300 million records in the table and the table has the structure tag, date, value. The primary key is a combined key of tag and date. The table contains information for about 600 unique tags most containing an average of about 400,000 rows but can range from 2000 to over 11 million rows.
The queries run against the table are:
SELECT date,
value
FROM table
WHERE tag = "a"
AND date BETWEEN 'x' and 'y'
ORDER BY date
....and there are very few if any INSERTS.
I have tried partitioning the data by tag into various number of partitions but this seems to have little increase in speed.

take time to read my answer here: (has similar volumes to yours)
500 millions rows, 15 million row range scan in 0.02 seconds.
MySQL and NoSQL: Help me to choose the right one
then amend your table engine to innodb as follows:
create table tag_date_value
(
tag_id smallint unsigned not null, -- i prefer ints to chars
tag_date datetime not null, -- can we make this date vs datetime ?
value int unsigned not null default 0, -- or whatever datatype you require
primary key (tag_id, tag_date) -- clustered composite PK
)
engine=innodb;
you might consider the following as the primary key instead:
primary key (tag_id, tag_date, value) -- added value save some I/O
but only if value isnt some LARGE varchar type !
query as before:
select
tag_date,
value
from
tag_date_value
where
tag_id = 1 and
tag_date between 'x' and 'y'
order by
tag_date;
hope this helps :)
EDIT
oh forgot to mention - dont use alter table to change engine type from mysiam to innodb but rather dump the data out into csv files and re-import into a newly created and empty innodb table.
note i'm ordering the data during the export process - clustered indexes are the KEY !
Export
select * into outfile 'tag_dat_value_001.dat'
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
tag_date_value
where
tag_id between 1 and 50
order by
tag_id, tag_date;
select * into outfile 'tag_dat_value_002.dat'
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
tag_date_value
where
tag_id between 51 and 100
order by
tag_id, tag_date;
-- etc...
Import
import back into the table in correct order !
start transaction;
load data infile 'tag_dat_value_001.dat'
into table tag_date_value
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
(
tag_id,
tag_date,
value
);
commit;
-- etc...

What is the cardinality of the date field (that is, how many different values appear in that field)? If the date BETWEEN 'x' AND 'y' is more limiting than the tag = 'a' part of the WHERE clause, try making your primary key (date, tag) instead of (tag, date), allowing date to be used as an indexed value.
Also, be careful how you specify 'x' and 'y' in your WHERE clause. There are some circumstances in which MySQL will cast each date field to match the non-date implied type of the values you compare to.

I would do two things - first throw some indexes on there around tag and date as suggested above:
alter table table add index (tag, date);
Next break your query into a main query and sub-select in which you are narrowing your results down when you get into your main query:
SELECT date, value
FROM table
WHERE date BETWEEN 'x' and 'y'
AND tag IN ( SELECT tag FROM table WHERE tag = 'a' )
ORDER BY date

Your query is asking for a few things - and with that high # of rows, the look of the data can change what the best approach is.
SELECT date, value
FROM table
WHERE tag = "a"
AND date BETWEEN 'x' and 'y'
ORDER BY date
There are a few things that can slow down this select query.
A very large result set that has to be sorted (order by).
A very large result set. If tag and date are in the index (and let's assume that's as good as it gets) every result row will have to leave the index to lookup the value field. Think of this like needing the first sentence of each chapter of a book. If you only needed to know the chapter names, easy - you can get it from the table of contents, but since you need the first sentence you have to go to the actual chapter. In certain cases, the optimizer may choose just to flip through the entire book (table scan in query plan lingo) to get those first sentences.
Filtering by the wrong where clause first. If the index is in the order tag, date... then tag should (for a majority of your queries) be the more stringent of the two columns. So basically, unless you have more tags than dates (or maybe than dates in a typical date range), then dates should be the first of the two columns in your index.
A couple of recommendations:
Consider if it's possible to truncate some of that data if it's too old to care about most of the time.
Try playing with your current index - i.e. change the order of the items in it.
Do away with your current index and replace it with a covering index (has all 3 fields in it)
Run some EXPLAIN's and make sure it's using your index at all.
Switch to some other data store (mongo db?) or otherwise ensure this monster table is kept as much in memory as possible.

I'd say your only chance to further improve it is a covering index with all three columns (tag, data, value). That avoids the table access.
I don't think that partitioning can help with that.

I would guess that adding an index on (tag, date) would help:
alter table table add index (tag, date);
Please post the result of an explain on this query (EXPLAIN SELECT date, value FROM ......)

I think that the value column is at the bottom of your performance issues. It is not part of the index so we will have table access. Further I think that the ORDER BY is unlikely to impact the performance so severely since it is part of your index and should be ordered.
I will argument my suspicions for the value column by the fact that the partitioning does not really reduce the execution time of the query. May you execute the query without value and further give us some results as well as the EXPLAIN? Do you really need it for each row and what kind of column is it?
Cheers!

Try inserting just the needed dates into a temporary table and the finishing with a select on the temporary table for the tags and ordering.
CREATE temporary table foo
SELECT date, value
FROM table
WHERE date BETWEEN 'x' and 'y' ;
ALTER TABLE foo ADD INDEX index( tag );
SELECT date, value
FROM foo
WHERE tag = "a"
ORDER BY date;
if that doesn't work try creating foo off the tag selection instead.
CREATE temporary table foo
SELECT date, value
FROM table
WHERE tag = "a";
ALTER TABLE foo ADD INDEX index( date );
SELECT date, value
FROM foo
WHERE date BETWEEN 'x' and 'y'
ORDER BY date;

Database Duplicate Value Issue ( Filtering Based on Previous Value)

Earlier this week I ask a question about filtering out duplicate values in sequence at run time. Had some good answers but the amount of data I was going over was to slow and not feasible.
Currently in our database, event values are not filtered. Resulting in duplicate data values (with varying timestamps). We need to process that data at run time and at the database level it’s to time costly ( and cannot pull it into code because it’s used a lot in stored procs) resulting in high query times. We need a data structure that we can query that has this data store filtered out so that no additional filtering is needed at runtime.
Currently in our DB
'F07331E4-26EC-41B6-BEC5-002AACA58337', '1', '2008-05-08 04:03:47.000'
'F07331E4-26EC-41B6-BEC5-002AACA58337', '0', '2008-05-08 10:02:08.000'
'F07331E4-26EC-41B6-BEC5-002AACA58337', '0', '2008-05-09 10:03:24.000’ (Need to delete this) **
'F07331E4-26EC-41B6-BEC5-002AACA58337', '1', '2008-05-10 04:05:05.000'
What we need
'F07331E4-26EC-41B6-BEC5-002AACA58337', '1', '2008-05-08 04:03:47.000'
'F07331E4-26EC-41B6-BEC5-002AACA58337', '0', '2008-05-08 10:02:08.000'
'F07331E4-26EC-41B6-BEC5-002AACA58337', '1', '2008-05-10 04:51:05.000'
This seems trivial, but our issue is that we get this data from wireless devices, resulting in out of sequence packets and our gateway is multithreaded so we cannot guarantee the values we get are in order. Something may come in like a '1' for 4 seconds ago and a '0' for 2 seconds ago, but we process the '1' already because it was first in. we have been spinning our heads on how to implement this. We cannot compare data to the latest value in the database because the latest may actually not have come in yet, so to throw that data out we'd be screwed and our sequence may be completely off. So currently we store every value that comes in and the database shuffles itself around based off of time.. but the units can send 1,1,1,0 and its valid because the event is still active, but we only want to store the on and off state ( first occurrence of the on state 1,0,1,0,1,0).. we thought about a trigger, but we'd have to shuffle the data around every time a new value came in because it might be earlier then the last message and it can change the entire sequence (inserts would be slow).
Any Ideas?
Ask if you need any further information.
[EDIT] PK Wont work - the issue is that our units actually send in different timestamps. so the PK wouldn't work because 1,1,1 are the same.. but there have different time stamps. Its like event went on at time1, event still on at time2, it sends us back both.. same value different time.

If I understand correctly, what you want to do is simply prevent the dupes from even getting in the database. If that is the case, why not have a PK (or Unique Index) defined on the first two columns and have the database do the heavy lifting for you. Dupe inserts would fail based on the PK or AK you've defined. You're code (or stored proc) would then just have to gracefully handle that exception.

Here's an update solution. Performance will vary depending on indexes.
DECLARE #MyTable TABLE
(
DeviceName varchar(100),
EventTime DateTime,
OnOff int,
GoodForRead int
)
INSERT INTO #MyTable(DeviceName, OnOff, EventTime)
SELECT 'F07331E4-26EC-41B6-BEC5-002AACA58337', 1, '2008-05-08 04:03:47.000'
INSERT INTO #MyTable(DeviceName, OnOff, EventTime)
SELECT 'F07331E4-26EC-41B6-BEC5-002AACA58337', 0, '2008-05-08 10:02:08.000'
INSERT INTO #MyTable(DeviceName, OnOff, EventTime)
SELECT 'F07331E4-26EC-41B6-BEC5-002AACA58337', 0, '2008-05-09 10:03:24.000'
INSERT INTO #MyTable(DeviceName, OnOff, EventTime)
SELECT 'F07331E4-26EC-41B6-BEC5-002AACA58337', 1, '2008-05-10 04:05:05.000'
UPDATE mt
SET GoodForRead =
CASE
(SELECT top 1 OnOff
FROM #MyTable mt2
WHERE mt2.DeviceName = mt.DeviceName
and mt2.EventTime < mt.EventTime
ORDER BY mt2.EventTime desc
)
WHEN null THEN 1
WHEN mt.OnOff THEN 0
ELSE 1
END
FROM #MyTable mt
-- Limit the update to recent data
--WHERE EventTime >= DateAdd(dd, -1, GetDate())
SELECT *
FROM #MyTable
It isn't hard to imagine a filtering solution based on this. It just depends on how often you want to look up the previous record for each record (every query or once in a while).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas