we have a database that is growing every day. roughly 40M records as of today.
This table/database is located in Azure.
The table has a primary key 'ClassifierID', and the query is running on this primary key.
The primary key is in the format of ID + timestamp (mmddyyy HHMMSS), for example 'CNTR00220200 04052021 073000'
Here is the query to get all the IDs by date
**Select distinct ScanID
From ClassifierResults
Where ClassifierID LIKE 'CNTR%04052020%**
Very simple and straightforward, but it sometimes takes over a min to complete. Do you have any suggestion how we can optimize the query? Thanks much.
The best thing here would be to fix your design so that a) you are not storing the ID and timestamp in the same text field, and b) you are storing the timestamp in a proper date/timestamp column. Using your single point of data, I would suggest the following table design:
ID | timestamp
CNTR00220200 | timestamp '2021-04-05 07:30:00'
Then, create an index on (ID, timestamp), and use this query:
SELECT *
FROM yourTable
WHERE ID LIKE 'CNTR%' AND
timestamp >= '2021-04-05' AND timestamp < '2021-04-06';
The above query searches for records having an ID starting with CNTR and falling exactly on the date 2021-04-05. Your SQL database should be able to use the composite index I suggested above on this query.
Related
Currently I using following query:
SELECT
ID,
Key
FROM
mydataset.mytable
where ID = 100077113
and Key='06019'
My data has 100 million rows:
ID - unique
Key - can have ~10,000 keys
If I know the key looking for ID can be done on ~10,000 rows and work much faster and process much less data.
How can I use the new clustering capabilites in BigQuery to partition on the field Key?
(I'm going to summarize and expand on what Mikhail, Pentium10, and Pavan said)
I have a table with 12M rows and 76 GB of data. This table has no timestamp column.
This is how to cluster said table - while creating a fake date column for fake partitioning:
CREATE TABLE `fh-bigquery.public_dump.github_java_clustered`
(id STRING, size INT64, content STRING, binary BOOL
, copies INT64, sample_repo_name STRING, sample_path STRING
, fake_date DATE)
PARTITION BY fake_date
CLUSTER BY id AS (
SELECT *, DATE('1980-01-01') fake_date
FROM `fh-bigquery.github_extracts.contents_java`
)
Did it work?
# original table
SELECT *
FROM `fh-bigquery.github_extracts.contents_java`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(3.3s elapsed, 72.1 GB processed)
# clustered table
SELECT *
FROM `fh-bigquery.public_dump.github_java_clustered2`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(2.4s elapsed, 232 MB processed)
What I learned here:
Clustering can work with unique ids, even for tables without a date to partition by.
Prefer using a fake date instead of a null date (but only for now - this should be improved).
Clustering made my query 99.6% cheaper when looking for rows by id!
Read more: https://medium.com/#hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b
you can have one filed of type DATE with NULL value, so you will be able partition by that field and since the table partitioned you will be able to enjoy clustering
You need to recreate your table with an additional date column with all rows having NULL values. And then you set partition to the date column. This way your table is partitioned.
After you've done with this, you will add clustering, based on the columns you identified in your query. Clustering will improve processing time and query costs will be reduced.
Now you can partition table on an integer column so this might be a good solution, remember there is a limit of 4,000 partitions for each table. So because you have ~10,000 keys I will suggest to create a sort of group_key that bundles ids together or maybe you have another column that you can leverage as integer with a cardinality < 4,000.
Recently BigQuery introduced support for clustering table even if they are not partitioned. So you can simply cluster on your integer field and don't use partitioning all together. Although, this solution will not be most effective for data scan optimisation.
My Cassandra-based application needs to read the rows changed since last read.
For this purpose, we are planning to have a table changed_rows that will contain two columns -
ID - The ID of the changed row and
Updated_Time - The timestamp when it was changed.
What is the best way to read such a table such that it reads small group of rows ordered by time.
Example: if the table is:
ID Updated_Time
foo 1000
bar 1200
abc 2000
pqr 2500
zyx 2900
...
xyz 901000
...
I have shown IDs to be simple 3-letter keys, in reality they are UUIDs.
Also, time shown above is shown as an integer for the sake of simplicity, but its an actual Cassandra timestamp (Or Java Date). The Updated_Time column is a monotonically increasing one.
If I query this data with:
SELECT * FROM changed_rows WHERE Updated_Time < toTimestamp(now())
I get the following error:
Cannot execute this query as it might involve data filtering and
thus may have unpredictable performance... Use Allow Filtering
But I think Allow Filtering in this case would kill the performance.
The Cassandra index page warns to avoid indexes for high cardinality columns and the Updated_Time above sure seems like high cardinality.
I do not know the ID column before-hand because the purpose of the query is to know the IDs updated between given time intervals.
What is the best way to query Cassandra in this case then?
Can I change my table somehow to run the time-chunk query more efficiently?
Note: This should sound somewhat similar to Cassandra-CDC feature but we cannot use the same because our solution should work for all the Cassandra versions
Assuming you know the time intervals you want to query, you need to create another table like the following:
CREATE TABLE modified_records (
timeslot timestamp,
updatedtime timestamp,
recordid timeuuid,
PRIMARY KEY (timeslot, updatedtime)
);
Now you can split your "updated record log" into time slices, eg 1 hour, and fill the table like this:
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:36:00', 123);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 09:00:00', '2017-02-27 09:56:00', 456);
INSERT INTO modified_records (timeslot, updatedtime, recordid) VALUES ( '2017-02-27 10:00:00', '2017-02-27 10:00:13', 789);
where you use a part of your updatedtime timestamp as a partition key, eg in this case you round to the integral hour. You then query by specifying the time slot only, eg:
SELECT * FROM modified_records WHERE timeslot = '2017-02-27 09:00:00';
SELECT * FROM modified_records WHERE timeslot = '2017-02-27 10:00:00';
Depending on how often your records get updated, you can go with smaller or bigger time slices, eg every 6 hours, or 1 day, or every 15 minutes. This structure is very flexible. You only need to know the timeslot you want to query. If you need to span multiple timeslots you'll need to perform multiple queries.
I have a table records having three fields:
id - the row id
value - the row value
source - the source of the value
timestamp - the time when the row was inserted (should this be a unix timestamp or a datetime?)
And I want to perform a query like this:
SELECT timestamp, value FROM records WHERE timestamp >= a AND timestamp <= b
However in a table with millions of records this query is super inefficient!
I am using Azure SQL Server as the DBMS. Can this be optimised?
If so can you provide a step-by-step guide to do it (please don't skip "small" steps)? Be it creating indexes, redesigning the query statement, redesigning the table (partitioning?)...
Thanks!
After creating an index on the field you want to search, you can use a between operator so it is a single operation, which is most efficient for sql.
SELECT XXX FROM ABC WHERE DateField BETWEEN '1/1/2015' AND '12/31/2015'
Also, in SQL Server 2016 you can create range indexes for use on things like time-stamps using memory optimized tables. That's really the way to do it.
I would recommend using the datetime, or even better the datetime2 data type to store the date data (datetime2 being better as it has a higher level of precision, and with lower precision levels will use less storage).
As for your query, based upon the statement you posted you would want the timestamp to be the key column, and then include the value. This is because you are using the timestamp as your predicate, and returning the value along with it.
CREATE NONCLUSTERED INDEX IX_Records_Timestamp on Records (Timestamp) INCLUDE (Value)
This being said, be careful of your column names. I would highly recommend not using reserved keywords for columns names as they can be a lot more difficult to work with.
I'm currently trying to improve the speed of SELECTS for a MySQL table and would appreciate any suggestions on ways to improve it.
We have over 300 million records in the table and the table has the structure tag, date, value. The primary key is a combined key of tag and date. The table contains information for about 600 unique tags most containing an average of about 400,000 rows but can range from 2000 to over 11 million rows.
The queries run against the table are:
SELECT date,
value
FROM table
WHERE tag = "a"
AND date BETWEEN 'x' and 'y'
ORDER BY date
....and there are very few if any INSERTS.
I have tried partitioning the data by tag into various number of partitions but this seems to have little increase in speed.
take time to read my answer here: (has similar volumes to yours)
500 millions rows, 15 million row range scan in 0.02 seconds.
MySQL and NoSQL: Help me to choose the right one
then amend your table engine to innodb as follows:
create table tag_date_value
(
tag_id smallint unsigned not null, -- i prefer ints to chars
tag_date datetime not null, -- can we make this date vs datetime ?
value int unsigned not null default 0, -- or whatever datatype you require
primary key (tag_id, tag_date) -- clustered composite PK
)
engine=innodb;
you might consider the following as the primary key instead:
primary key (tag_id, tag_date, value) -- added value save some I/O
but only if value isnt some LARGE varchar type !
query as before:
select
tag_date,
value
from
tag_date_value
where
tag_id = 1 and
tag_date between 'x' and 'y'
order by
tag_date;
hope this helps :)
EDIT
oh forgot to mention - dont use alter table to change engine type from mysiam to innodb but rather dump the data out into csv files and re-import into a newly created and empty innodb table.
note i'm ordering the data during the export process - clustered indexes are the KEY !
Export
select * into outfile 'tag_dat_value_001.dat'
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
tag_date_value
where
tag_id between 1 and 50
order by
tag_id, tag_date;
select * into outfile 'tag_dat_value_002.dat'
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
tag_date_value
where
tag_id between 51 and 100
order by
tag_id, tag_date;
-- etc...
Import
import back into the table in correct order !
start transaction;
load data infile 'tag_dat_value_001.dat'
into table tag_date_value
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
(
tag_id,
tag_date,
value
);
commit;
-- etc...
What is the cardinality of the date field (that is, how many different values appear in that field)? If the date BETWEEN 'x' AND 'y' is more limiting than the tag = 'a' part of the WHERE clause, try making your primary key (date, tag) instead of (tag, date), allowing date to be used as an indexed value.
Also, be careful how you specify 'x' and 'y' in your WHERE clause. There are some circumstances in which MySQL will cast each date field to match the non-date implied type of the values you compare to.
I would do two things - first throw some indexes on there around tag and date as suggested above:
alter table table add index (tag, date);
Next break your query into a main query and sub-select in which you are narrowing your results down when you get into your main query:
SELECT date, value
FROM table
WHERE date BETWEEN 'x' and 'y'
AND tag IN ( SELECT tag FROM table WHERE tag = 'a' )
ORDER BY date
Your query is asking for a few things - and with that high # of rows, the look of the data can change what the best approach is.
SELECT date, value
FROM table
WHERE tag = "a"
AND date BETWEEN 'x' and 'y'
ORDER BY date
There are a few things that can slow down this select query.
A very large result set that has to be sorted (order by).
A very large result set. If tag and date are in the index (and let's assume that's as good as it gets) every result row will have to leave the index to lookup the value field. Think of this like needing the first sentence of each chapter of a book. If you only needed to know the chapter names, easy - you can get it from the table of contents, but since you need the first sentence you have to go to the actual chapter. In certain cases, the optimizer may choose just to flip through the entire book (table scan in query plan lingo) to get those first sentences.
Filtering by the wrong where clause first. If the index is in the order tag, date... then tag should (for a majority of your queries) be the more stringent of the two columns. So basically, unless you have more tags than dates (or maybe than dates in a typical date range), then dates should be the first of the two columns in your index.
A couple of recommendations:
Consider if it's possible to truncate some of that data if it's too old to care about most of the time.
Try playing with your current index - i.e. change the order of the items in it.
Do away with your current index and replace it with a covering index (has all 3 fields in it)
Run some EXPLAIN's and make sure it's using your index at all.
Switch to some other data store (mongo db?) or otherwise ensure this monster table is kept as much in memory as possible.
I'd say your only chance to further improve it is a covering index with all three columns (tag, data, value). That avoids the table access.
I don't think that partitioning can help with that.
I would guess that adding an index on (tag, date) would help:
alter table table add index (tag, date);
Please post the result of an explain on this query (EXPLAIN SELECT date, value FROM ......)
I think that the value column is at the bottom of your performance issues. It is not part of the index so we will have table access. Further I think that the ORDER BY is unlikely to impact the performance so severely since it is part of your index and should be ordered.
I will argument my suspicions for the value column by the fact that the partitioning does not really reduce the execution time of the query. May you execute the query without value and further give us some results as well as the EXPLAIN? Do you really need it for each row and what kind of column is it?
Cheers!
Try inserting just the needed dates into a temporary table and the finishing with a select on the temporary table for the tags and ordering.
CREATE temporary table foo
SELECT date, value
FROM table
WHERE date BETWEEN 'x' and 'y' ;
ALTER TABLE foo ADD INDEX index( tag );
SELECT date, value
FROM foo
WHERE tag = "a"
ORDER BY date;
if that doesn't work try creating foo off the tag selection instead.
CREATE temporary table foo
SELECT date, value
FROM table
WHERE tag = "a";
ALTER TABLE foo ADD INDEX index( date );
SELECT date, value
FROM foo
WHERE date BETWEEN 'x' and 'y'
ORDER BY date;
i have a table in database that is having 7k plus records. I have a query that searches for a particular id in that table.(id is auto-incremented)
query is like this->
select id,date_time from table order by date_time DESC;
this query will do all search on that 7k + data.......isn't there anyways with which i can optimize this query so that search is made only on 500 or 1000 records....as these records will increase day by day and my query will become heavier and heavier.Any suggestions?????
I don't know if im missing something here but what's wrong with:
select id,date_time from table where id=?id order by date_time DESC;
and ?id is the number of the id you are searching for...
And of course id should be a primary index.
If id is unique (possibly your primary key), then you don't need to search by date_time and you're guaranteed to only get back at most one row.
SELECT id, date_time FROM table WHERE id=<value>;
If id is not unique, then you still use the same query but need to look at indexes, other contraints, and/or caching outside the database, if the query becomes too slow.