select most recent values in very large table

select most recent values in very large table - sql

I am an operations guy tasked with pulling data from a very large table. I'm not a DBA and cannot partition it or change the indexing. Table has nearly a billion entries, is not partitioned, and could probably be indexed "better". I need two fields, which we'll call mod_date and obj_id (mod_date is indexed). EDIT: I also add a filter for 'client' which I've blurred out in my screenshot of the explain plan.
My data:
Within the group of almost a billion rows, we have fewer than 10,000 obj_id values to query across several years (a few might even be NULL). Some of the <10k obj_ids -- probably between 1,000-2,500 -- have more than 10 million mod_date values each. When the obj_ids have over a few million mod_dates, each obj_id takes several minutes to scan and sort using MAX(mod_date). The full result set takes over 12 hours to query and no one has made it to completion without some "issue" (locked out, unplugged laptop, etc.). Even if we got the first 50 rows returned we'd still need to export to Excel ... it's only going to be about 8,000 rows with 2 columns but we can never make it to the end.
So here is a simplified query I'd use if it were a small table:
select MAX(trunc(mod_date,'dd')) as last_modified_date, obj_id
from my_table
where client = 'client_name'
and obj_type_id = 12
group by obj_id;
Cardinality is 317917582, "Cost" is 12783449
The issue:
The issue is the speed of the query with such a large unpartitioned table, given the current indexes. All the other answers I've seen about "most recent date" tend to use MAX, possibly in combination with FIRST_VALUE, which seem to require a full scan of all rows in order to sort them and then determine which is the most recent.
I am hoping there is a way to avoid that, to speed up the results. It seems that Oracle (I am using Oracle SQL developer) should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?
Even with such a large table, obj_ids having fewer than 10,000 mod_dates can return the MAX(mod_date) very quickly (seconds or less). The issue we are having is the obj_ids having the most mod_dates (over 10 million) take the longest to scan and sort, when they "should" be the quickest if I could get Oracle to start looking at the most recent first … because it would find a recent date quickly and move on!

First, I'd say its a common misconception that in order to make a query run faster, you need an index (or better indexes). Full table scan makes sense when you're pulling more than 10% of the data (rough estimate, depends on multiblock read count, block size, etc).
My advice is to setup a materialized view (MY_MV or whatever) that simply does the group by query (across all ids). If you need to limit the ids to a 10k subset, just make sure you full scan the table (check explain plan). You can add a full hint if needed (select /*+ full(t) */ .. from big_table t ...)
Then do:
dbms_mview.refresh('MY_MV','C',atomic_refresh=>false);
Thats it. No issues with a client only returning the first x rows and when you go to pull everything it re-runs the entire query (ugh). Full scans are also easier to track in long opts (harder to tell what progress you've made if you are doing nested loops on an index for example).
Once its done, dump entire MV table to a file or whatever you need.

tbone has it right I think. Or, if you do not have authority to create a materialized view, as he suggests, you might create a shell script on the database server to run your query via SQL*Plus and spool the output to a file. Then, run that script using nohup and you shouldn't need to worry about laptops getting turned off, etc.
But I wanted to explain something about your comment:
Oracle should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?
That would be a horrible way for Oracle to run your query, given the indexes you have listed. Let's step through it...
There is no index on obj_id, so Oracle needs to do a full table scan to make sure it gets all the distinct obj_id values.
So, it starts the FTS and finds obj_id 101. It then says "I need max(mod_date) for 101... ah ha! I have an index!" So, it does a reverse index scan. For each entry in the index, it looks up the row from table and checks it to see if it is obj_id 101. If the obj_id was recently updated, we're good because we find it and stop early. But if the obj_id has not been updated in a long time, we have to read many index entries and, for each, access the table row(s) to perform the check.
In the worst case -- if the obj_id is one of those few you mentioned where max(mod_date) will be NULL, we would use the index to look up EVERY SINGLE ROW in your table that has a non-null mod_date.
Doing so many index lookups would be an awful plan if it did that just once, but you're talking about doing it for several old or never-updated obj_id values.
Anyway, it's all academic. There is no Oracle query plan that will run the query that way. It's for good reason.
Without better indexing, you're just not going to improve upon a single full table scan.

Related

Stop query from checking further with "between dates" condition after all dates are found in sorted table

I have a very large table with a column that contains dates. Due to the table being so huge, I want to request the data for e.g. each day, which I am trying to do with the following statement:
SELECT *
FROM [my_db].[dbo].[my_data] where date between '2019-03-25' and '2019-03-26'
So far so good, when I run this query, the relevant data (about 10,000 rows) are returned. However, the query does not stop, it keeps executing for a very long time (couldn't see how long, I always stopped it after about 30 minutes).
I assume it's checking for more fitting dates. However, the table is sorted, so I know there won't be any further dates.
What is the best approach to handle this here? Is there a way to set some kind of timeout after no futher result was found? Or should I just use a normal timeout and hope the transaction was done in time? Thanks!

So it sounds like your query is performing a table scan to retrieve your data.
We don't know anything about the performance of your hardware but for a large table, possibly highly fragmented, this could be a time consuming operation on a slow drive or if IO is a bottleneck.
You can quickly get the approximate count of rows in several ways. Reading the comments you mention you are doing this on a laptop, so likely you are the only user in which case the approximate count is likely bang-on.
The easiest is to run
exec sp_spaceused 'tablename'
You can query a list of indexes on a table
select * from sys.indexes where object_id=Object_Id('tablename')
You can also see a list of all tables and their stats including rows using Object Explorer Details in SSMS. Connect to your server and expand the database from the list in Object Explorer. Open the Details panel (F7) and click on Tables, the list will be populated and rowcounts retrieved.
You can also expand Tables in Object Explorer, expand your specific table and then expand Indexes to view what is currently defined.
Because you (probably) have no index on your Date column, even though you know you have received all the qualifying results, SQL Server doesn't because it is having to scan the table. without an index, nothing guarantees a range of rows will all reside sequentially.
This means it jumps right in at one end and starts reading through page-by-page until it gets to the end, checking each row to see if it fits your filtering criteria. If the data you are expecting happens to reside on the first pages it reads then great - but SQL Server has no way of knowing it's found every possible qualifying row - many factors such as page fragmentation could mean some rows might exist further along the list of pages making up the table's data.
An index on the date column would help dramatically because then SQL server can seek directly to the start of the first qualifying date and read the values in order until it has reached the last qualifying row, where because the data is sorted it knows it has reach the end.
An index will also help with a query such as select count(*). Every index (except filtered indexes) includes every row, but not every column - therefore to get a row count SQL Server will scan the narrowest index, which means it will have the least possible IO.
In addition, doing select * if you don't actuall need every column will be an impact on performance.
If your query is highly selective, and you have an index on date, SQL Server will seek to the required rows in the index and then do a bookmark lookup to retrieve the remaining columns.
This is an expensive operation however, so there is a threshold where the trade-off is not worth it and SQL Server will opt to scan the table instead to avoid the lookup operation.

Index not used Postgres

Tracking indexes and analyzing the tables on which index add, we encounter some situations:
some of our tables have index, but when I execute a query with a clause where on index field, doesn't account in your idx_scan field respective. Same relname and schemaname, so, I couldn't be wrong.
Testing more, I deleted and create the table again, after that the query returned to account the idx_scan.
That occurred with another tables too, we executed some queries with indexes and didn't account idx_scan field, only in seq_scan and even if I create another field in the same table with index, this new field doesn't count idx_scan.
Whats the problem with these tables? What do we do wrong? Only if I create a new table with indexes that account in idx_scan, just in an old table that has wrong.
We did migration sometimes with this database, maybe it can be the problem? Happened on localhost and server online.
Another event that we saw, some indexes were accounted, idx_scan > 0, and when execute query select, does not increase idx_scan again, the number was fixed and just increase seq_scan.
I believe those problems can be related.
I appreciate some help, it's a big mystery prowling our DB and have no idea what the problem can be.

A couple suggestions (and what to add to your question).
The first is that index scans are not always favored to to sequential scans. For example, if your table is small or the planner estimates that most pages will need to be fetched, an index scan will be omitted in favor of a sequential scan.
Remember: no plan beats retrieving a single page off disk and sequentially running through it.
Similarly if you have to retrieve, say, 50% of the pages of a relation, doing an index scan is going to trade somewhat less disk/IO total for a great deal more random disk/IO. It might be a win if you use SSD's but certainly not with conventional hard drives. After all you don't really want to be waiting for platters to turn. If you are using SSD's you can tweak planner settings accordingly.
So index vs sequential scan is not the end of the story. The question is how many rows are retrieved, how big the tables are, what percentage of disk pages are retrieved, etc.
If it really is picking a bad plan (rather than a good plan that you didn't consider!) then the question becomes why. There are ways of setting statistics targets but these may not be really helpful.
Finally the planner really can't choose an index in some cases where you might like it to. For example, suppose I have a 10 million row table with records spanning 5 years (approx 2 million rows per year on average). I would like to get the distinct years. I can't do this with a standard query and index, but I can build a WITH RECURSIVE CTE to essentially execute the same query once for each year and that will use an index. Of course you had better have an index in that case or WITH RECURSIVE will do a sequential scan for each year which is certainly not what you want!
tl;dr: It's complicated. You want to make sure this is really a bad plan before jumping to conclusions and then if it is a bad plan see what you can do about it depending on your configuration.

Organizing lots of timestamped values in a DB (sql / nosql)

I have a device I'm polling for lots of different fields, every x milliseconds
the device returns a list of ids and values which I need to store with a time stamp in a DB of sorts.
Users of the system need to be able to query this DB for historic logs to create graphs, or query the last timestamp for each value.
A simple approach would be to define a MySQL table with
id,value_id,timestamp,value
and let users select
Select value form t where value_id=x order by timestamp desc limit 1
and just push everything there with index on timestamp and id, But my question is what's the best approach performance / size wise for designing the schema? or using nosql? can anyone comment on possible design trade offs. Will such a design scale with millions of records?

When you say "... or query the last timestamp for each value" is this what you had in mind?
select max(timestamp) from T where value = ?
If you have millions of records, and the above is what you meant (i.e. value is alone in the WHERE clause), then you'd need an index on the value column, otherwise you'd have to do a full table scan. But if queries will ALWAYS have [timestamp] column in the WHERE clause, you do not need an index on [value] column if there's an index on timestamp.
You need an index on the timestamp column if your users will issue queries where the timestamp column appears alone in the WHERE clause:
select * from T where timestamp > x and timestamp < y
You could index all three columns, but you want to make sure the writes do not slow down because of the indexing overhead.
The rule of thumb when you have a very large database is that every query should be able to make use of an index, so you can avoid a full table scan.
EDIT:
Adding some additional remarks after your clarification.
I am wondering how you will know the id? Is [id] perhaps a product code?
A single simple index on id might not scale very well if there are not many different product codes, i.e. if it's a low-cardinality index. The rebalancing of the trees could slow down the batch inserts that are happening every x milliseconds. A composite index on (id,timestamp) would be better than a simple index.
If you rarely need to sort multiple products but are most often selecting based on a single product-code, then a non-traditional DBMS that uses a hashed-key sparse-table rather than a b-tree might be a very viable even a superior alternative for you. In such a database, all of the records for a given key would be found physically on the same set of contiguous "pages"; the hashing algorithm looks at the key and returns the page number where the record will be found. There is no need to rebalance an index as there isn't an index, and so you completely avoid the related scaling worries.
However, while hashed-file databases excel at low-overhead nearly instant retrieval based on a key value, they tend to be poor performers at sorting large groups of records on an attribute, because the data are not stored physically in any meaningful order, and gathering the records can involve much thrashing. In your case, timestamp would be that attribute. If I were in your shoes, I would base my decision on the cardinality of the id: in a dataset of a million records, how many DISTINCT ids would be found?
YET ANOTHER EDIT SINCE THE SITE IS NOT LETTING ME ADD ANOTHER ANSWER:
Simplest way is to have two tables, one with the ongoing history, which is always having new values inserted, and the other, containing only 250 records, one per part, where the latest value overwrites/replaces the previous one.
Update latest
set value = x
where id = ?

You have a choice of
indexes (composite; covering value_id, timestamp and value, or some combination of them): you should test performance with different indexes; composite and non-composite, also be aware that there are quite a few significantly different ways to get 'max per group' (search so, especially mysql version with variables)
triggers - you might use triggers to maintain max row values in another table (best performance of further selects; this is redundant and could be kept in memory)
lazy statistics/triggers, since your database is updated quite often you can save cycles if you update your statistics periodically (if you can allow the stats to be y seconds old and if you poll 1000 / x times a second, then you potentially save y * 100 / x potential updates; and this can be noticeable, especially in terms of scalability)
The above is true if you are looking for last bit of performance, if not keep it simple.

Appropriate query and indexes for a logging table in SQL

Assume a table named 'log', there are huge records in it.
The application usually retrieves data by simple SQL:
SELECT *
FROM log
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)
logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.
How do we fix this?

Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.
If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.
UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:
What is it that you are actually trying to achieve?
If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.
If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)

1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.

Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.

Do you need all columns? First step should be to select only those you actually need to retrieve.
Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).
There can be some potential for improvement on the side of the processing application.
You should answer yourself these questions:
Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?

A couple of things
do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.
Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk

For me there are two things that you can do,
Partition the table horizontally based on the date column
Use the concept of pre-aggregation.
Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:
a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.
b. Aggregate the logs for that particular hour from the logs_temp table
c. Save the aggregated results in the summary table
d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.
This way results are pre-aggregated in the summary table.
Whenever you wish to select the result, you would select it from the summary table.
This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.
Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.
You can read more about Shadow Table trick here
I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.

As per other answers, do not use 'select *' unless you really need all the fields.
logLevel and creationData have indexes
You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.
Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.
C.

I really hope that by creationData you mean creationDate.
First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:
CREATE INDEX i_log_1 ON log (creationData, logLevel);
Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).
Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.
If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:
DECLARE
l_ownname VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
l_tabname VARCHAR2(255) := 'log'; -- Table to analyze
l_estimate_percent NUMBER(3) := 5; -- Percentage of rows to estimate (NULL means compute)
BEGIN
dbms_stats.gather_table_stats (
ownname => l_ownname ,
tabname => l_tabname,
estimate_percent => l_estimate_percent,
method_opt => 'FOR ALL INDEXED COLUMNS',
cascade => TRUE
);
END;
Otherwise, if the table is small, use
ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;
Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:
Oracle Documentation: Range Partitioning
OraFAQ: Range partitions
How to Create and Manage Partition Tables in Oracle

SQL Query Slow? Should it be?

Using SQLite, Got a table with ~10 columns. Theres ~25million rows.
That table has an INDEX on 'sid, uid, area, type'.
I run a select like so:
SELECT sid from actions where uid=1234 and area=1 and type=2
That returns me 1571 results, and takes 4 minutes to complete.
Is that sane?
I'm far from an SQL expert, so hopefully someone can fill me in on what I'm missing. Why could this possibly take 4+ minutes with everything indexed?
Any recommended resources to learn about achieving high SQL performance? I feel like a lot of the Google results just give me opinions or anecdotes, I wouldn't mind a solid book.

Create uid+area+type index instead, or uid+area+type+sid

Since the index starts with the sid column, it must do a scan (start at the beginning, read to the end) of either the index or the table to find your data matching the other 3 columns. This means it has to read all 25 million rows to find the answer. Even if it's reading just the rows of the index rather than the table, that's a lot of work.
Imagine a phone book of the greater New York metropolitan area, organized by (with an 'index' on) Last Name, First Name.
You submit SELECT [Last Name] FROM NewYorkPhoneBook WHERE [First Name] = 'Thelma'
It has to read all 25 million entries to find all those Thelmas. Unless you either specify the last name and can then turn directly to the page where that last name first appers (a seek), or have an index organized by First Name (a seek on the index followed by a seek on the table, aka a "bookmark lookup"), there's no way around it.
The index you would create to make your query faster is on uid, area, type. You could include sid, though leave it out if sid is part of the primary key.
Note: Tables often do have multiple indexes. Just note that the more indexes, the slower the write performance. Unnecessary indexes can slow overall performance, sometimes radically so. Testing and eventually experience will help guide you in this. Also, reasoning it out as a real-world problem (like my phone book examples) can really help. If it wouldn't make sense with phone books (and separate phone book indexes) then it probably won't make sense in the database.
One more thing: even if you put an index on those columns, if your query is going to end up pulling a great percentage of the rows in the main table, it will still be cheaper to scan the table rather than do the bookmark lookup (seek the index then seek the table for each row found). The exact "tipping point" of whether to do a bookmark lookup with a seek, or to do a table scan isn't something I can tell you off the top of my head, but it is based on solid math.

The index is not really usefull as it does start with the wrong field... which means a table scan.
Looks like you have a normal computer there, not something made for databases. I run table scans over 650 million rows in about a minute on my lower end db server, but that means reading about a gigabyte per second from the discs, which are a RAID of 10k RM discs - RAID 10. Just to say that basically... that databases love IO, and that in a degree that you have never seen before. Basically larger db servers have many discs to satisfy the IOPS (IO per second) requirement. I have seen a server with 190 discs.
So, you ahve two choices: beed up your IOPS capability (means spending money), or set up indices that get used because they are "proper".
Proper means: an index only is usefull if the fields it contains are used from left to right. Not necessarily in the same order... but if a field is missed there is a chance the SQL System decides it is not worth pursuing the index and instead goes table scan (as in your case).

When you create your new index on uid, area and type, you should also do a select distinct on each one to determine which has the fewest distinct entries, then create your index such that the fewer the differences the earlier they show up in the index definition.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas