Optimizing sql queries for deleting duplicates in Monetdb

Optimizing sql queries for deleting duplicates in Monetdb - sql

I have a problem where I have a market data table with >100,000,000 rows and I need to search and remove duplicates where the symbol and totvol columns match but the serial_no is different.
I have tried the query below on both a single table and also using a copy of the table for reference, but unfortunately it takes up an enormous amount of heap space (>100G and counting, sometimes filling the harddrive to the brim and crashing my database)and time (>30 mins) and brings my server to a crawl (60-95% cpu usage on 32 cores!) which is unacceptable. Is there an efficient way to write this query to optimize the sql execution if i want to execute something like this regularly?
Normally I would partition the table somehow since duplicates for the most part are inserted adjacent or near each other, but since monetdb is a column store database partitioning this way also takes up a lot of heap space. The only helpful thing I have found to reduce the heap is by creating an entirely new table with a subset of the data (i.e. split alphabetically by symbol) which results in smaller column bat files and then running the operation on the small table, is there any way I can keep the large table in tact and write a query which operates on maybe 1,000,000 rows at a time?
The query:
delete from print_11_11
where exists (Select a.serial_no
from print_11_11 as a, print_11_11 as b
where a.symbol=b.symbol
and a.totvol = b.totvol
and a.serial_no>b.serial_no)
Some example data, rows 2 and 3 are duplicates of one another and rows 4-7 are all duplicates = by my critera, note the exseq may be the same or different, it does not matter which exseq value we keep when removing duplicates:
<table border="1"><tr BGCOLOR="#CCCCFF"><th>serial_no</th><th>ttime</th><th>symbol</th><th>vol</th><th>totvol</th><th>exseq</th></tr>
<tr><td>0</td><td>80017</td><td>T</td><td>200</td><td>200</td><td>133813</td></tr>
<tr><td>855</td><td>80017</td><td>T</td><td>42</td><td>242</td><td>133813</td></tr>
<tr><td>867</td><td>80017</td><td>T</td><td>42</td><td>242</td><td>136690</td></tr>
<tr><td>868</td><td>80210</td><td>T</td><td>100</td><td>342</td><td>136690</td></tr>
<tr><td>876</td><td>80211</td><td>T</td><td>100</td><td>442</td><td>136690</td></tr>
<tr><td>877</td><td>80211</td><td>T</td><td>100</td><td>442</td><td>136696</td></tr>
<tr><td>882</td><td>80211</td><td>T</td><td>100</td><td>442</td><td>136737</td></tr>
<tr><td>883</td><td>80213</td><td>T</td><td>2928</td><td>3370</td><td>136737</td></tr>
</table>

Related

How wide can a postgresSQL table be before I run into issues?

According to the limits postgres supports up to 1600 columns per table.
https://www.postgresql.org/docs/current/limits.html
I understand that it's bad practice to have so many columns but what are the consequences of approaching this limit?
For example, will a table with 200 columns perform fine in an application? How can you tell when you're approaching too many columns for a given table?

The hard limit is that a table row has to fit inside a single 8kB block.
The "soft limits" you encounter with many columns are
writing SELECT list becomes more and more annoying (never, ever, use SELECT *)
each UPDATE has to write a large row version, so lots of data churn
extracting the 603th column from a row requires skipping the previous 602 columns, which is a performance hit
it is plain annoying if the output of \d is 50 pages long

Stop query from checking further with "between dates" condition after all dates are found in sorted table

I have a very large table with a column that contains dates. Due to the table being so huge, I want to request the data for e.g. each day, which I am trying to do with the following statement:
SELECT *
FROM [my_db].[dbo].[my_data] where date between '2019-03-25' and '2019-03-26'
So far so good, when I run this query, the relevant data (about 10,000 rows) are returned. However, the query does not stop, it keeps executing for a very long time (couldn't see how long, I always stopped it after about 30 minutes).
I assume it's checking for more fitting dates. However, the table is sorted, so I know there won't be any further dates.
What is the best approach to handle this here? Is there a way to set some kind of timeout after no futher result was found? Or should I just use a normal timeout and hope the transaction was done in time? Thanks!

So it sounds like your query is performing a table scan to retrieve your data.
We don't know anything about the performance of your hardware but for a large table, possibly highly fragmented, this could be a time consuming operation on a slow drive or if IO is a bottleneck.
You can quickly get the approximate count of rows in several ways. Reading the comments you mention you are doing this on a laptop, so likely you are the only user in which case the approximate count is likely bang-on.
The easiest is to run
exec sp_spaceused 'tablename'
You can query a list of indexes on a table
select * from sys.indexes where object_id=Object_Id('tablename')
You can also see a list of all tables and their stats including rows using Object Explorer Details in SSMS. Connect to your server and expand the database from the list in Object Explorer. Open the Details panel (F7) and click on Tables, the list will be populated and rowcounts retrieved.
You can also expand Tables in Object Explorer, expand your specific table and then expand Indexes to view what is currently defined.
Because you (probably) have no index on your Date column, even though you know you have received all the qualifying results, SQL Server doesn't because it is having to scan the table. without an index, nothing guarantees a range of rows will all reside sequentially.
This means it jumps right in at one end and starts reading through page-by-page until it gets to the end, checking each row to see if it fits your filtering criteria. If the data you are expecting happens to reside on the first pages it reads then great - but SQL Server has no way of knowing it's found every possible qualifying row - many factors such as page fragmentation could mean some rows might exist further along the list of pages making up the table's data.
An index on the date column would help dramatically because then SQL server can seek directly to the start of the first qualifying date and read the values in order until it has reached the last qualifying row, where because the data is sorted it knows it has reach the end.
An index will also help with a query such as select count(*). Every index (except filtered indexes) includes every row, but not every column - therefore to get a row count SQL Server will scan the narrowest index, which means it will have the least possible IO.
In addition, doing select * if you don't actuall need every column will be an impact on performance.
If your query is highly selective, and you have an index on date, SQL Server will seek to the required rows in the index and then do a bookmark lookup to retrieve the remaining columns.
This is an expensive operation however, so there is a threshold where the trade-off is not worth it and SQL Server will opt to scan the table instead to avoid the lookup operation.

select most recent values in very large table

I am an operations guy tasked with pulling data from a very large table. I'm not a DBA and cannot partition it or change the indexing. Table has nearly a billion entries, is not partitioned, and could probably be indexed "better". I need two fields, which we'll call mod_date and obj_id (mod_date is indexed). EDIT: I also add a filter for 'client' which I've blurred out in my screenshot of the explain plan.
My data:
Within the group of almost a billion rows, we have fewer than 10,000 obj_id values to query across several years (a few might even be NULL). Some of the <10k obj_ids -- probably between 1,000-2,500 -- have more than 10 million mod_date values each. When the obj_ids have over a few million mod_dates, each obj_id takes several minutes to scan and sort using MAX(mod_date). The full result set takes over 12 hours to query and no one has made it to completion without some "issue" (locked out, unplugged laptop, etc.). Even if we got the first 50 rows returned we'd still need to export to Excel ... it's only going to be about 8,000 rows with 2 columns but we can never make it to the end.
So here is a simplified query I'd use if it were a small table:
select MAX(trunc(mod_date,'dd')) as last_modified_date, obj_id
from my_table
where client = 'client_name'
and obj_type_id = 12
group by obj_id;
Cardinality is 317917582, "Cost" is 12783449
The issue:
The issue is the speed of the query with such a large unpartitioned table, given the current indexes. All the other answers I've seen about "most recent date" tend to use MAX, possibly in combination with FIRST_VALUE, which seem to require a full scan of all rows in order to sort them and then determine which is the most recent.
I am hoping there is a way to avoid that, to speed up the results. It seems that Oracle (I am using Oracle SQL developer) should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?
Even with such a large table, obj_ids having fewer than 10,000 mod_dates can return the MAX(mod_date) very quickly (seconds or less). The issue we are having is the obj_ids having the most mod_dates (over 10 million) take the longest to scan and sort, when they "should" be the quickest if I could get Oracle to start looking at the most recent first … because it would find a recent date quickly and move on!

First, I'd say its a common misconception that in order to make a query run faster, you need an index (or better indexes). Full table scan makes sense when you're pulling more than 10% of the data (rough estimate, depends on multiblock read count, block size, etc).
My advice is to setup a materialized view (MY_MV or whatever) that simply does the group by query (across all ids). If you need to limit the ids to a 10k subset, just make sure you full scan the table (check explain plan). You can add a full hint if needed (select /*+ full(t) */ .. from big_table t ...)
Then do:
dbms_mview.refresh('MY_MV','C',atomic_refresh=>false);
Thats it. No issues with a client only returning the first x rows and when you go to pull everything it re-runs the entire query (ugh). Full scans are also easier to track in long opts (harder to tell what progress you've made if you are doing nested loops on an index for example).
Once its done, dump entire MV table to a file or whatever you need.

tbone has it right I think. Or, if you do not have authority to create a materialized view, as he suggests, you might create a shell script on the database server to run your query via SQL*Plus and spool the output to a file. Then, run that script using nohup and you shouldn't need to worry about laptops getting turned off, etc.
But I wanted to explain something about your comment:
Oracle should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?
That would be a horrible way for Oracle to run your query, given the indexes you have listed. Let's step through it...
There is no index on obj_id, so Oracle needs to do a full table scan to make sure it gets all the distinct obj_id values.
So, it starts the FTS and finds obj_id 101. It then says "I need max(mod_date) for 101... ah ha! I have an index!" So, it does a reverse index scan. For each entry in the index, it looks up the row from table and checks it to see if it is obj_id 101. If the obj_id was recently updated, we're good because we find it and stop early. But if the obj_id has not been updated in a long time, we have to read many index entries and, for each, access the table row(s) to perform the check.
In the worst case -- if the obj_id is one of those few you mentioned where max(mod_date) will be NULL, we would use the index to look up EVERY SINGLE ROW in your table that has a non-null mod_date.
Doing so many index lookups would be an awful plan if it did that just once, but you're talking about doing it for several old or never-updated obj_id values.
Anyway, it's all academic. There is no Oracle query plan that will run the query that way. It's for good reason.
Without better indexing, you're just not going to improve upon a single full table scan.

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?

When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.

CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

Workaround for UDF memory leak

Apparently there is a memory leak on BigQuery's UDF. We run a simple UDF over a small table (3000 rows, 5MB) and it fails. If we run the same UDF over the first half of the table concatenated with the second half of the table (in the same query), then it works! Namely: SELECT blah myUDF(SELECT id,data FROM table) fails. SELECT blah myUDF(SELECT id, data FROM table ORDER BY id LIMIT 1500),myUDF(SELECT id, data FROM table ORDER BY id DESC LIMIT 1500) succeeds.
The question is: how do we work around this issue? Is there a way to dynamically split a table in multiple parts, each of equal size and of predefined number of rows? Say 1000 rows at a time? (the sample table has 3000 rows, but we want this to succeed in larger tables, and if we split a 6000 row table in half, the UDF will be failing again on each half). In any solution, it is important to (a) NOT use ORDER BY, because it has a 65000 row limitation; (b) use a single combined query (otherwise the solution may be too slow, plus every combined table is charged at a minimum of 10MB, so if we have to split a 1,000,000 row table into 1,000 rows at a time we will automatically be charged for 10 GB. Times 1,000 tables = 10TB. This stuff adds up quickly)
Any ideas?

This issue was related to limits we had on the size of the UDF code. It looks like V8's optimize+recompile pass of the UDF code generates a data segment that was bigger than our limits, but this was only happening when when the UDF runs over a "sufficient" number of rows. I'm meeting with the V8 team this week to dig into the details further.
In the meantime, we've rolled out a fix to update the max data segment size. I've verified that this fixes several other queries that were failing for the same reason.
Could you please retry your query? I'm afraid I can't easily get to your code resources in GCS.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas