Speed up removal of duplicates in Oracle with indexing - sql

How to remove duplicate entries from a large Oracle table (200M rows, 20 columns)?
The below query from 2014 is slow. It took 2 minutes to delete 1 duplicate entry for one specific combination of columns (i.e. where col1 = 1 and .. col20 = 'Z').
DELETE sch.table1
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM sch.table1
GROUP BY col1, col2, col3, col4,.. ., col20)
Any way to speed it up, e.g. with indexing?

Rather than using an anti-join (and finding the non-matching ROWID and then deleting all the others), you can use the ROW_NUMBER analytic function to directly find the ROWIDs to delete:
DELETE FROM sch.table1 t
WHERE EXISTS(
SELECT 1
FROM (
SELECT ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, col4, ..., col20
ORDER BY rowid
) AS rn
FROM sch.table1
) x
WHERE x.ROWID = t.ROWID
AND x.rn > 1
);
or:
DELETE FROM sch.table1 t
WHERE ROWID IN (
SELECT ROWID
FROM (
SELECT ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, col4, ..., col20
ORDER BY rowid
) AS rn
FROM sch.table1
)
WHERE rn > 1
);
fiddle

As a person who has spent 20 years as a professional in data warehousing doing this kind of operation, the best answer I can give you is that investing time in explain plan will return enormous time savings in the long run. The link above is just the syntax of the command. Interpreting the execution plan as detailed Oracle Database Performance Tuning Guide is difficult at first but will be worth the time invested.
In this case, I can tell you that "not in" queries are rarely efficiently optimized by the database engine, but you don't have to believe me, just verify it from the explain plan. The reason is that the execution engine will have to save the entire results of the subquery, all 200 million rows. Even worse, unless Oracle has advanced light years since I last used it, it does not index intermediate tables, so every row that is checked for "not in" is a full scan of the intermediate set. So it is possibly checking 200 million x 200 million comparisons (there may be some partitioning tricks that Oracle uses to reduce that a bit). That's a pretty capable database which can do that in just a few minutes.
So knowing that, you know what to do. Find a subquery that locates just the one row to delete instead of using one that gives you every row that you dont want to delete. #MTO answers along that line. Personally I try to avoid "where exists" for similar reasons, but databases these days might well do a decent job with them.
As a refinement I would make it a two step process and create a separate table having the rows which are found to be candidates for removal, and delete from the base table the matching rows. This way you have a record of the rows deleted in case somebody asks you the next day and you can run sanity checks (like counts) before running the actual deletes, which might some day prevent accidents.

Related

SQL Query to Return Number of Distinct Values for Each Column

For background, I am querying a large database with hundreds of attributes per table and millions of rows. I'm trying to figure out which columns have no values at all.
I can do the following:
SELECT
COUNT(DISTINCT col1) as col1,
COUNT(DISTINCT col2) as col2,
...
COUNT(DISTINCT coln) as coln
FROM table;
Whenever a count of 1 is returned, I know that column has no values, great. The issue is that this is incredibly tedious to retype hundreds of times. How can I do this in a more efficient manner? I only have a fundamental understanding of SQL, and the limited capabilities of Athena makes this more difficult for me. Thank you
Edit: Just to clarify, the reason why the count needs to be 1 is because this database uses empty Strings rather than NULL

Select all columns then sample, or select IDs only then join and sample?

Problem: Assuming we're looking at 10 billion rows of numerical data, where the FROM clause excludes 99% of entries, which method would you expect to perform better and why?
I could argue either way, but then again, I have maybe 6 months SQL experience and no formal compsci education. Problem is formatted in ANSI Snowflake SQL.
Method 1: Sample all columns (with conditions).
SELECT col1, col2, col3.... coln
FROM table1
WHERE cond1 and cond2 and cond3... condn
SAMPLE (1000000 rows)
Method 2: Sample IDs only (with conditions) then join.
SELECT *
FROM
(SELECT IDcol
FROM table1
WHERE cond1 and cond2 and cond3... condn
SAMPLE (1000000 rows)
) as t1sampled
INNER JOIN
(SELECT col1, col2, col3.... coln
FROM table1
) as t1
ON t1sampled.IDcol = t1.IDcol
Similar run times!
I modified the above methods to sample 10,000 rows (not 1,000,000), because this new warehouse had less migrated data than I first thought.
I used our extra small/light (XS) Snowflake warehouse.
Method 1: 6 minutes; 75 GB read
Method 2: 6 minutes 2 seconds; 90 GB read
The first approach is better I think. There is no need to join the table back onto itself, it's added complexity that isn't required and ultimately it produces the same result either way. Mike already mentioned that the query compiler may even create the same plan for both queries anyway...
Also FYI. Block sampling is significantly faster than row sampling but it may bias your results if you have small tables or if your micro partitions contain similar data (biased towards ingestion pattern if the table isn't clustered?).

Oracle SQL Function in WHERE clause takes longer to run than in SELECT clause

Really this is two related questions
Question: Why would a query run faster with a function in a select clause than in the where clause
Question: Why would an inline view take longer when moving the where clause from the inline view to the outer query.
I'm not going to dump the entire query since it has columns and tables related to my work, but this is basically it. If you need a working example I will write a SQLFiddle that is similar to what I'm doing.
Run Time: 117s,
Returns: 93 records
SELECT COL1, COL2, COL3
FROM my_table
WHERE [CONDITIONS...]
and my_package.my_function(:bind_var, COL1) = 'Y';
If I were to run the function by itself with what would be the bind var and what would be one of the COL1 values it would take .06s. So,
SELECT my_package.my_function(VAL1, VAL2) FROM DUAL;
So I rewrote the query like this:
SELECT *
FROM (
SELECT COL1, COL2, COL3
FROM my_table
WHERE [CONDITIONS...]
) temp_tbl
WHERE my_package.my_function(:bind_var, COL1) = 'Y';
Run Time: 116s,
Returns: 93 records
The query without the function takes ~3 seconds to run, but it doesn't make sense that a function that takes .06s for 93 records would take ~116s to run.
I tried seeing what happens if I moved the function to the SELECT clause.
SELECT *
FROM (
SELECT COL1, COL2, COL3, my_package.my_function(:bind_var, COL1) as fn_indc
FROM my_table
WHERE [CONDITIONS...]
) temp_tbl
WHERE fn_indc = 'Y';
When I run the inline view query it takes ~3 seconds to run. When I add the WHERE fn_indc = 'Y'; it takes ~116 seconds to run. Why would moving the function from the WHERE to the SELECT matter? Comparison of CHAR does not take that long to perform. Also, if I made an inline view that retrieved the value from the function and performed my where conditions in the outer query, what would cause this to run longer?
How many times is the function being executed in each case?
Without seeing query plans, I would wager that the query runs quickly when the other predicates are evaluated first, paring the result set down as much as possible before the function is called. When the function is only called 93 times (plus however many additional executions are required for the rows that aren't eliminated by any other predicate) the query runs quickly. On the other hand, if the function is called earlier in the query plan, it will be called many more times--potentially once for every row in the table and the query will return much more slowly. You could validate this by looking at the query plans or using some instrumentation to measure exactly how many times the function is called in the different cases.
The Oracle optimizer is free to evaluate predicates in whatever order it deems appropriate based on statistics. It is possible that rewriting a query will cause the optimizer to choose a different plan that is better or worse. But tomorrow, the optimizer is perfectly free to change its mind and to use the slower plan for any of the variants that you posted. Of course, Murphy being the law of the land, the optimizer is likely to wait for the worst possible time to decide to flip the query plan on you when it will cause you the most pain and suffering.
If the optimizer thinks that both the fast plan and the slow plan are roughly equally costly, that probably implies that it thinks that the function is either much less expensive to evaluate than it actually is or much more selective than it actually is. The best way to correct that mistaken belief is to associate statistics with the function. This lets you tell the optimizer how expensive the query is and how selective it is. That, in turn, lets the optimizer make better estimates and makes it likely that it will pick the more efficient plan regardless of how you write the query (and makes it much less likely that the plan will change for the worse in the future).
Now, you can also cheat a bit by writing the query in a way that prevents the optimizer from merging the predicate either by using hints or by putting something in the inline view that prevents the predicate from being pushed. One old trick is to throw a rownum in to the inline view. Of course, it is possible that some future version of the optimizer will be smart enough to figure out that rownum isn't doing anything here and can safely be removed. And you'd need to leave a nice long comment for the next person who comes along and wonders why you put a rownum in a query when you're not doing anything with it.
SELECT *
FROM (
SELECT COL1, COL2, COL3, rownum
FROM my_table
WHERE [CONDITIONS...]
) temp_tbl
WHERE my_package.my_function(:bind_var, COL1) = 'Y';
you didn't give us much information, so i will guess...
the following query does most probably make use of some indexes so it runs faster compared to FTS (Full Table Scan):
SELECT *
FROM (
SELECT COL1, COL2, COL3, my_package.my_function(:bind_var, COL1) as fn_indc
FROM my_table
WHERE [CONDITIONS...]
) temp_tbl
WHERE fn_indc = 'Y';
so it would access 'my_table' by corresponding index(es), then it would apply [my_package.my_function(:bind_var, COL1)] function only to rows belonging to the result-set (i.e. that got through the filtering)
If you didn't define function-based index oracle is not able to use indexes for the queries like:
SELECT *
FROM (
SELECT COL1, COL2, COL3
FROM my_table
WHERE [CONDITIONS...]
) temp_tbl
WHERE my_package.my_function(:bind_var, COL1) = 'Y';
so it does the following:
1. FTS (Full Table Scan) for my_table
2. apply filter on each row: my_package.my_function(:bind_var, COL1) = 'Y'
PS if you would change your function, so that it would return the [:bind_var] instead of expecting it as a parameter then you could build function-based index and make use of it as follows:
SELECT COL1, COL2, COL3
FROM my_table
WHERE [CONDITIONS...]
and my_package.my_function(COL1) = :bind_var;
Answering part of my question:
Question: Why would an inline view take longer when moving the where
clause from the inline view to the outer query.
This is because nested queries are joined as an outer join by the sql optimizer. So what I was trying to accomplish by having the other conditions run before the function was undone by the sql optimizer. If I wanted to force the sql optimizer to run my inline view first I can add rownum >= 1
SELECT *
FROM (
SELECT COL1, COL2, COL3
FROM my_table
WHERE [CONDITIONS...]
and rownum >= 1
) temp_tbl
WHERE my_package.my_function(:bind_var, COL1) = 'Y';
https://blogs.oracle.com/optimizer/entry/optimizer_transformations_subquery_unesting_part_1

postgres indexed query running slower than expected

I have a table with ~250 columns and 10m rows in it. I am selecting 3 columns with the where clause on an indexed column with an IN query. The number of ids in the IN clause is 2500 and the output is limited by 1000 rows, here's the rough query:
select col1, col2, col3 from table1 where col4 in (1, 2, 3, 4, etc) limit 1000;
This query takes much longer than I expected, ~1s. On an indexed integer column with only 2500 items to match, it seems like this should go faster? Maybe my assumption there is incorrect. Here is the explain:
http://explain.depesz.com/s/HpL9
I did not paste all 2500 ids into the EXPLAIN just for simplicity so ignore the fact that there are only 3 in that. Anything I am missing here?
It looks like you're pushing the limits of select x where y IN (...) type queries. You basically have a very large table with an large set of conditions to search on.
Depending on the type of indexes, I'm guessing you have B+Tree this kind of query is inefficient. These type of indexes do well with general purpose range matching and DB inserts while performing worse on single value lookups. Your query is doing ~2500 lookups on this index for single values.
You have a few options to deal with this...
Use Hash indexes (these perform much better on single value lookups)
Help out the query optimizer by adding in a few range based constraints, so you could take the 2500 values and find the min and max values and add that to the query. So basically append where x_id > min_val and x_id < max_val
Run the query in parallel mode if you have multiple db backends, simply breakup the 2500 constraints into say 100 groups and run all the queries at once and collect the results. It will be better if you group the constraints based on their value
The first option is certainly easier, but it will come at a price of making your inserts/deletes slower.
The second does not suffer from this, and you don't even need to limit it to one min max group. You could create N groups with N min and max constraints. Test it out with different groupings and see what works.
The last option is by far the best performing of course.
Your query is equivalent to:
select col1, col2, col3
from table1
where
col4 = 1
OR col4 = 2
OR col4 = 3
OR col4 = 4
... repeat 2500 times ...
which is equivalent to:
select col1, col2, col3
from table1
where col4 = 1
UNION
select col1, col2, col3
from table1
where col4 = 2
UNION
select col1, col2, col3
from table1
where col4 = 3
... repeat 2500 times ...
Basically, it means that the index on a table with 10M rows is searched 2500 times. On top of that, if col4 is not unique, then each search is a scan, which may potentially return many rows. Then 2500 intermediate result sets are combined together.
The server doesn't know that the 2500 IDs that are listed in the IN clause do not repeat. It doesn't know that they are already sorted. So, it has little choice, but do 2500 independent index seeks, remember intermediate results somewhere (like in an implicit temp table) and then combine them together.
If you had a separate table table_with_ids with the list of 2500 IDs, which had a primary or unique key on ID, then the server would know that they are unique and they are sorted.
Your query would be something like this:
select col1, col2, col3
from
table_with_ids
inner join table1 on table_with_ids.id = table1.col4
The server may be able to perform such join more efficiently.
I would test the performance using pre-populated (temp) table of 2500 IDs and compare it to the original. If the difference is significant, you can investigate further.
Actually, I'd start with running this simple query:
select col1, col2, col3
from table1
where
col4 = 1
and measure the time it takes to run. You can't get better than this. So, you'll have a lower bound and a clear indication of what you can and can't achieve. Then, maybe change it to where col4 in (1,2) and see how things change.
One more way to somewhat improve performance is to have an index not just on col4, but on col4, col1, col2, col3. It would still be one index, but on several columns. (In SQL Server I would have columns col1, col2, col3 "included" in the index on col4, rather than part of the index itself to make it smaller, but I don't think Postgres has such feature). In this case the server should be able to retrieve all data it needs from the index itself, without doing additional look-ups in the main table. Make it the so-called "covering" index.

Selecting data effectively sql

I have a very large table with over 1000 records and 200 columns. When I try to retreive records matching some criteria in the WHERE clause using SELECT statement it takes a lot of time. But most of the time I just want to select a single record that matches the criteria in the WHERE clause rather than all the records.
I guess there should be a way to select just a single record and exit which would minimize the retrieval time. I tried ROWNUM=1 in the WHERE clause but it didn't really work because I guess the engine still checks all the records even after finding the first record matching the WHERE criteria. Is there a way to optimize in case if I want to select just a few records?
Thanks in advance.
Edit:
I am using oracle 10g.
The Query looks like,
Select *
from Really_Big_table
where column1 is NOT NULL
and column2 is NOT NULL
and rownum=1;
This seems to work slower than the version without rownum=1;
rownum is what you want, but you need to perform your main query as a subquery.
For example, if your original query is:
SELECT co1, col2
FROM table
WHERE condition
then you should try
SELECT *
FROM (
SELECT col1, col2
FROM table
WHERE condition
) WHERE rownum <= 1
See http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html for details on how rownum works in Oracle.
1,000 records isn't a lot of data in a table. 200 columns is a reasonably wide table. For this reason, I'd suggest you aren't dealing with a really big table - I've performed queries against millions of rows with no problems.
Here is a little experiment... how long does it take to run this compared to the "SELECT *" query?
SELECT
Really_Big_table.Id
FROM
Really_Big_table
WHERE
column1 IS NOT NULL
AND
column2 IS NOT NULL
AND
rownum=1;
An example is here: You can view more here
SELECT ename, sal
FROM ( SELECT ename, sal, RANK() OVER (ORDER BY sal DESC) sal_rank
FROM emp )
WHERE sal_rank <= 1;
You also have to do some column indexing for column in the WHERE clause
In SQL most of the optimization would come in the form on index on the table (where you would index the columns that appear in the WHERE and ORDER BY columns as a rough guide).
You did not specify what SQL database you are using, so I can't point to a good resource.
Here is an introduction to indexes on Oracle.
Here another tutorial.
As for queries - you should always specify the columns you are returning and not use a blanket *.
it shouldn't take a lot of time to query a 1000 rows table. There are exceptions however, check if you are in one of the following cases:
1. Lots of rows were deleted
The table had a massive amount of rows in the past. Since the High Water Mark (HWM) is still high (delete won't lower it) and FULL TABLE SCAN read all the data up to the high water mark, it may take a lot of time to return results even if the table is now nearly empty.
Analyse your table (dbms_stats.gather_table_stats('<owner>','<table>')) and compare the space actually used by the rows (space on disk) with the effective space (data), for example:
SELECT t.avg_row_len * t.num_rows data_bytes,
(t.blocks - t.empty_blocks) * ts.block_size bytes_used
FROM user_tables t
JOIN user_tablespaces ts ON t.tablespace_name = ts.tablespace_name
WHERE t.table_name = '<your_table>';
You will have to take into account the overhead of the rows and blocks as well as the space reserved for update (PCT_FREE). If you see that you use a lot more space than required (typical overhead is below 30%, YMMV) you may want to reset the HWM, either:
ALTER TABLE <your_table> MOVE; and then rebuild INDEX (ALTER INDEX <index> REBUILD), don't forget to collect stats afterwards.
use DBMS_REDEFINITION
2. The table has very large columns
Check if you have columns of datatype LOB, CLOB, LONG (irk), etc. Data over 4000 bytes in any of these columns is stored out of line (in a separate segment), which means that if you don't select these columns you will only query the other smaller columns.
If you are in this case, don't use SELECT *. Either you don't need the data in the large columns or use SELECT rowid and then do a second query : SELECT * WHERE rowid = <rowid>.