SQL Server query: Union vs Distinct union all performance - sql

Does SQL have a difference in performance between these two statements?
SELECT distinct 'A' as TableName, Col1, Col2, Col3 FROM A
UNION ALL
SELECT distinct 'B' as TableName, Col1, Col2, Col3 from B
versus
SELECT 'A' as TableName, Col1, Col2, Col3 FROM A
UNION
SELECT 'B' as TableName, Col1, Col2, Col3 from B
The difference between this and similar questions such as UNION vs DISTINCT in performance is that I can confirm ahead of time that the individual tables I'm using won't have any duplicate records between them, only within the individual tables.
The execution plans look the same to me, in that it sorts the individual tables before concatenating. However, if I remove the scalar from them both, the plan of the UNION ALL stays basically the same but the UNION changes to concatenating before the distinct. I'll be concatenating about 20 tables together, and it's not clear whether doing 20 individual DISTINCTs is faster than doing one large DISTINCT at the end, since I can still confirm that the tables would not share any duplicates between them (only within the same table).

DISTINCT is not necessarily implemented by sort, it can also be implemented by hashing.
Both of these are memory consuming operations and reducing the size of data being distinctified can help reduce the amount of required memory which is good for concurrency.
The algorithmic complexity of sorting is n log n, meaning that the work required grows linearitmically as n grows. On that basis sorting 10 smaller sets of size s should generally be fast than sorting one larger set of size 10*s.

Let's not talk about SQL for a minute.
Case 1: Say, there is a list of 100 numbers.
List: 1,2,3,4,....60 and then 61 repeats 40 times.
The list is not arranges and you don't know this before hand. Now you are trying to search the unique values, from the list of 100 numbers and then sort them.
Case 2: As you said, there are two lists with no duplicate records between them.
List 1: 1,2,3,4,....60
List 2: 61,61,61,61... 40 times
It satisfies the condition you mentioned. List one, similarly, has the numbers in random order. But now you are searching the unique values, from the list of 60 and not a larger set of 100 numbers and another list from where you will get 61.
Coming to SQL, it all depends on size of data you have in each individual table, and may be some other factors.
I accept it's not a complete answer, still hope this helps.

Related

Speed up removal of duplicates in Oracle with indexing

How to remove duplicate entries from a large Oracle table (200M rows, 20 columns)?
The below query from 2014 is slow. It took 2 minutes to delete 1 duplicate entry for one specific combination of columns (i.e. where col1 = 1 and .. col20 = 'Z').
DELETE sch.table1
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM sch.table1
GROUP BY col1, col2, col3, col4,.. ., col20)
Any way to speed it up, e.g. with indexing?
Rather than using an anti-join (and finding the non-matching ROWID and then deleting all the others), you can use the ROW_NUMBER analytic function to directly find the ROWIDs to delete:
DELETE FROM sch.table1 t
WHERE EXISTS(
SELECT 1
FROM (
SELECT ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, col4, ..., col20
ORDER BY rowid
) AS rn
FROM sch.table1
) x
WHERE x.ROWID = t.ROWID
AND x.rn > 1
);
or:
DELETE FROM sch.table1 t
WHERE ROWID IN (
SELECT ROWID
FROM (
SELECT ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, col4, ..., col20
ORDER BY rowid
) AS rn
FROM sch.table1
)
WHERE rn > 1
);
fiddle
As a person who has spent 20 years as a professional in data warehousing doing this kind of operation, the best answer I can give you is that investing time in explain plan will return enormous time savings in the long run. The link above is just the syntax of the command. Interpreting the execution plan as detailed Oracle Database Performance Tuning Guide is difficult at first but will be worth the time invested.
In this case, I can tell you that "not in" queries are rarely efficiently optimized by the database engine, but you don't have to believe me, just verify it from the explain plan. The reason is that the execution engine will have to save the entire results of the subquery, all 200 million rows. Even worse, unless Oracle has advanced light years since I last used it, it does not index intermediate tables, so every row that is checked for "not in" is a full scan of the intermediate set. So it is possibly checking 200 million x 200 million comparisons (there may be some partitioning tricks that Oracle uses to reduce that a bit). That's a pretty capable database which can do that in just a few minutes.
So knowing that, you know what to do. Find a subquery that locates just the one row to delete instead of using one that gives you every row that you dont want to delete. #MTO answers along that line. Personally I try to avoid "where exists" for similar reasons, but databases these days might well do a decent job with them.
As a refinement I would make it a two step process and create a separate table having the rows which are found to be candidates for removal, and delete from the base table the matching rows. This way you have a record of the rows deleted in case somebody asks you the next day and you can run sanity checks (like counts) before running the actual deletes, which might some day prevent accidents.

Select all columns then sample, or select IDs only then join and sample?

Problem: Assuming we're looking at 10 billion rows of numerical data, where the FROM clause excludes 99% of entries, which method would you expect to perform better and why?
I could argue either way, but then again, I have maybe 6 months SQL experience and no formal compsci education. Problem is formatted in ANSI Snowflake SQL.
Method 1: Sample all columns (with conditions).
SELECT col1, col2, col3.... coln
FROM table1
WHERE cond1 and cond2 and cond3... condn
SAMPLE (1000000 rows)
Method 2: Sample IDs only (with conditions) then join.
SELECT *
FROM
(SELECT IDcol
FROM table1
WHERE cond1 and cond2 and cond3... condn
SAMPLE (1000000 rows)
) as t1sampled
INNER JOIN
(SELECT col1, col2, col3.... coln
FROM table1
) as t1
ON t1sampled.IDcol = t1.IDcol
Similar run times!
I modified the above methods to sample 10,000 rows (not 1,000,000), because this new warehouse had less migrated data than I first thought.
I used our extra small/light (XS) Snowflake warehouse.
Method 1: 6 minutes; 75 GB read
Method 2: 6 minutes 2 seconds; 90 GB read
The first approach is better I think. There is no need to join the table back onto itself, it's added complexity that isn't required and ultimately it produces the same result either way. Mike already mentioned that the query compiler may even create the same plan for both queries anyway...
Also FYI. Block sampling is significantly faster than row sampling but it may bias your results if you have small tables or if your micro partitions contain similar data (biased towards ingestion pattern if the table isn't clustered?).

How to avoid evaluating the same calculated column in Hive query repetedly

Lets say I have a calculated column:-
select str_to_map("k1:1,k2:2,k3:3")["k1"] as col1,
str_to_map("k1:1,k2:2,k3:3")["k2"] as col2,
str_to_map("k1:1,k2:2,k3:3")["k3"] as col3;
How do I 'fix' the column calculation only once and access its value multiple times in the query? The map being calculated is the same, only different keys are being accessed for different columns. Performing the same calculation repeatedly is a waste of resources. This example is purposely made too simple, but the point is I want to know how to avoid this kind of redundancy in Hive in general.
In general use subqueries, they are calculated once.
select map_col.["k1"] as col1,
map_col.["k2"] as col2,
map_col.["k3"] as col3
from
(
select str_to_map("k1:1,k2:2,k3:3") as map_col from table...
)s;
Also you can materialize some query into table to reuse the dataset across different queries or workflows.

postgres indexed query running slower than expected

I have a table with ~250 columns and 10m rows in it. I am selecting 3 columns with the where clause on an indexed column with an IN query. The number of ids in the IN clause is 2500 and the output is limited by 1000 rows, here's the rough query:
select col1, col2, col3 from table1 where col4 in (1, 2, 3, 4, etc) limit 1000;
This query takes much longer than I expected, ~1s. On an indexed integer column with only 2500 items to match, it seems like this should go faster? Maybe my assumption there is incorrect. Here is the explain:
http://explain.depesz.com/s/HpL9
I did not paste all 2500 ids into the EXPLAIN just for simplicity so ignore the fact that there are only 3 in that. Anything I am missing here?
It looks like you're pushing the limits of select x where y IN (...) type queries. You basically have a very large table with an large set of conditions to search on.
Depending on the type of indexes, I'm guessing you have B+Tree this kind of query is inefficient. These type of indexes do well with general purpose range matching and DB inserts while performing worse on single value lookups. Your query is doing ~2500 lookups on this index for single values.
You have a few options to deal with this...
Use Hash indexes (these perform much better on single value lookups)
Help out the query optimizer by adding in a few range based constraints, so you could take the 2500 values and find the min and max values and add that to the query. So basically append where x_id > min_val and x_id < max_val
Run the query in parallel mode if you have multiple db backends, simply breakup the 2500 constraints into say 100 groups and run all the queries at once and collect the results. It will be better if you group the constraints based on their value
The first option is certainly easier, but it will come at a price of making your inserts/deletes slower.
The second does not suffer from this, and you don't even need to limit it to one min max group. You could create N groups with N min and max constraints. Test it out with different groupings and see what works.
The last option is by far the best performing of course.
Your query is equivalent to:
select col1, col2, col3
from table1
where
col4 = 1
OR col4 = 2
OR col4 = 3
OR col4 = 4
... repeat 2500 times ...
which is equivalent to:
select col1, col2, col3
from table1
where col4 = 1
UNION
select col1, col2, col3
from table1
where col4 = 2
UNION
select col1, col2, col3
from table1
where col4 = 3
... repeat 2500 times ...
Basically, it means that the index on a table with 10M rows is searched 2500 times. On top of that, if col4 is not unique, then each search is a scan, which may potentially return many rows. Then 2500 intermediate result sets are combined together.
The server doesn't know that the 2500 IDs that are listed in the IN clause do not repeat. It doesn't know that they are already sorted. So, it has little choice, but do 2500 independent index seeks, remember intermediate results somewhere (like in an implicit temp table) and then combine them together.
If you had a separate table table_with_ids with the list of 2500 IDs, which had a primary or unique key on ID, then the server would know that they are unique and they are sorted.
Your query would be something like this:
select col1, col2, col3
from
table_with_ids
inner join table1 on table_with_ids.id = table1.col4
The server may be able to perform such join more efficiently.
I would test the performance using pre-populated (temp) table of 2500 IDs and compare it to the original. If the difference is significant, you can investigate further.
Actually, I'd start with running this simple query:
select col1, col2, col3
from table1
where
col4 = 1
and measure the time it takes to run. You can't get better than this. So, you'll have a lower bound and a clear indication of what you can and can't achieve. Then, maybe change it to where col4 in (1,2) and see how things change.
One more way to somewhat improve performance is to have an index not just on col4, but on col4, col1, col2, col3. It would still be one index, but on several columns. (In SQL Server I would have columns col1, col2, col3 "included" in the index on col4, rather than part of the index itself to make it smaller, but I don't think Postgres has such feature). In this case the server should be able to retrieve all data it needs from the index itself, without doing additional look-ups in the main table. Make it the so-called "covering" index.

Unique rows in duplicate data contained in one table

I have a table in an Oracle DB that stores transaction batches uploaded by users. A new upload mechanism has been implemented and I would like to compare its results. A single batch was uploaded using the origninal mechanism and then the new mechanism. I am trying to find unique rows (I rows that existed in the first upload that did not exist or are different in the second upload. Or rows that are non-existent in the first upload but do exist or are different in the second). I am dealing with a huge data set (Over a million records) and that makes this analysis very difficult.
I have tried several approaches:
SELECT col1, col2 ...
FROM table
WHERE upload_id IN (first_upload_ID, second_upload_id)
GROUP BY col1, col2..
HAVING COUNT(*) = 1;
SELECT col1, col2 ...
FROM table
WHERE upload_id = first_upload_ID
MINUS
SELECT col1, col2 ...
FROM table
WHERE upload_id = second_upload_id;
SELECT col1, col2 ...
FROM table
WHERE upload_id = second_upload_id
MINUS
SELECT col1, col2 ...
FROM table
WHERE upload_id = first_upload_ID;
Both of these results returned several hundred thousand rows, making it difficult to analyze.
Does anyone how any suggestions in how to approach/simplify this problem? Could I do a self join on several columns that are unique for each upload? If yes, what would that self join look like?
Thank you for the help.
One method that might be useful is to calculate a hash of each record and run a match based on that. It doesn't have to be some super-secure SHA-whatever, just the regular Oracle Ora_Hash(), as long as you're going to get a pretty small chance of hash collisions. Ora_Hash ought to be sufficent with a max_bucket_size of 4,294,967,295.
I'd just run joins between the two sets of hashes. Hash joins (as in the join mechanism) are very efficient.
Alternatively you could join the two data sets in their entirity, and as long as you're using equi-joins and only projecting the identifying rowid's from the data sets it would be broadly equivalent performance-wise because hashes would be computed on the join columns but only the rowid's would have to be stored as well, keeping the hash table size small. The tricky part there is in dealing with nulls in the join.
When doing a join like this make sure not to include columns containing the upload-id, and any audit data added to the uploaded data. Restrict the joins to the columns that contain the uploaded data. The MINUS approach should work well otherwise.