Unique rows in duplicate data contained in one table - sql

I have a table in an Oracle DB that stores transaction batches uploaded by users. A new upload mechanism has been implemented and I would like to compare its results. A single batch was uploaded using the origninal mechanism and then the new mechanism. I am trying to find unique rows (I rows that existed in the first upload that did not exist or are different in the second upload. Or rows that are non-existent in the first upload but do exist or are different in the second). I am dealing with a huge data set (Over a million records) and that makes this analysis very difficult.
I have tried several approaches:
SELECT col1, col2 ...
FROM table
WHERE upload_id IN (first_upload_ID, second_upload_id)
GROUP BY col1, col2..
HAVING COUNT(*) = 1;
SELECT col1, col2 ...
FROM table
WHERE upload_id = first_upload_ID
MINUS
SELECT col1, col2 ...
FROM table
WHERE upload_id = second_upload_id;
SELECT col1, col2 ...
FROM table
WHERE upload_id = second_upload_id
MINUS
SELECT col1, col2 ...
FROM table
WHERE upload_id = first_upload_ID;
Both of these results returned several hundred thousand rows, making it difficult to analyze.
Does anyone how any suggestions in how to approach/simplify this problem? Could I do a self join on several columns that are unique for each upload? If yes, what would that self join look like?
Thank you for the help.

One method that might be useful is to calculate a hash of each record and run a match based on that. It doesn't have to be some super-secure SHA-whatever, just the regular Oracle Ora_Hash(), as long as you're going to get a pretty small chance of hash collisions. Ora_Hash ought to be sufficent with a max_bucket_size of 4,294,967,295.
I'd just run joins between the two sets of hashes. Hash joins (as in the join mechanism) are very efficient.
Alternatively you could join the two data sets in their entirity, and as long as you're using equi-joins and only projecting the identifying rowid's from the data sets it would be broadly equivalent performance-wise because hashes would be computed on the join columns but only the rowid's would have to be stored as well, keeping the hash table size small. The tricky part there is in dealing with nulls in the join.

When doing a join like this make sure not to include columns containing the upload-id, and any audit data added to the uploaded data. Restrict the joins to the columns that contain the uploaded data. The MINUS approach should work well otherwise.

Related

How to avoid evaluating the same calculated column in Hive query repetedly

Lets say I have a calculated column:-
select str_to_map("k1:1,k2:2,k3:3")["k1"] as col1,
str_to_map("k1:1,k2:2,k3:3")["k2"] as col2,
str_to_map("k1:1,k2:2,k3:3")["k3"] as col3;
How do I 'fix' the column calculation only once and access its value multiple times in the query? The map being calculated is the same, only different keys are being accessed for different columns. Performing the same calculation repeatedly is a waste of resources. This example is purposely made too simple, but the point is I want to know how to avoid this kind of redundancy in Hive in general.
In general use subqueries, they are calculated once.
select map_col.["k1"] as col1,
map_col.["k2"] as col2,
map_col.["k3"] as col3
from
(
select str_to_map("k1:1,k2:2,k3:3") as map_col from table...
)s;
Also you can materialize some query into table to reuse the dataset across different queries or workflows.

5+ Intermediate SQL Tables to Arrive at Desired Table, Postgres

I am generating reports on electoral data that group voters into their age groups, and then assign those age groups a quartile, before finally returning the table of age groups and quartiles.
By the time I arrive at the table with the schema and data that I want, I have created 7 intermediate tables that might as well be deleted at this point.
My question is, is it plausible that so many intermediate tables are necessary? Or this a sign that I am "doing it wrong?"
Technical Specifics:
Postgres 9.4
I am chaining tables, starting with the raw database tables and successively transforming the table closer to what I want. For instance, I do something like:
CREATE TABLE gm.race_code_and_turnout_count AS
SELECT race_code, count(*)
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
And then I do
CREATE TABLE gm.race_code_and_percent_of_total_turnout AS
SELECT race_code, count, round((count::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.race_code_and_turnout_count
And that first table goes off in a second branch:
CREATE TABLE gm.race_code_and_turnout_percentage AS
SELECT t1.race_code, round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM gm.race_code_and_turnout_count AS t1
JOIN gm.race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
So each table is building on the one before it.
While temporary tables are used a lot in SQL Server (mainly to overcome the peculiar locking behaviour that it has) it is far less common in Postgres (and your example uses regular tables, not temporary tables).
Usually the overhead of creating a new table is higher than letting the system store intermediate on disk.
From my experience, creating intermediate tables usually only helps if:
you have a lot of data that is aggregated and can't be aggregated in memory
the aggregation drastically reduces the data volume to be processed so that the next step (or one of the next steps) can handle the data in memory
you can efficiently index the intermediate tables so that the next step can make use of those indexes to improve performance.
you re-use a pre-computed result several times in different steps
The above list is not completely and using this approach can also be beneficial if only some of these conditions are true.
If you keep creating those tables create them at least as temporary or unlogged tables to minimized the IO overhead that comes with writing that data and thus keep as much data in memory as possible.
However I would always start with a single query instead of maintaining many different tables (that all need to be changed if you have to change the structure of the report).
For example your first two queries from your question can easily be combined into a single query with no performance loss:
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code;
This is going to be faster than writing the data twice to disk (including all transactional overhead).
If you stack your queries using common table expressions Postgres will automatically store the data on disk if it gets too big, if not it will process it in-memory. When manually creating the tables you force Postgres to write everything to disk.
So you might want to try something like this:
with race_code_and_turnout_count as (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
), race_code_and_total_count as (
select ....
from ....
), race_code_and_turnout_percentage as (
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM ace_code_and_turnout_count AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
)
select *
from ....;
and see how that performs.
If you don't re-use the intermediate steps more than once, writing them as a derived table instead of a CTE might be faster in Postgres due to the way the optimizer works, e.g.:
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
) AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
If it performs well and results in the right output, I see nothing wrong with it. I do however suggest to use (local) temporary tables if you need intermediate tables.
Your series of queries can always be optimized to use fewer intermediate steps. Do that if you feel your reports start performing poorly.

Optimizing an Oracle SQL query which uses IN clause extensively

I maintain an application where I am trying to optimize an Oracle SQL query wherein multiple IN clauses are used. This query is now a blocker as it hogs nearly 3 minutes of execution time and affects application performance severely.The query is called from Java code(JDBC) and looks like this :
Select disctinct col1,col2,col3,.. colN from Table1
where 1=1 and not(col1 in (idsetone1,idsetone2,... idsetoneN)) or
(col1 in(idsettwo1,idsettwo2,...idsettwoN))....
(col1 in(idsetN1,idsetN2,...idsetNN))
The ID sets are retrieved from a different schema and therefore a JOIN between column1 of table 1 and ID sets is not possible. ID sets have grown over time with use of the application and currently they number more than 10,000 records.
How can I start with optimizing this query ?
I really doupt about "The ID sets are retrieved from a different schema and therefore a JOIN between column1 of table 1 and ID sets is not possible." Of course you can join the tables, provided you got select privileges on it.
Anyway, let's assume it is not possible due to whatever reason. One solution could be to insert all entries first into a Nested Table and the use this one:
CREATE OR REPLACE TYPE NUMBER_TABLE_TYPE AS TABLE OF NUMBER;
Select disctinct col1,col2,col3,.. colN from Table1
where 1=1
and not (col1 NOT MEMBER OF (NUMBER_TABLE_TYPE(idsetone1,idsetone2,... idsetoneN))
OR
(col1 MEMBER OF NUMBER_TABLE_TYPE(idsettwo1,idsettwo2,...idsettwoN))
Regarding the max. number of elements Oracle Documentation says: Because a nested table does not have a declared size, you can put as many elements in the constructor as necessary.
I don't know how serious you can take this statement.
You should put all the items into one temporary table and to an explicit join:
Select your cols
from Table1
left join table_with_items
on table_with_items.id = Table1.col1
where table_with_items.id is null;
Also that distinct suggest a problem in your business logic or in the architecture of application. Why do you have duplicate ids? You should get rid of that distinct.

Optimize Oracle SELECT on large dataset

I am new in Oracle (working on 11gR2). I have a table TABLE with something like ~10 millions records in it, and this pretty simple query :
SELECT t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1
AND t.col11 = val2
AND t.col12 = val3
AND t.col13 = val4
The query is currently taking about 30s/1min.
My question is: how can I improve performance ? After a lot of research, I am aware of the most classical ways to improve performance but I have some problems :
Partitioning: can't really, the table is used in an other project and it would be too impactful. Plus it only delay the problem given the number of rows inserted in the table every day.
Add an index: The thing is, the columns used in the WHERE clause are not the one returned by the query (except for one). Thus, I have not been able to find an appropriate index yet. As far as I know, setting an index on 12~13 columns does not make a lot of sense (or does it?).
Materialized views: I must say I never used them, but I understood the maintenance cost is pretty high and my table is updated quite often.
I think the best way to do this would be to add an appropriate index, but I can't find the right columns on which it should be created.
An index makes sense provided that your query results in a small percentage of all rows. You would create one index on all four columns used in the WHERE clause.
If too many records match, then a full table scan will be done. You may be able to speed this up by having this done in parallel threads using the PARALLEL hint:
SELECT /*+parallel(t,4)*/
t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1 AND t.col11 = val2 AND t.col12 = val3 AND t.col13 = val4;
Table with 10 millions records is quite little table. You just need to create an appropriate index. Which column select for index - depends on content of them. For example, if you have column that contains only "1" and "0", or "yes" and "no", you shouldn't index it. The more different values contains column - the more effect gives index. Also you can make index on two or three (and more) columns, or function-based index (in this case index contains results of your SQL function, not columns values). Also you can create more than one index on table.
And in any case, if your query selects more then 20 - 30% of all table records, index will not help.
Also you said that table is used by many people. In this case, you need to cooperate with them to avoid duplicating indexes.
Indexes on each of the columns referenced in the WHERE clause will help performance of a query against a table with a large number of rows, where you are seeking a small subset, even if the columns in the WHERE clause are not returned in the SELECT column list.
The downside of course is that indexes impede insert/update performance. So when loading the table with large numbers of records, you might need to disable/drop the indexes prior to loading and then re-create/enable them again afterwards.

Comparing two partition's data in hive

I have 9 million records in each of my partition in hive and I have two partitions. The table has 20 columns. Now I want to compare the dataset between the partitions based upon an id column. which is the best way to do it considering the fact that self join with 9 million records will create performence issues.
Can you try the SMB join - its mostly like a merging two sorted lists. However in this case you will need to create two more tables.
Another option would be to write an UDF to do the same - that would be project by itself. The first option is easier.
Did you try the self join and have it fail? I don't think it should be an issue as long as you specify the join condition correctly. 9 million rows is actually not that much for Hive. It can handle large joins by using the join condition as a reduce key, so it doesn't actually do the full cartesian product.
select a.foo, b.foo
from my_table a
full outer join my_table b
on a.id <=> b.id
where a.partition = 'x' and b.partition = 'y'
To do a full comparison of 2 tables (or comparing 2 partitions of the same table), my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives performance problems as you mentioned, and also gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq.
In your case, you would use that program specifying that the "2 tables to compare" are the same and using the "--source-where" and "--destination-where" to indicate which partitions you want to compare. The "--group-by-column" option might also be useful to specify the "id" column.