sql distinct on multiple cols but not all the columns - sql

I am trying to write a merge query to read some data from table A and merge in table B.
While reading from A , I want to make sure that the rows are distinct based on few columns. so I have a query something like:
select distinct col1, col2, col3 from A.
However I want the uniqueness to be on (col1,col2) pair. The above query would still get me rows which are unique for (col1,col2,col3) but not for (col1,col2)
Is there any other way to achieve this or use something like
select distinct(col1,col2),col3 from A.
Also,
I expect that A would have many records ~10-20 million . Moreover I am planning to iterate over multiple such table A [which itself is generated on the fly over few joins]. Hopefully the performance would be not worse of the solution, but in either case I am looking for a solution that works first and then I might run queries in part.
Oracle 10G is the version.
Thanks !
Update: I am wondering if groupby would invoke a sorting operation whcih would be very expensive than using distinct ?
Update: Answering one of the question below : With (A, B, C) and (A, B, D) rows, what is your expected result?
=> Assuming I want distinct on (A,B) it is fine to return either of (A,B,C) or (A,B,D) but not both.

If there more then one distinct value in col3 for specific pair (col1, col2) you'll have to choose one of them I've used here max (but you can use any of those Aggregate Functions (Oracle)):
select col1, col2, max(col3)
from A
group by col1, col2

Related

SQL Query to Return Number of Distinct Values for Each Column

For background, I am querying a large database with hundreds of attributes per table and millions of rows. I'm trying to figure out which columns have no values at all.
I can do the following:
SELECT
COUNT(DISTINCT col1) as col1,
COUNT(DISTINCT col2) as col2,
...
COUNT(DISTINCT coln) as coln
FROM table;
Whenever a count of 1 is returned, I know that column has no values, great. The issue is that this is incredibly tedious to retype hundreds of times. How can I do this in a more efficient manner? I only have a fundamental understanding of SQL, and the limited capabilities of Athena makes this more difficult for me. Thank you
Edit: Just to clarify, the reason why the count needs to be 1 is because this database uses empty Strings rather than NULL

How to avoid evaluating the same calculated column in Hive query repetedly

Lets say I have a calculated column:-
select str_to_map("k1:1,k2:2,k3:3")["k1"] as col1,
str_to_map("k1:1,k2:2,k3:3")["k2"] as col2,
str_to_map("k1:1,k2:2,k3:3")["k3"] as col3;
How do I 'fix' the column calculation only once and access its value multiple times in the query? The map being calculated is the same, only different keys are being accessed for different columns. Performing the same calculation repeatedly is a waste of resources. This example is purposely made too simple, but the point is I want to know how to avoid this kind of redundancy in Hive in general.
In general use subqueries, they are calculated once.
select map_col.["k1"] as col1,
map_col.["k2"] as col2,
map_col.["k3"] as col3
from
(
select str_to_map("k1:1,k2:2,k3:3") as map_col from table...
)s;
Also you can materialize some query into table to reuse the dataset across different queries or workflows.

Compute aggregate on insert and update rather than on select

In MS Sql Server 2016, I have a view that contains a query that looks something like this:
SELECT Col1, Col2, MAX(Col3) FROM TableA
WHERE Col1 = ???
GROUP BY Col1, Col2
I use this view all over the place, and as a result, most of my queries seem very inefficient, having to do an Index Scan on TableA on almost every SELECT in my system.
What I'd like to know is whether there a way to store that MAX(Col3) so that it is computed during INSERTs and UPDATEs instead of during SELECTs?
Here are some thoughts and why I rejected them:
I don't think I can use a clustered indexed on the view, because "MAX(Col3)" is not deterministic.
I don't see how I can use a filtered view.
I don't want to use triggers.
I would start with an index tableA(col1, col2, col3).
Depending on the size of the table and the number of matches on col1, this should be quite reasonable performancewise.
Otherwise, you might need a summary table that is maintained using triggers.

Oracle using wrong index for select query?

So I have a table with two indexes:
Index_1: Column_A, Column_B
Index_2: Column A, Column_B, Column_C
I am running a select query:
select * from table Where (Column A, Column_B, Column_C)
IN(('1','2','3'), ('4','5','6'),...);
When using "EXPLAIN PLAN for" in SQL developer. It seems to be using the first index instead of the second, despite the second matching the values in my query?
Why is this? and is it hindering my optimal performance?
Expanding on my comment, although we can't analyze Oracle's query planning without knowing anything about the data or seeing the actual plan, the three-column index is not necessarily better suited for your query than is the two-column index, at least if the base table has additional columns (which you are selecting) beyond those three.
Oracle is going to need to read the base table anyway to get the other columns. Supposing that the values in column_C are not too correlated with the values in column_A and column_B, the three-column index will be a lot larger than the two-column index. Using the two-column index may therefore involve reading fewer blocks overall, especially if that index is relatively selective.
Oracle has a very good query planner. If it has good table statistics to work with then it is probably choosing a good plan. Even without good statistics it will probably do a fine job for a query as simple as yours.
I had similar problem, Oracle use an index with 2 columns as per explain plan, while my query involves selection of 20 columns from the table and where clause having 5 values as below:
from tab1
where Col1= 'A'
and col2 = 'b'
and col3 = 'c'
and col4 = 'd'
and col5 >= 10
Index1: col1, col2
Index2: col1, col2, col3, col4, col5
If I added a hint to use index2 the query executed much faster than using index1... what could be done so that Oracle choose the index2?
Tried to ensure statistics are gathered, still system detects the index1 as the best to use.

postgres indexed query running slower than expected

I have a table with ~250 columns and 10m rows in it. I am selecting 3 columns with the where clause on an indexed column with an IN query. The number of ids in the IN clause is 2500 and the output is limited by 1000 rows, here's the rough query:
select col1, col2, col3 from table1 where col4 in (1, 2, 3, 4, etc) limit 1000;
This query takes much longer than I expected, ~1s. On an indexed integer column with only 2500 items to match, it seems like this should go faster? Maybe my assumption there is incorrect. Here is the explain:
http://explain.depesz.com/s/HpL9
I did not paste all 2500 ids into the EXPLAIN just for simplicity so ignore the fact that there are only 3 in that. Anything I am missing here?
It looks like you're pushing the limits of select x where y IN (...) type queries. You basically have a very large table with an large set of conditions to search on.
Depending on the type of indexes, I'm guessing you have B+Tree this kind of query is inefficient. These type of indexes do well with general purpose range matching and DB inserts while performing worse on single value lookups. Your query is doing ~2500 lookups on this index for single values.
You have a few options to deal with this...
Use Hash indexes (these perform much better on single value lookups)
Help out the query optimizer by adding in a few range based constraints, so you could take the 2500 values and find the min and max values and add that to the query. So basically append where x_id > min_val and x_id < max_val
Run the query in parallel mode if you have multiple db backends, simply breakup the 2500 constraints into say 100 groups and run all the queries at once and collect the results. It will be better if you group the constraints based on their value
The first option is certainly easier, but it will come at a price of making your inserts/deletes slower.
The second does not suffer from this, and you don't even need to limit it to one min max group. You could create N groups with N min and max constraints. Test it out with different groupings and see what works.
The last option is by far the best performing of course.
Your query is equivalent to:
select col1, col2, col3
from table1
where
col4 = 1
OR col4 = 2
OR col4 = 3
OR col4 = 4
... repeat 2500 times ...
which is equivalent to:
select col1, col2, col3
from table1
where col4 = 1
UNION
select col1, col2, col3
from table1
where col4 = 2
UNION
select col1, col2, col3
from table1
where col4 = 3
... repeat 2500 times ...
Basically, it means that the index on a table with 10M rows is searched 2500 times. On top of that, if col4 is not unique, then each search is a scan, which may potentially return many rows. Then 2500 intermediate result sets are combined together.
The server doesn't know that the 2500 IDs that are listed in the IN clause do not repeat. It doesn't know that they are already sorted. So, it has little choice, but do 2500 independent index seeks, remember intermediate results somewhere (like in an implicit temp table) and then combine them together.
If you had a separate table table_with_ids with the list of 2500 IDs, which had a primary or unique key on ID, then the server would know that they are unique and they are sorted.
Your query would be something like this:
select col1, col2, col3
from
table_with_ids
inner join table1 on table_with_ids.id = table1.col4
The server may be able to perform such join more efficiently.
I would test the performance using pre-populated (temp) table of 2500 IDs and compare it to the original. If the difference is significant, you can investigate further.
Actually, I'd start with running this simple query:
select col1, col2, col3
from table1
where
col4 = 1
and measure the time it takes to run. You can't get better than this. So, you'll have a lower bound and a clear indication of what you can and can't achieve. Then, maybe change it to where col4 in (1,2) and see how things change.
One more way to somewhat improve performance is to have an index not just on col4, but on col4, col1, col2, col3. It would still be one index, but on several columns. (In SQL Server I would have columns col1, col2, col3 "included" in the index on col4, rather than part of the index itself to make it smaller, but I don't think Postgres has such feature). In this case the server should be able to retrieve all data it needs from the index itself, without doing additional look-ups in the main table. Make it the so-called "covering" index.