Select independent distinct with one query - sql

I need to select distinct values from multiple columns in an h2 database so I can have a list of suggestions for the user based on what is in the database. In other words, I need something like
SELECT DISTINCT a FROM table
SELECT DISTINCT b FROM table
SELECT DISTINCT c FROM table
in one query. In-case I am not clear enough, I want a query that given this table (columns ID, thing, other, stuff)
0 a 5 p
1 b 5 p
2 a 6 p
3 c 5 p
would result in something like this
a 5 p
b 6 -
c - -
where '-' is an empty entry.

This is a bit complicated, but you can do it as follows:
select max(thing) as thing, max(other) as other, max(stuff) as stuff
from ((select row_number() over (order by id) as seqnum, thing, NULL as other, NULL as stuff
from (select thing, min(id) as id from t group by thing
) t
) union all
(select row_number() over (order by id) as seqnum, NULL, other, NULL
from (select other, min(id) as id from t group by other
) t
) union all
(select row_number() over (order by id) as seqnum, NULL, NULL, stuff
from (select stuff, min(id) as id from t group by stuff
) t
)
) t
group by seqnum
What this does is assign a sequence number to each distinct value in each column. It then combines these together into a single row for each sequence number. The combination uses the union all/group by approach. An alternative formulation uses full outer join.
This version uses the id column to keep the values in the same order as they appear in the original data.
In H2 (which was not originally on the question), you can use the rownum() function instead (documented here). You may not be able to specify the ordering however.

Related

Joining two tables at random

I have two tables, on with main data, and another shorter table with additional data.
I would like to join the rows from the shorter table to some of the rows of the main table, at random. For example:
main table:
id
data
1
apple
2
banana
3
cherry
4
date
5
elderberry
6
fig
secondary table:
id
data
1
accordion
2
banjo
Desired Result:
main
secondary
… ?
accordion
… ?
banjo
I can think of one way to do it, using a lot of pre-processing with CTEs:
WITH
cte1 AS (SELECT data FROM main ORDER BY random() LIMIT 2),
cte2 AS (SELECT row_number() OVER() AS row, data FROM cte1),
cte3 AS (SELECT row_number() OVER () AS row, data FROM secondary)
SELECT *
FROM cte2 JOIN cte3 ON cte2.row=cte3.row;
It works, but is there a more straightforward way of joining two tables at random?
I have attached a fiddle: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=21af08976112c7ac7c18329fa3699b8c&hide=2
A CTE is basically just a re-usable template for a subquery.
So this can be golfcoded to using 2 subqueries.
SELECT m.rn, m.data main_data, s.data secondary_data
FROM (SELECT data, ROW_NUMBER() OVER (ORDER BY random()) rn FROM main) m
JOIN (SELECT data, ROW_NUMBER() OVER (ORDER BY random()) rn FROM secondary) s USING (rn)
I could rewrite it to this:
SELECT *
FROM (SELECT row_number() OVER (ORDER BY random()) as id,
data
FROM main
ORDER BY RANDOM()) m1
JOIN secondary s on s.id = m1.id
dbfiddle
Update: LIMIT is not needed after looking at #LukStorm's version
I assumed that you know which table is shorter so there is only one column with generated id's

Automating Repeated Unions

I'm running a query like this:
SELECT id FROM table
WHERE table.type IN (1, 2, 3)
LIMIT 15
This returns a random sampling. I might have 7 items from class_1 and 3 items from class_2. I would like to return exactly 5 items from each class, and the following code works:
SELECT id FROM (
SELECT id, type FROM table WHERE type = 1 LIMIT 5
UNION
SELECT id, type FROM table WHERE type = 2 LIMIT 5
UNION ...
ORDER BY type ASC)
This gets unwieldy if I want a random sampling from ten classes, instead of only three. What is the best way to do this?
(I'm using Presto/Hive, so any tips for those engines would be appreciated).
Use a function like row_number to do this. This makes the selection independent of the number of types.
SELECT id,type
FROM (SELECT id, type, row_number() over(partition by type order by id) as rnum --adjust the partition by and order by columns as needed
FROM table
) T
WHERE rnum <= 5
I would strongly suggest adding ORDER BY. Anyway, you can do something like:
with
x as (
select
id,
type,
row_number() over(partition by type order by id) as rn
from table
)
select * from x where rn <= 5

Could this query be optimized?

My goal is to select record by two criterias that depend on each other and group it by other criteria.
I found solution that select record by single criteria and group it
SELECT *
FROM "records"
NATURAL JOIN (
SELECT "group", min("priority1") AS "priority1"
FROM "records"
GROUP BY "group") AS "grouped"
I think I understand concept of this searching - select properties you care about and match them in original table - but when I use this concept with two priorities I get this monster
SELECT *
FROM "records"
NATURAL JOIN (
SELECT *
FROM (
SELECT "group", "priority1", min("priority2") AS "priority2"
FROM "records"
GROUP BY "group", "priority1") AS "grouped2"
NATURAL JOIN (
SELECT "group", min("priority1") AS "priority1"
FROM "records"
NATURAL JOIN (
SELECT "group", "priority1", min("priority2") AS "priority2"
FROM "records"
GROUP BY "group", "priority1") AS "grouped2'"
GROUP BY "group") AS "GroupNested") AS "grouped1"
All I am asking is couldn't it be written better (optimalized and looking-better)?
JSFIDDLE
---- Update ----
The goal is that I want select single id for each group by priority1 and priority2 should be selected as first and then priority2).
Example:
When I have table records with id, group, priority1 and priority2
with data:
id , group , priority1 , priority2
56 , 1 , 1 , 2
34 , 1 , 1 , 3
78 , 1 , 3 , 1
the result should be 56,1,1,2. For each group search first for min of priority1 than search for min of priority2.
I tried combine max and min together (in one query`, but it does not find anything (I do not have this query anymore).
EXISTS() to the rescue! (I did some renaming to avoid reserved words)
SELECT *
FROM zrecords r
WHERE NOT EXISTS (
SELECT *
FROM zrecords nx
WHERE nx.zgroup = r.zgroup
AND ( nx.priority1 < r.priority1
OR nx.priority1 = r.priority1 AND nx.priority2 < r.priority2
)
);
Or, to avoid the AND / OR logic, compare the two-tuples directly:
SELECT *
FROM zrecords r
WHERE NOT EXISTS (
SELECT *
FROM zrecords nx
WHERE nx.zgroup = r.zgroup
AND (nx.priority1, nx.priority2) < (r.priority1 , r.priority2)
);
maybe this is what you expect
with dat as (
SELECT "group" grp
, priority1, priority2, id
, row_number() over (partition by "group" order by priority1) +
row_number() over (partition by "group" order by priority2) as lp
FROM "records")
select dt.grp, priority1, priority2, dt.id
from dat dt
join (select min(lp) lpmin, grp from dat group by grp) dt1 on (dt1.lpmin = dt.lp and dt1.grp =dt.grp)
Simply use row_number() . . . once:
select r.*
from (select r.*,
row_number() over (partition by "group" order by priority1, priority2) as seqnum
from records r
) r
where seqnum = 1;
Note: I would advise you to avoid natural join. You can use using instead (if you don't want to explicitly include equality comparisons).
Queries with natural join are very hard to debug, because the join keys are not listed. Worse, "natural" joins do not use properly declared foreign key relationships. They depend simply on columns that have the same name.
In tables that I design, they would never be useful anyway, because almost all tables have createdAt and createdBy columns.

SQL - Find Differences Between Columns

Let's say I have the following table
Sku | Number | Name
11 1 hat
12 1 hat
13 1 hats
22 2 car
33 3 truck
44 4 boat
45 4 boat
Is there an easy way to figure out how to find the differences within each Number. For example, with the table above, I would want the query to output:
13 | 1 | hats
The reason for this is because our program processes the rows as long as the number matches the name. If there is an instance where the name doesn't match but the rest of the names do, it will fail.
You can find the most common value (the "mode") using window functions and aggregation:
select t.*
from (select number, name, count(*) as cnt,
row_number() over (partition by number order by count(*) desc) as seqnum
from t
group by number, name
) t
where seqnum = 1;
You could then find everything that is not the mode using a join. The easier way is just to change the where condition:
select t.*
from (select number, name, count(*) as cnt,
row_number() over (partition by number order by count(*) desc) as seqnum
from t
group by number, name
) t
where seqnum > 1;
Note: If there are ties in frequency for the most common value, then an arbitrary most common value is chosen.
EDIT:
Actually, if you want the original skus, you might as well do the join:
with modes as (
select t.*
from (select number, name, count(*) as cnt,
row_number() over (partition by number order by count(*) desc) as seqnum
from t
group by number, name
) t
where seqnum = 1
)
select t.*
from t join
modes
on t.number = modes.number and t.name <> modes.name;
This will ignore NULL values (but the logic can easily be fixed to accommodate them).

Inner join to check tables contain same values not working as expected

SELECT COUNT(1) FROM own.no_preselection_1_a;
SELECT COUNT(1) FROM own.no_preselection_1;
SELECT COUNT(1) FROM
(SELECT DISTINCT * FROM own.no_preselection_1_a
);
SELECT COUNT(1) FROM
(SELECT DISTINCT * FROM own.no_preselection_1
);
SELECT COUNT(1)
FROM OWN.no_preselection_1 t1
INNER JOIN OWN.no_preselection_1_a t2
ON t1.number = t2.number
AND t1.location_number = t2.location_number;
This returns:
COUNT(1)
----------------------
398
COUNT(1)
----------------------
398
COUNT(1)
----------------------
308
COUNT(1)
----------------------
308
COUNT(1)
----------------------
578
If we look at the visual explanation of joins here: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
The problem is on that last query. I would have thought that if the sets are the same (ie a perfect overlap), then the inner join would would return a set the size of the original sets.
Is the problem that each of the duplicates are creating entries for all of each other? (eg if there are 3 dupes of the same value on each table, it would create 3x3 = 9 entries for it?)
What's the solution here? (Just select the distincts to do the inner join on?) Is this a good test for checking if two tables contain the same data?
You have duplicates in your table, as the first and third, and second and fourth counts in your list make clear.
The join is working as it should, so there is no "problem". What are you trying to accomplish? Your goal is not being satisfied by the join.
I would suggest that you annotate your question with some actual data and the results that you want.
If you want to show that the two tables have the same values, you might try a union. Assuming that all the columns are the same in both tables and the columns in a row uniquely identify each row:
select t.*
from ((select '1' as which, t.*
from OWN.no_preselection_1 t
) union all
(select '1-a' as which, t.*
from OWN.no_preselection_1_a
)
) t
group by < all the columns in the tables >
having count(*) <> 1
If you are limited to the two columns and want to see if there are corresponding entries (with duplicates), the following works:
select t.*
from ((select '1' as which, number, location_number,
row_number() over (partition by number, location_number order by number) as seqnum
from OWN.no_preselection_1 t
) union all
(select '1-a' as which, number, location_number,
row_number() over (partition by number, location_number order by number) as seqnum
from OWN.no_preselection_1_a
)
) t
group by number, location_number, seqnum
having count(*) <> 1