SQL - Find Differences Between Columns - sql

Let's say I have the following table
Sku | Number | Name
11 1 hat
12 1 hat
13 1 hats
22 2 car
33 3 truck
44 4 boat
45 4 boat
Is there an easy way to figure out how to find the differences within each Number. For example, with the table above, I would want the query to output:
13 | 1 | hats
The reason for this is because our program processes the rows as long as the number matches the name. If there is an instance where the name doesn't match but the rest of the names do, it will fail.

You can find the most common value (the "mode") using window functions and aggregation:
select t.*
from (select number, name, count(*) as cnt,
row_number() over (partition by number order by count(*) desc) as seqnum
from t
group by number, name
) t
where seqnum = 1;
You could then find everything that is not the mode using a join. The easier way is just to change the where condition:
select t.*
from (select number, name, count(*) as cnt,
row_number() over (partition by number order by count(*) desc) as seqnum
from t
group by number, name
) t
where seqnum > 1;
Note: If there are ties in frequency for the most common value, then an arbitrary most common value is chosen.
EDIT:
Actually, if you want the original skus, you might as well do the join:
with modes as (
select t.*
from (select number, name, count(*) as cnt,
row_number() over (partition by number order by count(*) desc) as seqnum
from t
group by number, name
) t
where seqnum = 1
)
select t.*
from t join
modes
on t.number = modes.number and t.name <> modes.name;
This will ignore NULL values (but the logic can easily be fixed to accommodate them).

Related

Bigquery to find max losing streak for each perso

Assume I have a database like this. The original database has millions of data for ~1000 unique names.
What I am after is to find the maximum losing streak for each person when sorted by date.
What I am looking for is a query that have all unique names in 1 column and the maximum losing streak they had in the next one.
Another option to consider
select distinct name, count(*) streak from (
select name,
count(*) over win - countif(win_loss = 'LOSS') over win as grp
from your_table
window win as (partition by name order by date)
)
where win_loss = 'LOSS'
group by name, grp
qualify 1 = rank() over(partition by name order by count(*) desc)
if applied to sample data in your question - output is
(Updated - SKIP to be ignored)
You might consider below trick.
Firstly, generate sequences of loss(1) and others(0) over time per user.
ABC - 111011110
XYZ - 1111111011
If the sequence is split with delimiter 0, you will get multiple losing-streaks.
Find a sequence with max length from losing-streaks array.
SELECT Name,
(SELECT MAX(LENGTH(r)) FROM UNNEST(SPLIT(results, '0')) r) AS losing_streaks
FROM (
SELECT Name,
STRING_AGG(
CASE WinLoss
WHEN 'LOSS' THEN '1'
WHEN 'SKIP' THEN NULL
ELSE '0'
END, '' ORDER BY Date
) AS results
FROM sample_table
GROUP BY 1
);
+------+----------------+
| Name | losing_streaks |
+------+----------------+
| ABC | 4 |
| XYZ | 7 |
+------+----------------+
SKIP to be ignored
ABC has 3 losing streak before first win and then 4 losing streak before next (ignoring SKIP). So the answer has to be 4 not 3
to address above you can use below version
select distinct name, count(*) streak from (
select name, win_loss,
countif(win_loss != 'SKIP') over win - countif(win_loss = 'LOSS') over win as grp
from your_table
window win as (partition by name order by date)
)
where win_loss = 'LOSS'
group by name, grp
qualify 1 = rank() over(partition by name order by count(*) desc)
with output

Presto SQL - Rank Multiple Conditions for Multiple Columns

I am trying to write a single query (if possible) to rank ids based on multiple conditions.
My table is like this:
id group subgroup value
1 A Q 12
2 A Z 10
3 B Z 14
4 A Z 20
5 B W 20
I tried this query:
SELECT id,
CASE WHEN group = 'A' THEN ROW_NUMBER() OVER (PARTITION BY group ORDER BY SUM(value) DESC) AS rank_group
CASE WHEN group = 'A' AND subgroup = 'Z' THEN ROW_NUMBER() OVER (PARTITION BY group, subgroup ORDER BY SUM(value) DESC) AS rank_subgroup
FROM table
GROUP BY group, subgroup
But ended up with something like this:
id rank_group rank_subgroup
1 1 1
1 2 2
I would like to get each distinct id and return the rank based on the conditions of the case statement, but it looks like adding the needed partition causes a multiplication as the group by is necessary. I could write individual queries for each column, but I'd like to avoid if possible.
Do you want something like this?
select t.*,
dense_rank() over (order by sumg, group),
dense_rank() over (partition by group order by sumsg, subg),
from (select t.*,
sum(value) over (partition by group) as sumg,
sum(value) over (partition by group, subgroup) as sumsg
from t
) t;
This is my best guess at interpreting what you might want.

Oracle ListaGG, Top 3 most frequent values, given in one column, grouped by ID

I have a problem regarding SQL query , it can be done in "plain" SQL, but as I am sure that I need to use some group concatenation (can't use MySQL) so second option is ORACLE dialect as there will be Oracle database. Let's say we have following entities:
Table: Veterinarian visits
Visit_Id,
Animal_id,
Veterinarian_id,
Sickness_code
Let's say there is 100 visits (100 visit_id) and each animal_id visits around 20 times.
I need to create a SELECT , grouped by Animal_id with 3 columns
animal_id
second shows aggregated amount of flu visits for this particular animal (let's say flu, sickness_code = 5)
3rd column shows top three sicknesses codes for each animal (top 3 most often codes for this particular animal_id)
How to do it? First and second columns are easy, but third? I know that I need to use LISTAGG from Oracle, OVER PARTITION BY, COUNT and RANK, I tried to tie it together but didn't work out as I expected :( How should this query look like?
Here sample data
create table VET as
select
rownum+1 Visit_Id,
mod(rownum+1,5) Animal_id,
cast(NULL as number) Veterinarian_id,
trunc(10*dbms_random.value)+1 Sickness_code
from dual
connect by level <=100;
Query
basically the subqueries do the following:
aggregate count and calculate flu count (in all records of the animal)
calculate RANK (if you need realy only 3 records use ROW_NUMBER - see discussion below)
Filter top 3 RANKs
LISTAGGregate result
with agg as (
select Animal_id, Sickness_code, count(*) cnt,
sum(case when SICKNESS_CODE = 5 then 1 else 0 end) over (partition by animal_id) as cnt_flu
from vet
group by Animal_id, Sickness_code
), agg2 as (
select ANIMAL_ID, SICKNESS_CODE, CNT, cnt_flu,
rank() OVER (PARTITION BY ANIMAL_ID ORDER BY cnt DESC) rnk
from agg
), agg3 as (
select ANIMAL_ID, SICKNESS_CODE, CNT, CNT_FLU, RNK
from agg2
where rnk <= 3
)
select
ANIMAL_ID, max(CNT_FLU) CNT_FLU,
LISTAGG(SICKNESS_CODE||'('||CNT||')', ', ') WITHIN GROUP (ORDER BY rnk) as cnt_lts
from agg3
group by ANIMAL_ID
order by 1;
gives
ANIMAL_ID CNT_FLU CNT_LTS
---------- ---------- ---------------------------------------------
0 1 6(5), 1(4), 9(3)
1 1 1(5), 3(4), 2(3), 8(3)
2 0 1(5), 10(3), 4(3), 6(3), 7(3)
3 1 5(4), 2(3), 4(3), 7(3)
4 1 2(5), 10(4), 1(2), 3(2), 5(2), 7(2), 8(2)
I intentionally show Sickness_code(count visits) to demonstarte that top 3 can have ties that you should handle.
Check the RANK function. Using ROW_NUMBER is not deterministic in this case.
I think the most natural way uses two levels of aggregation, along with a dash of window functions here and there:
select vas.animal,
sum(case when sickness_code = 5 then cnt else 0 end) as numflu,
listagg(case when seqnum <= 3 then sickness_code end, ',') within group (order by seqnum) as top3sicknesses
from (select animal, sickness_code, count(*) as cnt,
row_number() over (partition by animal order by count(*) desc) as seqnum
from visits
group by animal, sickness_code
) vas
group by vas.animal;
This uses the fact that listagg() ignores NULL values.

Aggregate function like MAX for most common cell in column?

Group by the highest Number in a column worked great with MAX(), but what if I would like to get the cell that is at most common.
As example:
ID
100
250
250
300
200
250
So I would like to group by ID and instead of get the lowest (MIN) or highest (MAX) number, I would like to get the most common one (that would be 250, because there 3x).
Is there an easy way in SQL Server 2012 or am I forced to add a second SELECT where I COUNT(DISTINCT ID) and add that somehow to my first SELECT statement?
You can use dense_rank to return all the id's with the highest counts. This would handle cases when there are ties for the highest counts as well.
select id from
(select id, dense_rank() over(order by count(*) desc) as rnk from tablename group by id) t
where rnk = 1
A simple way to do what you want uses top and order by:
SELECT top 1 id
FROM t
GROUP BY id
ORDER BY COUNT(*) DESC;
This is a statistic called the mode. Getting the mode and max is a bit challenging in SQL Server. I would approach it as:
WITH cte AS (
SELECT t.id, COUNT(*) AS cnt,
row_number() OVER (ORDER BY COUNT(*) DESC) AS seqnum
FROM t
GROUP BY id
)
SELECT MAX(id) AS themax, MAX(CASE WHEN seqnum = 1 THEN id END) AS MODE
FROM cte;

Select by greatest sum, but without the sum in the result

I need to select the top score of all combined attempts by a player and I need to use a WITH clause.
create table scorecard(
id integer primary key,
player_name varchar(20));
create table scores(
id integer references scorecard,
attempt integer,
score numeric
primary key(id, attempt));
Sample Data for scorecard:
id player_name
1 Bob
2 Steve
3 Joe
4 Rob
Sample data for scores:
id attempt score
1 1 50
1 2 45
2 1 10
2 2 20
3 1 40
3 2 35
4 1 0
4 2 95
The results would simply look like this:
player_name
Bob
Rob
But would only be Bob if Rob had scored less than 95 total. I've gotten so far as to have the name and the total scores that they got in two columns using this:
select scorecard.player_name, sum(scores.score)
from scorecard
left join scores
on scorecard.id= scores.id
group by scorecard.name
order by sum(scores.score) desc;
But how do I just get the names of the highest score (or scores if tied).
And remember, it should be using a WITH clause.
Who ever told you to "use a WITH clause" was missing a more efficient solution. To just get the (possibly multiple) winners:
SELECT c.player_name
FROM scorecard c
JOIN (
SELECT id, rank() OVER (ORDER BY sum(score) DESC) AS rnk
FROM scores
GROUP BY 1
) s USING (id)
WHERE s.rnk = 1;
A plain subquery is typically faster than a CTE. If you must use a WITH clause:
WITH top_score AS (
SELECT id, rank() OVER (ORDER BY sum(score) DESC) AS rnk
FROM scores
GROUP BY 1
)
SELECT c.player_name
FROM scorecard c
JOIN top_score s USING (id)
WHERE s.rnk = 1;
SQL Fiddle.
You could add a final ORDER BY c.player_name to get a stable sort order, but that's not requested.
The key feature of the query is that you can run a window function like rank() over the result of an aggregate function. Related:
Postgres window function and group by exception
Get the distinct sum of a joined table column
Can try something like follows.
With (SELECT id, sum(score) as sum_scores
FROM scores
group by id) as sumScoresTable,
With (SELECT max(score) as max_scores
FROM scores
group by id) as maxScoresTable
select player_name
FROM scorecard
WHERE scorecard.id in (SELECT sumScoresTable.id
from sumScoresTable
where sumScoresTable.score = (select maxScoresTable.score from maxScoresTable)
Try this code:
WITH CTE AS (
SELECT ID, RANK() OVER(ORDER BY SumScore DESC) As R
FROM (
SELECT ID, SUM(score) AS SumScore
FROM scores
GROUP BY ID )
)
SELECT player_name
FROM scorecard
WHERE ID IN (SELECT ID FROM CTE WHERE R = 1)