How do you flip rows into new columns? - sql

I've got a table that looks like this:
player_id | violation
---------------------
1 | A
1 | A
1 | B
2 | C
3 | D
3 | A
And I want to turn it into this, with a bunch of new columns that refer to the types of violations, and then the sum of the number of each individual type of violation that each player got (not that concerned with what the columns are called; a/b/c/d would work great as well):
player_id | violation_a | violation_b | violation_c | violation_d
-----------------------------------------------------------------
1 | 2 | 1 | 0 | 0
2 | 0 | 0 | 1 | 0
3 | 1 | 0 | 0 | 1
I know how I could do this, but it would take a ton of lines of code, since there are in reality 100+ types of violations. Is there any way (perhaps with a tablefunc()?) that I could do this more concisely than spelling out each of the new 100+ columns that I want and the logic for them each individually?

In pure SQL I don't see how you could avoid declaring the columns yourself. You either have to create subselects or filters in every column ..
SELECT DISTINCT ON (t.player_id)
t.player_id,
count(*) FILTER (WHERE violation = 'A') AS violation_a,
count(*) FILTER (WHERE violation = 'B') AS violation_b,
count(*) FILTER (WHERE violation = 'C') AS violation_c,
count(*) FILTER (WHERE violation = 'D') AS violation_d
FROM t
GROUP BY t.player_id;
.. or create a pivot table:
SELECT *
FROM crosstab(
'SELECT player_id, t2.violation, count(*) FILTER (WHERE t.violation = t2.violation)::INT
FROM t,(SELECT DISTINCT violation FROM t) t2
GROUP BY player_id, t2.violation'
) AS ct(player_id INT,violation_a int,violation_b int,violation_c int,violation_d int);
Demo: db<>fiddle

Related

ORACLE SELECT DISTINCT VALUE ONLY IN SOME COLUMNS

+----+------+-------+---------+---------+
| id | order| value | type | account |
+----+------+-------+---------+---------+
| 1 | 1 | a | 2 | 1 |
| 1 | 2 | b | 1 | 1 |
| 1 | 3 | c | 4 | 1 |
| 1 | 4 | d | 2 | 1 |
| 1 | 5 | e | 1 | 1 |
| 1 | 5 | f | 6 | 1 |
| 2 | 6 | g | 1 | 1 |
+----+------+-------+---------+---------+
I need get a select of all fields of this table but only getting 1 row for each combination of id+type (I don't care the value of the type). But I tried some approach without result.
At the moment that I make an DISTINCT I cant include rest of the fields to make it available in a subquery. If I add ROWNUM in the subquery all rows will be different making this not working.
Some ideas?
My better query at the moment is this:
SELECT ID, TYPE, VALUE, ACCOUNT
FROM MYTABLE
WHERE ROWID IN (SELECT DISTINCT MAX(ROWID)
FROM MYTABLE
GROUP BY ID, TYPE);
It seems you need to select one (random) row for each distinct combination of id and type. If so, you could do that efficiently using the row_number analytic function. Something like this:
select id, type, value, account
from (
select id, type, value, account,
row_number() over (partition by id, type order by null) as rn
from your_table
)
where rn = 1
;
order by null means random ordering of rows within each group (partition) by (id, type); this means that the ordering step, which is usually time-consuming, will be trivial in this case. Also, Oracle optimizes such queries (for the filter rn = 1).
Or, in versions 12.1 and higher, you can get the same with the match_recognize clause:
select id, type, value, account
from my_table
match_recognize (
partition by id, type
all rows per match
pattern (^r)
define r as null is null
);
This partitions the rows by id and type, it doesn't order them (which means random ordering), and selects just the "first" row from each partition. Note that some analytic functions, including row_number(), require an order by clause (even when we don't care about the ordering) - order by null is customary, but it can't be left out completely. By contrast, in match_recognize you can leave out the order by clause (the default is "random order"). On the other hand, you can't leave out the define clause, even if it imposes no conditions whatsoever. Why Oracle doesn't use a default for that clause too, only Oracle knows.

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.

How to find whether an unordered itemset exists

I am representing itemsets in SQL (SQLite, if relevant). My tables look like this:
ITEMS table:
| ItemId | Name |
| 1 | Ginseng |
| 2 | Honey |
| 3 | Garlic |
ITEMSETS:
| ItemSetId | Name |
| ... | ... |
| 7 | GinsengHoney |
| 8 | HoneyGarlicGinseng |
| 9 | Garlic |
ITEMSETS2ITEMS
| ItemsetId | ItemId |
| ... | .... |
| 7 | 1 |
| 7 | 2 |
| 8 | 2 |
| 8 | 1 |
| 8 | 3 |
As you can see, an Itemset may contain several Items, and this relationship is detailed in the Itemset2Items table.
How can I check whether a new itemset is already in the table, and if so, find its ID?
For instance, I want to check whether "Ginseng, Garlic, Honey" is an existing itemset. The desired answer would be "Yes", because there exists a single ItemsetId which contains exactly these three IDs. Note that the set is unordered: a query for "Honey, Garlic, Ginseng" should behave identically.
How can I do this?
I would recommend that you start by placing the item sets that you want to check into a table, with one row per item.
The question is now about the overlap of this "proposed" item set to other itemsets. The following query provides the answer:
select itemsetid,
from (select coalesce(ps.itemid, is2i.itemid) as itemid, is2i.itemsetid,
max(case when ps.itemid is not null then 1 else 0 end) as inProposed,
max(case when is2i.itemid is not null then 1 else 0 end) as inItemset
from ProposedSet ps full outer join
ItemSets2items is2i
on ps.itemid = is2i.itemid
group by coalesce(ps.itemid, is2i.itemid), is2i.itemsetid
) t
group by itemsetid
having min(inProposed) = 1 and min(inItemSet) = 1
This joins all the proposed items with all the itemsets. It then groups by the items in each item set, giving a flag as to whether the item is in the set. Finally, it checks that all items in an item set are in both.
Sounds like you need to find an ItemSet that:
contains all the Items in your wanted list
doesn't contain any other Items
This example will return the ID of such an itemset if it exists.
Note: this solution is for MySQL, but it should work in SQLite once you change #variables into something SQLite understands, e.g. bind variables.
-- these are the IDs of the items in the new itemset
-- if you add/remove some, make sure to change the IN clauses below
set #id1 = 1;
set #id2 = 2;
-- this is the count of items listed above
set #cnt = 2;
SELECT S.ItemSetId FROM ItemSets S
INNER JOIN
(SELECT ItemsetId, COUNT(*) as C FROM ItemSets2Items
WHERE ItemId IN (#id1, #id2)
GROUP BY ItemsetId
HAVING COUNT(*) = #cnt
) I -- included ingredients
ON I.ItemsetId = S.ItemSetId
LEFT JOIN
(SELECT ItemsetId, COUNT(*) as C FROM ItemSets2Items
WHERE ItemId NOT IN (#id1, #id2)
GROUP BY ItemsetId
) A -- additional ingredients
ON A.ItemsetId = S.ItemSetId
WHERE A.C IS NULL
See fiddle for MySQL.

SQL: Find rows where field value differs

I have a database table structured like this (irrelevant fields omitted for brevity):
rankings
------------------
(PK) indicator_id
(PK) alternative_id
(PK) analysis_id
rank
All fields are integers; the first three (labeled "(PK)") are a composite primary key. A given "analysis" has multiple "alternatives", each of which will have a "rank" for each of many "indicators".
I'm looking for an efficient way to compare an arbitrary number of analyses whose ranks for any alternative/indicator combination differ. So, for example, if we have this data:
analysis_id | alternative_id | indicator_id | rank
----------------------------------------------------
1 | 1 | 1 | 4
1 | 1 | 2 | 6
1 | 2 | 1 | 3
1 | 2 | 2 | 9
2 | 1 | 1 | 4
2 | 1 | 2 | 7
2 | 2 | 1 | 4
2 | 2 | 2 | 9
...then the ideal method would identify the following differences:
analysis_id | alternative_id | indicator_id | rank
----------------------------------------------------
1 | 1 | 2 | 6
2 | 1 | 2 | 7
1 | 2 | 1 | 3
2 | 2 | 1 | 4
I came up with a query that does what I want for 2 analysis IDs, but I'm having trouble generalizing it to find differences between an arbitrary number of analysis IDs (i.e. the user might want to compare 2, or 5, or 9, or whatever, and find any rows where at least one analysis differs from any of the others). My query is:
declare #analysisId1 int, #analysisId2 int;
select #analysisId1 = 1, #analysisId2 = 2;
select
r1.indicator_id,
r1.alternative_id,
r1.[rank] as Analysis1Rank,
r2.[rank] as Analysis2Rank
from rankings r1
inner join rankings r2
on r1.indicator_id = r2.indicator_id
and r1.alternative_id = r2.alternative_id
and r2.analysis_id = #analysisId2
where
r1.analysis_id = #analysisId1
and r1.[rank] != r2.[rank]
(It puts the analysis values into additional fields instead of rows. I think either way would work.)
How can I generalize this query to handle many analysis ids? (Or, alternatively, come up with a different, better query to do the job?) I'm using SQL Server 2005, in case it matters.
If necessary, I can always pull all the data out of the table and look for differences in code, but a SQL solution would be preferable since often I'll only care about a few rows out of thousands and there's no point in transferring them all if I can avoid it. (However, if you have a compelling reason not to do this in SQL, say so--I'd consider that a good answer too!)
This will return your desired data set - Now you just need a way to pass the required analysis ids to the query. Or potentially just filter this data inside your application.
select r.* from rankings r
inner join
(
select alternative_id, indicator_id
from rankings
group by alternative_id, indicator_id
having count(distinct rank) > 1
) differ on r.alternative_id = differ.alternative_id
and r.indicator_id = differ.indicator_id
order by r.alternative_id, r.indicator_id, r.analysis_id, r.rank
I don't know wich database you are using, in SQL Server I would go like this:
-- STEP 1, create temporary table with all the alternative_id , indicator_id combinations with more than one rank:
select alternative_id , indicator_id
into #results
from rankings
group by alternative_id , indicator_id
having count (distinct rank)>1
-- STEP 2, retreive the data
select a.* from rankings a, #results b
where a.alternative_id = b.alternative_id
and a.indicator_id = b. indicator_id
order by alternative_id , indicator_id, analysis_id
BTW, THe other answers given here need the count(distinct rank) !!!!!
I think this is what you're trying to do:
select
r.analysis_id,
r.alternative_id,
rm.indicator_id_max,
rm.rank_max
from rankings rm
join (
select
analysis_id,
alternative_id,
max(indicator_id) as indicator_id_max,
max(rank) as rank_max
from rankings
group by analysis_id,
alternative_id
having count(*) > 1
) as rm
on r.analysis_id = rm.analysis_id
and r.alternative_id = rm.alternative_id
You example differences seems wrong. You say you want analyses whose ranks for any alternative/indicator combination differ but the example rows 3 and 4 don't satisfy this criteria. A correct result according to your requirement is:
analysis_id | alternative_id | indicator_id | rank
----------------------------------------------------
1 | 1 | 2 | 6
2 | 1 | 2 | 7
1 | 2 | 1 | 3
2 | 2 | 1 | 4
On query you could try is this:
with distinct_ranks as (
select alternative_id
, indicator_id
, rank
, count (*) as count
from rankings
group by alternative_id
, indicator_id
, rank
having count(*) = 1)
select r.analysis_id
, r.alternative_id
, r.indicator_id
, r.rank
from rankings r
join distinct_ranks d on r.alternative_id = d.alternative_id
and r.indicator_id = d.indicator_id
and r.rank = d.rank
You have to realize that on multiple analysis the criteria you have is ambiguous. What if analysis 1,2 and 3 have rank 1 and 4,5 and 6 have rank 2 for alternative/indicator 1/1? The set (1,2,3) is 'different' from the set (4,5,6) but inside each set there is no difference. what is the behavior you desire in that case, should they show up or not? My query finds all records that have a different rank for the same alternative/indicator *from all other analysis' but is not clear if this is correct in your requirement.

Deleting similar columns in SQL

In PostgreSQL 8.3, let's say I have a table called widgets with the following:
id | type | count
--------------------
1 | A | 21
2 | A | 29
3 | C | 4
4 | B | 1
5 | C | 4
6 | C | 3
7 | B | 14
I want to remove duplicates based upon the type column, leaving only those with the highest count column value in the table. The final data would look like this:
id | type | count
--------------------
2 | A | 29
3 | C | 4 /* `id` for this record might be '5' depending on your query */
7 | B | 14
I feel like I'm close, but I can't seem to wrap my head around a query that works to get rid of the duplicate columns.
count is a sql reserve word so it'll have to be escaped somehow. I can't remember the syntax for doing that in Postgres off the top of my head so I just surrounded it with square braces (change it if that isn't correct). In any case, the following should theoretically work (but I didn't actually test it):
delete from widgets where id not in (
select max(w2.id) from widgets as w2 inner join
(select max(w1.[count]) as [count], type from widgets as w1 group by w1.type) as sq
on sq.[count]=w2.[count] and sq.type=w2.type group by w2.[count]
);
There is a slightly simpler answer than Asaph's, with EXISTS SQL operator :
DELETE FROM widgets AS a
WHERE EXISTS
(SELECT * FROM widgets AS b
WHERE (a.type = b.type AND b.count > a.count)
OR (b.id > a.id AND a.type = b.type AND b.count = a.count))
EXISTS operator returns TRUE if the following SQL statement returns at least one record.
According to your requirements, seems to me that this should work:
DELETE
FROM widgets
WHERE type NOT IN
(
SELECT type, MAX(count)
FROM widgets
GROUP BY type
)