Why count ignores grouping by - sql

I don't understand why my query doesn't group results of count by the column I specified. Instead it counts all occurrences of outcome_id in the 'un' subtable.
What am I missing there?
The full structure of my sample database and the query I tried are here:
https://www.db-fiddle.com/f/4HuLpTFWaE2yBSQSzf3dX4/4
CREATE TABLE combination (
combination_id integer,
ticket_id integer,
outcomes integer[]
);
CREATE TABLE outcome (
outcome_id integer,
ticket_id integer,
val double precision
);
insert into combination
values
(510,188,'{52,70,10}'),
(511,188,'{52,56,70,18,10}'),
(512,188,'{55,70,18,10}'),
(513,188,'{54,71,18,10}'),
(514,189,'{52,54,71,18,10}'),
(515,189,'{55,71,18,10,54,56}')
;
insert into outcome
values
(52,188,1.3),
(70,188,2.1),
(18,188,2.6),
(56,188,2),
(55,188,1.1),
(54,188,2.2),
(71,188,3),
(10,188,0.5),
(54,189,2.2),
(71,189,3),
(18,189,2.6),
(55,189,2)
with un AS (
SELECT combination_id, unnest(outcomes) outcome
FROM combination c JOIN
outcome o
on o.ticket_id = c.ticket_id
GROUP BY 1,2
)
SELECT combination_id, cnt
FROM (SELECT un.combination_id,
COUNT(CASE WHEN o.val >= 1.3 THEN 1 END) as cnt
FROM un JOIN
outcome o
on o.outcome_id = un.outcome
GROUP BY 1
) x
GROUP BY 1, 2
ORDER BY 1
Expected result should be:
510 2
511 4
512 2
513 3
514 4
515 4

Assuming, you have these PK constraints:
CREATE TABLE combination (
combination_id integer PRIMARY KEY
, ticket_id integer
, outcomes integer[]
);
CREATE TABLE outcome (
outcome_id integer
, ticket_id integer
, val double precision
, PRIMARY KEY (ticket_id, outcome_id)
);
and assuming this objective:
For each row in table combination, count the number of array elements in outcomes for which there is at least one row with matching outcome_id and ticket_id in table outcome - and val >= 1.3.
Assuming above PK, this burns down to a much simpler query:
SELECT c.combination_id, count(*) AS cnt
FROM combination c
JOIN outcome o USING (ticket_id)
WHERE o.outcome_id = ANY (c.outcomes)
AND o.val >= 1.3
GROUP BY 1
ORDER BY 1;
This alternative might be faster with index support:
SELECT c.combination_id, count(*) AS cnt
FROM combination c
CROSS JOIN LATERAL unnest(c.outcomes) AS u(outcome_id)
WHERE EXISTS (
SELECT
FROM outcome o
WHERE o.outcome_id = u.outcome_id
AND o.val >= 1.3
AND o.ticket_id = c.ticket_id -- ??
)
GROUP BY 1
ORDER BY 1;
Plus, it does not require the PK on outcome. Any number of matching rows still count as 1, due to EXISTS.
db<>fiddle here
As always, the best answer depends on the exact definition of setup and requirements.

A simpler version of #forpas answer:
-- You don't need to join to outcomes in the "with" statement.
with un AS (
SELECT combination_id, ticket_id, unnest(outcomes) outcome
FROM combination c
-- no need to join to outcomes here
GROUP BY 1,2,3
)
SELECT combination_id, cnt FROM
(
SELECT un.combination_id,
COUNT(CASE WHEN o.val >= 1.3 THEN 1 END) as cnt
FROM un
JOIN outcome o on o.outcome_id = un.outcome
and o.ticket_id = un.ticket_id
GROUP BY 1
)x
GROUP BY 1,2
ORDER BY 1
As others have pointed out, the expected result for 514 should be 3 based on your input data.
I'd also like to suggest that using full field names in the group by and order by clauses makes queries easier to debug and maintain going forward.

You need to join on ticket_id also:
with un AS (
SELECT c.combination_id, c.ticket_id, unnest(c.outcomes) outcome
FROM combination c JOIN outcome o
on o.ticket_id = c.ticket_id
GROUP BY 1,2,3
)
SELECT combination_id, cnt
FROM (SELECT un.combination_id, un.ticket_id,
COUNT(CASE WHEN o.val >= 1.3 THEN 1 END) as cnt
FROM un JOIN outcome o
on o.outcome_id = un.outcome and o.ticket_id = un.ticket_id
GROUP BY 1,2
) x
GROUP BY 1, 2
ORDER BY 1
See the demo.
Results:
> combination_id | cnt
> -------------: | --:
> 510 | 2
> 511 | 4
> 512 | 2
> 513 | 3
> 514 | 3
> 515 | 4

Related

Find next row with specific value in a given row

The table I have now looks something like this. Each row has a time value (on which the table is sorted in ascending order), and two values which can be replicated across rows:
Key TimeCall R_ID S_ID
-------------------------------------------
1 100 40 A
2 101 50 B
3 102 40 C
4 103 50 D
5 104 60 A
6 105 40 B
I would like to return something like this, wherein for each row, a JOIN is applied such that the S_ID and Time_Call of the next row that shares that row's R_ID is displayed (or is NULL if that row is the last instance of a given R_ID). Example:
Key TimeCall R_ID S_ID NextTimeCall NextS_ID
----------------------------------------------------------------------
1 100 40 A 102 C
2 101 50 B 103 D
3 102 40 C 105 B
4 103 50 D NULL NULL
5 104 60 A NULL NULL
6 105 40 B NULL NULL
Any advice on how to do this would be much appreciated. Right now I'm joining the table on itself and staggering the key on which I'm joining, but I know this won't work for the instance that I've outlined above:
SELECT TOP 10 Table.*, Table2.TimeCall AS NextTimeCall, Table2.S_ID AS NextS_ID
FROM tempdb..#Table AS Table
INNER JOIN tempdb..#Table AS Table2
ON Table.TimeCall + 1 = Table2.TimeCall
So if anyone could show me how to do this such that it can call rows that aren't just consecutive, much obliged!
Use LEAD() function:
SELECT *
, LEAD(TimeCall) OVER (PARTITiON BY R_ID ORDER BY [Key]) AS NextTimeCall
, LEAD(S_ID) OVER (PARTITiON BY R_ID ORDER BY [Key]) AS NextS_ID
FROM Table2
ORDER BY [Key]
SQLFiddle DEMO
This is only test example I had close by ... but i think it could help you out, just adapt it to your case, it uses Lag and Lead ... and it's for SQL Server
if object_id('tempdb..#Test') IS NOT NULL drop table #Test
create table #Test (id int, value int)
insert into #Test (id, value)
values
(1, 1),
(1, 2),
(1, 3)
select id,
value,
lag(value, 1, 0) over (order by id) as [PreviusValue],
lead(Value, 1, 0) over (order by id) as [NextValue]
from #Test
Results are
id value PreviusValue NextValue
1 1 0 2
1 2 1 3
1 3 2 0
Use an OUTER APPLY to select the top 1 value that has the same R_ID as the first Query and has a higher Key field
Just change the TableName to the actual name of your table in both parts of the query
SELECT a.*, b.TimeCall as NextTimeCall, b.S_ID as NextS_ID FROM
(
SELECT * FROM TableName as a
) as a
OUTER APPLY
(
SELECT TOP 1 FROM TableName as b
WHERE a.R_ID = b.R_ID
AND a.Key > B.Key
ORDER BY Key ASC
) as b
Hope this helps! :)
For older versions, here is one trick using Outer Apply
SELECT a.*,
nexttimecall,
nexts_id
FROM table1 a
OUTER apply (SELECT TOP 1 timecall,s_id
FROM table1 b
WHERE a.r_id = b.r_id
AND a.[key] < b.[key]
ORDER BY [key] ASC) oa (nexttimecall, nexts_id)
LIVE DEMO
Note : It is better to avoid reserved keywords(Key) as column/table names.

Return most results for a match based on a preferential order of keywords

I have built a program to index keywords in text files and put them to the database.
My tables are simple:
FILE_ID|Name
------------
1 | a.txt
2 | b.txt
3 | c.txt
KEYWORD_ID|FILE_ID|Hits
-----------------------
1 | 1 | 55
2 | 1 | 10
3 | 1 | 88
1 | 2 | 44
2 | 2 | 15
1 | 3 | 199
2 | 3 | 1
3 | 3 | 4
There is no primary key in this table. I didn't find it necessary.
Now I'd like to search which file has most hits to certain keywords.
If I have only one keyword it is easy:
select top 10 *
from words
where keyword_id=1
order by hits desc
Lets say I want to search for files with keyword 1 and 3 (both must be present and first keyword has highest importance). After many hours I came with this:
select top 10 k.*
from
(
select file_id,
max(hits) as maxhits
from words
where keyword_id=3
group by file_id
) as x
inner join keyword as k
on (k.file_id = x.file_id
and k.keyword=1)
order by k.hits desc
How to make this right? Especially if I want to search with N keywords. Would it be better use temp table and work with that?
If searching with keyword 1 and 3 I want FILE_ID 3 and 1 returned, in this order (because file_id 3 has higher hit count for keyword 1)
Not sure, but (based on your comment) may be this is what you need ?
(I used table declaration from #scsimon answer)
declare #words table (KEYWORD_ID int, [FILE_ID] int, HITS int)
insert into #words
values
(1,1,55),
(2,1,10),
(3,1,88),
(1,2,44),
(2,2,15),
(1,3,199),
(2,3,1),
(3,3,4)
select [FILE_ID] from (
select *, row_number() over(partition by KEYWORD_ID order by HITS desc) rn from #words
where KEYWORD_ID in(1,3)
)t
where rn = 1
order by hits desc
Assuming that all relevant keywords to be found are stored in table KTable which has two columns ID and KEYWORD_ID
Then query should be
SELECT
FileID,
SUM(Hits) NetHits,
SUM(Hits/K.ID) WeightedHits
FROM
Words w JOIN Ktable K
on w.KEYWORD_ID= K.KEYWORD_ID
GROUP BY FileID
HAVING count(1) = (SELECT COUNT(1) FROM Ktable )
ORDER BY 2 DESC,3 DESC
Same query using Windowing function will be
SELECT
DISTINCT
FileID,
NetHitsPerFile
FROM
(
SELECT
FileID,
SUM(Hits) OVER (PARTITION BY FileID ORDER BY K.ID ASC) NetHitsPerFile,
SUM(FileID) OVER(PARTITION BY K.ID) Files,
SUM(Hits/K.ID) OVER (PARTITION BY FileID ORDER BY K.ID ASC) weightedHits
FROM
Words w JOIN Ktable K
on w.KEYWORD_ID= K.KEYWORD_ID
)T
WHERE Files= (SELECT COUNT(1) FROM Ktable)
ORDER BY NetHitsPerFile, weightedHits
Here's one way... if you only want to see the rows with the KEYWORD_ID you specify, just add that WHERE CLAUSE at the bottom as well. The INNER JOIN limits the FILE_ID to those which contain both KEYWORD_ID you specify by checking that the distinct count is = to the number of keywords. Thus, in the below example we limit the result set on 2 KEYWORD_ID and check to make sure each FILE_ID has 2 distinct KEYWORD_ID associated, with the HAVING clause
declare #words table (KEYWORD_ID int, [FILE_ID] int, HITS int)
insert into #words
values
(1,1,55),
(2,1,10),
(3,1,88),
(1,2,44),
(2,2,15),
(1,3,199),
(2,3,1),
(3,3,4)
select top 10 w.*
from #words w
inner join
(select [FILE_ID]
from #words
where KEYWORD_ID in (1,3)
group by [FILE_ID]
having count(distinct KEYWORD_ID) = 2
) x on x.[FILE_ID] = w.[FILE_ID]
order by HITS desc
You can use top (n) with ties for your query as below:
declare #n int = 10 --10 in your scenario
select top (#n) with ties *
from (
select w.*, f.name from #words w inner join #files f
on w.[FILE_ID] = f.[file_id]
) a
order by (row_number() over (partition by a.[file_id] order by hits desc)-1)/#n +1

How to select all records of n groups?

I want to select the records of the top n groups. My data looks like this:
Table 'runner':
id gid status rtime
---------------------------
100 5550 1 2016-08-19
200 5550 2 2016-08-22
300 5550 1 2016-08-30
100 6050 3 2016-09-01
200 6050 1 2016-09-02
100 6250 1 2016-09-11
200 6250 1 2016-09-15
300 6250 3 2016-09-19
Table 'static'
id description env
-------------------------------
100 something 1 somewhere 1
200 something 2 somewhere 2
300 something 3 somewhere 3
The unit id (id) is unique within the group but not unique in its column, because an instance of the group is generated regularly. The group id (gid) is assigned to every unit but will not generate on more than one instance.
Now, combining the tables and selecting everything or filter by a specific value is easy, but how do I select all records of, for example, the first two groups without directly refering to the group ids?
Expected result would be:
id gid description status rtime
--------------------------------------
300 6250 something 2 3 2016-09-19
200 6250 something 1 1 2016-09-15
100 6250 something 3 1 2016-09-11
200 6050 something 2 1 2016-09-02
100 6050 something 1 3 2016-09-01
Extra Question: When I filter for a timeframe like this:
[...]
WHERE runner.rtime BETWEEN '2016-08-25' AND '2016-09-16'
Is there a simple way of ensuring, that groups are not cut off but either appear with all their records or not at all?
You can use a ROW_NUMBER() to do this. First, create a query to rank groups:
SELECT gid, ROW_NUMBER() over (order by gid desc) as RN
FROM Runner
GROUP BY gid
Then use this as a derived table to get your other info, and use a where clause to filter to the number of groups you want to see. For instance, the below would return the top 5 groups RN <= 5:
SELECT id, R.gid, description, status, rtime
FROM (SELECT gid, ROW_NUMBER() over (order by gid desc) as RN
FROM Runner
GROUP BY gid) G
INNER JOIN Runner R on R.gid = G.gid
INNER JOIN Statis S on S.id = R.id
WHERE RN <= 5 --Change this to see more or less groups
For your second question about dates, you can do this with a subquery like so:
SELECT *
FROM Runner
WHERE gid IN (SELECT gid
FROM Runner
WHERE rtime BETWEEN '2016-08-25' AND '2016-09-16')
Hmmm. I suspect this might do what you want:
select top (1) with ties r.*
from runner r
order by min(rtime) over (partition by gid), gid;
At least, this will get the complete first group.
In any case, the idea is to include gid as a key in the order by and to use top with ties.
you can do the following
with report as(
select n.id,n.gid,m.description,n.status,n.rtime, dense_rank() over(order by gid desc) as RowNum
from #table1 n
inner join #table2 m on n.id = m.id )
select id,gid,description,status,rtime
from report
where RowNum<=2 -- <-- here n=2
order by gid desc,rtime desc
here a working demo
DENSE_RANK looks like a ideal solution here
Select * From
(
select DENSE_RANK() over (order by gid desc) as D_RN, r.*
from runner r
) A
Where D_RN = 1
No need to use ranking functions (ROW_NUMBER, DENSE_RANK etc).
SELECT r.id, gid, [description], [status], rtime
FROM runner r
INNER JOIN static s ON r.id = s.id
WHERE gid IN (
SELECT TOP 2 gid FROM runner GROUP BY gid ORDER BY gid DESC
)
ORDER BY rtime DESC;
The same using CTE:
WITH grouped
AS
(
SELECT TOP 2 gid
FROM runner GROUP BY gid ORDER BY gid DESC
)
SELECT r.id, grouped.gid, [description], [status], rtime
FROM runner r
INNER JOIN static s ON r.id = s.id
INNER JOIN grouped ON r.gid = grouped.gid
ORDER BY rtime DESC;

SQL Server group by set of results

I have a table with data that look like this:
product_id | filter_id
__________________
4525 5066
4525 5068
4525 5091
4526 5066
4526 5068
4526 5094
4527 5066
4527 5068
4527 5094
4528 5066
4528 5071
4528 5078
which is actualy groups of three filters for each product e.g. product 4525 has the filters 5066,5068 and 5091.
The second and third group, is the exact same set of filters (5066,5068 and 5094) bound to a different product ( 4526 and 4527 ).
I want to have each unique filter set only one time ( in other words, I want to remove the duplicate sets of filter_ids ). I don't really care what will happen to the product_id, I only want my unique sets of three filter_ids to be grouped with a key.
For example this will also do:
new_id | filter_id
__________________
1 5066
1 5068
1 5091
2 5066
2 5068
2 5094
3 5066
3 5071
3 5078
I hope I explained it well enough.
Thank you.
Please try below query, which is a bit longer than I expected. Not getting any other logic as of now !!!
select
distinct filter_id,
DENSE_RANK() over(order by sc) new_id
from(
select *,
(SELECT ' ' + cast(filter_id as nvarchar(10))
FROM tbl b where b.product_id=a.product_id order by filter_id
FOR XML PATH('')) SC
From tbl a
)x
order by new_id
/-------------- Other Way ------------------/
SELECT
DENSE_RANK() OVER (ORDER BY PRODUCT_ID) new_id,
filter_id
FROM
Table1
WHERE product_id in (
SELECT MIN(product_id) FROM(
SELECT
product_id,
SUM(filter_id*RN) OVER (PARTITION BY PRODUCT_ID) SM
FROM(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY filter_id) RN
FROM Table1
)x
)xx GROUP BY SM)
Select dense_rank()
over(order by product_id asc),filter_id
from table
If I understand well the question the expected result only have the filter_id of the product 4525, 4526 and 4528 because 4526 and 4527 have the same filter_id, so only one of those is needed, in that case this query will do:
SELECT product_id
, dense_rank() OVER (ORDER BY PRODUCT_ID) new_id
, filter_id
FROM table1 c
WHERE NOT EXISTS (SELECT 1
FROM table1 a
LEFT JOIN table1 b ON a.product_id < b.product_id
WHERE b.product_id = c.product_id
GROUP BY a.product_id, b.product_id
HAVING COUNT(DISTINCT a.filter_id)
= COUNT(CASE WHEN a.filter_id = b.filter_id THEN 1
ELSE NULL
END));
SQLFiddle demo
To get the result the first step is to remove the products with a full duplicate list of filter_ID. To get those product the subquery check every product couple to see if the number of filter_id in one is equal to the filter_id shared by the couple.
If you can have product with different number of filters and if a product with a list of filter fully contained in the filter list of another product should be removed from the result, for example if with the base data
product_id | filter_id
-----------+----------
4525 | 5066
4525 | 5068
4525 | 5091
4526 | 5066
4526 | 5068
the expected result is
new_id | filter_id
-------+----------
1 | 5066
1 | 5068
1 | 5091
the query need to be changed to
SELECT product_id
, dense_rank() OVER (ORDER BY PRODUCT_ID) new_id
, filter_id
FROM table1 c
WHERE NOT EXISTS (SELECT b.product_id
FROM table1 a
LEFT JOIN table1 b ON a.product_id < b.product_id
WHERE b.product_id IS NOT NULL
AND b.product_id = c.product_id
GROUP BY a.product_id, b.product_id
HAVING COUNT(DISTINCT a.filter_id)
= COUNT(CASE WHEN a.filter_id = b.filter_id THEN 1
ELSE NULL
END)
OR COUNT(DISTINCT b.filter_id)
= COUNT(CASE WHEN a.filter_id = b.filter_id THEN 1
ELSE NULL
END));
SQLFiddle Demo
I came out with a query quite similar to the second one of TechDo, nine hour after after him. Even if the result is similar, as the idea is different, my idea is to concat the values of filter_id with math
;WITH B AS (
SELECT Product_ID
, filter_id = filter_id - MIN(filter_id) OVER (PARTITION BY NULL)
, _ID = Row_Number() OVER (PARTITION BY Product_ID ORDER BY filter_id) - 1
, N = CEILING(LOG10(MAX(filter_id) OVER (PARTITION BY NULL)
- MIN(filter_id) OVER (PARTITION BY NULL)))
FROM table1 a
), G1 AS (
SELECT Product_ID
, _ID = SUM(Filter_ID * POWER(10, N * _ID))
FROM B
GROUP BY Product_ID
), G2 AS (
SELECT Product_ID = MIN(Product_ID)
FROM G1
GROUP BY _ID
)
SELECT g2.product_id
, dense_rank() OVER (ORDER BY g2.PRODUCT_ID) new_id
, a.filter_id
FROM G2
INNER JOIN table1 a ON g2.product_id = a.product_id;
SQLFiddle demo
The first CTE do a lot of work:
filter_id is reduced in rank (the reduction from 0 to n-1 digits, depending on the range of the data)
is generated a order number for the filter within the product (_ID)
is calculated the max number of digits of the reduced filter_id (N)
In the following CTE those values are used to generate the filter concatenation using the SUM, the formula SUM(Filter_ID * POWER(10, N * _ID)) put a reduced filter_id every N position, for example with the data provided by the OP we have that the max difference of filter_id is 28, so N is 2 and the results are (the points are added for readability)
Product_ID _ID
----------- -----------
4525 25.02.00
4526 28.02.00
4527 28.02.00
4528 12.05.00
The formula used make collision between different filter group impossible, but need a larger space to be calculated, if the range of the filter_id is big it can hit the limit if the integer.

Join a dynamic number of rows in postgres

Let's say I have the following tables:
Batch Items
---+----- ---+----------+--------
id | size id | batch_id | quality
---+----- ---+----------+--------
1 | 10 1 | 1 | 9
2 | 2 2 | 1 | 10
3 | 2 | 1
4 | 2 | 2
5 | 2 | 1
6 | 2 | 9
I have batches of items. They are sent by batches of size batch.size. An item is broken if it's quality is <= 3.
I want to know the number of broken items in the last batches sent:
batch_id | broken_item_count
---------+---------------------
1 | 0
2 | 2 (and not 3)
My idea is the following:
SELECT batch.id as batch_id, COUNT(broken_items.*) as broken_item_count
FROM batch
INNER JOIN (
SELECT id
FROM items
WHERE items.quality <= 3
ORDER BY items.id asc
LIMIT batch.size -- invalid reference to FROM-clause entry for table "batch"
) broken_items ON broken_items.batch_id = batch.id
(I would ORDER BY items.shipped_at. But for simplicity, I order by items.id)
But this query shows me the error I put as the comment.
How can I limit the number of joined items based on the batch.size that is different for each row ?
Is there any other way to achieve what I want ?
SELECT b.id AS batch_id
, count(i.quality < 4 OR NULL) AS broken_item_count
FROM batch b
LEFT JOIN (
SELECT batch_id, quality
, row_number() OVER (PARTITION BY batch_id ORDER BY id DESC) AS rn
FROM items
) i ON i.batch_id = b.id
AND i.rn <= b.size
GROUP BY 1
ORDER BY 1;
SQL Fiddle with added examples.
This is much like #Clodoaldos's answer, but with a couple of differences. Most importantly:
You want to count the broken items in the last batches sent, so we have to ORDER BY id DESC
If there can be batches without items at all you need to use LEFT JOIN instead of a plain JOIN or those batches are excluded.
Consequently, the check i.rn <= b.size needs to move from the WHERE clause to the JOIN clause.
SQL Fiddle
select
b.id as batch_id,
count(quality <= 3 or null) as broken_item_count
from
batch b
inner join (
select
id, quality, batch_id,
row_number() over (partition by batch_id order by id) as rn
from items
) i on i.batch_id = b.id
where rn <= b.size
group by b.id
order by b.id
From what I understand the count of defective items cannot be greater than the batch size.
EDIT: After reading your comments, I think using the RANK() function, and then join by rank and size should work for you. The following query attempts that.
SELECT b.id,
SUM(CASE WHEN i1.quality <= 3 THEN 1 ELSE 0END) as broken_item_count
FROM BATCH as b
LEFT JOIN (SELECT i.id, i.batch_id, i.quality,
RANK() OVER(PARTITION BY i.batch_id ORDER BY i.id) as RANK
FROM ITEMS as i) as i1 ON b.id = i1.batch_id AND i1.RANK <= b.size
GROUP BY b.id
EDIT2: Updated the query with a LEFT JOIN to cover the case where there are no samples in some batch.