For example, I have a table
Date
ID
Column2
Result of count()
01-01-2022
1
Yes
3
01-02-2022
1
No
2
01-03-2022
2
Yes
5
And when I want to check totals by date I get the same result when I count directly from table1.
select date, sum(Result of count()) from (select date, count(distinct ID) from table1 group by date) group by date
Result: 10
But when I add another column in subquery to count by, like this
select date, sum(Result of count()) from (select date, column2, count(distinct ID) from table1 group by date, column2) group by date
I get result with more values: 13
How an addition of another column affects the rows counted?
Related
I need to make a PIVOT table from Source like this table
FactID UserID Date Product QTY
1 11 01/01/2020 A 600
2 11 02/01/2020 A 400
3 11 03/01/2020 B 500
4 11 04/01/2020 B 200
6 22 06/01/2020 A 1000
7 22 07/01/2020 A 200
8 22 08/01/2020 B 300
9 22 09/01/2020 B 100
Need Pivot Like this where Product QTY is QTY by Last Date
UserID A B
11 400 200
22 200 100
My try PostgreSQL
Select
UserID,
MAX(CASE WHEN Product='A' THEN 'QTY' END) AS 'A',
MAX(CASE WHEN Product='B' THEN 'QTY' END) AS 'B'
FROM table
GROUP BY UserID
And Result
UserID A B
11 600 500
22 1000 300
I mean I get a result by the maximum QTY and not by the maximum date!
What do I need to add to get results by the maximum (last) date ??
Postgres doesn't have "first" and "last" aggregation functions. One method for doing this (without a subquery) uses arrays:
select userid,
(array_agg(qty order by date desc) filter (where product = 'A'))[1] as a,
(array_agg(qty order by date desc) filter (where product = 'B'))[1] as b
from tab
group by userid;
Another method uses select distinct with first_value():
select distinct userid,
first_value(qty) over (partition by userid order by product = 'A' desc, date desc) as a,
first_value(qty) over (partition by userid order by product = 'B' desc, date desc) as b
from tab;
With the appropriate indexes, though, distinct on might be the fastest approach:
select userid,
max(qty) filter (where product = 'A') as a,
max(qty) filter (where product = 'B') as b
from (select distinct on (userid, product) t.*
from tab t
order by userid, product, date desc
) t
group by userid;
In particular, this can use an index on userid, product, date desc). The improvement in performance will be most notable if there are many dates for a given user.
You can use DENSE_RANK() window function in order to filter by the last date per each product and UserID before applying conditional aggregation such as
SELECT UserID,
MAX(CASE WHEN Product='A' THEN QTY END) AS "A",
MAX(CASE WHEN Product='B' THEN QTY END) AS "B"
FROM
(
SELECT t.*, DENSE_RANK() OVER (PARTITION BY Product,UserID ORDER BY Date DESC) AS rn
FROM tab t
) q
WHERE rn = 1
GROUP BY UserID
Demo
presuming all date values are distinct(no ties occur for dates)
I have this table, i want to count the number of orders which are of the same type , and the count of all orders, as follow
ord_id type
1 A
2 B
3 A
4 C
Here is the result :
TYPE COUNT TOTAL
A 2 4
B 1 4
C 1 4
where count column is the count of orders based on their type, and total is the total orders.
Here is my code:
SELECT type, COUNT(*)
FROM
table
where type = 'A'
Union
SELECT type, COUNT(*)
FROM
table
where type = 'b';
Use aggregation and window functions:
select
type,
count(*) cnt,
sum(count(*)) over() total
from mytable
group by type
This seems something simple, but couldn't find an answer for this question last few hours.
I have a table request_state, where "id" is primary key, it can have multiple entries with same state_id. I want to get the id after grouping by state_id using max datetime.
So I tried this, but it gives error "state_id" must appear in the GROUP BY clause or be used in an aggregate function
select id, state_id, max(datetime)
from request_state
group by id
but when I use following query, I get multiple entries with same state_id.
select id, state_id, max(datetime)
from request_state
group by id, state_id
My table:
id state_id date_time
cef 1 Jan 1
ter 1 Jan 2
ijk 1 Jan 3
uuu 2 Feb 1
rrr 2 Feb 2
This is what I want as my result,
id state_id date_time
__ ________ _________
ijk 1 Jan 3
rrr 2 Feb 2
You seem to want:
select max(id) as id, state_id, max(datetime)
from request_state
group by state_id;
If you want the row where datetime is maximum for each state, then use distinct on:
select distinct on (state) rs.*
from request_state rs
order by state, datetime desc;
Try this query:
select id, state_id, date_time from (
select id, state_id, date_time,
row_number() over (partition by state_id order by date_time desc) rn
from tbl
) a where rn = 1
You can use correlated suqbuery :
select t.*
from table t
where date_time = (select max(date_time) from table t1 where t1.state_id = t.state_id);
I have a table where I need to get the ID, for a group(based on ID and Name) with a COUNT(*) = 3, for the latest set of timestamps.
So for example below, I want to retrieve ID 2. As it has 3 rows, and the latest timestamps (even though ID 3 has latest timestamps overall, it doesn't have a count of 3).
But I don't understand how to order by Date, as I cannot contain it in the Group By clause, as it is not the same:
SELECT TOP 1 ID
FROM TABLE
GROUP BY ID,Name
HAVING COUNT(ID) > 2
AND Name = 'ABC'
--ORDER BY Date DESC
Sample Data
ID Name Date
1 ABC 2015-05-27 08:00
1 ABC 2015-05-27 09:00
1 ABC 2015-05-27 10:00
2 ABC 2015-05-27 11:00
2 ABC 2015-05-27 12:00
2 ABC 2015-05-27 13:00
3 ABC 2015-05-27 14:00
3 ABC 2015-05-27 15:00
In SQL server, you need aggregate the columns not on group by list:
SELECT TOP 1 ID
FROM TABLE
WHERE Name = 'ABC'
GROUP BY ID,Name
HAVING COUNT(ID) > 2
ORDER BY MAX(Date) DESC
The name filter should be put before the group by for better performance, if you really need it.
You could do it in a nested query.
Subquery:
SELECT ID
from TABLE
GROUP BY ID
HAVING Count(ID) > 2
That gives you the IDs you want. Put that in another query:
SELECT ID, Data
FROM Table
Where ID in (Subquery)
Order by Date DESC;
First get all desired IDs. That is all IDs having a count > 2. Get the maximum date for each such ID. Then rank these records with ROW_NUMBER, giving the latest ID #1. At last remove all IDs that are not ranked #1.
select name, id
from
(
select
name, id, row_count() over (partition by name order by max_date desc) as rn
from
(
select name, id, max(date) as max_date
from mytable
--where name = 'ABC'
group by name, id
having count(*) > 2
) wanted_ids
) ranked_ids
where rn = 1;
I have daily time series for different companies in my dataset and work with PostgreSQL. My goal is to exclude companies with too incomplete time series. Therefor I want to exclude all companies which have 3 or more consecutive missing values. Furthermore I want to exclude all companies which have more than 50% missing values between their first and final date in the dataset.
We can work with the following example data:
date company value
2012-01-01 A 5
2012-01-01 B 2
2012-01-02 A NULL
2012-01-02 B 2
2012-01-02 C 4
2012-01-03 A NULL
2012-01-03 B NULL
2012-01-03 C NULL
2012-01-04 A NULL
2012-01-04 B NULL
2012-01-04 C NULL
2012-01-05 A 8
2012-01-05 B 9
2012-01-05 C 3
2012-01-06 A 8
2012-01-06 B 9
2012-01-06 C NULL
So A has to be excluded because it has a gap of three consecutive missing values, and C because it has more than 50% missing values between its first and final date.
Combining other answers in this forum I made up the following code:
Add an autoincrement primary key to identify each row
CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date;
CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE;
ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL('id_seq');
UPDATE test SET id = NEXTVAL('id_seq');
ALTER TABLE test ADD PRIMARY KEY (id);
Detect the gaps in the time series
CREATE TABLE to_del AS WITH count3 AS
( SELECT *,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
AS cnt FROM test)
SELECT company, id FROM count3 WHERE cnt >= 3;
Delete the gaps from mytable
DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);
It seems to achieve to detect and delete gaps of 3 or more consecutive missing values from the time series. But this approach is very cumbersome. And I can't figure out how to additinoally exclude all companies with more than 50% missing values.
Can you think of a more effective solution than mine (I just learn to work with PostgreSQL), that also manages to exclude companies with more than 50% missing values?
I would create only one query:
DELETE FROM mytable
WHERE company in (
SELECT Company
FROM (
SELECT Company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company)
/
COUNT(*)
OVER (PARTITION BY company) As p50
) alias
WHERE cnt >= 3 OR p50 > 0.5
)
A composite index on (company + value) columns can help to gain a maximum speed of this query.
EDIT
The above query doesn't work
I've corrected it slightly, here is a demo: http://sqlfiddle.com/#!15/c9bfe/7
Two things have been changed:
- PARTITION BY company ORDER BY date instead of ORDER BY id
- explicit cast to numeric( because integer have been truncated to 0):
OVER (PARTITION BY company)::numeric
SELECT company, cnt, p50
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
-- WHERE cnt >= 3 OR p50 > 0.5
and now the delete query should work:
DELETE FROM mytable
WHERE company in (
SELECT company
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
WHERE cnt >= 3 OR p50 > 0.5
)
For the 50% criteria, you could select all the companies for which the number of distinct dates in lower than half the number of days between the min and max dates.
I have not tested this but it should give you an idea. I used a CTE to make it easier to read.
WITH MinMax AS
(
SELECT Company, DATE_PART('day', AGE(MIN(Date), MAX(Date))) AS calendar_days, COUNT(DISTINCT date) AS days FROM table
GROUP By Company
)
SELECT Company FROM MinMax
WHERE (calendars_days / 2) > days