I have a large PostgreSQL DB table. From this table I need to take rows grouped by Car_id and position columns.
The problem is that I have a lot of duplicates and need to take one row with the best position.
I wrote a sql example that gave me the correct results, but it needs to be modified. Or how can I do it in a cleaner way?
And I need to choose a unique car_id, with a minimum position, last by date of scrape, of all passed license plate numbers, I am not interested in what particular license plate number will be.
Example of SQL:
select
"eventDate",
"Car_id",
min("position") as "carPosition",
groupArray(concat(toString("scrapedAt"), '_', toString("position"))) as "scrapedAtByPosition",
groupArray(concat("licensePlate", '_', toString("position"))) as "licensePlateByPosition",
groupArray(concat(toString("amazonChoice"), '_', toString("position"))) as "amazonChoicesByPosition",
'organic' as "matchType"
from "Car1_ScrapeHistoryLicensePlate"
inner join (
select "Car_id", max("scrapedAt") as "scrapedAt"
from "Car1_ScrapeHistoryLicensePlate"
where "licensePlate" IN ('ALPR912', 'JGPD831') and "eventDate" between '2022-08-12' and '2022-09-12'
group by "Car_id", "eventDate"
) as t1 USING ("Car_id", "scrapedAt")
where "licensePlate" IN ('ALPR912', 'JGPD831') and "eventDate" between '2022-08-12' and '2022-09-12'
group by "eventDate", "Car_id"
order by "eventDate" desc;
Database records:
eventDate Car_id licensePlate position scrapedAt
---------- ------ ------------ ------- ---------
2022-09-10, 1, APRJSC512, 1, 1660000001
2022-09-10, 1, APRJSC512, 1, 1660000002
2022-09-10, 1, PLBQWN035, 1, 1660000003
2022-09-10, 1, PLBQWN035, 1, 1660000004
2022-09-10, 1, PLBQWN035, 2, 1660000002
2022-09-11, 2, APRJSC512, 1, 1660000011
2022-09-11, 2, APRJSC512, 2, 1660000022
2022-09-11, 2, PLBQWN035, 1, 1660000033
2022-09-11, 2, PLBQWN035, 2, 1660000044
2022-09-11, 2, PLBQWN035, 5, 1660000022
2022-09-12, 3, APRJSC512, 3, 1660000111
2022-09-12, 3, PLBQWN035, 3, 1660000222
2022-09-13, 4, PLBQWN035, 4, 1660001111
2022-09-14, 5, PLBQWN035, 5, 1660011111
Expected result:
eventDate Car_id licensePlate position scrapedAt
---------- ------ ------------ ------- ---------
2022-09-10, 1, PLBQWN035, 1, 1660000004
2022-09-11, 2, PLBQWN035, 1, 1660000033
2022-09-12, 3, PLBQWN035, 3, 1660000222
In PostgreSQL you can use brilliant distinct on.
The order by list of expressions expressions determine which record to be picked for each car_id. For each group with the same car_id the first one is picked.
select distinct on (car_id) * -- or the relevant expression list here
from the_table
order by car_id, position, scrapedat desc;
DB-fiddle
select eventDate
,Car_id
,licensePlate
,position
,scrapedAt
from
(
select *
,row_number() over(partition by car_id order by position, scrapedat desc) as rn
from t
) t
where rn = 1
eventdate
car_id
licenseplate
position
scrapedat
2022-09-10
1
PLBQWN035
1
1660000004
2022-09-11
2
PLBQWN035
1
1660000033
2022-09-12
3
PLBQWN035
3
1660000222
2022-09-13
4
PLBQWN035
4
1660001111
2022-09-14
5
PLBQWN035
5
1660011111
Fiddle
Related
This question already has answers here:
Get top 1 row of each group
(19 answers)
Closed 4 years ago.
I'm trying to query for last read report and the date it was read.
UserReport
UserId, ReportId, DateRead
1, 2, 2018-01-01
1, 1, 2015-02-12
2, 3, 2016-03-11
3, 2, 2017-04-10
1, 3, 2016-01-01
2, 1, 2018-02-02
So to get for a specific user I can do a query like this:
SELECT TOP 1 *
FROM UserReport
WHERE UserId = 1
ORDER BY DateRead DESC
But I'm having troubles figuring out how to do this for each user. What is throwing me off is TOP 1
Expected Result:
UserId, ReportId, DateRead
1, 2, 2018-01-01
2, 1, 2018-02-02
3, 2, 2017-04-10
You could use:
SELECT TOP 1 WITH TIES *
FROM UserReport
ORDER BY ROW_NUMBER() OVER(PARTITION BY UserId ORDER BY DateRead DESC)
I have this table:
CREATE TABLE yourtable
(
HevEvenementID INT,
HjvNumeSequJour INT,
HteTypeEvenID INT
);
INSERT INTO yourtable
VALUES (12074, 1, 66), (12074, 2, 66), (12074, 3, 5),
(12074, 4, 7), (12074, 5, 17), (12074, 6, 17),
(12074, 7, 17), (12074, 8, 17), (12074, 9, 17), (12074, 10, 5)
I need to group by consecutive HteTypeEvenID. Right now I am doing this:
SELECT
HevEvenementID,
MAX(HjvNumeSequJour) AS HjvNumeSequJour,
HteTypeEvenID
FROM
(SELECT
HevEvenementID,
HjvNumeSequJour,
HteTypeEvenID
FROM
yourtable y) AS s
GROUP BY
HevEvenementID, HteTypeEvenID
ORDER BY
HevEvenementID,HjvNumeSequJour, HteTypeEvenID
which returns this:
HevEvenementID HjvNumeSequJour HteTypeEvenID
---------------------------------------------
12074 2 66
12074 4 7
12074 9 17
12074 10 5
I need to group by consecutive HteTypeEvenID, to get this result:
HevEvenementID HjvNumeSequJour HteTypeEvenID
----------------------------------------------
12074 2 66
12074 3 5
12074 4 7
12074 9 17
12074 10 5
Any suggestions?
In SQL Server, you can do this with aggregation and difference of row numbers:
select HevEvenementID, HteTypeEvenID,
max(HjvNumeSequJour)
from (select t.*,
row_number() over (partition by HevEvenementID order by HjvNumeSequJour) as seqnum_1,
row_number() over (partition by HevEvenementID, HteTypeEvenID order by HjvNumeSequJour) as seqnum_2
from yourtable t
) t
group by HevEvenementID, HteTypeEvenID, (seqnum_1 - seqnum_2)
order by max(HjvNumeSequJour);
I think the best way to understand how this works is by staring at the results of the subquery. You will see how the difference between the two values defines the groups of adjacent values.
I am trying to get a dense rank to group sets of data together. In my table I have ID, GRP_SET, SUB_SET, and INTERVAL which simply represents a date field. When records are inserted using an ID they get inserted as GRP_SETs of 3 rows shown as a SUB_SET. As you can see when inserts happen the interval can change slightly before it finishes inserting the set.
Here is some example data and the DRANK column represents what ranking I'm trying to get.
with q as (
select 1 id, 'a' GRP_SET, 1 as SUB_SET, 123 as interval, 1 as DRANK from dual union all
select 1, 'a', 2, 123, 1 from dual union all
select 1, 'a', 3, 124, 1 from dual union all
select 1, 'b', 1, 234, 2 from dual union all
select 1, 'b', 2, 235, 2 from dual union all
select 1, 'b', 3, 235, 2 from dual union all
select 1, 'a', 1, 331, 3 from dual union all
select 1, 'a', 2, 331, 3 from dual union all
select 1, 'a', 3, 331, 3 from dual)
select * from q
Example Data
ID GRP_SET SUBSET INTERVAL DRANK
1 a 1 123 1
1 a 2 123 1
1 a 3 124 1
1 b 1 234 2
1 b 3 235 2
1 b 2 235 2
1 a 1 331 3
1 a 2 331 3
1 a 3 331 3
Here is the query I Have that gets close but I seem to need something like a:
Partition By: ID
Order within partition by: ID, Interval
Change Rank when: ID, GRP_SET (change)
select
id, GRP_SET, SUB_SET, interval,
DENSE_RANK() over (partition by ID order by id, GRP_SET) as DRANK_TEST
from q
Order by
id, interval
Using the MODEL clause
Behold for you are pushing your requirements beyond the limits of what is easy to express in "ordinary" SQL. But luckily, you're using Oracle, which features the MODEL clause, a device whose mystery is only exceeded by its power (excellent whitepaper here). You shall write:
SELECT
id, grp_set, sub_set, interval, drank
FROM (
SELECT id, grp_set, sub_set, interval, 1 drank
FROM q
)
MODEL PARTITION BY (id)
DIMENSION BY (row_number() OVER (ORDER BY interval, sub_set) rn)
MEASURES (grp_set, sub_set, interval, drank)
RULES (
drank[any] = NVL(drank[cv(rn) - 1] +
DECODE(grp_set[cv(rn) - 1], grp_set[cv(rn)], 0, 1), 1)
)
Proof on SQLFiddle
Explanation:
SELECT
id, grp_set, sub_set, interval, drank
FROM (
-- Here, we initialise your "dense rank" to 1
SELECT id, grp_set, sub_set, interval, 1 drank
FROM q
)
-- Then we partition the data set by ID (that's your requirement)
MODEL PARTITION BY (id)
-- We generate row numbers for all columns ordered by interval and sub_set,
-- such that we can then access row numbers in that particular order
DIMENSION BY (row_number() OVER (ORDER BY interval, sub_set) rn)
-- These are the columns that we want to generate from the MODEL clause
MEASURES (grp_set, sub_set, interval, drank)
-- And the rules are simple: Each "dense rank" value is equal to the
-- previous "dense rank" value + 1, if the grp_set value has changed
RULES (
drank[any] = NVL(drank[cv(rn) - 1] +
DECODE(grp_set[cv(rn) - 1], grp_set[cv(rn)], 0, 1), 1)
)
Of course, this only works if there are no interleaving events, i.e. there is no other grp_set than a between 123 and 124
This might work for you. The complicating factor is that you want the same "DENSE RANK" for intervals 123 and 124 and for intervals 234 and 235. So we'll truncate them to the nearest 10 for purposes of ordering the DENSE_RANK() function:
SELECT id, grp_set, sub_set, interval, drank
, DENSE_RANK() OVER ( PARTITION BY id ORDER BY TRUNC(interval, -1), grp_set ) AS drank_test
FROM q
Please see SQL Fiddle demo here.
If you want the intervals to be even closer together in order to be grouped together, then you can multiply the value before truncating. This would group them by 3s (but maybe you don't need them so granular):
SELECT id, grp_set, sub_set, interval, drank
, DENSE_RANK() OVER ( PARTITION BY id ORDER BY TRUNC(interval*10/3, -1), grp_set ) AS drank_test
FROM q
I have below data available with me
Date Sec ID Price
01-Jan-2014, 1, 100
02-Jan-2014, 1, 111
03-Jan-2014, 1, 90
04-Jan-2014, 1, 121
01-Jan-2014, 2, 10
02-Jan-2014, 2, 11
03-Jan-2014, 2, 9
04-Jan-2014, 2, 12
I am using the lag function using below query but not getting proper results
select sec_id,date_of_data,price,
LAG(sec_id,1) over (order by sec_id) as prev_sec_id,
LAG(date_of_data,1) over (order by sec_id) as prev_date,
LAG(price,1) over (order by sec_id) as prev_price,
price/LAG(price,1) over (order by sec_id)-1 as price_return
from eqa.asset_mkt_price_ts
where sec_id in (1,2);
and date_of_data between '01-Jan-2014' and '04-Jan-2014'
Results are as below
Date Sec ID Price Prev Sec ID Prev Price
01-Jan-2014, 1, 100, NULL, NULL
02-Jan-2014, 1, 111, 1, 100
03-Jan-2014, 1, 90, 1, 111
04-Jan-2014, 1, 121, 1, 90
01-Jan-2014, 2, 10, 1, 121 ----- Issue Case
02-Jan-2014, 2, 11, 2, 10
03-Jan-2014, 2, 9, 2, 11
04-Jan-2014, 2, 12, 2, 12
As seen above, results are not logical as For Sec ID: 2, Previous Price is being used of Sec ID: 1 which is not correct
Hope any expert around here can help me
Thanks
Hitesh
You need to replace order by sec_id with partition by sec_id order by date. Using order by sec_id produces an analytic window of the whole input table ordered by sec_id, which could give unpredictable results and will always get the previous row regardless of whether a new sec_id group is started.
Partitioning by sec_id gives two analytic windows, so the lag function works as you would like it to:
select x.*
from
(select sec_id,date_of_data,price,
LAG(sec_id,1) over (partition by sec_id order by date) as prev_sec_id,
LAG(date_of_data,1) over (partition by sec_id order by date) as prev_date,
LAG(price,1) over (partition by sec_id order by date) as prev_price,
price/LAG(price,1) over (partition by sec_id order by date)-1 as price_return
from eqa.asset_mkt_price_ts
where sec_id in (1,2)) x where x.price_return < 0.3;
I have the following table structure
Key int
MemberID int
VisitDate DateTime
How can group all the dates falling with a given date range say 15 days..The first visit for the sameMember should be considered as the starting date.
eg
Key ID VisitDate(MM/dd/YY)
1 1 02/01/11
2 1 02/09/11
3 1 02/12/11
4 1 02/17/11
5 2 02/03/11
6 2 02/19/11
In this case the result should be
ID StartDate EndDate
1 02/01/11 02/12/11
1 02/17/11 02/17/11
2 02/03/11 02/03/11
2 02/19/11 02/19/11
One way to do this would be to use window aggregating. Here's how:
Setup:
DECLARE #data TABLE (
[Key] int, ID int, VisitDate date
);
INSERT INTO #data ([Key], ID, VisitDate)
SELECT 1, 1, '02/01/2011' UNION ALL
SELECT 2, 1, '02/09/2011' UNION ALL
SELECT 3, 1, '02/12/2011' UNION ALL
SELECT 4, 1, '02/17/2011' UNION ALL
SELECT 5, 2, '02/03/2011' UNION ALL
SELECT 6, 2, '02/19/2011';
Query:
WITH marked AS (
SELECT
*,
Grp = DATEDIFF(DAY, MIN(VisitDate) OVER (PARTITION BY ID), VisitDate) / 15
FROM #data
)
SELECT
ID,
StartDate = MIN(VisitDate),
EndDate = MAX(VisitDate)
FROM marked
GROUP BY ID, Grp
ORDER BY ID, StartDate
Output:
ID StartDate EndDate
----------- ---------- ----------
1 2011-02-01 2011-02-12
1 2011-02-17 2011-02-17
2 2011-02-03 2011-02-03
2 2011-02-19 2011-02-19
Basically, for each row, the query is calculating the difference of days between VisitDate and the first VisitDate for the same ID and divides it by 15. The result is then used as a grouping criterion. Note that SQL Server uses integer division when both operands of the / operator are integers.