SQL aggregate using DISTINCT on ID by latest date - sql

Request
I have a section of data below and my goal is to limit the agent column to be distinct only containing unique values, where the unique value selected is the latest date it was modified.
Existing Data
modified agent rank
2016-10-18 346502 0
2013-06-04 346502 41
2011-10-31 346503 0
2012-08-13 346505 0
2016-04-18 346506 66
2015-01-27 346506 1
2016-01-21 346507 103
2015-01-27 346507 130
2012-01-30 346508 0
Trying to use this answer https://stackoverflow.com/a/29912858/461887 as a basis but cannot get where to aggregate it properly.
SQL not working
SELECT DISTINCT
FLiex.agtprof.modify_date_time
,FLiex.agtprof.agent_id
,FLiex.agtprof.rank
,FLiex.agtprof.external_id
WHERE
FLiex.agtprof.modify_date_time = MAX( FLiex.agtprof.modify_date_time)
FROM
FLiex.agtprof
Desired Output
modify agent rank
18/10/2016 346502 0
18/04/2016 346506 66
21/01/2016 346507 103
13/08/2012 346505 0
30/01/2012 346508 0
31/10/2011 346503 0

You're attempting to get single row data, but based on the other rows. While this may be possible with aggregate functions, it's much easier to do with window (analytic) functions:
SELECT [modified], [agent], [rank], [id]
FROM (SELECT [modified], [agent], [rank], [id],
ROW_NUMBER() OVER (PARTITION BY [agent]
ORDER BY [modified] DESC) AS rn
FROM [agtprof]) t
WHERE rn = 1

SELECT DISTINCT max(id_date), agent, rank, id
FROM fliex.agtprof
GROUP BY 2,3,4;
Try this. I think if you chose the max id_date and then group by the rest, you should get the results you're looking for.

Try this:
SELECT
FLiex.agtprof.modify_date_time
,FLiex.agtprof.agent_id
,FLiex.agtprof.rank
,FLiex.agtprof.external_id
FROM
FLiex.agtprof
INNER JOIN (
SELECT
Max(FLiex.agtprof.modify_date_time) as max_mod_date_time
,FLiex.agtprof.agent_id as agent_id
FROM
FLiex.agtprof
GROUP BY FLiex.agtprof.agent_id
) Filter
ON FLiex.agtprof.agentID = Filter.agent_id
AND FLiex.agtprof.modify_date_time = Filter.max_mod_date_time

Related

Turn several queries into one in SQL Server

I have a table in SQL Server called schedule that has the following columns (and others not listed):
scheduleId
roomId
dateRegistered
dateFreed
4564
2
2022-12-25
2022-12-26
4565
3
2022-12-25
2022-12-27
4566
15
2022-12-26
2022-12-27
4567
2
2022-12-28
2022-12-31
4568
3
2022-12-28
2022-12-30
In some part of my app I need to show all the rooms occupied at a certain date.
Currently I run a query like this:
SELECT TOP (1) *
FROM schedule
WHERE roomId = [theNeededRoom] AND dateFreed < [providedDate]
ORDER BY dateFreed DESC
The thing is that I have to run that query in a for loop so that I get the information for every room.
I'm sure there has to be a better way to do this in a single query that returns a row for each of the different roomIds possible, how can I go about this?
Also, when the room is first registered, the dateFreed column has a null value, if I wanted to take this into account, how can I make the query so that, in the case the dateFreed value is null, that is the row that gets chosen?
You can use TOP(1) WITH TIES, while ordering on the last "dateFreed" date.
In order to have a "tied" value to match on, instead of ordering on "dateFreed DESC" we can use the ROW_NUMBER window function to generate a ranking on the same field (which will store 1 for each most recent "dateFreed" value, per "roomId").
SELECT TOP (1) WITH TIES *
FROM schedule
WHERE dateFreed < [providedDate]
ORDER BY ROW_NUMBER() OVER(PARTITION BY roomId ORDER BY dateFreed DESC)
SELECT
t.*
FROM
(
SELECT
roomId AS rId,
max(dateFreed) AS dateFreedMax
FROM
schedule s
GROUP BY
s.roomId
) AS t
WHERE
t.dateFreedMax < [providedDate]
OR t.dateFreedMax IS NULL
Or
SELECT roomId
FROM
schedule s
GROUP BY s.roomId, dateFreed
HAVING
max(dateFreed)<[providedDate] OR dateFreed IS NULL

count most repeated value per group in hive?

I am using hive 0.14.0 in a hortonworks data platform, on a big file similar to this input data:
tpep_pickup_datetime
pulocationid
2022-01-28 23:32:52.0
100
2022-02-28 23:02:40.0
202
2022-02-28 17:22:45.0
102
2022-02-28 23:19:37.0
102
2022-03-29 17:32:02.0
102
2022-01-28 23:32:40.0
101
2022-02-28 17:28:09.0
201
2022-03-28 23:59:54.0
100
2022-02-28 21:02:40.0
100
I want to find out what was the most common hour in each locationid, this being the result:
locationid
hour
100
23
101
17
102
17
201
17
202
23
i was thinking in using a partition command like this:
select * from (
select hour(tpep_pickup_datetime), pulocationid
(max (hour(tpep_pickup_datetime))) over (partition by pulocationid) as max_hour,
row_number() over (partition by pulocationid) as row_no
from yellowtaxi22
) res
where res.row_no = 1;
but it shows me this error:
SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: Invalid function pulocationid
is there any other way of doing this?
with raw_groups -- subquery syntax
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
),
grouped_data as -- another subquery syntax based on `with`
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
raw_groups
group by
mylocation.pulocationid
)
select --format data into your requested format
location.pulocationid,
location.hour
from
grouped_data
I do not remember hive 0.14 can use with clause, but you could easily re-write the query to not use it.(by substituting the select in pace of the table names) I just don't find it as readable:
select --format data into your requested format
location.pulocationid,
location.hour
from
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
)
group by
mylocation.pulocationid
)
You were half way there!
The idea was in the right direction however the syntax is a little bit off:
First find the count per each hour
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime)
Then add the row_number but you need to order it by the total count in a descending way:
select pulocationid , hour , cnt , row_number () over ( partition be pulocationid order by cnt desc ) as row_no from
Last but not the list, take only the rows with the highest count ( this can be done by the max function rather than the row_number one by the way)
Or in total :
select pulocationid , hour from (
select pulocationid , hour , cnt , row_number ()
over ( partition by pulocationid order by cnt desc )
as row_no from (
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime) ))
Where row_no=1

Remove duplicate rows based on field in a select query with PostgreSQL?

Considering the table mdl_files that contains the following fields: id, contenthash, timecreated, filesize.
This tables stores attachment files.
We consider that all the rows with the same content hash are duplicate rows and I just want to keep the oldest row (or first if dates are equals).
How can I do that?
The following query:
SELECT
id,
contenthash,
filesize,
to_timestamp(timecreated) :: DATE
FROM mdl_files
ORDER BY contenthash;
returns:
2480229 00002e87605311feb82b70473b61e81f0223c774 18178 2016-10-05
2997411 0000bfd20ef84948eee6811ce5bbac03de42ccb0 1293 2017-03-31
1304839 000280169fc78d704a2d4569bfb6f42ea4a1d5ae 8203 2015-11-10
1364656 000280169fc78d704a2d4569bfb6f42ea4a1d5ae 8203 2015-11-17
71568 0003c6aec5835964870902d697c06d21abf76bf7 139439 2013-04-19
2959945 000419c19d77df7285e669614075b47414e3ab2c 398 2017-03-20
3483049 00061dc0bc2452304107ddc75e7ee2908c729905 28618 2017-08-17
3483047 00061dc0bc2452304107ddc75e7ee2908c729905 28618 2017-08-17
I want to get this resultset:
2480229 00002e87605311feb82b70473b61e81f0223c774 18178 2016-10-05
2997411 0000bfd20ef84948eee6811ce5bbac03de42ccb0 1293 2017-03-31
1304839 000280169fc78d704a2d4569bfb6f42ea4a1d5ae 8203 2015-11-10
71568 0003c6aec5835964870902d697c06d21abf76bf7 139439 2013-04-19
2959945 000419c19d77df7285e669614075b47414e3ab2c 398 2017-03-20
3483049 00061dc0bc2452304107ddc75e7ee2908c729905 28618 2017-08-17
I want the following duplicated lines to be removed from the resultset:
1364656 000280169fc78d704a2d4569bfb6f42ea4a1d5ae 8203 2015-11-17
3483047 00061dc0bc2452304107ddc75e7ee2908c729905 28618 2017-08-17
Use DISTINCT ON:
SELECT DISTINCT ON (contenthash)
id,
contenthash,
filesize,
to_timestamp(timecreated) :: DATE
FROM mdl_files
ORDER BY contenthash, timecreated, id;
DISTINCT ON is a Postgres extension that makes sure that returns one row for each unique combination of the keys in parentheses. The specific row is the first one found based on the order by clause.
You can try to use ROW_NUMBER() with windows function to make row number then delete it.
SELECT t.*
FROM (
SELECT
id,
contenthash,
filesize,
ROW_NUMBER() OVER (PARTITION BY contenthash,filesize order by timecreated) rn
FROM mdl_files
) t
where t.rn = 1
sqlfiddle
If you want to DELETE duplicate data you can use EXISTS in where clause.
DELETE
FROM mdl_files f WHERE EXISTS(
SELECT 1
FROM (
SELECT
id,
contenthash,
filesize,
ROW_NUMBER() OVER (PARTITION BY contenthash,filesize order by timecreated) rn
FROM mdl_files
) t
where t.rn > 1 and t.id = f.id
)
sqlfiddle

Find out the last updated record in my DB using MAX in CASE statement

I have APPLICATIONSTATUSLOG_ID primary key field on my table.
In order to find out the last updated record in my DB and the MAX(APPLICATIONSTATUSLOG_ID) is presumed to be the most recent record.
I tried this code :
SELECT
MAX(CASE WHEN MAX(d.ApplicationStatusLog_ID) = d.ApplicationStatusLog_ID THEN d.ApplicationStatusID END) AS StatusID,
FROM
ApplicationStatusLog d
But I get error:
Msg 130, Level 15, State 1, Line 53 Cannot perform an aggregate function on an expression containing an aggregate or a subquery.
My table looks like
ApplicationID - ApplicationStatusID - ApplicationStatusLogID
10000 17 100
10000 08 101
10000 10 102
10001 06 103
10001 10 104
10002 06 105
10002 07 106
My output should be:
10000 10
10001 10
10002 07
Please help me understand and resolve my problem.
If you want to just find the last updated row, given that it has max value in APPLICATIONSTATUSLOG_ID column. The query would be:
SELECT *
FROM ApplicationStatusLog
WHERE ApplicationStatusLog_ID = (SELECT MAX(ApplicationStatusLog_ID) FROM ApplicationStatusLog )
EDIT
So as you stated in comment, the query for it will be:
DECLARE #statusId INT
SELECT #statusId = STATUSID
FROM ApplicationStatusLog
WHERE ApplicationStatusLog_ID = (SELECT MAX(ApplicationStatusLog_ID) FROM ApplicationStatusLog )
EDIT 2:
The query as per your edit in question will be:
WITH C AS
(
SELECT ApplicationID,ApplicationStatusID,ApplicationStatusLogID, ROW_NUMBER() OVER (PARTITION BY ApplicationID ORDER BY ApplicationStatusLogID DESC) AS ranking
FROM ApplicationStatusLog
)
SELECT ApplicationID,ApplicationStatusID
FROM C
WHERE ranking = 1
You can join same table twice like this:
select IT.JoiningID, JT.MAXAPPLICATIONSTATUSID FROM dbo.[Table] IT
INNER JOIN (
Select JoiningID, MAX (APPLICATIONSTATUSID) MAXAPPLICATIONSTATUSID
FROM dbo.[Table]
GROUP BY JoiningID
) JT ON IT.JoiningID = JT.JoiningID
Now you have MAXAPPLICATIONSTATUSID per ID so you can write what you wand based on MAXAPPLICATIONSTATUSID.
Without full query
SELECT
x.StatusId
...
FROM <Table> a
CROSS APPLY
(
SELECT x.APPLICATIONSTATUSID as StatusId
FROM <Table> x
HAVING MAX(APPLICATIONSTATUSLOG_ID) = a.APPLICATIONSTATUSLOG_ID
GROUP BY x.APPLICATIONSTATUSID
)

GROUP values separated by specific records

I want to make a specific counter which will raise by one after a specific record is found in a row.
time event revenue counter
13.37 START 20 1
13.38 action A 10 1
13.40 action B 5 1
13.42 end 1
14.15 START 20 2
14.16 action B 5 2
14.18 end 2
15.10 START 20 3
15.12 end 3
I need to find out total revenue for every visit (actions between START and END). I was thinking the best way would be to set a counter like this:
so I could group events. But if you have a better solution, I would be grateful.
You can use a query similar to the following:
with StartTimes as
(
select time,
startRank = row_number() over (order by time)
from events
where event = 'START'
)
select e.*, counter = st.startRank
from events e
outer apply
(
select top 1 st.startRank
from StartTimes st
where e.time >= st.time
order by st.time desc
) st
SQL Fiddle with demo.
May need to be updated based on the particular characteristics of the actual data, things like duplicate times, missing events, etc. But it works for the sample data.
SQL Server 2012 supports an OVER clause for aggregates, so if you're up to date on version, this will give you the counter you want:
count(case when eventname='START' then 1 end) over (order by eventtime)
You could also use the latest START time instead of a counter to group by, like this:
with t as (
select
*,
max(case when eventname='START' then eventtime end)
over (order by eventtime) as timeStart
from YourTable
)
select
timeStart,
max(eventtime) as timeEnd,
sum(revenue) as totalRevenue
from t
group by timeStart;
Here's a SQL Fiddle demo using the schema Ian posted for his solution.