Remove duplicate rows based on field in a select query with PostgreSQL? - sql

Considering the table mdl_files that contains the following fields: id, contenthash, timecreated, filesize.
This tables stores attachment files.
We consider that all the rows with the same content hash are duplicate rows and I just want to keep the oldest row (or first if dates are equals).
How can I do that?
The following query:
SELECT
id,
contenthash,
filesize,
to_timestamp(timecreated) :: DATE
FROM mdl_files
ORDER BY contenthash;
returns:
2480229 00002e87605311feb82b70473b61e81f0223c774 18178 2016-10-05
2997411 0000bfd20ef84948eee6811ce5bbac03de42ccb0 1293 2017-03-31
1304839 000280169fc78d704a2d4569bfb6f42ea4a1d5ae 8203 2015-11-10
1364656 000280169fc78d704a2d4569bfb6f42ea4a1d5ae 8203 2015-11-17
71568 0003c6aec5835964870902d697c06d21abf76bf7 139439 2013-04-19
2959945 000419c19d77df7285e669614075b47414e3ab2c 398 2017-03-20
3483049 00061dc0bc2452304107ddc75e7ee2908c729905 28618 2017-08-17
3483047 00061dc0bc2452304107ddc75e7ee2908c729905 28618 2017-08-17
I want to get this resultset:
2480229 00002e87605311feb82b70473b61e81f0223c774 18178 2016-10-05
2997411 0000bfd20ef84948eee6811ce5bbac03de42ccb0 1293 2017-03-31
1304839 000280169fc78d704a2d4569bfb6f42ea4a1d5ae 8203 2015-11-10
71568 0003c6aec5835964870902d697c06d21abf76bf7 139439 2013-04-19
2959945 000419c19d77df7285e669614075b47414e3ab2c 398 2017-03-20
3483049 00061dc0bc2452304107ddc75e7ee2908c729905 28618 2017-08-17
I want the following duplicated lines to be removed from the resultset:
1364656 000280169fc78d704a2d4569bfb6f42ea4a1d5ae 8203 2015-11-17
3483047 00061dc0bc2452304107ddc75e7ee2908c729905 28618 2017-08-17

Use DISTINCT ON:
SELECT DISTINCT ON (contenthash)
id,
contenthash,
filesize,
to_timestamp(timecreated) :: DATE
FROM mdl_files
ORDER BY contenthash, timecreated, id;
DISTINCT ON is a Postgres extension that makes sure that returns one row for each unique combination of the keys in parentheses. The specific row is the first one found based on the order by clause.

You can try to use ROW_NUMBER() with windows function to make row number then delete it.
SELECT t.*
FROM (
SELECT
id,
contenthash,
filesize,
ROW_NUMBER() OVER (PARTITION BY contenthash,filesize order by timecreated) rn
FROM mdl_files
) t
where t.rn = 1
sqlfiddle
If you want to DELETE duplicate data you can use EXISTS in where clause.
DELETE
FROM mdl_files f WHERE EXISTS(
SELECT 1
FROM (
SELECT
id,
contenthash,
filesize,
ROW_NUMBER() OVER (PARTITION BY contenthash,filesize order by timecreated) rn
FROM mdl_files
) t
where t.rn > 1 and t.id = f.id
)
sqlfiddle

Related

count most repeated value per group in hive?

I am using hive 0.14.0 in a hortonworks data platform, on a big file similar to this input data:
tpep_pickup_datetime
pulocationid
2022-01-28 23:32:52.0
100
2022-02-28 23:02:40.0
202
2022-02-28 17:22:45.0
102
2022-02-28 23:19:37.0
102
2022-03-29 17:32:02.0
102
2022-01-28 23:32:40.0
101
2022-02-28 17:28:09.0
201
2022-03-28 23:59:54.0
100
2022-02-28 21:02:40.0
100
I want to find out what was the most common hour in each locationid, this being the result:
locationid
hour
100
23
101
17
102
17
201
17
202
23
i was thinking in using a partition command like this:
select * from (
select hour(tpep_pickup_datetime), pulocationid
(max (hour(tpep_pickup_datetime))) over (partition by pulocationid) as max_hour,
row_number() over (partition by pulocationid) as row_no
from yellowtaxi22
) res
where res.row_no = 1;
but it shows me this error:
SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: Invalid function pulocationid
is there any other way of doing this?
with raw_groups -- subquery syntax
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
),
grouped_data as -- another subquery syntax based on `with`
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
raw_groups
group by
mylocation.pulocationid
)
select --format data into your requested format
location.pulocationid,
location.hour
from
grouped_data
I do not remember hive 0.14 can use with clause, but you could easily re-write the query to not use it.(by substituting the select in pace of the table names) I just don't find it as readable:
select --format data into your requested format
location.pulocationid,
location.hour
from
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
)
group by
mylocation.pulocationid
)
You were half way there!
The idea was in the right direction however the syntax is a little bit off:
First find the count per each hour
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime)
Then add the row_number but you need to order it by the total count in a descending way:
select pulocationid , hour , cnt , row_number () over ( partition be pulocationid order by cnt desc ) as row_no from
Last but not the list, take only the rows with the highest count ( this can be done by the max function rather than the row_number one by the way)
Or in total :
select pulocationid , hour from (
select pulocationid , hour , cnt , row_number ()
over ( partition by pulocationid order by cnt desc )
as row_no from (
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime) ))
Where row_no=1

Get last_modified_date from a group of date of each target

I've a table in oracle database:
Transaction_ID Target Status Last_modified_date
80913570 8536349 1 2018-10-03 03:40:36.0
80913540 8860342 1 2018-09-28 08:45:32.0
80913541 9135368 1 2018-09-28 08:45:42.0
80913532 8860342 1 2018-09-28 08:12:52.0
80913624 9256309 1 2018-10-05 01:25:06.0
80913573 9256309 0 2018-10-03 07:18:35.0
80913574 9256309 0 2018-10-03 07:21:26.0
80913576 9256309 1 2018-10-03 07:28:36.0
80913613 5429179 0 2018-10-08 05:45:00.0
80913614 5429179 1 2018-10-04 06:48:06.0
In this table, I want most recent modified dates of all Target. As some Targets have single record while others are with multiple modified dates.
I tried following query:
select max(last_modified_date) from demoTable where target in (select distinct target from demoTable);
But, is getting only one value among all Targets due to in condition while I want values of to all Targets.
*PL/SQL too can be used to achieve the results. But I'm new to the industry, I don't know exactly how to do it.
Use group by
select target,max(last_modified_date) from demoTable
group by target
use co-related sub-query, As you need each target recent date so you can choose any of the method from below two
select t.* from demoTable t
where t.Last_modified_date in
( select max(Last_modified_date) from demoTable t1
where t1.Target=t.Target
)
Or use row_number window function
select Transaction_ID ,Target , Status, Last_modified_date from
(
select Transaction_ID ,Target , Status, Last_modified_date , row_number() over(partition by target order by Last_modified_date desc) as rn from demoTable
) t where t.rn=1

SQL aggregate using DISTINCT on ID by latest date

Request
I have a section of data below and my goal is to limit the agent column to be distinct only containing unique values, where the unique value selected is the latest date it was modified.
Existing Data
modified agent rank
2016-10-18 346502 0
2013-06-04 346502 41
2011-10-31 346503 0
2012-08-13 346505 0
2016-04-18 346506 66
2015-01-27 346506 1
2016-01-21 346507 103
2015-01-27 346507 130
2012-01-30 346508 0
Trying to use this answer https://stackoverflow.com/a/29912858/461887 as a basis but cannot get where to aggregate it properly.
SQL not working
SELECT DISTINCT
FLiex.agtprof.modify_date_time
,FLiex.agtprof.agent_id
,FLiex.agtprof.rank
,FLiex.agtprof.external_id
WHERE
FLiex.agtprof.modify_date_time = MAX( FLiex.agtprof.modify_date_time)
FROM
FLiex.agtprof
Desired Output
modify agent rank
18/10/2016 346502 0
18/04/2016 346506 66
21/01/2016 346507 103
13/08/2012 346505 0
30/01/2012 346508 0
31/10/2011 346503 0
You're attempting to get single row data, but based on the other rows. While this may be possible with aggregate functions, it's much easier to do with window (analytic) functions:
SELECT [modified], [agent], [rank], [id]
FROM (SELECT [modified], [agent], [rank], [id],
ROW_NUMBER() OVER (PARTITION BY [agent]
ORDER BY [modified] DESC) AS rn
FROM [agtprof]) t
WHERE rn = 1
SELECT DISTINCT max(id_date), agent, rank, id
FROM fliex.agtprof
GROUP BY 2,3,4;
Try this. I think if you chose the max id_date and then group by the rest, you should get the results you're looking for.
Try this:
SELECT
FLiex.agtprof.modify_date_time
,FLiex.agtprof.agent_id
,FLiex.agtprof.rank
,FLiex.agtprof.external_id
FROM
FLiex.agtprof
INNER JOIN (
SELECT
Max(FLiex.agtprof.modify_date_time) as max_mod_date_time
,FLiex.agtprof.agent_id as agent_id
FROM
FLiex.agtprof
GROUP BY FLiex.agtprof.agent_id
) Filter
ON FLiex.agtprof.agentID = Filter.agent_id
AND FLiex.agtprof.modify_date_time = Filter.max_mod_date_time

SQL: How to make a query that return last created row per each user from table's data

Consider following table's data
ID UserID ClassID SchoolID Created
2184 19313 10 28189 2010-10-25 14:16:39.823
46697 19313 10 27721 2011-04-04 14:50:49.433
•47423 19313 11 27721 2011-09-15 09:15:51.740
•47672 19881 11 42978 2011-09-19 17:31:12.853
3176 19881 11 42978 2010-10-27 22:29:41.130
22327 19881 9 45263 2011-02-14 19:42:41.320
46661 32810 11 41861 2011-04-04 14:26:14.800
•47333 32810 11 51721 2011-09-13 22:43:06.053
131 32810 11 51721 2010-09-22 03:16:44.520
I want to make a sql query that return the last created row for each UserID in which the result will be as below ( row that begin with • in the above rows ) :
ID UserID ClassID SchoolID Created
47423 19313 11 27721 2011-09-15 09:15:51.740
47672 19881 11 42978 2011-09-19 17:31:12.853
47333 32810 11 51721 2011-09-13 22:43:06.053
You can use a CTE (Common Table Expression) with the ROW_NUMBER function:
;WITH LastPerUser AS
(
SELECT
ID, UserID, ClassID, SchoolID, Created,
ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY Created DESC) AS 'RowNum')
FROM dbo.YourTable
)
SELECT
ID, UserID, ClassID, SchoolID, Created,
FROM LastPerUser
WHERE RowNum = 1
This CTE "partitions" your data by UserID, and for each partition, the ROW_NUMBER function hands out sequential numbers, starting at 1 and ordered by Created DESC - so the latest row gets RowNum = 1 (for each UserID) which is what I select from the CTE in the SELECT statement after it.
I know this is an old question at this point, but I was having the same problem in MySQL, and I think I have figured out a standard sql way of doing this. I have only tested this with MySQL, but I don't believe I am using anything MySQL-specific.
select mainTable.* from YourTable mainTable, (
select UserID, max(Created) as Created
from YourTable
group by UserID
) dateTable
where mainTable.UserID = dateTable.UserID
and mainTable.Created = dateTable.Created

SQL CTE and ORDER BY affecting result set

I've pasted a very simplified version of my SQL query below. The problem that I'm running into is that the ORDER BY statement is affecting the select results of my CTE. I haven't been able to understand why this is, my original thinking was that within the CTE, I execute some SELECT statement, then the ORDER BY should work on THOSE results.
Unfortunately the behavior that I'm seeing is that my inner SELECT statement is being affected by the order by, giving me 'items' that are not in the TOP 10.
Here is an example of data:
(Indexed in reverse order by ID)
ID, Date
9600 2010-10-12
9599 2010-09-08
9598 2010-08-31
9597 2010-08-31
9596 2010-08-30
9595 2010-08-11
9594 2010-08-06
9593 2010-08-05
9592 2010-08-02
....
9573 2010-08-10
....
8174 2010-08-05
....
38 2029-12-20
My basic query:
;with results as(
select TOP 10 ID, Date
from dbo.items
)
SELECT ID
FROM results
query returns:
ID, Date
9600 2010-10-12
9599 2010-09-08
9598 2010-08-31
9597 2010-08-31
9596 2010-08-30
9595 2010-08-11
9594 2010-08-06
9593 2010-08-05
9592 2010-08-02
My query with the ORDER BY
;with results as(
select TOP 10 ID, Date
from dbo.items
)
SELECT ID
FROM results
ORDER BY Date DESC
query returns:
ID, Date
38 2029-12-20
9600 2010-10-12
9599 2010-09-08
9598 2010-08-31
9597 2010-08-31
9596 2010-08-30
9595 2010-08-11
9573 2010-08-10
9594 2010-08-06
8174 2010-08-05
Can anyone explain why the first query will only return IDs that are in the top 10 of the table, and the second query returns the top 10 of the entire table (after the sorting is applied).
When you use SELECT TOP n you must supply an ORDER BY if you want deterministic behaviour otherwise the server is free to return any 10 rows it feels like. The behaviour you are seeing is perfectly valid.
To solve the problem, specify an ORDER BY inside the CTE:
WITH results AS
(
SELECT TOP 10 ID, Date
FROM dbo.items
ORDER BY ID DESC
)
SELECT ID
FROM results
ORDER BY Date
I think you can add new column like
SELECT ROW_NUMBER() OVER(ORDER BY <ColumnName>;) AS RowNo
and then all your columns.. this would help you to query using the CTE anchor... using between, where etc clauses..