Redundant values while fetching distinct values from column after joins - sql

While I was trying to fetch unique email ids from my postgres database, I am still getting redundant values. The query is as follows :
select distinct(t2.email_id), user_id, registration_date,
last_login, status, count_uo
from (
select t1.*
from (
select distinct(u.email_id), u.user_id,
u.registration_date, u.last_login,
u.status, count(distinct(uo.id)) as count_uo
from users u
join user_offers uo on u.user_id = uo.user_id
and u.email_id != ''
and uo.offer_id in ('13', '9', 18, 7, 19, 25)
join user_utils uu on u.user_id = uu.user_id
and uu.carrier ~* 'Airtel'
or uu.carrier ~* 'Jio'
or uu.carrier ~* 'Idea'
or uu.carrier ~* '!dea'
where u.registration_date::date between date'2016-08-04' and date'2017-09-28'
and u.last_login::date between date'2017-06-01' and date'2017-09-29'
and u.gender = 'm'
and u.status = 'sms-verified'
and u.email_verification_status = 'UN-VERIFIED'
and u.email_id != '' group by u.user_id
) as t1
where t1.count_uo >1 and t1.count_uo < 100
) t2;
I get the output as follows, even after applying distinct twice.
email_id | user_id | registration_date | last_login | status | count_uo
---------------+---------+----------------------------+----------------------------+--------------+----------
abc#gmail.com | 509 | 2017-07-26 16:59:50.608219 | 2017-07-26 17:56:54.88664 | sms-verified | 3
def#gmail.com | 518 | 2017-08-18 19:26:45.217283 | 2017-08-22 15:38:01.591841 | sms-verified | 3
abc#gmail.com | 512 | 2017-08-17 12:01:00.003048 | 2017-08-21 17:52:56.303841 | sms-verified | 3
Since I'm weak in SQL, any help will be appreciated very much.

If you are using Postgres, you can use distinct on:
select distinct on (t2.email_id) t2.email_id, user_id,
registration_date, last_login, status, count_uo
from ( . . . ) t2
order by t2.email_id;
You can add a second key to the order by to get a particular row (say the most recent login by using order by t2.email_id, last_login desc).

You have two users (rows) with 'abc#gmail.com' as email_id: Notice that they have distinct value in user_id column (509 and 512).
As #GordonLinoff said, you can hide one of that results by using DISTINCT ON clause. But I figure out that it's not what you want...
I imagine it's more likely you inserted some test data and duplicated 'abc#gmail.com' in it.
This also point out (I think) a mistake in your model definition. (missing UNIQUE constraints over both email_id and user_id columns in your users table to avoid it could happen again I mean).

Related

Redshift join each values in an array

I have a table like below (its actually the pg_group table)
group_id | group_name | userid
_____________________________________
101 | gr1 | {100,101}
102 | gr2 | {100,110,120}
I have another table where I can see the name of the user id.
userid | username
______________________
100 | user1
101 | user2
110 | user3
120 | user4
I want to join these 2 tables and generate the output like this.
group_id | group_name | username
_____________________________________
101 | gr1 | user1,user2
102 | gr2 | user1,user3,user4
I tried listagg and etc, but it didn't work as expected.
Update:
I tried this one, but list agg seems not working.
SELECT I.group_name, listagg(J.username,',')
FROM pg_group I
LEFT JOIN pg_user J
ON J.userid = ANY(I.userid)
GROUP BY I.group_name
ERROR: One or more of the used functions must be applied on at least one user created tables. Examples of user table only functions are LISTAGG, MEDIAN, PERCENTILE_CONT, etc;
Here first I have converted arrays of user_ID INTO ROWS WITH UNNEST THEN COLLECTED username against those user_id and at last with string_agg() function again those usernames have been grouped into a comma separated column.
select group_id,group_name,string_agg(username,',')usrname from
(select group_id,group_name,unnest(userid::text[])user_id from pg_group )pg
inner join pg_user u
on pg.user_id::int = u.userid
group by group_id,group_name
From googling so far I have understood that you cannot use listagg() if there is no user defined table is involved. I have found a way around. But I cannot check it since I don't have Redshift platform. Please check it out:
select group_name,listagg(username, ', ') within group (order by column_name)
from
(
SELECT I.group_name,J.username
FROM pg_group I
LEFT JOIN pg_user J
ON J.userid = ANY(I.userid)
left join (select top 1 1 from my_schema.my_table)
on 1=1
)
Instead of my_schema.my_table Please use any of your user defined table

Postgresql select distinct Column A based on certain conditions on Column B

I have a table with data:
+--------+---------+
| userid | status |
+--------+---------+
| user_1 | success |
| user_2 | fail |
| user_2 | success |
| user_3 | fail |
| user_3 | fail |
+--------+---------+
I would like my query output to be distinct on userid but with condition that between fail and success values in status column. I would like to choose success instead (if both fail as in user_3, choose fail then). The table below shows the output that I would like to have as my result:
+--------+---------+
| userid | status |
+--------+---------+
| user_1 | success |
| user_2 | success |
| user_3 | fail |
+--------+---------+
Any efficient query would be nice as well. Thanks!
Here is a pretty efficient way to get the results you need.
SELECT userid, MAX(status)
FROM table1
GROUP BY userid
The MAX() function will work for strings as well.
Since, "success" > "fail",
if a userid has 1 row of "success" and 1 row of "fail", the maximum value is "success"
Use SELECT DISTINCT ON, which provides a simple and readable method to get the rows unique on userid. The ORDER BY ensures that status = 'success' is sorted before 'fail', and hence 'success'is selected if present:
SELECT DISTINCT ON (userid) userid,
status
FROM my_table
ORDER BY userid,
status DESC;
Note: An multicolumn index on (status, userid) may help performance. Also, in some cases a query using GROUP BY (see the answer from Terence) may be faster than the one using DISTINCT.
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row
of each set of rows where the given expressions evaluate to equal. ...
The DISTINCT ON expression(s) must match the leftmost ORDER BY
expression(s). The ORDER BY clause will normally contain additional
expression(s) that determine the desired precedence of rows within
each DISTINCT ON group.
(From SELECT DISTINCT docs )
First select status is success userid.
select distinct userid,status from yourtable where status='success'
Second select userid not contain success status.
select distinct userid,status from yourtable where
userid not in(select distinct userid from yourtable where status='success')
Then union.
select distinct userid,status from yourtable where status='success'
union
select distinct userid,status from yourtable where
userid not in(select distinct userid from yourtable where status='success')

PostgreSQL remove duplicates by GROUP BY

I would like to print the last message of a person, but only his latest message should be printed per person. I use PostgreSQL 10.
+-----------+----------+--------------+
| name | body | created_at |
+-----------+----------+--------------+
| Maria | Test3 | 2017-07-07 |
| Paul | Test5 | 2017-06-01 |
+-----------+----------+--------------+
I have tried this with the following SQL query, this gives me exactly that back but unfortunately the people are doubled in it.
SELECT * FROM messages
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
+-----------+----------+--------------+
| name | body | created_at |
+-----------+----------+--------------+
| Maria | Test1 | 2016-06-01 |
| Maria | Test2 | 2016-11-01 |
| Maria | Test3 | 2017-07-07 |
| Paul | Test4 | 2017-01-01 |
| Paul | Test5 | 2017-06-01 |
+-----------+----------+--------------+
I tried to remove the duplicates with a DISTINCT, but unfortunately I get this error message:
SELECT DISTINCT ON (name) * FROM messages
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: SELECT DISTINCT ON (name) * FROM messages ^ : SELECT DISTINCT ON (name) * FROM messages WHERE receive = 't' GROUP BY name ORDER BY MAX(created_at) DESC
Do you have any ideas how I can solve this ?
You would use DISTINCT ON as follows:
SELECT DISTINCT ON (name) *
FROM messages
WHERE receive = 't'
ORDER BY name, created_at DESC
That is:
no GROUP BY clause is needed
the column(s) listed in DISTINCT ON(...) must appear first in the ORDER BY clause
... followed by the column that should be use to break the group (here, that is created_at)
Note that the results of a distinct on query are always sorted by columns in the clause (because this sort is what is used to identifiy which rows should be kept).
If you want more control over the sort order, then you can use window functions instead:
SELECT *
FROM (
SELECT m.*, ROW_NUMBER() OVER(PARTITION BY name ORDER BY created_at DESC) rn
FROM messages m
WHERE receive = 't'
) t
WHERE rn = 1
ORDER BY created_at DESC
Use DISTINCT ON, but with the right ORDER BY:
SELECT DISTINCT ON (name) m.*
FROM messages m
WHERE receive = 't'
ORDER BY name, created_at DESC;
In general, you don't use DISTINCT ON with GROUP BY. It is used with ORDER BY. The way it works is to that it chooses the first row for each name based on the ORDER BY clause.
You should not be thinking of what you are doing as aggregation. You want to filter based on the created_at. In many databases, you would express this using a correlated subquery:
select m.*
from messages m
where m.created_at = (select max(m2.created_at)
from messages m2
where m2.name = m.name and m2.receive = 't'
) and
m.receive = 't'; -- this condition is probably not needed
SELECT *
FROM messages
WHERE receive = 't' and not exists (
select 1
from messages m
where m.receive = message.receive and messages.name = m.name and m.created_at > messages.created_at
)
ORDER BY created_at DESC
The query above finds the messages which fulfill the following criteria:
receive is 't'
there not exists another message which
has the same value for receive
has the same name
and is newer
Assuming that the same name does not send two messages at exactly the same time this should be enough. Another point to make is that the name might look similar, but be different, if some white characters are present inside the value, so, if you see two records in the result with the same name, but with different created_at in the query above, then it is highly probable that white characters are playing tricks on you.

SQL filtering rows using WHERE clause with array_agg function and joins

My current query:
SELECT
i.ID AS interview_id, i.board, i.time_taken, i.notes, i.interview_date,
u.ID AS user_id, u.first_name, u.last_name, u.state, u.district, u.optional, u.is_interview_user_only,
COALESCE((SELECT array_agg(DISTINCT uj.job_name) as jobs FROM user_jobs uj WHERE uj.user_id = u.id), '{}') as jobs
FROM interview i
JOIN users u ON i.user_id = u.id
GROUP BY u.id, i.id;
I want to implement filters for the interviews by "checking if a certain Job is in the array_agg of that interview row."
Current output:
interview_id | time_taken | ... | jobs
1001 | 25 | ... | {CEO, Product Manager}
1002 | 20 | ... | {Customer Care, Hospitality}
1003 | 40 | ... | {CEO, CFO}
1004 | 35 | ... | {Army Official, Sales Manager}
Output I want: When I mention Jobs having "CEO" as a filter/criteria
interview_id | time_taken | ... | jobs
1001 | 25 | ... | {CEO, Product Manager}
1003 | 40 | ... | {CEO, CFO}
to return rows only having "CEO" as one of the value in jobs array aggregate.
I am not sure how to use the WHERE or HAVING or IN clause with the above query so as to filter the rows based on results from the subquery aggregate array.
Is this possible in a single query where joins and aggregates are present?
If not, how else can I make this possible?
I want to implement this statement for applying sorts and filters options selected from Frontend.
Database: PostgreSQL
Environment: Node using Express and node-pg package.
Frontend: Vue
Rewrite the query so there is no subquery. That is, just add in user_jobs to the FROM clause:
SELECT i.ID AS interview_id, i.board, i.time_taken, i.notes, i.interview_date,
u.ID AS user_id, u.first_name, u.last_name, u.state, u.district,
u.optional, u.is_interview_user_only,
array_agg(DISTINCT uj.job_name) as jobs
FROM interview i
JOIN users u ON i.user_id = u.id
LEFT JOIN user_jobs uj ON uj.user_id = u.id
GROUP BY u.id, i.id;
Now, you can add filtering in a HAVING clause:
HAVING COUNT(*) FILTER (WHERE uj.job_name = 'CEO') > 0

Beginner SQL query with ROW_NUMBER

i'm kind of a beginner with SQL.
Right now i'm trying to create a bit complex select but i'm getting some error, which I know it's a beginner mistake.
Any help appreciated.
SELECT ROW_NUMBER() OVER (ORDER BY score) AS rank, userID, facebookID, name, score FROM (
SELECT * FROM Friends AS FR WHERE userID = ?
JOIN
Users WHERE Users.facebookID = FR.facebookFriendID
)
UNION (
SELECT * FROM User WHERE userID = ?
)
Where the 2 ? will be replaced with my user's ID.
The table User contains every user in my db, while the Friends table contains all facebookFriends for a user.
USER TABLE
userID | facebookID | name | score
FRIENDS TABLE
userID | facebookFriendID
Sample data
USER
A | facebookID1 | Alex | 100
B | facebookID2 | Mike | 200
FRIENDS
A | facebookID2
A | facebookID3
B | facebookID1
I'd like this result since Alex and mike are friends:
rank | userID | facebookID | name
1 | B | facebookID2 | Mike
2 | A | facebookID1 | Alex
I hope this was quite clear explanation.
I'm getting this error at the moment:
Error occurred executing query: Incorrect syntax near the keyword 'AS'.
You've got several issues with your query. JOINS come before WHERE clauses. And when using a JOIN, you need to specify your ON clauses. Also when using a UNION, you need to make sure the same number of fields are returned in both queries.
Give this a try:
SELECT ROW_NUMBER() OVER (ORDER BY score) AS rank, userID, facebookID, name, score
FROM (
SELECT *
FROM Users
WHERE UserId = 'A'
UNION
SELECT U.userId, u.facebookId, u.name, u.score
FROM Friends FR
JOIN Users U ON U.facebookID = FR.facebookFriendID
WHERE FR.userID = 'A' ) t
SQL Fiddle Demo
Also, by the way your using ROW_NUMBER, it really will be a Row Number vs a RANK. If you want Rankings (with potential ties), replace ROW_NUMBER with RANK.