Trouble Finding ID's with Duplicate Fields - sql

My data looks like this:
ID Email
1 someone#hotmail.com
2 someone1#hotmail.com
3 someone2#hotmail.com
4 someone3#hotmail.com
5 someone4#hotmail.com
6 someone5#hotmail.com
There should be exactly 1 email per ID, but there's not.
> dim(data)
[1] 5071 2
> length(unique(data$Person_Onyx_Id))
[1] 5071
> length((data$Email))
[1] 5071
> length(unique(data$Email))
[1] 4481
So, I need to find the ID's with duplicated email addresses.
Seems like this should be easy, but I'm striking out:
> sqldf("select ID, count(Email) from data group by ID having count(Email) > 1")
[1] ID count(Email)
<0 rows> (or 0-length row.names)
I've also tried taking off the having clause and sending the result to an object and sorting the object by the count(Email)... it appears that every ID has count(Email) of 1...
I would dput the actual data but I can't due to the sensitivity of email addresses.

Are you also sure you don't have the opposite condition, multiple ids with the same email?
select Email, count(*)
from data
group by Email
having count(*) > 1;

My guess is that you have NULL emails. You could find this by using count(*) rather than count(email):
select ID, count(*)
from data
group by ID
having count(*) > 1;

Related

Count with Group By always shows '1'

I'm trying to count the number of articles mentioning Donald Trump exist in a Google BigQuery table.
SELECT
sourcecommonname,
COUNT(DISTINCT sourcecommonname) counter
FROM
`israel_media`
WHERE
persons LIKE '%donald trump%'
GROUP BY
sourcecommonname
Results are always
sourcecommonname
counter
first_newspaper
1
second_newspaper
1
third_newspaper
1
forth_newspaper
1
What am I not seeing?
Of course. You are counting distinct values. But there is only one value per group. That is what the group by does.
Perhaps you just want count() without distinct:
SELECT sourcecommonname,
COUNT(*) as counter
FROM `israel_media`
WHERE persons LIKE '%donald trump%'
GROUP BY sourcecommonname

SQL Rows to Columns if column values are unknown

I have a table that has demographic information about a set of users which looks like this:
User_id Category IsMember
1 College 1
1 Married 0
1 Employed 1
1 Has_Kids 1
2 College 0
2 Married 1
2 Employed 1
3 College 0
3 Employed 0
The result set I want is a table that looks like this:
User_Id|College|Married|Employed|Has_Kids
1 1 0 1 1
2 0 1 1 0
3 0 0 0 0
In other words, the table indicates the presence or absence of a category for each user. Sometimes the user will have a category where the value if false, sometimes the user will have no row for a category, in which case IsMember is assumed to be false.
Also, from time to time additional categories will be added to the data set, and I'm wondering if its possible to do this query without knowing up front all the possible category names, in other words, I won't be able to specify all the column names I want to count in the result. (Note only user 1 has category "has_kids" and user 3 is missing a row for category "married"
(using Postgres)
Thanks.
You can use jsonb funcions.
with titles as (
select jsonb_object_agg(Category, Category) as titles,
jsonb_object_agg(Category, -1) as defaults
from demog
),
the_rows as (
select null::bigint as id, titles as data
from titles
union
select User_id, defaults || jsonb_object_agg(Category, IsMember)
from demog, titles
group by User_id, defaults
)
select id, string_agg(value, '|' order by key)
from (
select id, key, value
from the_rows, jsonb_each_text(data)
) x
group by id
order by id nulls first
You can see a running example in http://rextester.com/QEGT70842
You can replace -1 with 0 for the default value and '|' with ',' for the separator.
You can install tablefunc module and use the crosstab function.
https://www.postgresql.org/docs/9.1/static/tablefunc.html
I found a Postgres function script called colpivot here which does the trick. Ran the script to create the function, then created the table in one statement:
select colpivot ('_pivoted', 'select * from user_categories', array['user_id'],
array ['category'], '#.is_member', null);

sql query order by parts of

lets say you have a table with 10 000 records of different email adresses, but within this tables there are a few hundred (this can vary and should not matter) addresses that contains a specific domain name ie #horses.com.
I would like in one single query retrieve all 10 000 record, but the ones that contains #horses.com will always be on top of the list.
Something like this " SELECT TOP 10000 * FROM dbo.Emails ORDER BY -- the records that contains #horses.com comes first"
OR
Give me 10000 records from the table dbo.Emails but make shure everyone that contains "#horses.com" comes first, no matter how many there is.
BTW This is on an sql 2012 server.
Anyone??
Try this:
SELECT TOP 10000 *
FROM dbo.Emails
ORDER BY IIF(Email LIKE '%#horses.com', 0, 1)
This assumes the email ends in '#horses.com', which isn't unreasonable. If you really want a contains-like function, add another % after the .com.
Edit: The IIF function is only available in sql server 2012 and later, for a more portable solution use CASE WHEN Email LIKE '%#horses.com' THEN 0 ELSE 1 END.
SELECT TOP 10000 *
FROM dbo.Emails
ORDER BY case when charindex('#horses.com', email) > 0
then 1
else 2
end,
email
SELECT 1,* FROM dbo.Emails where namn like '%#horses.com%'
union
SELECT 2,* FROM dbo.Emails where namn not like '%#horses.com%'
order by 1

combine results from different selects

I have one table that contains a field "ID", "mailSent" and "serviceUsed". "mailSent" contains the time when a mail was sent and "serviceUsed" contains a counter that just says if the user has used the service for the particular mail that I have sent.
I am trying to do a report that gives me back for each ID the following two facts:
1. The last time when a user has used the service, i.e., the time when for a particular user serviceUsed != 0
2. The total number of times a user has used the service, i.e., sum(serviceUsed) for each user
I would like to display this in one view and map the result always to the particular user. I can build each of the two queries separately but do not know how to combine it into one view. The two queries look as follows:
1. Select ID, max(mailSent) from Mails where serviceUsed > 0 group by ID
2. Select ID, sum(serviceUsed) from Mails group by ID
Notice that I cannot just combine them both because I also want to show the IDs that have never used my service, i.e., where serviceUsed = 0. Hence, if I just eliminate the where clause in my first query, then I will get wrong results for max(mailSent). Any idea how I can combine both?
In other words what I want is then something like this:
ID, max(mailSent), sum(serviceUsed)
where max(mailSent) is from the first query and sum(serviceUsed) from the second query.
Regards!
Try like this
SELECT * FROM
(
Select ID, max(mailSent) from Mails where serviceUsed > 0 group by ID
UNOIN ALL
Select ID, sum(serviceUsed) from Mails group by ID
) AS T
You can write it within one Query:
SELECT ID, sum(serviceUsed), max(mailSent) from Mails group by ID;
The problem, that you don't have the serviceUsed > 0 in your second Query doesn't matter. You can sum them up too, because they have the value 0.
If you have the following input:
id serviceUsed mailSent
--------------------------
1 0 1.1.1970
1 4 3.1.1970
1 3 4.1.1970
2 0 2.1.1970
The Query should return this result:
id serviceUsed mailSent
--------------------------
1 7 4.1.1970
2 0 2.1.1970
But I wonder, where your primary key is?
You want to do this with conditional aggregation:
select ID, max(case when serviceUsed > 0 then mailSent end),
sum(serviceUsed)
from Mails
group by ID;

SQL select query for ranking

I have this table called logs that logs a who input or output data.
Now I wan't to get the statistics of who has the most contributions and rank them.
Columns are
Occur_Time | iUser_id | iUsername | oUser_id | oUsername
--iUser_id is the input persons index from another table that lists the username.
--iUsername is the input persons name.
--oUser_id is the index of the person who took the input away.
--oUsername is the name of the person who took the input away.
Now I wan't to know who has the most input.
My logic:
Example:
User_id is 1, name is One.
Check how many times 1 is repeated on iUser_id = 100 times.
Check how many times 1 is repeated on oUser_id = 10 times.
User_id=1 has contributed 90 times.
Then sort by who has most contribution.
Thank you.
(untested):
SELECT L.iUsername,
((SELECT COUNT(1) FROM logs WHERE iUsername=L.iUsername) -
(SELECT COUNT(1) FROM logs WHERE oUsername=L.iUsername)) as rank
FROM logs L
GROUP BY L.iUsername
ORDER BY rank ASC
The Rank feature is probably what you are looking for.
http://msdn.microsoft.com/en-us/library/ms176102.aspx
Try that query.
Select user_id, count(user_id) from tablename group by user_id;