using SQL to detect sex change - sql

I have a dataset like this:
id : E.g. 111, 111, 111, 112, 112, 113, 113
Year: E.g. 2010, 2011, 2012, 2010, 2011,2010, 2015
Sex: E.g. M, M, F, F, F, M, M
In this dataset, ID = 111 had a sex change (switch from M to F - or from F to M)
With postgre sql, I try to find out:
A: How many ids stay as man (and which ids)
B: How many ids stay as woman (and which ids)
C: How many ids go from man to woman (and which ids)
D: How many ids go from woman to man (and which ids)
I try like this:
# problem A
SELECT COUNT(DISTINCT ID) FROM table WHERE ID NOT IN (SELECT ID FROM table WHERE SEX = 'M');
SELECT DISTINCT ID FROM table WHERE ID NOT IN (SELECT ID FROM table WHERE SEX = 'M');
# problem B
SELECT COUNT(DISTINCT ID) FROM table WHERE ID NOT IN (SELECT ID FROM table WHERE SEX = 'F');
SELECT DISTINCT ID FROM table WHERE ID NOT IN (SELECT ID FROM table WHERE SEX = 'F');
# all sex change
SELECT COUNT(DISTINCT ID) FROM table WHERE ID IN (SELECT ID FROM table WHERE SEX = 'M') AND ID IN (SELECT ID FROM table WHERE SEX = 'F');
SELECT DISTINCT ID FROM table WHERE ID IN (SELECT ID FROM table WHERE SEX = 'M') AND ID IN (SELECT ID FROM table WHERE SEX = 'F');
Is it correct? Or is window-lag function needed?

You can try this, to calculate in advance some metrics:
SELECT *
,MAX(CASE WHEN sex = 'M' THEN 1 ELSE 0 END) OVER (PARTITION BY ID) AS has_M
,MAX(CASE WHEN sex = 'F' THEN 1 ELSE 0 END) OVER (PARTITION BY ID) AS has_F
,DENSE_RANK() OVER (PARTITION BY id ORDER BY id, year) AS initial_sex
FROM mytable;
and then solve your issues:
SELECT SUM(CASE WHEN initial_sex = 1 AND SEX = 'M' THEN 1 ELSE 0 END)
,string_agg(CASE WHEN initial_sex = 1 AND SEX = 'M' THEN CAST(id AS VARCHAR(12)) END, ', ')
,SUM(CASE WHEN initial_sex = 1 AND SEX = 'F' THEN 1 ELSE 0 END)
,string_agg(CASE WHEN initial_sex = 1 AND SEX = 'F' THEN CAST(id AS VARCHAR(12)) END, ', ')
,SUM(CASE WHEN (initial_sex = 1 AND SEX = 'F' AND has_m = 1) OR (initial_sex = 1 AND SEX = 'M' AND has_F = 1) THEN 1 ELSE 0 END)
,string_agg(CASE WHEN (initial_sex = 1 AND SEX = 'F' AND has_m = 1) OR (initial_sex = 1 AND SEX = 'M' AND has_F = 1) THEN CAST(id AS VARCHAR(12)) END, ', ')
FROM
(
SELECT *
,MAX(CASE WHEN sex = 'M' THEN 1 ELSE 0 END) OVER (PARTITION BY ID) AS has_M
,MAX(CASE WHEN sex = 'F' THEN 1 ELSE 0 END) OVER (PARTITION BY ID) AS has_F
,DENSE_RANK() OVER (PARTITION BY id ORDER BY id, year) AS initial_sex
FROM mytable
) DS;
Here is the full working example.

Assuming column SEX will only have either 'F' or 'M' as value, problem A can be solved
problem A
SELECT COUNT(DISTINCT ID) FROM table WHERE SEX != 'F';
SELECT DISTINCT ID FROM table WHERE SEX != 'F';

step-by-step demo: db<>fiddle
Assuming, the change happens only once, you could use the first_value() window function:
SELECT DISTINCT -- 5
id,
CASE
WHEN first_sex = last_sex THEN 'Stay ' || sex -- 3
ELSE 'Change from ' || first_sex || ' To ' || last_sex -- 4
END sex_status
FROM (
SELECT
id,
sex,
first_value(sex) OVER (PARTITION BY id ORDER BY year) as first_sex, -- 1
first_value(sex) OVER (PARTITION BY id ORDER BY year DESC) as last_sex -- 2
FROM mytable
) s
Fetch first sex values per id over years
Fetch last sex values per id over years (Notice the different order: It gives the first value from the "bottom")
Compare first and last; if they are the same, return "Stay" and sex
Otherwise return "Change" with sexes. (Of course, you can do whatever you want here. Adding appropriate status identifiers or similar instead of pure text seems to make sense at this point.)
DISTINCT clause to reduce the records to one per id.
Afterwards you can do whatever statistics you want. For example counting the different status by GROUP BY sex_status:
demo: db<>fiddle
SELECT
sex_status,
COUNT(*)
FROM (
-- query from above
) s
GROUP BY sex_status

Related

How to flatten a SQL statement

I have a case statement
Select customer, group, case when group = one then 'A' else 'B' end as Indicator FROM TABLE1
How do I "flatten" the indicator so for each customer I have 2 column for each indicator type (Goal Table)
Current Table:
Customer
Group
Indicator
Joh
One
A
Joh
Two
B
Jane
One
A
Jane
Two
B
Goal Table:
Customer
Indicator1
Indicator2
Joh
A
B
Jane
A
B
Since values are being hard-coded ('A','B') for indicator column, we can use max, as it will yield one value only -
with data_cte(Customer,Group_1,Indicator) as(
select * from values
('Joh','One','A'),
('Joh','Two','B'),
('Jane','One','A'),
('Jane','Two','B')
)select d.customer
,max(case when d.group_1 = 'One' then 'A' end) as indicator1
,max(case when d.group_1 = 'Two' then 'B' end) as indicator2
from data_cte d
group by d.customer;
The form of Pankaj's answer is good if you have fixed group's, but his code has the indicator values hard coded, this it should look like:
with data_cte(Customer, Group_1, Indicator) as (
select *
from values
('Joh','One','A'),
('Joh','Two','B'),
('Jane','One','A'),
('Jane','Two','B')
)
select
d.customer
,max(case when d.group_1 = 'One' then d.indicator end) as indicator1
,max(case when d.group_1 = 'Two' then d.indicator end) as indicator2
from data_cte as d
group by 1;
The CASE in the MAX can be swapped for a IFF in the form
MAX(IFF(d.group_1 = 'One` then d.indicator, null)) as indicator1
This works as MAX takes the larest value, so if you only have one matching group_1 per customer, the other will be null and those are not larger so the wanted value is taken.
If you have many, you will want to somehow rank then, and then FIRST_VALUE with a partition on customer, and ordered by something like a date..
anyways, if you have unkown/dynamic columns this can be solve using Snowflake Scripting to double query the data.
create or replace table table1 as
select column1 customer, column2 as _group, column3 as indicator
from values
('Joh',1,'A'),
('Joh',2,'B'),
('Jane',1,'C'),
('Jane',3,'E'),
('Jane',2,'D');
declare
sql string;
res resultset;
c1 cursor for select distinct _group as key from table1 order by key;
begin
sql := 'select customer ';
for record in c1 do
sql := sql || ',max(iff(_group = '|| record.key ||', indicator, null)) as col_' || record.key::text;
end for;
sql := sql || ' from table1 group by 1 order by 1';
res := (execute immediate :sql);
return table (res);
end;
gives:
CUSTOMER
COL_1
COL_2
COL_3
Jane
C
D
E
Joh
A
B
null

SQL works in Athena Engine v1 but not v2

I have a SQL query embedded into a system that has worked successfully until now in Athena with engine version 1. However it fails in engine version 2 and I haven't been able to work out why.
Here is a generalised version of the SQL. It sums the number of people in 3 groups: adults, NY residents and the overlap of the two. (NY adults).
In version 1 this works, but in v2 I get the error "column z.id_field cannot be resolved"
WITH BASE AS (SELECT person_id, age, state
FROM people
WHERE gender = 'male'
)
,group_a as (
SELECT distinct (person_id) as id_field
FROM BASE
WHERE age > 17
),
group_b as (
SELECT distinct (person_id) as id_field
FROM BASE
WHERE state = 'NY'
)
SELECT CASE WHEN z.id_field is null then 'group_b_only' WHEN r.id_field is null then 'group_a_only' ELSE 'Overlap' END as group
, COUNT (coalesce (z.id_field, r.id_field)) as count
FROM group_a AS z FULL OUTER JOIN group_b as r USING (id_field)
GROUP BY 1;
As a note, in any database this would be simpler as an aggregation and probably faster too:
SELECT grp, COUNT(*)
FROM (SELECT person_id,
(CASE WHEN MAX(age) > 17 AND MAX(state) = 'NY' THEN 'Both'
WHEN MAX(age) > 17 THEN 'Age Only'
ELSE 'State Only'
END) as grp
FROM people
WHERE gender = 'male' AND
(age > 17 OR state = 'NY')
GROUP BY person_id
) x
GROUP BY grp;
The above assumes that person_id can be repeated in people. If that is not the case, then this can be simplified to:
SELECT (CASE WHEN age > 17 AND state = 'NY' THEN 'Both'
WHEN age > 17 THEN 'Age Only'
ELSE 'State Only'
END) as grp, COUNT(*)
FROM people
WHERE gender = 'male' AND
(age > 17 OR state = 'NY')
GROUP BY grp;

SQL Server : split a column with varied data into 3 specific columns grouped by ID

Using SQL Server, I'm trying to split information shared in one column into three based on the number of IDs. Ideally I'd have distinct IDs at the end.
There can be from 1-3 rows per PersonID depending on the information in the contact column.
If a personID appears more than once I'd like to have the data split into two columns, one for phone and one for email.
I'd need to check that the data contained an "#" symbol for it to be put into the Email column, and the rest put into Phone or Alt Phone.
It's pretty hard to explain so if you need any more information please comment.
Hopefully the example below will help:
PersonID Name Contact
----------------------------------------
1 Chen 212747
1 Chen Chen#test.com
2 Pudge 18191
2 Pudge 18182222
2 Pudge Pudge#test.com
3 Riki Riki#test.com
3 Riki 19192
4 Lina 18424
I want to convert this into :
PersonID Name Phone Alt Phone Email
--------------------------------------------------------
1 Chen 212747 NULL Chen#test.com
2 Pudge 18191 18182222 Pudge#test.com
3 Riki 19192 NULL Riki#test.com
4 Lina 18424 NULL NULL
declare #Table AS TABLE
(
PersonID INT ,
Name VARCHAR(100),
Contact VARCHAR(100)
)
INSERT #Table
( PersonID, Name, Contact)
VALUES
(1 ,'Chen','212747'),
(1 ,'Chen','Chen#test.com'),
(2 ,'Pudge','18191'),
(2 ,'Pudge','18182222'),
(2 ,'Pudge','Pudge#test.com'),
(3 ,'Riki','Riki#test.com'),
(3 ,'Riki','19192'),
(4 ,'Lina','18424')
SELECT
xQ.PersonID,
xQ.Name,
MAX(CASE WHEN xQ.IsEmail = 0 AND xQ.RowNumberPhone = 1 THEN xQ.Contact ELSE NULL END) AS Phone,
MAX(CASE WHEN xQ.IsEmail = 0 AND xQ.RowNumberPhone = 2 THEN xQ.Contact ELSE NULL END) AS [Alt Phone],
MAX(CASE WHEN xQ.IsEmail = 1 AND xQ.RowNumberEmail = 1 THEN xQ.Contact ELSE NULL END) AS Email
FROM
(
SELECT *
,CASE WHEN PATINDEX('%#%',T.Contact)>0 THEN 1 ELSE 0 END AS IsEmail
,RANK() OVER(PARTITION BY T.PersonID, CASE WHEN PATINDEX('%#%',T.Contact)=0 THEN 1 ELSE 0 END ORDER BY T.Contact) AS RowNumberPhone
,RANK() OVER(PARTITION BY T.PersonID, CASE WHEN PATINDEX('%#%',T.Contact)>0 THEN 1 ELSE 0 END ORDER BY T.Contact) AS RowNumberEmail
FROM #Table AS T
)AS xQ
GROUP BY
xQ.PersonID,
xQ.Name
ORDER BY xQ.PersonID
You can do it using subqueries
declare #tbl table(PersonID int,Name varchar(50),Contact varchar(100))
insert into #tbl
select 1,'Chen','212747' union
select 1,'Chen','Chen#test.com' union
select 2,'Pudge','18191' union
select 2,'Pudge','18182222' union
select 2,'Pudge','Pudge#test.com' union
select 3,'Riki','Riki#test.com' union
select 3,'Riki','19192' union
select 4,'Lina','18424'
SELECT DISTINCT
M.PersonID
,M.Name
,(SELECT TOP 1 Contact FROM #tbl WHERE PersonID = M.PersonID AND Contact NOT LIKE '%#%' ORDER BY Contact) AS Phone
,(SELECT TOP 1 Contact FROM #tbl WHERE PersonID = M.PersonID AND Contact NOT LIKE '%#%'
AND Contact NOT IN (SELECT TOP 1 Contact FROM #tbl WHERE PersonID = M.PersonID AND Contact NOT LIKE '%#%' ORDER BY Contact)) AS AltPhone
,(SELECT TOP 1 Contact FROM #tbl WHERE PersonID = M.PersonID AND Contact LIKE '%#%') AS Email
FROM #tbl M
Output
1 Chen 212747 NULL Chen#test.com
2 Pudge 18182222 18191 Pudge#test.com
3 Riki 19192 NULL Riki#test.com
4 Lina 18424 NULL NULL
Using row number and group by person id you can achieve the same by below query.
Select PersonID, max(Name) name,
max(case when rn=1 and contact not like '%#%' then contact end) phone,
max(case when rn=2 and contact not like '%#%' then contact end) Alt_Phone,
max(case when contact like '%#%' then contact end) mailid
from(select t.*, row_number() over(partition by personid order by contact) as rn from table t) as t2
group by PersonID

Calculation of occurrence of strings

I have a table with 3 columns, id, name and vote. They're populated with many registers. I need that return the register with the best balance of votes. The votes types are 'yes' and 'no'.
Yes -> Plus 1
No -> Minus 1
This column vote is a string column. I am using SQL SERVER.
Example:
It must return Ann for me
Use conditional Aggregation to tally the votes as Kannan suggests in his answer
If you really only want 1 record then you can do it like so:
SELECT TOP 1
name
,SUM(CASE WHEN vote = 'yes' THEN 1 ELSE -1 END) AS VoteTotal
FROM
#Table
GROUP BY
name
ORDER BY
VoteTotal DESC
This will not allow for ties but you can use this method which will rank the responses and give you results use RowNum to get only 1 result or RankNum to get ties.
;WITH cteVoteTotals AS (
SELECT
name
,SUM(CASE WHEN vote = 'yes' THEN 1 ELSE -1 END) AS VoteTotal
,ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY SUM(CASE WHEN vote = 'yes' THEN 1 ELSE -1 END) DESC) as RowNum
,DENSE_RANK() OVER (PARTITION BY 1 ORDER BY SUM(CASE WHEN vote = 'yes' THEN 1 ELSE -1 END) DESC) as RankNum
FROM
#Table
GROUP BY
name
)
SELECT name, VoteTotal
FROM
cteVoteTotals
WHERE
RowNum = 1
--RankNum = 1 --if you want with ties use this line instead
Here is the test data used and in the future do NOT just put an image of your test data spend the 2 minutes to make a temp table or a table variable so that people you are asking for help do not have to!
DECLARE #Table AS TABLE (id INT, name VARCHAR(25), vote VARCHAR(4))
INSERT INTO #Table (id, name, vote)
VALUES (1, 'John','no'),(2, 'John','no'),(3, 'John','yes')
,(4, 'Ann','no'),(5, 'Ann','yes'),(6, 'Ann','yes')
,(9, 'Marie','no'),(8, 'Marie','no'),(7, 'Marie','yes')
,(10, 'Matt','no'),(11, 'Matt','yes'),(12, 'Matt','yes')
Use this code,
;with cte as (
select id, name, case when vote = 'yes' then 1 else -1 end as votenum from register
) select name, sum(votenum) from cte group by name
You can get max or minimum based out of this..
This one gives the 'yes' rate for each person:
SELECT Name, SUM(CASE WHEN Vote = 'Yes' THEN 1 ELSE 0 END)/COUNT(*) AS Rate
FROM My_Table
GROUP BY Name

Looping in select query

I want to do something like this:
select id,
count(*) as total,
FOR temp IN SELECT DISTINCT somerow FROM mytable ORDER BY somerow LOOP
sum(case when somerow = temp then 1 else 0 end) temp,
END LOOP;
from mytable
group by id
order by id
I created working select:
select id,
count(*) as total,
sum(case when somerow = 'a' then 1 else 0 end) somerow_a,
sum(case when somerow = 'b' then 1 else 0 end) somerow_b,
sum(case when somerow = 'c' then 1 else 0 end) somerow_c,
sum(case when somerow = 'd' then 1 else 0 end) somerow_d,
sum(case when somerow = 'e' then 1 else 0 end) somerow_e,
sum(case when somerow = 'f' then 1 else 0 end) somerow_f,
sum(case when somerow = 'g' then 1 else 0 end) somerow_g,
sum(case when somerow = 'h' then 1 else 0 end) somerow_h,
sum(case when somerow = 'i' then 1 else 0 end) somerow_i,
sum(case when somerow = 'j' then 1 else 0 end) somerow_j,
sum(case when somerow = 'k' then 1 else 0 end) somerow_k
from mytable
group by id
order by id
this works, but it is 'static' - if some new value will be added to 'somerow' I will have to change sql manually to get all the values from somerow column, and that is why I'm wondering if it is possible to do something with for loop.
So what I want to get is this:
id somerow_a somerow_b ....
0 3 2 ....
1 2 10 ....
2 19 3 ....
. ... ...
. ... ...
. ... ...
So what I'd like to do is to count all the rows which has some specific letter in it and group it by id (this id isn't primary key, but it is repeating - for id there are about 80 different values possible).
http://sqlfiddle.com/#!15/18feb/2
Are arrays good for you? (SQL Fiddle)
select
id,
sum(totalcol) as total,
array_agg(somecol) as somecol,
array_agg(totalcol) as totalcol
from (
select id, somecol, count(*) as totalcol
from mytable
group by id, somecol
) s
group by id
;
id | total | somecol | totalcol
----+-------+---------+----------
1 | 6 | {b,a,c} | {2,1,3}
2 | 5 | {d,f} | {2,3}
In 9.2 it is possible to have a set of JSON objects (Fiddle)
select row_to_json(s)
from (
select
id,
sum(totalcol) as total,
array_agg(somecol) as somecol,
array_agg(totalcol) as totalcol
from (
select id, somecol, count(*) as totalcol
from mytable
group by id, somecol
) s
group by id
) s
;
row_to_json
---------------------------------------------------------------
{"id":1,"total":6,"somecol":["b","a","c"],"totalcol":[2,1,3]}
{"id":2,"total":5,"somecol":["d","f"],"totalcol":[2,3]}
In 9.3, with the addition of lateral, a single object (Fiddle)
select to_json(format('{%s}', (string_agg(j, ','))))
from (
select format('%s:%s', to_json(id), to_json(c)) as j
from
(
select
id,
sum(totalcol) as total_sum,
array_agg(somecol) as somecol_array,
array_agg(totalcol) as totalcol_array
from (
select id, somecol, count(*) as totalcol
from mytable
group by id, somecol
) s
group by id
) s
cross join lateral
(
select
total_sum as total,
somecol_array as somecol,
totalcol_array as totalcol
) c
) s
;
to_json
---------------------------------------------------------------------------------------------------------------------------------------
"{1:{\"total\":6,\"somecol\":[\"b\",\"a\",\"c\"],\"totalcol\":[2,1,3]},2:{\"total\":5,\"somecol\":[\"d\",\"f\"],\"totalcol\":[2,3]}}"
In 9.2 it is also possible to have a single object in a more convoluted way using subqueries in instead of lateral
SQL is very rigid about the return type. It demands to know what to return beforehand.
For a completely dynamic number of resulting values, you can only use arrays like #Clodoaldo posted. Effectively a static return type, you do not get individual columns for each value.
If you know the number of columns at call time ("semi-dynamic"), you can create a function taking (and returning) polymorphic parameters. Closely related answer with lots of details:
Dynamic alternative to pivot with CASE and GROUP BY
(You also find a related answer with arrays from #Clodoaldo there.)
Your remaining option is to use two round-trips to the server. The first to determine the the actual query with the actual return type. The second to execute the query based on the first call.
Else, you have to go with a static query. While doing that, I see two nicer options for what you have right now:
1. Simpler expression
select id
, count(*) AS total
, count(somecol = 'a' OR NULL) AS somerow_a
, count(somecol = 'b' OR NULL) AS somerow_b
, ...
from mytable
group by id
order by id;
How does it work?
Compute percents from SUM() in the same SELECT sql query
SQL Fiddle.
2. crosstab()
crosstab() is more complex at first, but written in C, optimized for the task and shorter for long lists. You need the additional module tablefunc installed. Read the basics here if you are not familiar:
PostgreSQL Crosstab Query
SELECT * FROM crosstab(
$$
SELECT id
, count(*) OVER (PARTITION BY id)::int AS total
, somecol
, count(*)::int AS ct -- casting to int, don't think you need bigint?
FROM mytable
GROUP BY 1,3
ORDER BY 1,3
$$
,
$$SELECT unnest('{a,b,c,d}'::text[])$$
) AS f (id int, total int, a int, b int, c int, d int);