SQL Query: Combining group by with different record entries - sql

Given the following table and records:
location
interaction
session
us
5
xyz
us
10
xyz
us
20
xyz
us
5
qrs
us
10
qrs
us
20
qrs
de
5
abc
de
10
abc
de
20
abc
fr
5
mno
fr
10
mno
I'm trying to create a query that will get a count of locations for all sessions that have interactions of 5 and 10, but NOT 20.
So assuming the above, the query will return
count
location
2
us
1
de
FR will not be in the results, because session 'mno' did not have an interaction of 20.
As far as I can tell, I need to group by session first, then group by location afterwards. Which might mean using a nested select statement. I've tried a few things, but am not sure how to proceed. Would appreciate any help on how to approach a query like this.

Maybe your example is wrong - if a location session shall have interactions 5 and 10 but not 20 then only 'fr' shall be in the result as both 'us' and 'de' do have 20 in all their sessions.
SQL fiddle here.
with t as
(
select location, array_agg(interaction) arr
from the_table
group by location, session
)
select count(*) cnt, location
from t
where arr #> array[5,10] and not arr #> array[20]
group by location;

Related

Multiple group by keeping their orders

Let's consider this example :
Clients Routes City Timestamp
1 10 NY 0
1 11 NY 10
1 12 WDC 11
1 13 NY 20
2 22 LA 15
What I want as an output is something like this :
Clients Routes_number City min(Timestamp)
1 2 NY 0
1 1 WDC 11
1 1 NY 20
2 1 LA 15
The idea here is that I have to do multiple group by that kept their orders. For example, if we see the cities for Client 1, we can understand that he travelled from NY -> WDC -> NY (in the same day). So the idea is like to do a group by that counts the routes_number and the minimum timestamp but it will stop EACH TIME it finds a new city. If I do a global group by I will get something like this :
Clients Routes_number City min(Timestamp)
1 3 NY 0
1 1 WDC 11
2 1 LA 15
With an output like this, we lost the information that we have NY-> WDC and AGAIN NY. We thought that he only did NY -> WDC in one way...
I don't even know if it's possible to do such a request using SQL or if I have to do it in my code (I am newbie in Spark & Scala but Scala is the language that I use).
Thank you !
This is a gaps-and-islands problem that can be solved with a difference of row numbers:
select client, city, count(*), min(timestamp)
from (select t.*,
row_number() over (partition by client, city order by timestamp) as seqnum_1,
row_number() over (partition by client order by timestamp) as seqnum_2
from t
) t
group by client, city, (seqnum_2 - seqnum_1);
Here is a db<>fiddle.
It can be tricky to see how the difference of row numbers works to identify adjacent rows with the same city value. If you look at the results of the subquery, you'll get a good idea on how it works.

Identifying Records Where a String Appears More Than Once

I have a following dataset that looks like:
ID Medication Dose
1 Aspirin 4
1 Tylenol 7
1 Aspirin 2
1 Ibuprofen 1
2 Aspirin 6
2 Aspirin 2
2 Ibuprofen 6
2 Tylenol 4
3 Tylenol 3
3 Tylenol 7
3 Tylenol 2
I would like to develop a code that would identify patients who have been administered a medication more than once. So for example, ID 1 had Aspirin twice, ID 2 had Aspirin twice and ID 3 had Tylenol three times.
I could be wrong but I think the easiest way to do this would be to concatenate each ID based on Medication using a code similar to the one below; but I'm not quite sure what to do after that - is it possible to count if a string appears twice within a cell?
SELECT DISTINCT ST2.[ID],
SUBSTRING(
(
SELECT ','+ST1.Medication AS [text()]
FROM ED_NOTES_MASTER ST1
WHERE ST1.[ID] = ST2.[ID]
Order BY [ID]
FOR XML PATH ('')
), 1, 200000) [Result]
FROM ED_NOTES_MASTER ST2
I would like the output to look like the following:
ID MEDICATION Aspirin2x Tylenol2x Ibuprofen2x
1 Aspirin, Tylenol , Aspirin YES NO NO
2 Ibuprofen, Aspirin, Aspirin YES NO NO
3 Tylenol, Tylenol ,Tylenol NO YES NO
For the first part of your question (identify patients that have had a particular medication more than once), you can do this using GROUP BY to group by the ID and medication, and then using COUNT to get how many times each medication was given to each patient. For example:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
This will give you a list of all ID - Medication combinations that appear in the table and a count of how many times each combo appears. To limit these results down to just those that are greater than 2, you can add a condition to the COUNTed field using HAVING:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
HAVING amount >= 2
The problem now is formatting the results in the way you want. What you will get from the query above is a list of all patient - medication combinations that came up in the table more than once, like this:
ID | Medication | Count
------+---------------+-------
1 | Aspirin | 2
2 | Aspirin | 2
3 | Tylenol | 3
I'd suggest that you try and work with this format if possible, because as you have found, to get multiple values returned in a comma delimited list as you have in your Medication column you have to resort to some hacks to get it to work (although a recent version of SQL Server does implement some sort of proper group concatenation functionality.). If you really need the Aspirin2x etc. columns, take a look at the PIVOT operation in SQL Server.

SQL query to get only rows match the condition based on two separated columns under one 'group by'

The simple SELECT query would return the data as below:
Select ID, User, Country, TimeLogged from Data
ID User Country TimeLogged
1 Samantha SCO 10
1 John UK 5
1 Andrew NZL 15
2 John UK 20
3 Mark UK 10
3 Mark UK 20
3 Steven UK 10
3 Andrew NZL 15
3 Sharon IRL 5
4 Andrew NZL 25
4 Michael AUS 5
5 Jessica USA 30
I would like to return a sum of time logged for each user grouped by ID
But for only ID numbers where both of these values Country = UK and User = Andrew are included within their rows.
So the output in the above example would be
ID User Country TimeLogged
1 John UK 5
1 Andrew NZL 15
3 Mark UK 30
3 Steven UK 10
3 Andrew NZL 15
First you need to identify which IDs you're going to be returning
SELECT ID FROM MyTable WHERE Country='UK'
INTERSECT
SELECT ID FROM MyTable WHERE [User]='Andrew';
and based on that, you can then filter to aggregate the expected rows.
SELECT ID,
[User],
Country,
SUM(Timelogged) as Timelogged
FROM mytable
WHERE (Country='UK' OR [User]='Andrew')
AND ID IN( SELECT ID FROM MyTable WHERE Country='UK'
INTERSECT
SELECT ID FROM MyTable WHERE [User]='Andrew')
GROUP BY ID, [User], country;
So, you have described what you need to write almost perfectly but not quite. Your result table indicates that you want Country = UK OR User = Andrew, rather than AND
You need to select and group by, then include a WHERE:-
Select ID, User, Country, SUM(Timelogged) as Timelogged from mytable
WHERE Country='UK' OR User='Andrew'
Group by ID, user, country

Hive sql: count and avg

I'm recently trying to learn Hive and i have a problem with a sql consult.
I have a json file with some information. I want to get the average for each register. Better in example:
country times
USA 1
USA 1
USA 1
ES 1
ES 1
ENG 1
FR 1
then with next consult:
select country, count(*) from data;
I obtain:
country times
USA 3
ES 2
ENG 1
FR 1
then i should get next out:
country avg
USA 0,42 (3/7)
ES 0,28 (2/7)
ENG 0,14 (1/7)
FR 0,14 (1/7)
I don't know how i can obtain this out from the first table.
I tried:
select t1.country, avg(t1.tm),
from (
select country,count(*)as tm from data where not country is null group by country
) t1
group by t1.country;
but my out is wrong.
Thanks for help!! BR.
Divide the each group count by total count to get the result. Use Sub-Query to find the total number of records in your table
Try this
select t1.country, count(*)/IFNULL((select cast(count(*) as float) from data),0)
from data
group by t1.country;

Coalescing values in a column over a partition

I have chosen to ask this question via an example as I think it most clearly illustrates what I am trying to do.
Say I have the following table:
member number time
------ ----- -----
1 2 19:21
1 4 19:24
1 27 19:37
2 4 19:01
2 7 21:56
2 8 22:00
2 21 22:01
How can I obtain the following column?
member number new column
------ ----- ---------
1 2 2.4.27
1 4 2.4.27
1 27 2.4.27
2 4 4.7.8.21
2 7 4.7.8.21
2 8 4.7.8.21
2 21 4.7.8.21
EDIT(S):
I am using DB2 SQL.
There is not necessarily the same number of rows for each member.
The order is determined by time say.
depending on your version of db2, the LISTAGG() function may be available to you. i think it is included in any db2 version after 9.7.
example:
select
member,
number,
listagg(number,',') as new_column
from
tablename
group by
member
In Oracle this will do the job,
select a.member,a.number,b.newcol from table a,(select member,replace(wm_concat(number),',','.') newcol from test11 group by member)b where a.member=b.member;
I know it's bad form answering your own question but I have found this useful page:
https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/aggregating_strings42?lang=en
Modifying the code there gives:
create table test (member int, number int, time_stamp time)`;
insert into test values
(1,2,'19:21'),
(1,4,'19:24'),
(1,27,'19:37'),
(2,4,'19:01'),
(2,7,'21:56'),
(2,8,'22:00'),
(2,21,'22:01');
select
member, substr(xmlcast(xmlgroup('.' || number as a order by time_stamp) as varchar(60)), 2)
from test
group by member