I have the following table
cust_id | category | counts
1 | food | 2
1 | pets | 5
3 | pets | 3
I would like to get this output
cust_id | food_count | pets_count
1 | 2 | 5
3 | 0 | 3
Where the number of columns map all unique values in the category column. Do you know how that can be done in Presto SQL? If I were doing this in pySpark I would use CountVectorizer but I'm a bit struggling with SQL.
You can use GROUP BY and sum on condition. For example using if function:
-- sample data
WITH dataset (cust_id, category, counts) AS (
VALUES (1, 'food', 2),
(1, 'pets', 5),
(3, 'pets', 3)
)
--query
select cust_id, sum(if(category = 'food', counts, 0)) food_counts, sum(if(category = 'pets', counts, 0)) pets_counts
from dataset
group by cust_id
Output:
cust_id
food_counts
pets_counts
1
2
5
3
0
3
Related
I am trying to create a field which shows an order based on two columns. I have one column with a code in and one with a date. There are many dates for each code, but I am trying to pick out the latest date for each code. The table below shows the two columns I have and the order column that I need to create.
code date order column
1 10/04/22 3
1 11/04/22 2
1 14/05/22 1
2 10/04/22 2
2 15/04/22 1
3 11/04/22 1
4 12/04/22 2
4 16/04/22 1
5 15/04/22 2
5 17/04/22 1
As Larnu and Sean have already stated, Row_number is your friend here.
Start with the data:
CREATE TABLE #Table (code int, date date)
INSERT INTO #table
VALUES
(1, '04/10/22')
,(1, '04/11/22')
,(1, '05/14/22')
,(2, '04/10/22')
,(2, '04/15/22')
,(3, '04/11/22')
,(4, '04/12/22')
,(4, '04/16/22')
,(5, '04/15/22')
,(5, '04/17/22');
Then we write the query with the row numbers. The magic here is in the partition by/order by. That partitions your data based on the code, so it takes the three 1s and puts the dates in descending order. It then numbers them 1, 2, 3 with the latest date being number 1. Then it does code 2...
SELECT code
, date
, ROW_NUMBER() OVER (PARTITION BY code ORDER BY date desc) rn
FROM #table
GROUP BY code, date
ORDER BY code asc, date asc;
And that gets us the result you asked for:
|code | date | rn |
|:----|:---------|:----|
| 1 |2022-04-10| 3 |
| 1 |2022-04-11| 2 |
| 1 |2022-05-14| 1 |
| 2 |2022-04-10| 2 |
| 2 |2022-04-15| 1 |
| 3 |2022-04-11| 1 |
| 4 |2022-04-12| 2 |
| 4 |2022-04-16| 1 |
| 5 |2022-04-15| 2 |
| 5 |2022-04-17| 1 |
And then if you only want the max date for each code... keep only the ones where row number equals 1.
WITH CTE AS
(SELECT code
, date
, ROW_NUMBER() OVER (PARTITION BY code ORDER BY date desc) rn
FROM #table
)
SELECT code
, date
FROM CTE
WHERE rn = 1
ORDER BY code
| code | date |
|:-----|:---------|
| 1 |2022-05-14|
| 2 |2022-04-15|
| 3 |2022-04-11|
| 4 |2022-04-16|
| 5 |2022-04-17|
I have a table like this:
+----+--------------+--------+----------+
| id | name | weight | some_key |
+----+--------------+--------+----------+
| 1 | strawberries | 12 | 1 |
| 2 | blueberries | 7 | 1 |
| 3 | elderberries | 0 | 1 |
| 4 | cranberries | 8 | 2 |
| 5 | raspberries | 18 | 2 |
+----+--------------+--------+----------+
I'm looking for a generic request that would get me all berries where there are three entries with the same 'some_key' and one of the entries (within those three entries belonging to the same some_key) has the weight = 0
in case of the sample table, expected output would be:
1 strawberries
2 blueberries
3 cranberries
As you want to include non-grouped columns, I would approach this with window functions:
select id, name
from (
select id,
name,
count(*) over w as key_count,
count(*) filter (where weight = 0) over w as num_zero_weight
from fruits
window w as (partition by some_key)
) x
where x.key_count = 3
and x.num_zero_weight >= 1
The count(*) over w counts the number of rows in that group (= partition) and the count(*) filter (where weight = 0) over w counts how many of those have a weight of zero.
The window w as ... avoids repeating the same partition by clause for the window functions.
Online example: https://rextester.com/SGWFI49589
Try this-
SELECT some_key,
SUM(weight) --Sample aggregations on column
FROM your_table
GROUP BY some_key
HAVING COUNT(*) = 3 -- If you wants at least 3 then use >=3
AND SUM(CASE WHEN weight = 0 THEN 1 ELSE 0 END) >= 1
As per your edited question, you can try this below-
SELECT id, name
FROM your_table
WHERE some_key IN (
SELECT some_key
FROM your_table
GROUP BY some_key
HAVING COUNT(*) = 3 -- If you wants at least 3 then use >=3
AND SUM(CASE WHEN weight = 0 THEN 1 ELSE 0 END) >= 1
)
Try doing this.
Table structure and sample data
CREATE TABLE tmp (
id int,
name varchar(50),
weight int,
some_key int
);
INSERT INTO tmp
VALUES
('1', 'strawberries', '12', '1'),
('2', 'blueberries', '7', '1'),
('3', 'elderberries', '0', '1'),
('4', 'cranberries', '8', '2'),
('5', 'raspberries', '18', '2');
Query
SELECT t1.*
FROM tmp t1
INNER JOIN (SELECT some_key
FROM tmp
GROUP BY some_key
HAVING Count(some_key) >= 3
AND Min(Abs(weight)) = 0) t2
ON t1.some_key = t2.some_key;
Output
+-----+---------------+---------+----------+
| id | name | weight | some_key |
+-----+---------------+---------+----------+
| 1 | strawberries | 12 | 1 |
| 2 | blueberries | 7 | 1 |
| 3 | elderberries | 0 | 1 |
+-----+---------------+---------+----------+
Online Demo: http://sqlfiddle.com/#!15/70cca/26/0
Thank you, #mkRabbani for reminding me about the negative values.
Further reading
- ABS() Function - Link01, Link02
- HAVING Clause - Link01, Link02
I have a table holding categories with an inner parent child relationship.
The table looks like this:
ID | ParentID | OrderID
---+----------+---------
1 | Null | 1
2 | Null | 2
3 | 2 | 1
4 | 1 | 1
OrderID is the order inside the current level.
I want to create a recursive SQL query to create the natural order of the table.
Meaning the output will be something like:
ID | Order
-----+-------
1 | 100
4 | 101
2 | 200
3 | 201
Appreciate any help.
Thanks
I am not really sure what you mean by "natural order", but the following query generates the results you want for this data:
with t as (
select v.*
from (values (1, NULL, 1), (2, NULL, 2), (3, 2, 1), (4, 1, 1)) v(ID, ParentID, OrderID)
)
select t.*,
(100 * coalesce(tp.orderid, t.orderid) + (case when t.parentid is null then 0 else 1 end)) as natural_order
from t left join
t tp
on t.parentid = tp.id
order by natural_order;
Here my row with my first select:
SELECT
user.id, analytic_youtube_demographic.age,
analytic_youtube_demographic.percent
FROM
`user`
INNER JOIN
analytic ON analytic.user_id = user.id
INNER JOIN
analytic_youtube_demographic ON analytic_youtube_demographic.analytic_id = analytic.id
Result:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |13-17| 19,6 |
| 1 |18-24| 38.4 |
| 1 |25-34| 22.5 |
| 1 |35-44| 11.5 |
| 1 |45-54| 5.3 |
| 1 |55-64| 1.6 |
| 1 |65+ | 1.2 |
| 2 |13-17| 10 |
| 2 |18-24| 10 |
| 2 |25-34| 25 |
| 2 |35-44| 5 |
| 2 |45-54| 25 |
| 2 |55-64| 5 |
| 1 |65+ | 20 |
---------------------------
The max value by user_id:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |18-24| 38.4 |
| 2 |45-54| 25 |
| 2 |25-34| 25 |
---------------------------
And I need to filter Age in ['25-34', '65+']
I must have at the end :
-----------
| id |
|----------
| 2 |
-----------
Thanks a lot for your help.
Have tried to use MAX(analytic_youtube_demographic.percent). But I don't know how to filter with the age too.
Thanks a lot for your help.
You can use the rank() function to identify the largest percentage values within each user's data set, and then a simple WHERE clause to get those entries that are both of the highest rank and belong to one of the specific demographics you're interested in. Since you can't use windowed functions like rank() in a WHERE clause, this is a two-step process with a subquery or a CTE. Something like this ought to do it:
-- Sample data from the question:
create table [user] (id bigint);
insert [user] values
(1), (2);
create table analytic (id bigint, [user_id] bigint);
insert analytic values
(1, 1), (2, 2);
create table analytic_youtube_demographic (analytic_id bigint, age varchar(32), [percent] decimal(5, 2));
insert analytic_youtube_demographic values
(1, '13-17', 19.6),
(1, '18-24', 38.4),
(1, '25-34', 22.5),
(1, '35-44', 11.5),
(1, '45-54', 5.3),
(1, '55-64', 1.6),
(1, '65+', 1.2),
(2, '13-17', 10),
(2, '18-24', 10),
(2, '25-34', 25),
(2, '35-44', 5),
(2, '45-54', 25),
(2, '55-64', 5),
(2, '65+', 20);
-- First, within the set of records for each user.id, use the rank() function to
-- identify the demographics with the highest percentage.
with RankedDataCTE as
(
select
[user].id,
youtube.age,
youtube.[percent],
[rank] = rank() over (partition by [user].id order by youtube.[percent] desc)
from
[user]
inner join analytic on analytic.[user_id] = [user].id
inner join analytic_youtube_demographic youtube on youtube.analytic_id = analytic.id
)
-- Now select only those records that are (a) of the highest rank within their
-- user.id and (b) either the '25-34' or the '65+' age group.
select
id,
age,
[percent]
from
RankedDataCTE
where
[rank] = 1 and
age in ('25-34', '65+');
Can anyone tell me what is wrong with the following sql query ?
Select *,
(SELECT [DiseaseID], COUNT(*) AS [Rank] FROM [DiseaseSymptom] WHERE
([SymptomID] IN(1, 5)) GROUP BY [DiseaseID] ORDER BY [Rank] DESC)
FROM Disease WHERE GenderID in (1, 3)
I have 2 tables one contains disease and the gender it is associated with
Disease
+-----------+-------------------+----------+
| DiseaseID | DiseaseName | GenderID |
+-----------+-------------------+----------+
| 1 | Fever | 3 |
| 2 | Flu | 3 |
| 3 | Lady Disease | 2 |
| 4 | Gentlemen Disease | 1 |
+-----------+-------------------+----------+
Gender 1 = Male, 2 = Female, 3 = Common
And a Symptom Disease Matrix like this
DiseaseSymptom
+-----------+-----------+----------+
| DiseaseID | SymptomID | DissymID |
+-----------+-----------+----------+
| 1 | 1 | 1 |
| 1 | 2 | 3 |
| 1 | 4 | 4 |
| 2 | 1 | 5 |
| 2 | 3 | 9 |
| 2 | 4 | 6 |
| 2 | 5 | 7 |
+-----------+-----------+----------+
I get symptoms from user and match it in the DiseaseSymptom table and rank it according to the number of symptoms matched (inner sql statement)
In the outer statement I simply want get the result from inner statement and evaluate whether it belongs to specific gender. The error I get when I try to run the above query is
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified.
Subqueries in select clause must only generate a scalar value, not a resultset with multiple columns or rows. if you want both then put the subquery in the from clause (properly correlated), and refer to the two different vqlues in the select clause
Select d.*, z.DeseaseId, z.Rank
FROM Disease d
join (SELECT DiseaseID, COUNT(*) Rank
FROM DiseaseSymptom
WHERE SymptomID IN(1, 5)
GROUP BY DiseaseID) Z
On z.DeseaseId = d.DeseaseId
WHERE GenderID in (1, 3)
Order By z.Rank
You are using a subquery with group by. Your intention is to have a correlated subquery. The problem is that the subquery is returning more than one row. I think this is what you want:
Select d.*,
(SELECT COUNT(*) AS [Rank]
FROM [DiseaseSymptom] ds
WHERE [SymptomID] IN (1, 5)) AND ds.DiseaseId = d.DiseaseId
)
FROM Disease d
WHERE GenderID in (1, 3);
You should use Common Table Expression (cte) like this:
with cte as (SELECT [DiseaseID], GenderID, COUNT(*) AS [Rank] FROM [DiseaseSymptom] WHERE
([SymptomID] IN(1, 5)) GROUP BY [DiseaseID],GenderID ORDER BY [Rank] DESC)
select * FROM cte WHERE GenderID in (1, 3)
Hope this help ;)
There is really no need to have a nested query, just join and filter
SELECT d.DiseaseID, d.DiseaseName, d.GenderID
, Symptoms = Count(ds.SymptomID)
FROM Disease d
INNER JOIN DiseaseSymptom ds ON d.DiseaseID = ds.DiseaseID
WHERE ds.SymptomID IN (1, 5)
AND d.GenderID IN (1, 3)
GROUP BY d.DiseaseID, d.DiseaseName, d.GenderID
ORDER BY Count(SymptomID) Desc
SQLFiddle Demo