Problems with group by and order by - sql

Hi I am a newbie to the world of sql but struggling to get some of the basics to work.
I have a set of data that looks like this:
Table name: Sample
PROJECT WORK ORDER AMOUNT
-----------------------------------------
111 a 100
222 b 200
111 c 300
444 d 400
111 e 500
666 f 600
I want it to end up looking like this:
Table name: Sample
PROJECT WORK ORDER AMOUNT PROJECT AMOUNT
--------------------------------------------------------
111 e 500 900
111 c 300 900
111 a 100 900
666 f 600 600
444 d 400 600
222 b 200 200
Sorted by project with the greatest TOTAL amount
Group by does not work for me as it groups all projects into one, so I can't see the 3 work order lines for "Project 111"
PROJECT WORK ORDER AMOUNT
-----------------------------------------
111 a 900
222 b 200
444 d 400
666 f 600
Order by does not work as I can't get it sort it out on the basis of the greatest project value
Table name: Sample
PROJECT WORK ORDER AMOUNT
-----------------------------------------
666 f 600
111 e 500
444 d 400
111 c 300
222 b 200
111 a 100
My alternative idea was if I could create another column "Project Amount" that calculates the projects total based on values in "Project" column and I can then easily sort it by Project Amount instead to achieve the desired format
Table name: Sample
PROJECT WORK ORDER AMOUNT PROJECT AMOUNT
--------------------------------------------------------
111 e 500 900
111 c 300 900
111 a 100 900
666 f 600 600
444 d 400 600
222 b 200 200
But I am struggling how to get column "Project Amount" to calculate all the projects total value and present them on any rows that appear with the same project number.
Any advise?

select *
, sum(amount) over (partition by project) as ProjAmount
, row_number() over
from YourTable
order by
ProjAmount desc
Example at SQL Fiddle.
To select only the top two projects with the highest amounts, you could use dense_rank:
select *
from (
select *
, dense_rank() over (order by ProjAmount desc) as dr
from (
select *
, sum(amount) over (partition by project) as ProjAmount
from YourTable
) WithProjAmount
) WithDenseRank
where dr < 3
order by
ProjAmount desc
Example at SQL Fiddle.

A version with plain SQL subquery
SELECT s.*,
(SELECT SUM(Amount) FROM Sample WHERE Project = s.Project) ProjectAmount
FROM Sample s
ORDER BY ProjectAmount DESC
SQLFiddle

SELECT a.project ,
a.work ,
a.amount ,
b.proj_amount
FROM project A
JOIN ( SELECT SUM(amount) proj_amount ,
project
FROM project
WHERE project = project
GROUP BY project
) b ON a.project = b.project
ORDER BY proj_amount DESC ,
amount DESC

Related

How to Select ID's in SQL (Databricks) in which at least 2 items from a list are present

I'm working with patient-level data in Azure Databricks and I'm trying to build out a cohort of patients that have at least 2 diagnoses from a list of specific diagnosis codes. This is essentially what the table looks like:
CLAIM_ID | PTNT_ID | ICD_CD | DATE
---------+---------+--------+------------
1 101 2500 01_25_2020
2 101 3850 03_13_2018
3 222 2500 10_26_2018
4 222 8888 11_30_2018
5 222 9155 04_01_2019
6 871 2500 02_17_2020
7 871 3200 09_09_2019
The list of ICD_CD codes of interest is something like [2500, 3850, 8888]. In this case, I would want to return TOTAL UNIQUE PTNT_ID = 2. These would be PTNT_ID = (101, 222) as these are the only two patients that have at least 2 ICD_CD codes of interest.
When I use something like this, I'm able to return all of the relevant PTNT_ID values, but I'm not able to get the total count of these PTNT_ID:
select mc.PTNT_ID
from MEDICAL_CLAIMS mc
where mc.PTNT_ID in ( # list of ICD_CD of interest
)
group by mc.PTNT_ID
having count(distinct mc.PTNT) >= 2
When I try to add a COUNT statement in, it returns an error
Just select from the query:
select count(*)
from
(
select mc.PTNT_ID
from MEDICAL_CLAIMS mc
where mc.PTNT_ID in ( # list of ICD_CD of interest )
group by mc.PTNT_ID
having count(distinct mc.PTNT) >= 2
) ptnts;

Getting latest price of different products from control table

I have a control table, where Prices with Item number are tracked date wise.
id ItemNo Price Date
---------------------------
1 a001 100 1/1/2003
2 a001 105 1/2/2003
3 a001 110 1/3/2003
4 b100 50 1/1/2003
5 b100 55 1/2/2003
6 b100 60 1/3/2003
7 c501 35 1/1/2003
8 c501 38 1/2/2003
9 c501 42 1/3/2003
10 a001 95 1/1/2004
This is the query I am running.
SELECT pr.*
FROM prices pr
INNER JOIN
(
SELECT ItemNo, max(date) max_date
FROM prices
GROUP BY ItemNo
) p ON pr.ItemNo = p.ItemNo AND
pr.date = p.max_date
order by ItemNo ASC
I am getting below values
id ItemNo Price Date
------------------------------
10 a001 95 2004-01-01
6 b100 60 2003-01-03
9 c501 42 2003-01-03
Question is, is my query right or wrong? though I am getting my desired result.
Your query does what you want, and is a valid approach to solve your problem.
An alternative option would be to use a correlated subquery for filtering:
select p.*
from prices p
where p.date = (select max(p1.date) from prices where p1.itemno = p.itemno)
The upside of this query is that it can take advantage of an index on (itemno, date).
You can also use window functions:
select *
from (
select p.*, rank() over(partition by itemno order by date desc) rn
from prices p
) p
where rn = 1
I would recommend benchmarking the three options against your real data to assess which one performs better.

How to get latest records based on two columns of max

I have a table called Inventory with the below columns
item warehouse date sequence number value
111 100 2019-09-25 12:29:41.000 1 10
111 100 2019-09-26 12:29:41.000 1 20
222 200 2019-09-21 16:07:10.000 1 5
222 200 2019-09-21 16:07:10.000 2 10
333 300 2020-01-19 12:05:23.000 1 4
333 300 2020-01-20 12:05:23.000 1 5
Expected Output:
item warehouse date sequence number value
111 100 2019-09-26 12:29:41.000 1 20
222 200 2019-09-21 16:07:10.000 2 10
333 300 2020-01-20 12:05:23.000 1 5
Based on item and warehouse, i need to pick latest date and latest sequence number of value.
I tried with below code
select item,warehouse,sequencenumber,sum(value),max(date) as date1
from Inventory t1
where
t1.date IN (select max(date) from Inventory t2
where t1.warehouse=t2.warehouse
and t1.item = t2.item
group by t2.item,t2.warehouse)
group by t1.item,t1.warehouse,t1.sequencenumber
Its working for latest date but not for latest sequence number.
Can you please suggest how to write a query to get my expected output.
You can use row_number() for this:
select *
from (
select
t.*,
row_number() over(
partition by item, warehouse
order by date desc, sequence_number desc, value desc
) rn
from mytable t
) t
where rn = 1

How can I get 5 most viewed products per customer?

I need help with a query to extract the top 5 most viewed products per customer in the last week. I know basic SQL so any help would be appreciated.
My tables look like this
db1.views
cust_id hit date_hit prod_id
111 abc [timestamp] 12345
222 bcs [timestamp] 87653
333 pdr [timestamp] 36702
444 lao [timestamp] 90165
444 afe [timestamp] 89104
333 wgt [timestamp] 46177
111 gfr [timestamp] 46468
db2.item
prod_id color
12345 red
87653 green
36702 blue
90165 red
89104 green
46177 yellow
46468 pink
db3.price
prod_id price
12345 500
87653 450
36702 600
90165 570
89104 650
46177 430
46468 900
This was my original query:
SELECT *
FROM (
SELECT COUNT(v.hit) AS hit, i.prod_id, v.cust_id
FROM db1.views v
JOIN db2.item i
ON v.prod_id = i.prod_id
JOIN db3.price p
ON i.prod_id = p.prod_id
WHERE i.color = "red" OR p.price > 500)
GROUP BY i.prod_id, v.cust_id
) AS A
JOIN db1.views B
ON A.prod_id = B.prod_id
WHERE A.hit>1 AND B.date_hit BETWEEN date_sub(current_timestamp(), 7) AND current_timestamp()
Unfortunately with this one I could not find a way to limit my results to the 5 prod_id with the most views.
I then read around and found the rank() and row_number() functions, and started trying something like this:
SELECT rank() over(PARTITION BY A.prod_id ORDER BY A.hits DESC) AS row_num
FROM (
SELECT i.prod_id, COUNT(v.hit) AS hits
FROM db1.views v
JOIN db2.item i
ON v.prod_id = i.prod_id
JOIN db3.price p
ON i.prod_id = p.prod_id
WHERE i.color = "red" OR p.price > 500
GROUP BY i.prod_id
SORT BY hits DESC) AS A
GROUP BY A.prod_id, A.hits;
My issue with this one is that it always, always times out! I'm not sure if I have a syntax error or if I'm doing something that SQL is unable to resolve but I haven't been able to have this one work. I tried the same with row_number() and didn't work either. I feel like I might be close with this one but I'm not sure why it keeps timing out. Also, I know this second one does not have the cust_id, it's only because I can't even make it work at the moment.
What I'd like to have is something like this:
cust_id hit prod_id
111 50 84304
111 45 12345
111 42 16730
111 11 17592
111 4 43024
222 93 87653
222 91 23489
222 34 83920
222 22 57482
222 20 38402
333 43 36702
You can use aggregation to count the views and then window functions to determine the top 5:
select vc.*
from (select v.cust_id, v.prod_id, count(*) as cnt,
row_number() over (partition by v.cust_id order by count(*) desc) as seqnum
from views v
where v.timestamp > date_sub(current_timestamp(), 7)
group by v.cust_id, v.prod_id
) vc
where seqnum <= 5;

Group by in PIVOT operator

How do i use the group by clause in the PIVOT operator?
I tried with the following code but i get the null values and the results are not getting aggregated.
select EmpName, CHN,HYD FROM location
PIVOT (Sum(salary) for EmpLoc in ([CHN], [HYD]))
AS
pivottable
I want the final output to be like this.
CHN HYD
kunder 400 200
shetty 150 150
or
CHN HYD Total
kunder 400 200 600
shetty 150 150 300
Total 550 350 900
Just add the derived column Total=CHN+HYD and a sub-query to create the Total Row
The Seq (though not displayed) will put the Total row at the bottom
Declare #YourTable table (EmpLoc varchar(25),EmpName varchar(25),Salary int)
Insert Into #YourTable values
('HYD','kunder',200)
,('HYD','shetty',150)
,('CHN','shetty',150)
,('CHN','kunder',200)
,('CHN','kunder',200)
Select EmpName, CHN,HYD,Total=CHN+HYD
From (
Select Seq=0,EmpLoc,EmpName,Salary From #YourTable
Union All
Select Seq=1,EmpLoc,'Total',Salary From #YourTable
) A
pivot (sum(Salary) for EmpLoc in ([CHN], [HYD])) P
Returns
EmpName CHN HYD Total
kunder 400 200 600
shetty 150 150 300
Total 550 350 900
Declare #YourTable table (EmpLoc varchar(25),EmpName varchar(25),Salary int)
Insert Into #YourTable values
('HYD','kunder',200)
,('HYD','shetty',150)
,('CHN','shetty',150)
,('CHN','kunder',200)
,('CHN','kunder',200)
;with cte as
(
SELECT * from
(
select * from #YourTable
) as y
pivot
(
sum(salary)
for EmpLoc in ([CHN], [HYD])
) as p
)
SELECT
EmpName,sum(CHN)CHN ,sum(HYD)HYD
FROM CTE
GROUP BY EmpName;
I have no issue using your code from your example to get your desired results. I am guessing that your query is not as simple as your example, and as such is introducing other complications not shown here.
You may need to use a subquery and pivot using just the columns necessary for the pivot and join back to the rest of your query to get the results you are looking for using pivot().
Using conditional aggregation may be a simpler solution:
select
empname
, CHN = sum(case when emploc = 'CHN' then salary else 0 end)
, HYD = sum(case when emploc = 'HYD' then salary else 0 end)
--, Total = sum(salary) /* Optional total */
from location
group by empname
rextester demo: http://rextester.com/LYRH81756
returns:
+---------+-----+-----+
| EmpName | CHN | HYD |
+---------+-----+-----+
| kunder | 400 | 200 |
| shetty | 150 | 150 |
+---------+-----+-----+