SQL query which will extract conditionally the values from top categories the first and the 2nd where CATEGORY is OTHER - sql

I have this table. The table just a small example and has more obs.
id
CATEGORY
AMOUNT
1
TECH
120
1
FUN
220
2
OTHER
340
2
PARENTS
220
made by id category amount spent in each category.I want to select ID and Category in which the ID spents the most but in case if category is OTHER I want to get 2nd most spending category.
I have a constraint. I CANNOT use the the subquery and select with filter WHERE CATEGORY <> 'OTHER'. It just makes my machine to go out of the memory (For reasons Idk)
This is what I have tried.
I have tried to create a row_number () over (partition by id order by amount desc) rn.
and then
select id, category from table where row num = 1 group by 1,2
**buttt. I don't know how to say to query. If CATEGORY is OTHER then take row num=2 . **
id
CATEGORY
AMOUNT
ROW NUM
1
TECH
120
2
1
FUN
220
1
2
OTHER
340
1
2
PARENTS
220
2
Another thing I was thinking to do is to write qualify function
QUALIFY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY AMOUNT DESC) <1.
Also here I am getting only 1st records in which there is also OTHER. If I could filter it out within QUALIFY and say if CATEGORY is 'OTHER' don't consider it.
I am using Databricks.

Related

Get occurrence count of specific categories in a table

Looking to get the transition count of categories from a table. For Name type B, category transitions from Good to Bad so count is 2. For Name type A, it transitions from Good - Moderate - Good - Moderate - Bad, hence gets a count of 5.
Any help would be appreciated.
This is my input data:
Name
order no
category
A
1
Good
A
2
Good
A
3
MODERATE
A
4
Good
A
5
MODERATE
A
6
Bad
A
7
Bad
B
1
Good
B
2
Good
B
3
Good
B
4
BAD
And this is my desired output:
Name
category_transition_count
A
5
B
2
select name
,count(cnt) as category_transition_count
from
(select name
,case when category <> lag(category) over(partition by Name order by order_no) or lag(category) over(partition by Name order by order_no) is null then 1 end as cnt
from t) t
group by name
name
category_transition_count
A
5
B
2
Fiddle
You could use the lag window function to get the category of the previous row, and then compare it with the current row to see if it changed, and count those occurrences. Note that by definition the lag of the first value is null, which can't be different from the current value. so you'll need to handle that explicitly:
SELECT name, COUNT(changed) + 1
FROM (SELECT name,
CASE WHEN category <> LAG(category) OVER (PARTITION BY name ORDER BY order_no ASC)
THEN 1
END AS changed
FROM mytable) t
GROUP BY name
SQLFiddle (PostgreSQL) demo

How to consecutively count everything greater than or equal to itself in SQL?

Let's say if I have a table that contains Equipment IDs of equipments for each Equipment Type and Equipment Age, how can I do a Count Distinct of Equipment IDs that have at least that Equipment Age.
For example, let's say this is all the data we have:
equipment_type
equipment_id
equipment_age
Screwdriver
A123
1
Screwdriver
A234
2
Screwdriver
A345
2
Screwdriver
A456
2
Screwdriver
A567
3
I would like the output to be:
equipment_type
equipment_age
count_of_equipment_at_least_this_age
Screwdriver
1
5
Screwdriver
2
4
Screwdriver
3
1
Reason is there are 5 screwdrivers that are at least 1 day old, 4 screwdrivers at least 2 days old and only 1 screwdriver at least 3 days old.
So far I was only able to do count of equipments that falls within each equipment_age (like this query shown below), but not "at least that equipment_age".
SELECT
equipment_type,
equipment_age,
COUNT(DISTINCT equipment_id) as count_of_equipments
FROM equipment_table
GROUP BY 1, 2
Consider below join-less solution
select distinct
equipment_type,
equipment_age,
count(*) over equipment_at_least_this_age as count_of_equipment_at_least_this_age
from equipment_table
window equipment_at_least_this_age as (
partition by equipment_type
order by equipment_age
range between current row and unbounded following
)
if applied to sample data in your question - output is
Use a self join approach:
SELECT
e1.equipment_type,
e1.equipment_age,
COUNT(*) AS count_of_equipments
FROM equipment_table e1
INNER JOIN equipment_table e2
ON e2.equipment_type = e1.equipment_type AND
e2.equipment_age >= e1.equipment_age
GROUP BY 1, 2
ORDER BY 1, 2;
GROUP BY restricts the scope of COUNT to the rows in the group, i.e. it will not let you reach other rows (rows with equipment_age greater than that of the current group). So you need a subquery or windowing functions to get those. One way:
SELECT
equipment_type,
equipment_age,
(Select COUNT(*)
from equipment_table cnt
where cnt.equipment_type = a.equipment_type
AND cnt.equipment_age >= a.equipment_age
) as count_of_equipments
FROM equipment_table a
GROUP BY 1, 2, 3
I am not sure if your environment supports this syntax, though. If not, let us know we will find another way.

Case Statement for multiple criteria

I would like to ignore some of the results of my query as for all intents and purposes, some of the results are a duplicate, but based on the way the request was made, we need to use this hierarchy and although we are seeing different 'Company_Name' 's, we need to ignore one of the results.
Query:
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
2
ORDER BY
3 ASC, 2 ASC
This code omits half a doze joins and where statements that are not germane to this question.
Results:
Customer_Name_Count Company_Name Total_Sales
-------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 6 Jimmy's Restaurant 1,500
4 9 Impala Hotel 2,000
5 12 Sports Drink 2,500
In the above set, we can see that numbers 2 & 3 have the same count and the same total_sales number and similar company names. Is there a way to create a case statement that takes these 3 factors into consideration and then drops one or the other for Jimmy's enterprises? The other issue is that this has to be variable as there are other instances where this happens. And I would only want this to happen if the count and sales number match each other with a similar name in the company name.
Desired result:
Customer_Name_Count Company_Name Total_Sales
--------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 9 Impala Hotel 2,000
4 12 Sports Drink 2,500
Looks like other answers are accurate based on assumption that Company_IDs are the same for both.
If Company_IDs are different for both Jimmy's Bar and Jimmy's Restaurant then you can use something like this. I suggest you get functional users involved and do some data clean-up else you'll be maintaining this every time this issue arise:
SELECT
COUNT(DISTINCT CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END) AS Customer_Name_Count
,CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END AS Company_Name
,SUM(A12.Total_Sales) AS Total_Sales
FROM some_table er
GROUP BY CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END
Your problem is that the joins you are using are multiplying the number of rows. Somewhere along the way, multiple names are associated with exactly the same entity (which is why the numbers are the same). You can fix this by aggregating by the right id:
SELECT COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
MAX(Company_Name) as Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM some_table AS A12
GROUP BY Company_id -- I'm guessing the column is something like this
ORDER BY 3 ASC, 2 ASC;
This might actually overstate the sales (I don't know). Better would be fixing the join so it only returned one name. One possibility is that it is a type-2 dimension, meaning that there is a time component for values that change over time. You may need to restrict the join to a single time period.
You need to have function to return a common name for the companies and then use DISTINCT:
SELECT DISTINCT
Customer_Name_Count,
dbo.GetCommonName(Company_Name) as Company_Name,
Total_Sales
FROM dbo.theTable
You can try to use ROW_NUMBER with window function to make row number by Customer_Name_Count and Total_Sales then get rn = 1
SELECT * FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY Customer_Name_Count,Total_Sales ORDER BY Company_Name) rn
FROM (
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
Company_Name
)t1
)t1
WHERE rn = 1

SQL view with a column that shows top result of relationship with multiple weightings

I have three tables, an Objects table, a Status table and a StatusTypes Table.
An Object has Multiple Status' which each has a status type. I would like to create a view that gives me the objects ID, and Most Important Status Description which is found in the StatusTypes table, and the most important status Date which is in the Status Table.
The part I am getting hung up on is to find the most Important Status It must first be sorted by the latest date, then by a integer weighting (Priority) in the Status Table then again by another weighting in the StatusTypes Table (Weighting)
What would be the best SQL statement to quickly deliver these results.
Objects
ID Aquisiton Date Serial Number
127237 1997-04-21 2151513515
127239 1997-10-31 2151513523
127242 1998-01-20 2165588481
127272 1998-10-20 2195689842
127286 1999-06-15 2231549489
127291 1999-06-01 2229564978
Status
ID ObjectID Priority StatusMessage Date Status
1 127237 1 Online 22.02.12 07.01.00 1
2 127237 3 Job Received 22.02.12 07.01.00 3
3 127237 5 Job Started 22.02.12.07.01.00 3
4 127237 5 Jam 22.02.12.07.01.00 2
5 127286 1 Online 22.02.12.07.09.00 1
Status Types
ID Description Weighting
1 Idle 0
2 Error 9
3 Working 5
Expected Output##
ID Status Date
127237 Error 22.02.12 07.01.00
127286 Idle 22.02.12.07.09.00
Sounds like you could use ROW_NUMBER():
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Date DESC, Priority, Weighting) 'RowRank'
FROM YourTable a
)sub
WHERE RowRank = 1
Obviously replacing YourTable with the relevant JOIN's
The ROW_NUMBER() function assigns a number to each row. PARTITION BY is optional, but used to start the numbering over for each value in that group, ie: if you PARTITION BY ID then for each unique ID value the numbering will start over at 1. ORDER BY of course is used to define how the counting should go, and is required in the ROW_NUMBER() function.
Updated with your data:
SELECT ObjectID,Description,Date
FROM (SELECT a.*,b.Description,ROW_NUMBER() OVER(PARTITION BY a.ObjectID ORDER BY CONVERT(DATE,LEFT([Date],8),4) DESC, Priority DESC, Weighting DESC) 'RowRank'
FROM Status a
JOIN Status_Types b
ON a.Status = b.ID
)sub
WHERE RowRank = 1
Demo: SQL Fiddle

How to loop through a table and look for adjacent rows with identical values in one field and update another column conditionally in SQL?

I have a table that has a field called ‘group_quartile’ which uses the sql ntile() function to calculate which quartile does each customer lie in on the basis of their activity scores. However using this ntile(0 function i find there are some customers which have same activity scores but are in different quartiles. I need to modify the ‘group-quartile’ column to make all customers with the same activity scores lie in the same group_quartile.
A view of the table values :
Customer_id Product Activity_Score Group_Quartile
CH002 T 2328 1
CR001 T 268 1
CN001 T 178 1
MS006 T 45 2
ST001 T 21 2
CH001 T 0 2
CX001 T 0 3
KH001 T 0 3
MH002 T 0 4
SJ003 T 0 4
CN001 S 439 1
AC002 S 177 1
SC001 S 91 2
PV001 S 69 3
TS001 S 0 4
I used CTE expression but it didnot work.
My query only updates(from the above example) :
CX001 T 0 3
modified to
CX001 T 0 2
So only the first repeating activity score is checked and that row’s group_quartile is updated to 2.
I need to update all the below rows as well.
CX001 T 0 3
KH001 T 0 3
MH002 T 0 4
SJ003 T 0 4
I cannot use DENSE_RANK() instead of quartile to segregate the records as arranging the customers per product in approximately 4 quartiels is a business requirement.
From my understanding I need to loop through the table -
Find a row which has same activity score and the same product as its predecessor but has a different group_quartile
Update the selected row's group_quartile to its predecessor's quartile value
Then againg loop through the updated table to look for any row with the above condition , and update that row similarly.
The loop continues until all rows with same activity scores (for the same product) are put in the same group_quartile.
--
THIS IS THE TABLE STRUCTURE I AM WORKING ON:
CREATE TABLE #custs
(
customer_id NVARCHAR(50),
PRODUCT NVARCHAR(50),
ACTIVITYSCORE INT,
GROUP_QUARTILE INT,
RANKED int,
rownum int
)
INSERT INTO #custs
-- adding a column to give row numbers(unique id) for each row
SELECT customer_id, PRODUCT, ACTIVITYSCORE,GROUP_QUARTILE,RANKED,
Row_Number() OVER(partition by product ORDER BY activityscore desc) N
FROM
-- rows derived form a parent table based on 'segmentation' column value
(SELECT customer_id, PRODUCT, ACTIVITYSCORE,
DENSE_RANK() OVER (PARTITION BY PRODUCT ORDER BY ACTIVITYSCORE DESC) AS RANKED,
NTILE(4) OVER(PARTITION BY PRODUCT ORDER BY ACTIVITYSCORE DESC) AS GROUP_QUARTILE
FROM #parent_score_table WHERE (SEGMENTATION = 'Large')
) as temp
ORDER BY PRODUCT
The method I used to achieve this partially is as follows :
-- The query find the rows which have activity score same as its previous row but has a different GRoup_Quartiel value.
-- I need to use a query to update this row.
-- Next, find any rows in this newly updated table that has activity score same as its previous row but a differnet group_quartile vale.
-- Continue to update the tabel in the above manner until all rows with same activity scores have been updated to have the same quartile value
I managed to find only the rows which have activity score same as its previous row but has a different Group_Quartill value but cannot loop thorugh to find new rows that may match this updated row.
select t1.customer_id,t1.ACTIVITYSCORE,t1.PRODUCT, t1.RANKED, t1.GROUP_QUARTILE, t2.GROUP_QUARTILE as modified_quartile
from #custs t1, #custs t2
where (
t1.rownum = t2.rownum + 1
and t1.ACTIVITYSCORE = t2.ACTIVITYSCORE
and t1.PRODUCT = t2.PRODUCT
and not(t1.GROUP_QUARTILE = t2.GROUP_QUARTILE))
Can anyone help with what should be the t-sql statement for the above?
Cheers!
Assuming you've already worked out a basis Group_Quartile as indicated above, you can update the table with a query similar to the following:
update a
set Group_Quartile = coalesce(topq.Group_Quartile, a.Group_Quartile)
from activityScores a
outer apply
(
select top 1 Group_Quartile
from activityScores topq
where a.Product = topq.Product
and a.Activity_Score = topq.Activity_Score
order by Group_Quartile
) topq
SQL Fiddle with demo.
Edit after comment:
I think you did a lot of the work already by getting the Group_Quartile working.
For each row in the table, the statement above will join another row to it using the outer apply statement. Only one row will be joined back to the original table due to the top 1 clause.
So each for each row, we are returning one more row. The extra row will be matched on Product and Activity_Score, and will be the row with the lowest Group_Quartile (order by Group_Quartile). Finally, we update the original row with this lowest Group_Quartile value so each row with the same Product and Activity_Score will now have the same, lowest possible Group_Quartile.
So SJ003, MH002, etc will all be matched to CH001 and be updated with the Group_Quartile value of CH001, i.e. 2.
It's hard to explain code! Another thing that might help is looking at the join without the update statement:
select a.*
, TopCustomer_id = topq.Customer_Id
, NewGroup_Quartile = topq.Group_Quartile
from activityScores a
outer apply
(
select top 1 *
from activityScores topq
where a.Product = topq.Product
and a.Activity_Score = topq.Activity_Score
order by Group_Quartile
) topq
SQL Fiddle without update.