SQL JOIN with CASE statement result - sql

Is there any way of joining the result of a case statement with a reference table without creating a CTE, ect.
Result AFTER CASE statement:
ID Name Bonus Level (this is the result of a CASE statement)
01 John A
02 Jim B
01 John B
03 Jake C
Reference table
A 10%
B 20%
C 30%
I want to then get the % next to each employee, then the max %age using the MAX function and grouping by ID, then link it back again to the reference so that each employee has the single correct (highest) bonus level next to their name. (This is a totally fictitious scenario, but very similar to what I am looking for).
Just need help with joining the result of the CASE statement with the reference table.
Thanks in advance.

In place of a temporary value as the result of the case statement, you could use a select statement from the reference table.
So if your case statement looks like:
case when variable1=value then bonuslevel =A
Then, replacing it like this might help
case when variable1=value then (select percentage from ReferenceTable where variable2InReferenceTable=A)

Don't know if I am overly simplifying, but based on the results of your case result query, why not just join that to the reference table, and do a max grouped by ID/Name. Since the ID and persons name wont change anyhow since they are the same person, you are just getting the max you want. To complete the Bonus level, rejoin just that portion after the max percentage determined for the person.
select
lvl1.ID,
lvl1.Name,
lvl1.FinalBonus,
rt2.BonusLvl
from
( select
PQ.ID,
PQ.Name,
max( rt.PcntBonus ) as FinalBonus
from
(however you
got your
data query ) PQ
JOIN RefTbl rt
on PQ.BonusLvl = rt.BonusLvl
) lvl1
JOIN RefTbl rt2
on lvl1.FinalBonus = rt2.PcntBonus
Since the Bonus levels (A,B,C) do not guarantee corresponding % levels (10,20,30), I did it this way... OTHERWISE, you could have just used max() on both the bonus level and percent. But what if your bonus levels were listed as something like
Limited 10%
Aggressive 20%
Ace 30%
You could see that a max of the level above would have "Limited", but the max % = 30 is associated with an "Ace" sales rep... Get the 30% first, then see what the label that matched that is.

Related

Calculating the mode/median/most frequent observation in categorical variables in SQL impala

I would like to calculate the mode/median or better, most frequent observation of a categorical variable within my query.
E.g, if the variable has the following string values:
dog, dog, dog, cat, cat and I want to get dog since its 3 vs 2.
Is there any function that does that? I tried APPX_MEDIAN() but it only returns the first 10 characters as median and I do not want that.
Also, I would like to get the most frequent observation with respect to date if there is a tie-break.
Thank you!
the most frequent observation is mode and you can calculate it like this.
Single value mode can be calculated like this on a value column. Get the count and pick up row with max count.
select count(*),value from mytable group by value order by 1 desc limit 1
now, in case you have multiple modes, you need to join back to the main table to find all matches.
select orig.value from
(select count(*) c, value v from mytable) orig
join (select count(*) cmode from mytable group by value order by 1 desc limit 1) cmode
ON orig.c= cmode.cmode
This will get all count of values and then match them based on count. Now, if one value of count matches to max count, you will get 1 row, if you have two value counts matches to max count, you will get 2 rows and so on.
Calculation of median is little tricky - and it will give you middle value. And its not most frequent one.

Why aren't these two sql statements returning same output?

I'm just getting started with sql and have the objective to transform this:
select X.persnr
from Pruefung X
where X.persnr in (
select Y.persnr
from pruefung Y
where X.matrikelnr <> Y.matrikelnr)
output:
into the same output but using a form of join. I tried it the way below but I can't seem to get "rid" of the cartesian product as far as i can see. Or maybe i misunderstood the above statement what it should actually do. For me the above says "for each unique matrikelnr display all corresponding persnr".
select X.persnr
from Pruefung X
join pruefung y on x.persnr=y.persnr
where x.matrikelnr<>y.matrikelnr
output: A long list (I don't want to fill the entire question with it) - i am guessing the cartesian product from the join
This is the relation I am using.
Edit: Distinct (unless i am using it in the wrong place) won't work because then persnr is only displayed once, thats not the objective though.
Your initial query actually does:
select persnr from Pruefung if the same persnr exists for a a diferent matrikelnr.
"for each unique matrikelnr display all corresponding persnr"
This is achieved using aggregation:
Depending on the DBMS you are using you could use something like (SQL Server uses STRING_AGG, but MySQL uses GROUP_CONCAT)
SELECT matrikelnr,STRING_AGG(matrikelnr,',')
GROUP BY matrikelnr
You cannot easily achieve what you got from a correlated query (your first attempt) by using a join.
Edit:
A join does not result in a "Cartesian product" expect from when there is no join condition (CROSS JOIN).
A join matches two sets based on a join condition. The reason why you get more entries is that the join looks at the join key (PERSNR) and does its matching.
For example for 101 you have 3 entries. That means you will get 3x3 reults.
You then filter out the results for the cases where X.matrikelnr <> Y.matrikelnr If we assume matrikelnr is unique that would mean the row matched with itself. so you will lose 3 results ending up with 3x3 - 3 = 6.
If you want to achieve something in SQL you must first define what you are expecting to use and then use the appropiate tools (in this case correlated queries not joins)
You can write your 1st query with EXISTS instead of IN like:
select X.persnr
from Pruefung X
where exists (
select 1
from pruefung Y
where X.persnr = Y.persnr and X.matrikelnr <> Y.matrikelnr
)
This way it's obvious that this query means:
return all the persnrs of the table for which there exists another
row with the same persnr but different matrikelnr
For your sample data the result is all the persnrs of the table.
Your 2nd query though, does something different.
It links every row of the table with all the rows of the same table with the same persnr but different matrikelnr.
So for every row of the table you will get as many as rows as there are for the same persnrs but different matrikelnrs.
For example for the 1st row with persnr = 101 and matrikelnr = 8532478 you will get 2 rows because there are 2 rows in the table with persnr = 101 and matrikelnr <> 8532478.
You are right. It's the cartesian product's fault. Suppose you have persnr 1,1,1,2,2,2 in the first table and persnr 1,1,1,2,2 in the second. How many lines are you expecting to be returned?
In pdeuso-code it would go like this
Select
...
WHERE persnr in (second table)
-- 6 lines
Select persnr
FROM ...
JOIN ... ON a.persnr = b.persnr
-- 3X3 + 3X2 = 15 lines.
SELECT DISTINCT persnr
FROM ...
JOIN ... ON a.persnr = b.persnr
-- 2 lines (1 and 2)
Take your pick

SQL Case with calculation on 2 columns

I have a value table and I need to write a case statement that touches 2 columns: Below is the example
Type State Min Max Value
A TX 2 15 100
A TX 16 30 200
A TX 31+ 500
Let say I have another table that has the following
Type State Weight Value
A TX 14 ?
So when I join the table , I need a case statement that looks at weight from table 2 , type and state - compare it to the table 1 , know that the weight falls between 2 and 15 from row 1 and update Value in table 2 with 100
Is this doable ?
Thanks
It returns 0 if there aren't rows in this range of values.
select Type, State, Weight,
(select coalesce(Value, 0)
from table_b
where table_b.Type = table_a.Type
and table_b.State = table_a.State
and table_a.Value between table_b.Min and table_b.Max) as Value
from table_a
For an Alteryx solution: (1) run both tables into a Join tool, joining on Type and State; (2) Send the output to a Filter tool where you force Weight to be between Min and Max; (3) Send that output to a Select tool, where you grab only the specific columns you want; (since the Join will give you all columns from all tables). Done.
Caveats: the data running from Join to Filter could be large, since you are joining every Type/State combination in the Lookup table to the other table. Depending on the size of your datasets, that might be cumbersome. Alteryx is very fast though, and at least we're limiting on State and Type, so if your datasets aren't too large, this simple solution will work fine.
With larger data, try to do it as part of your original select, utilizing one of the other solutions given here for your SQL query.
Considering that Min and Max columns in first table are of Integer type
You need to use INNER JOIN on ranges
SELECT *
FROM another_table a
JOIN first_table b
ON a.type = b.type
AND a.State = b.State
AND a.Weight BETWEEN b.min AND b.max

Retrieve names by ratio of their occurrence

I'm somewhat new to SQL queries, and I'm struggling with this particular problem.
Let's say I have query that returns the following 3 records (kept to one column for simplicity):
Tom
Jack
Tom
And I want to have those results grouped by the name and also include the fraction (ratio) of the occurrence of that name out of the total records returned.
So, the desired result would be (as two columns):
Tom | 2/3
Jack | 1/3
How would I go about it? Determining the numerator is pretty easy (I can just use COUNT() and GROUP BY name), but I'm having trouble translating that into a ratio out of the total rows returned.
SELECT name, COUNT(name)/(SELECT COUNT(1) FROM names) FROM names GROUP BY name;
Since the denominator is fixed, the "ratio" is directly proportional to the numerator. Unless you really need to show the denominator, it'll be a lot easier to just use something like:
select name, count(*) from your_table_name
group by name
order by count(*) desc
and you'll get the right data in the right order, but the number that's shown will be the count instead of the ratio.
If you really want that denominator, you'd do a count(*) on a non-grouped version of the same select -- but depending on how long the select takes, that could be pretty slow.

MySQL: Getting highest score for a user

I have the following table (highscores),
id gameid userid name score date
1 38 2345 A 100 2009-07-23 16:45:01
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
5 38 2345 A 50 2009-07-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
7 32 2345 A 100 2009-07-20 16:45:01
Now in the above structure, a user can play a game multiple times but I want to display the "Games Played" by a specific user. So in games played section I can't display multiple games. So the concept should be like if a user played a game 3 times then the game with highest score should be displayed out of all.
I want result data like:
id gameid userid name score date
2 39 2345 A 500 2009-07-20 16:45:01
3 31 2345 A 100 2009-07-20 16:45:01
4 38 2345 A 200 2009-10-20 16:45:01
6 32 2345 A 120 2009-07-20 16:45:01
I tried following query but its not giving me the correct result:
SELECT id,
gameid,
userid,
date,
MAX(score) AS score
FROM highscores
WHERE userid='2345'
GROUP BY gameid
Please tell me what will be the query for this?
Thanks
Requirement is a bit vague/confusing but would something like this satisfy the need ?
(purposely added various aggregates that may be of interest).
SELECT gameid,
MIN(date) AS FirstTime,
MAX(date) AS LastTime,
MAX(score) AS TOPscore.
COUNT(*) AS NbOfTimesPlayed
FROM highscores
WHERE userid='2345'
GROUP BY gameid
-- ORDER BY COUNT(*) DESC -- for ex. to have games played most at top
Edit: New question about adding the id column to the the SELECT list
The short answer is: "No, id cannot be added, not within this particular construct". (Read further to see why) However, if the intent is to have the id of the game with the highest score, the query can be modified, using a sub-query, to achieve that.
As explained by Alex M on this page, all the column names referenced in the SELECT list and which are not used in the context of an aggregate function (MAX, MIN, AVG, COUNT and the like), MUST be included in the ORDER BY clause. The reason for this rule of the SQL language is simply that in gathering the info for the results list, SQL may encounter multiple values for such an column (listed in SELECT but not GROUP BY) and would then not know how to deal with it; rather than doing anything -possibly useful but possibly silly as well- with these extra rows/values, SQL standard dictates a error message, so that the user can modify the query and express explicitly his/her goals.
In our specific case, we could add the id in the SELECT and also add it in the GROUP BY list, but in doing so the grouping upon which the aggregation takes place would be different: the results list would include as many rows as we have id + gameid combinations the aggregate values for each of this row would be based on only the records from the table where the id and the gameid have the corresponding values (assuming id is the PK in table, we'd get a single row per aggregation, making the MAX() and such quite meaningless).
The way to include the id (and possibly other columns) corresponding to the game with the top score, is with a sub-query. The idea is that the subquery selects the game with TOP score (within a given group by), and the main query's SELECTs any column of this rows, even when the fieds wasn't (couldn't be) in the sub-query's group-by construct. BTW, do give credit on this page to rexem for showing this type of query first.
SELECT H.id,
H.gameid,
H.userid,
H.name,
H.score,
H.date
FROM highscores H
JOIN (
SELECT M.gameid, hs.userid, MAX(hs.score) MaxScoreByGameUser
FROM highscores H2
GROUP BY H2.gameid, H2.userid
) AS M
ON M.gameid = H.gameid
AND M.userid = H.userid
AND M.MaxScoreByGameUser = H.score
WHERE H.userid='2345'
A few important remarks about the query above
Duplicates: if there the user played several games that reached the same hi-score, the query will produce that many rows.
GROUP BY of the sub-query may need to change for different uses of the query. If rather than searching for the game's hi-score on a per user basis, we wanted the absolute hi-score, we would need to exclude userid from the GROUP BY (that's why I named the alias of the MAX with a long, explicit name)
The userid = '2345' may be added in the [now absent] WHERE clause of the sub-query, for efficiency purposes (unless MySQL's optimizer is very smart, currently all hi-scores for all game+user combinations get calculated, whereby we only need these for user '2345'); down side duplication; solution; variables.
There are several ways to deal with the issues mentioned above, but these seem to be out of scope for a [now rather lenghty] explanation about the GROUP BY constructs.
Every field you have in your SELECT (when a GROUP BY clause is present) must be either one of the fields in the GROUP BY clause, or else a group function such as MAX, SUM, AVG, etc. In your code, userid is technically violating that but in a pretty harmless fashion (you could make your code technically SQL standard compliant with a GROUP BY gameid, userid); fields id and date are in more serious violation - there will be many ids and dates within one GROUP BY set, and you're not telling how to make a single value out of that set (MySQL picks a more-or-less random ones, stricter SQL engines might more helpfully give you an error).
I know you want the id and date corresponding to the maximum score for a given grouping, but that's not explicit in your code. You'll need a subselect or a self-join to make it explicit!
Use:
SELECT t.id,
t.gameid,
t.userid,
t.name,
t.score,
t.date
FROM HIGHSCORES t
JOIN (SELECT hs.gameid,
hs.userid,
MAX(hs.score) 'max_score'
FROM HIGHSCORES hs
GROUP BY hs.gameid, hs.userid) mhs ON mhs.gameid = t.gameid
AND mhs.userid = t.userid
AND mhs.max_score = t.score
WHERE t.userid = '2345'