Calculating Proportions for Baseball-Related Query - sql

Here are my two tables:
BIO - contains player biographical information with the following columns
i. PLAYER_ID
ii. PLAYER_NAME
iii. DATE_OF_BIRTH
iv. TEAM_NAME
PITCHES - contains batter and pitcher statistics by pitch with the following columns
i. GAME_DATE (formatted YYYY-MM-DD, e.g. 2016-01-01)
ii. BATTER_PLAYER_ID
iii. PITCHER_PLAYER_ID
iv. PITCHER_THROW_SIDE (L/R)
v. BATTER_HAND (L/R)
vi. PITCH_TYPE (Changeup, Curveball, Cutter, 4-seam fastball, Knuckleball, 2-Seam
Fastball, Slider, Splitter)
vii. PITCH_CALL (Ball, CatcherInterference, FoulBall, HitByPitch, InPlay, StrikeCalled,
StrikeSwinging)
viii. IN_ZONE (YES/NO)
I want a query that returns the names of players with an in-zone or out-of-zone swinging strike rate of greater than 15% for fastballs for the 2016-2017 seasons, combined. I also want team name, pitcher handedness, and to include cutters and sinkers as fastballs.
Here is what I have so far:
SELECT b.PLAYER_NAME, b.TEAM_NAME, p.PITCHER_THROW_SIDE
FROM BIO AS b INNER JOIN PITCHES AS p
ON b.PLAYER_ID = p.PITCHER_PLAYER_ID
WHERE p.PITCH_TYPE = '4-seam fastball' OR p.PITCH_TYPE = '2-Seam' OR p.PITCH_TYPE = 'Cutter'
AND p.GAME_DATE BETWEEN 2016-01-01 AND 2017-12-31
GROUP BY b.PLAYER_ID
HAVING (Count(IN_ZONE)) ....
I think this is the right idea... but I'm a bit lost now as to how I can include the 15% in-zone/out-of-zone rates.
Thank you for any help.

How do you calculate in-zone or out-of-zone swinging strike rate of greater than 15%?
if NR_YES/total
HAVING sum(case when IN_ZONE='YES' then 1 else 0)/count(*)>0.15
If Nr_Yes/Nr_NO
HAVING sum(case when IN_ZONE='YES' then 1 else 0)/count(*)>3/17
Note that 3/17 is 15/85. Since there is only the possibility of yes and no
Also note that with the sum of 0s and 1s I am actually simulating a
count(*) where IN_ZONE='YES'

Related

JOIN the same table on two columns

I use JOINs to replace country and product IDs in import and export data with actual country and products names stored in separate tables. In the data source table (data), there are two columns with country IDs, for origin and destination, both of which I am replacing with country names.
The code I have come up with refers to the country_names table twice – as country_names, and country_names2, – which doesn’t seem to be very elegant. I expected to be able to refer to the table just once, by a single name. I would be grateful if someone pointed me to a more elegant and maybe more efficient way to achieve the same result.
SELECT
country_names.name AS origin,
country_names2.name AS dest,
product_names.name AS product,
SUM(data.export_val) AS export_val,
SUM(data.import_val) AS import_val
FROM
OEC.year_origin_destination_hs92_6 AS data
JOIN
OEC.products_hs_92 AS product_names
ON
data.hs92 = product_names.hs92
JOIN
OEC.country_names AS country_names
ON
data.origin = country_names.id_3char
JOIN
OEC.country_names AS country_names2
ON
data.dest = country_names2.id_3char
WHERE
data.year > 2012
AND data.export_val > 1E8
GROUP BY
origin,
dest,
product
The table to convert product IDs to product names has 6K+ rows. Here is a small sample:
id hs92 name
63215 3215 Ink
2130110 130110 Lac
21002 1002 Rye
2100200 100200 Rye
52706 2706 Tar
20902 902 Tea
42203 2203 Beer
42302 2302 Bran
178703 8703 Cars
The table to convert country IDs to country names (which is the table I have to JOIN on twice) has 264 rows for all countries in the world. (id_3char is the column used.) Here is a sample:
id id_3char name
euchi chi Channel Islands
askhm khm Cambodia
eublx blx Belgium-Luxembourg
eublr blr Belarus
eumne mne Montenegro
euhun hun Hungary
asmng mng Mongolia
nabhs bhs Bahamas
afsen sen Senegal
And here is a sample of data from the import and export data table with a total of 205M rows that has the two columns origin and dest that I am making a join on:
year origin dest hs92 export_val import_val
2009 can isr 300410 2152838.47 3199.24
1995 chn jpn 590190 275748.65 554154.24
2000 deu gmb 100610 1573508.44 1327.0
2008 deu jpn 540822 10000.0 202062.43
2010 deu ukr 950390 1626012.04 159423.38
2006 esp prt 080530 2470699.19 125291.33
2006 grc ind 844859 8667.0 3182.0
2000 ltu deu 630399 6018.12 5061.96
2005 usa zaf 290219 2126216.52 34561.61
1997 ven ecu 281122 155347.73 1010.0
I think you already have it done such that it can be considered good enough to just use as is :o)
Meantime, If for some reason you really-really want to avoid two joins on that country table - what you can do is to materialize below select statement into let's say `OEC.origin_destination_pairs` table
SELECT
o.id_3char o_id_3char,
o.name o_name,
d.id_3char d_id_3char,
d.name d_name
FROM `OEC.country_names` o
CROSS JOIN `OEC.country_names` d
Then you can just join on that new table as below
SELECT
country_names.o_name AS origin,
country_names.d_name AS dest,
product_names.name AS product,
SUM(data.export_val) AS export_val,
SUM(data.import_val) AS import_val
FROM OEC.year_origin_destination_hs92_6 AS data
JOIN OEC.products_hs_92 AS product_names
ON data.hs92 = product_names.hs92
JOIN OEC.origin_destination_pairs AS country_names
ON data.origin = country_names.o_id_3char
AND data.dest = country_names2.d_id_3char
WHERE data.year > 2012
AND data.export_val > 1E8
GROUP BY
origin,
dest,
product
The motivation behind above is cost of storing and querying in your particular case
Your `OEC.country_names` table is just about 10KB in size
Each time you query it you pay as if it is 10MB (Charges are rounded to the nearest MB, with a minimum 10 MB data processed per table referenced by the query, and with a minimum 10 MB data processed per query.)
So, if you will materialize above mentioned table - it will still be less than 10MB so no difference in querying charges
Similar situation with storing that table - no visible changes in charges
You can check more about pricing here

SQL (COUNT(*) / locations.area)

We are learning SQL at school, and my professor has this sql code in his documents.
SELECT wp.city, (COUNT(*) / locations.area) AS population_density
FROM world_poulation AS wp
INNER JOIN location
ON wp.city = locations.city
WHERE locations.state = “Hessen”
GROUP BY wp.city, locations.area
Everything is almost clear for me, just the aggregate function with /locations.area doesn't make any sense to me. Can anybody help?
Thank you in advance!
Look at what the query is grouped on, that tells you what each group consists of. In this case, each group is a city, and contains all the rows that have the same value for wp.city (and as the location table is joined on that value too, the locations.area is only included in the grouping so that it can be used in the result).
So each group has a number of rows, and the COUNT(*) aggregate will contain the number of rows for each group. The value of (COUNT(*) / locations.area) will be the number of rows in the group divided by the value of locations.area for that group.
If you would have data like this:
world_population
name city
--------- ---------
John London
Peter London
Sarah London
Malcolm London
Ian Cardiff
Johanna Stockholm
Sven Stockholm
Egil Stockholm
locations
city state area
----------- -------------- ---------
London Hessen 2
Cardiff Somehere else 14
Stockholm Hessen 1
Then you would get a result with two groups (as Cardiff is not in the state Hessen). One group has four people from London which has the area 2, so the population density would be 2. The other group has three people from Stockholm which has the area 1, so the population density would be 3.
Side note: There is a typo in the query, as it joins in the table location but refers to it as locations everywhere else.
Try writing it like:
SELECT wp.city,
locations.area,
COUNT(*) AS population,
(COUNT(*) / locations.area) AS population_density
FROM world_poulation AS wp
INNER JOIN location
ON wp.city = locations.city
WHERE locations.state = “Hessen”
GROUP BY wp.city, locations.area
The key is the GROUP BY statement. You are showing pairs of cities and areas. The COUNT(*) is the number of times a given pair shows up in the table you created by joining world population and location. The area is just a number, so you can divide the area by the COUNT.

SQL - Selecting Records with an odd number of a given attribute

I'm just brushing up on some SQL - in other words, I'm really rusty - and am a bit stuck at the moment. It's probably something trivial, but we'll see.
I'd like to select all people that possess an odd number of a certain attribute that isn't an integer ( in this example, TransactionType). So, for example, take the following test/not real info where these people are buying a car or some similarly big purchase.
Name TransactionType Date
John Buy 5/1
John Cancel 5/1
John Buy 5/2
Joseph Buy 5/25
Joseph Cancel 5/25
Tanya Buy 5/28
I would like it to return the people who had an odd number of transactions; in other words, they ended up purchasing the item. So, in this case, John and Tanya would be selected and Joseph would not.
I know I can use the modulus operand here, but I'm a bit lost how to utilize it correctly.
I thought of using
count(TransactionType) % 2 != 0
in the where clause but that's obviously a no-go. Any pointers in the right direction would be very helpful. Let me know if this is unclear, and thanks!
You are close. You need a having clause instead of a where clause.
select Name
from table
group by Name
having count(TransactionType) % 2 != 0
Wouldn't you be better off getting the latest status by the transaction date and using that rather than relying on counting TransactionType to determine the latest status:
Something like this:
SELECT b.Name, b.TransactionType, b.[Date]
FROM (
SELECT Name, MAX(t1.[DATE]) latestDate
FROM [Transactions] t1
GROUP BY t1.Name
) a
INNER JOIN [Transactions] b ON b.Name = a.Name AND a.latestDate = b.[Date]
WHERE b.TransactionType = 'Buy'
Assuming your dates are valid dates with times included, this should work.
Sample SQL Fiddle
If you only store the date portion the max date would be the same for people that Buy and Cancel on the same date, therefore it would return more data and some incorrect records.

SQL SUM with Repeating Sub Entries - Best Practice?

I hit this issue regularly but here is an example....
I have a Order and Delivery Tables. Each order can have one to many Deliveries.
I need to report totals based on the Order Table but also show deliveries line by line.
I can write the SQL and associated Access Report for this with ease ....
SELECT xxx
FROM
Order
LEFT OUTER JOIN
Delivery on Delivery.OrderNO = Order.OrderNo
until I get to the summing element. I obviously only want to sum each Order once, not the 1-many times there are deliveries for that order.
e.g. The SQL might return the following based on 2 Orders (ignore the banalness of the report, this is very much simplified)
Region OrderNo Value Delivery Date
North 1 £100 12-04-2012
North 1 £100 14-04-2012
North 2 £73 01-05-2012
North 2 £73 03-05-2012
North 2 £73 07-05-2012
South 3 £50 23-04-2012
I would want to report:
Total Sales North - £173
Delivery 12-04-2012
Delivery 14-04-2012
Delivery 01-05-2012
Delivery 03-05-2012
Delivery 07-05-2012
Total Sales South - £50
Delivery 23-04-2012
The bit I'm referring to is the calculation of the £173 and £50 which the first of which obviously shouldn't be £419!
In the past I've used things like MAX (for a given Order) but that seems like a fudge.
Surely there must be a regular answer to this seemingly common problem but I can't find one.
I don't necessarily need the code - just a helpful point in the right direction.
Many thanks,
Chris.
A roll up operator may not look pretty. However, it would do the regular aggregates that you see now, and it show the subtotals of the order. This is what you're looking for.
SELECT xxx
FROM
Order
LEFT OUTER JOIN
Delivery on Delivery.OrderNO = Order.OrderNo
GROUP BY xxx
WITH ROLLUP;
I'm not exactly sure how the rest of your query is set up, but it would look something like this:
Region OrderNo Value Delivery Date
North 1 £100 12-04-2012
North 1 £100 14-04-2012
North 2 £73 01-05-2012
North 2 £73 03-05-2012
North 2 £73 07-05-2012
NULL NULL f419 NULL
I believe what you want is called a windowing function for your aggregate operation. It looks like the following:
SELECT xxx, SUM(Value) OVER (PARTITION BY Order.Region) as OrderTotal
FROM
Order
LEFT OUTER JOIN
Delivery on Delivery.OrderNO = Order.OrderNo
Here's the MSDN article. The PARTITION BY tells the SUM to be done separately for each distinct Order.Region.
Edit: I just noticed that I missed what you said about orders being counted multiple times. One thing you could do is SUM() the values before joining, as a CTE (guessing at your schema a bit):
WITH RegionOrders AS (
SELECT Region, OrderNo, SUM(Value) OVER (PARTITION BY Region) AS RegionTotal
FROM Order
)
SELECT Region, OrderNo, Value, DeliveryDate, RegionTotal
FROM RegionOrders RO
INNER JOIN Delivery D on D.OrderNo = RO.OrderNo

Tricky SQL I cant get...Dates and NULLs

Imagine a table that has some line items
LineItemID CountryID Date
1 China 6/26/2011
2 China 6/27/2011
3 US 3/21/2011
I also have a table that has some rates so:
CountryID ExchangeRateDate ExchangeRateDateTo Rate
US 1/1/2011 NULL 1
China 6/1/2011 6/13/2011 6.06
China 6/13/2011 6/26/2011 6.13
China 6/26/2011 NULL 6.26
Notice the rate for the US doesnt change its simply a rate of 1 with a NULL for the ExchangeRateDateTo. I can join the tables by the countryID no problem, my issue is for instance how can I join not only the country ID but also to use the FirstTable's date with the 2nd tables ExchangeRateDate/To to get the correct date.
For instance, I cannot say
WHERE FirstTable.[Date] BETWEEN SecondTable.ExchangeRateDate AND SecondTable.ExchangeRateDateTo
Because for instance China's rate becomes null (starts from 6/26/2011 till NULL).
So basically I am looking for a way such that I get the result for instance of china using the rate of 6.26 because the dates are from 6/26-6/27. The rate of 6.13 ended right on 6/26, so it should pickup the new rate.
So my join would be to countryID plus using the date range to pick up the right rate, otherwise if i only join by the countryid you can see that that would yield a cartesian.
Use a sentinel value
WHERE
FirstTable.[Date]
BETWEEN SecondTable.ExchangeRateDate
AND ISNULL(SecondTable.ExchangeRateDateTo, '99991231')
You can also do this in the JOIN too
FROM
FirstTable F
JOIN
SecondTable S ON F.[Date] BETWEEN S.ExchangeRateDate AND ISNULL(S.ExchangeRateDateTo, '99991231')
COALESCE is more portable but has side effects around datatype precedence.
It's acceptable to store 9991231 as the ExchangeRateDateTo date: may not be "correct" but it simplifies code and JOINs.
Edit: to work around incorrect ranges, use a non-inclusive comparison
Assuming the FromDate is the start of the range...
FROM
FirstTable F
JOIN
SecondTable S ON F.[Date] >= S.ExchangeRateDate AND
F.[Date] < ISNULL(S.ExchangeRateDateTo, '99991231')
Change to > a nd <= to make it more confusing if required
For your ExchangeRateDateTo comparisons, just use:
COALESCE(ExchangeRateDateto, GETDATE())
...which will check for either a non-NULL date OR the current date if the date field is NULL.