I have a table called Graduates recording name and income for different graduates. Now I need to calculate the median of income. Here is the code from a book.
My question is
What is the result from having clause?
What is the result from self join ?
SELECT AVG(DISTINCT income)
FROM (
SELECT T1.income
FROM Graduates T1, Graduates T2
GROUP BY T1.income
HAVING SUM(CASE WHEN T2.income >= T1.income THEN 1 ELSE 0 END)
>= COUNT(*) / 2
AND SUM(CASE WHEN T2.income <= T1.income THEN 1 ELSE 0 END)
>= COUNT(*) / 2
) TMP;
What the code is doing is finding the one or two middle values. It is doing this by counting the number of values bigger than and less than the value.
Each of the SUM()s in the HAVING clause is counting the number of values greater than or less than a given income. What the expression is doing is saying something to the effect:
The middle value value(s) are the ones that have the same number of values bigger and less than itself.
The median is then the average of the middle values. If there is one, then the average is the value itself. If there are two the average is the median.
This is awful for multiple reasons:
It only works on numeric values. But medians are defined for strings and dates as well.
It requires a self-join, which is expensive.
It is rather indecipherable.
It is not obvious how to get medians within groups rather than over the entire dataset.
Related
I've been playing around it for the whole day and it's by far the most hard topic to understand in SQL.
Say we have a students table, which consists of group number and students rating as so:
Each group can contain multiple students
And now I want to look for groups where at least 60% of students have rating of 4 or higher.
Expecting something like:
group_n
percentage_of_goodies
1120
0.7
1200
0.66
1111
1
I tried this
with group_goodies as (
select group_n, count(id) goodies from students
where rating >= 4
group by group_n
), group_counts as (
select group_n, count(id) acount from students
group by group_n
)
select cast(group_goodies.goodies as float)/group_counts.acount from group_counts, group_goodies
where cast(group_goodies.goodies as float)/group_counts.acount > 0.6;
and got an unexpected result
where percentage seems to surpass 100% (and it's not because I misplaced denominator, since there are controversial outputs below as well), which is obviously is not intended. There are also more output rows than there are groups. Apparently, I could use window functions, but I can't figure it out myself.. So how can I have this query done?
Problem is extracting count of students before and after the query seems to be impossible within a single query, so I had to create 2 CTEs in order to receive needed results. Both of the CTEs seems to output proper result, where in first CTE amount of students rarely exceeds 10, and in second CTE amounts are naturally smaller and match needs. But When I divide them like that, it results in something unreasonable.
If someone explains it properly, one will make my day 😳
If I understand correctly, this is a pretty direct aggregation query:
select group_id, avg( (rating >= 4)::int ) as good_students
from students
group by group_id
having avg( (rating >= 4)::int ) > 0.6;
I don't see why two levels of aggregation would be needed.
The avg() works by converting each rating to 0 if less than or equal to 4 or 1 for the higher ones. The average of these values is the ratio that are 1.
First, aggregate the students per group and rating to set a partial sum.
Then calculate the fraction of ratings that are 4 or better and report only those groups where that fraction exceeds 0.6.
SELECT group_n,
100.0 *
sum(ratings_per_group) FILTER (WHERE rating >= 4) /
sum(ratings_per_group)
AS fraction_of_goodies
FROM (SELECT group_n, rating,
count(*) AS ratings_per_group
FROM students
GROUP BY group_n, rating
) AS per_group_and_rating
GROUP BY group_n
HAVING sum(ratings_per_group) FILTER (WHERE rating >= 4) /
sum(ratings_per_group) > 0.6;
I'm newbie to Hivesql.
I have a raw table with 6 million records like this:
I want to count the number of IP_address access to each Modem_id everyweek.
The result table I want will be like this:
I did it with left join, and it worked. But since using join will be time-consuming, I want do it with case when statement - but I can't write a correct statement. Do you have any ideas?
This is the join statement I used:
select a.modem_id,
a.Number_of_IP_in_Day_1,
b.Number_of_IP_in_Day_2
from
(select modem_id,
count(distinct ip_address) as Number_of_IP_in_Day_1
from F_ACS_DEVICE_INFORMATION_NEW
where day=1
group by modem_id) a
left join
(select modem_id,
count(distinct param_value) as Number_of_IP_in_Day_2
from F_ACS_DEVICE_INFORMATION_NEW
where day=2
group by modem_id) b
on a.modem_id= b.modem_id;
You can express your logic using just aggregatoin:
select a.modem_id,
count(distinct case when date = 1 then ip_address end) as day_1,
count(distinct case when date = 2 then ip_address end) as day_2
from F_ACS_DEVICE_INFORMATION_NEW a
group by a.modem_id;
You can obviously extend this for more days.
Note: As your question and code are written, this assumes that your base table has data for only one week. Otherwise, I would expect some date filtering. Presumably, that is what the _NEW suffix means on the table name.
Based on your question and further comments, you would like
The number of different IP addresses accessed by each modem
In counts by week (as columns) for 4 weeks
e.g., result would be 5 columns
modem_id
IPs_accessed_week1
IPs_accessed_week2
IPs_accessed_week3
IPs_accessed_week4
My answer here is based on knowledge of SQL - I haven't used Hive but it appears to support the things I use (e.g., CTEs). You may need to tweak the answer a bit.
The first key step is to turn the day_number into a week_number. A straightforward way to do this is FLOOR((day_num-1)/7)+1 so days 1-7 become week 1, days 8-14 become week2, etc.
Note - it is up to you to make sure the day_nums are correct. I would guess you'd actually want info the the last 4 weeks, not the first four weeks of data - and as such you'd probably calculate the day_num as something like SELECT DATEDIFF(day, IP_access_date, CAST(getdate() AS date)) - whatever the equivalent is in Hive.
There are a few ways to do this - I think the clearest is to use a CTE to convert your dataset to what you need e.g.,
convert day_nums to weeknums
get rid of duplicates within the week (your code has COUNT(DISTINCT ...) - I assume this is what you want) - I'm doing this with SELECT DISTINCT (rather than grouping by all fields)
From there, you could PIVOT the data to get it into your table, or just use SUM of CASE statements. I'll use SUM of CASE here as I think it's clearer to understand.
WITH IPs_per_week AS
(SELECT DISTINCT
modem_id,
ip_address,
FLOOR((day-1)/7)+1 AS week_num -- Note I've referred to it as day_num in text for clarity
FROM F_ACS_DEVICE_INFORMATION_NEW
)
SELECT modem_id,
SUM(CASE WHEN week_num = 1 THEN 1 ELSE 0 END) AS IPs_access_week1,
SUM(CASE WHEN week_num = 2 THEN 1 ELSE 0 END) AS IPs_access_week2,
SUM(CASE WHEN week_num = 3 THEN 1 ELSE 0 END) AS IPs_access_week3,
SUM(CASE WHEN week_num = 4 THEN 1 ELSE 0 END) AS IPs_access_week4
FROM IPs_per_week
GROUP BY modem_id;
I have a view consists of data from different tables. major fields are BillNo,ITEM_FEE,GroupNo. Actually I need to calculate the total discount by passing the groupNo. The discount calculation is based on the fraction of amount group by BillNo(single Bill no can have multiple entries). If there are multiple transactions for a single BillNo then discount is calculated if decimal part of sum of ITEM_FEE is greater than 0 and if there is only single transaction and the decimal part of ITEM_FEE is greater than 0 then the decimal part will be treated as discount.
I have prepared script and I am getting total discount for a particular groupNo.
declare #GroupNo as nvarchar(100)
set #GroupNo='3051'
SELECT Sum(disc) Discount
FROM --sum(ITEM_FEE) TotalAmoiunt,
(SELECT (SELECT CASE
WHEN ( Sum(item_fee) )%1 > 0 THEN Sum(( item_fee )%1)
END
FROM view_bi_sales VBS
WHERE VBS.billno = VB.billno
GROUP BY billno) Disc
FROM view_bi_sales VB
WHERE groupno = #GroupNo)temp
The problem is that it takes almost 2 minutes to get the result.
Please help me to find the result faster if possible.
Thank you for all your help and support , as I was already calculating sum of decimal part of ITEM_FEE group by BillNo , there was no need of checking greater than 0 or not. below query gives me the desired ouput in less than 10 sec
select sum(discount) from
(select sum((ITEM_FEE)%1) discount from view_BI_Sales
where groupno=3051
group by BillNo )temp
If I understand correctly, you don't need a JOIN. This might help performance:
SELECT SUM(disc) as Discount
FROM (SELECT (CASE WHEN SUM(item_fee % 1) > 0
THEN SUM(item_fee % 1)
END) as disc
FROM view_bi_sales VBS
WHERE groupno = #GroupNo
GROUP BY billno
) vbs;
Let's say I have two columns: Date and Indicator
Usually the indicator goes from 0 to 1 (when the data is sorted by date) and I want to be able to identify if it goes from 1 to 0 instead. Is there an easy way to do this with SQL?
I am already aggregating other fields in the same table. If I can add this to as another aggregation (e.g. without using a separate "where" statement or passing over the data a second time) it would be pretty awesome.
This is the phenomena I want to catch:
Date Indicator
1/5/01 0
1/4/01 0
1/3/01 1
1/2/01 1
1/1/01 0
This isn't a teradata-specific answer, but this can be done in normal SQL.
Assuming that the sequence is already 'complete' and xn+1 can be derived from xn, such as when the dates are sequential and all present:
SELECT date -- the 1 on the day following the 0
FROM r curr
JOIN r prev
-- join each day with the previous day
ON curr.date = dateadd(d, 1, prev.date)
WHERE curr.indicator = 1
AND prev.indicator = 0
YMMV on the ability of such a query to use indexes efficiently.
If the sequence is not complete the same can be applied after making a delegate sequence which is well ordered and similarly 'complete'.
This can also be done using correlated subqueries, each selecting the indicator of the 'previous max', but.. uhg.
Joining the table against it self it quite generic, but most SQL Dialects now support Analytical Functions. Ideally you could use LAG() but TeraData seems to try to support the absolute minimum of these, and so so they point you to use SUM() combined with rows preceding.
In any regard, this method avoids a potentially costly join and effectively deals with gaps in the data, whilst making maximum use of indexes.
SELECT
*
FROM
yourTable t
QUALIFY
t.indicator
<
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
QUALIFY is a bit TeraData specific, but slightly tidier than the alternative...
SELECT
*
FROM
(
SELECT
*,
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
AS previous_indicator
FROM
yourTable t
)
lagged
WHERE
lagged.indicator < lagged.previous_indicator
Supposing you mean that you want to determine whether any row having 1 as its indicator value has an earlier Date than a row in its group having 0 as its indicator value, you can identify groups with that characteristic by including the appropriate extreme dates in your aggregate results:
SELECT
...
MAX(CASE indicator WHEN 0 THEN Date END) AS last_ind_0,
MIN(CASE indicator WHEN 1 THEN Date END) AS first_ind_1,
...
You then test whether first_ind_1 is less than last_ind_0, either in code or as another selection item.
I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?
SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
(SELECT DATE(created_at) AS date,
SUBSTR(processed_text,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.