Hive query to select only records in certain percentile - hive

I have table with two columns - ID and total duration:
id tot_dur
123 1
124 2
125 5
126 8
I want to have a Hive query that select only 75th percentile. It should be only the last record:
id tot_dur
126 8
This is what I have, but its hard for me to understand the use of OVER() and PARTITIONED BY() functions, since from what I researched, this are the functions I should use. Before I get the tot_dur column I should sum and group by column duration. Not sure if percentile is the correct function, because I found use cases with percentile_approx.
select k1.id as id, percentile(cast(tot_dur as bigint),0.75) OVER () as tot_dur
from (
SELECT id, sum(duration) as tot_dur
FROM data_source
GROUP BY id) k1
group by id

If I've got you right, this is what you want:
with data as (select stack(4,
123, 1,
124, 2,
125, 5,
126, 8) as (id, tot_dur))
-----------------------------------------------------------------------------
select data.id, data.tot_dur
from data
join (select percentile(tot_dur, 0.75) as threshold from data) as t
where data.tot_dur >= t.threshold;

Related

How to obtain a uniform sample from my SQL table?

Say I have a simple table:
id, value
2, 5
4, 3
10, 4
20, 5
24, 4
40, 3
60, 3
80, 3
150, 3
90, 3
120, 3
As you can see the majority of the value column is 3. If I want to obtain a subset of this table, there is a high likelihood that 3 would dominate, i.e., SELECT * FROM TABLE LIMIT 10. So how can I thus perform some statistics to ensure that I have a uniform distribution, i.e., a subset that contains 2 of each distinct value?
You can use a query like this
WITH cte AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY value ORDER BY id) AS rn
FROM data
)
SELECT
id,
value
FROM cte
WHERE rn < 3
GROUP BY value, id
It would give you at most 2 rows per value.
You can check a working demo here

SQL query to get average sum of other rows and store in current rows

I have table like this, and I want to query to store the average of others row points.
USER_ID POINTS
------------- --------
a14e43e4f851 134
1e86e5adedbf 40
3c66730edf69 149
32e24082f97b 67
b33e3100a7be 124
274ee414ad8f 85
bdeef25fc797 172
For example - for user_id = a14e43e4f851, the average sum of points should be
avg(40+149+67+124+85+172) .
PS - not taken the points (134) in calculation for user a14e43e4f851.
Output should look like this --
USER_ID POINTS AVG
------------- ------- ------
a14e43e4f851 134 106 which is avg(40+149+67+124+85+172)
1e86e5adedbf 40 avg(134+149+67+124+85+172)
3c66730edf69 149 avg(134+40+67+124+85+172)
32e24082f97b 67 avg(134+40+149+124+85+172)
b33e3100a7be 124 ...
274ee414ad8f 85 ...
bdeef25fc797 172 ...
You could use a correlated subquery:
select t.*,
(select avg(t1.points) from mytable t1 where t1.user_id <> t.user_id) as average
from mytable t
An alternative uses window functions:
select t.*,
(sum(points) over() - points) / nullif(count(*) - 1, 0) as average
from mytable t
Note: avg obviously conflicts with a language keyword, I use average instead.
If you wanted an update statement:
update mytable t
set t.average = (
select avg(t1.points) from mytable t1 where t1.user_id <> t.user_id
)
However, I would not recommend actually storing this value; this is derived information, that can easily be computed on the fly whenever needed, using the first statement. If you are going to run the query often, you could create a view:
create view myview as
select t.*,
(sum(points) over() - points) / nullif(count(*) - 1, 0) as average
from mytable t
I'm assuming user_id is the PK.
WITH q AS (SELECT sum(points) AS s, count(*) AS n FROM mytable)
UPDATE table SET average = (q.s-points)/(q.n-1);
The idea is that
average of all other user's score is sum(score)/count(*)
sum of all user scores except this one is equal to sum of all scores minus this user's score
average of all other users' score except this one is (sum(score)-score_for_this_user)/(count(*)-1)
Nice thing is it only has to calculate the sum() and count() once.
To handle the case where there is only one row in the table:
WITH q AS (SELECT sum(points) AS s, NULLIF(count(*),0) AS n FROM mytable)
UPDATE table SET average = (q.s-points)/(q.n-1);
This makes the count NULL instead of 0, so the average updated should be NULL too.

Calculating growth and retention rate with sql

So I wrote a query to calculate the retention, new and returning student growth rate. The code below returns a result similar to this.
Row visit_month student_type numberofstd growth
1 2013 new 574 null
2 2014 new 220 -62%
3 2014 retained 442 245%
4 2015 new 199 -10%
5 2015 retained 533 21%
6 2016 new 214 8%
7 2016 retained 590 11%
8 2016 returning 1 -100%
Query I have tried.
with visit_log AS (
SELECT studentid,
cast(substr(session, 1, 4) as numeric) as visit_month,
FROM abaresult
GROUP BY 1,
2
ORDER BY 1,
2),
time_lapse_2 AS (
SELECT studentid,
Visit_month,
lag(visit_month, 1) over (partition BY studentid ORDER BY studentid, visit_month) lag
FROM visit_log),
time_diff_calculated_2 AS (
SELECT studentid,
visit_month,
lag,
visit_month - lag AS time_diff
FROM time_lapse_2),
student_categorized AS (
SELECT studentid,
visit_month,
CASE
WHEN time_diff=1 THEN 'retained'
WHEN time_diff>1 THEN 'returning'
WHEN time_diff IS NULL THEN 'new'
END AS student_type,
FROM time_diff_calculated_2)
SELECT visit_month,
student_type,
Count(distinct studentid) as numberofstd,
ROUND(100 * (COUNT(student_type) - LAG(COUNT(student_type), 1) OVER (ORDER BY student_type)) / LAG(COUNT(student_type), 1) OVER (ORDER BY student_type),0) || '%' AS growth
FROM student_categorized
group by 1,2
order by 1,2
The query above calculates the retention, new and returning rate based on the figures of the last session student_type category.
I am looking for a way to calculate these figures based on the total number of students in each visit_month and not from each category. Is there a way I can achieve this?
I am trying to get a table similar to this
Row visit_month student_type totalstd numberofstd growth
1 2013 new 574 574 null
2 2014 new 662 220 62%
3 2014 retained 662 442 22%
4 2015 new 732 199 10%
5 2015 retained 732 533 21%
6 2016 new 804 214 8%
7 2016 retained 804 590 11%
8 2016 returning 804 1 100%
Note:
The totalstd is the total number of student in each session and is gotten by new+retention+returning.
The growth calculation was assumed.
Please help!
Thank you.
While I do not have your source data, I am relying myself in the query you shared and the output results.
I created some extra code in order to output the desired result. I would like to point that I did not have access to BigQuery's compilation because I did not have the data. Thus, I have tried to prevent any possible errors the query myself. In addition, the queries between ** are unchanged and were copied from your code. Below is the code (it is a mix of yours and the extra bits I created):
#*****************************************************************
with visit_log AS (
SELECT studentid,
cast(substr(session, 1, 4) as numeric) as visit_month,
FROM abaresult
GROUP BY 1,
2
ORDER BY 1,
2),
time_lapse_2 AS (
SELECT studentid,
Visit_month,
lag(visit_month, 1) over (partition BY studentid ORDER BY studentid, visit_month) lag
FROM visit_log),
time_diff_calculated_2 AS (
SELECT studentid,
visit_month,
lag,
visit_month - lag AS time_diff
FROM time_lapse_2),
student_categorized AS (
SELECT studentid,
visit_month,
CASE
WHEN time_diff=1 THEN 'retained'
WHEN time_diff>1 THEN 'returning'
WHEN time_diff IS NULL THEN 'new'
END AS student_type,
FROM time_diff_calculated_2)
#**************************************************************
#Code I added
#each unique visit_month will have its count
WITH total_stud AS (
SELECT visit_month, count(distinct studentid) as totalstd FROM visit_log
GROUP BY 1
ORDER BY visit_month
),
#After you have your student_categorized temp table, create another one
#It will have the count of the number of students per visit_month per student_type
number_std_monthType AS (
SELECT visit_month,student_type, Count(distinct studentid) as numberofstd from student_categorized
GROUP BY 1, 2
),
#You will have one row per combination of visit_month and student_type
student_categorized2 AS(
SELECT DISTINCT visit_month,student_type FROM student_categorized2
GROUP BY 1,2
),
#before calculation, create the table with all the necessary data
#you have the desired table without the growth
#notice that I used two keys to join t1 and t3 so the results are correct
final_t AS (
SELECT t1.visit_month,
t1.student_type,
t2.totalstd as totalstd,
t3.numberofstd
FROM student_categorized2 t1
LEFT JOIN total_stud AS t2 ON t1.visit_month = t2.visit_month
LEFT JOIN number_std_monthType t3 ON (t1.visit_month = t3.visit_month and t1.student_type = t3.student_type)
ORDER BY
)
#Now all the necessary values to calculate growth are in the temp table final_t
SELECT visit_month, student_type, totalstd, numberofstd,
ROUND(100 * (totalstd - LAG(totalstd) OVER (PARTITION BY visit_month ORDER BY visit_month ASC) /LAG(totalstd) OVER (PARTITION BY visit_month ORDER BY visit_month ASC) || '%' AS growth
FROM final_t
Notice that I used LEFT JOIN in order to have the proper counts in the final table, once each count was calculated in a different temp table. Also, I did not use your final SELECT statement.
If you have any issues with the code, do not hesitate to ask.

Two select count queries and then calculate percentage

I'd like to combine two queries
SELECT COUNT(*) FROM abc;
SELECT COUNT(Status) FROM ABC WHERE Status='Active';
And then calculate the percentage (by taking the 2nd query divided by first query). I'd like to achieve this in one single query. What i've attempted so far:
SELECT COUNT(*) AS A FROM abc
UNION
SELECT COUNT(Status) AS B FROM ABC WHERE Status='Active';
UNION
SELECT(COUNT(Status)*100/SELECT COUNT(*) FROM abc)) AS %ofAB FROM abc WHERE Status='Active'
What I get:
A
--
31
36
86,11111111
What I want:
A | B | %ofAB
---------------------
36 | 31 | 86,1111111%
This should give you what you want:
SELECT
COUNT(*) AS TotalCount,
SUM(IIF(Status = 'Active', 1, 0)) AS ActiveCount,
ROUND((SUM(IIF(Status = 'Active', 1, 0)) * 100/ COUNT(*)),2) AS PctActive
FROM
Abc
EDIT: Didn't notice that this was for Access. I don't know if CAST is available in Access, so you may need to use an equivalent function to make sure that the integers don't simply yield 1 or 0. It's possible that Access will convert a division into a decimal automatically, but in SQL Server it does not.

Joining onto a table that doesn't have ranges, but requires ranges

Trying to find the best way to write this SQL statement.
I have a customer table that has the internal credit score of that customer. Then i have another table with definitions of that credit score. I would like to join these tables together, but the second table doesn't have any way to link it easily.
The score of the customer is an integer between 1-999, and the definition table has these columns:
Score
Description
And these rows:
60 LOW
99 MED
999 HIGH
So basically if a customer has a score between 1 and 60 they are low, 61-99 they are med, and 100-999 they are high.
I can't really INNER JOIN these, because it would only join them IF the score was 60, 99, or 999, and that would exclude anyone else with those scores.
I don't want to do a case statement with the static numbers, because our scores may change in the future and I don't want to have to update my initial query when/if they do. I also cannot create any tables or functions to do this- I need to create a SQL statement to do it for me.
EDIT:
A coworker said this would work, but its a little crazy. I'm thinking there has to be a better way:
SELECT
internal_credit_score
(
SELECT
credit_score_short_desc
FROM
cf_internal_credit_score
WHERE
internal_credit_score = (
SELECT
max(credit.internal_credit_score)
FROM
cf_internal_credit_score credit
WHERE
cs.internal_credit_score <= credit.internal_credit_score
AND credit.internal_credit_score <= (
SELECT
min(credit2.internal_credit_score)
FROM
cf_internal_credit_score credit2
WHERE
cs.internal_credit_score <= credit2.internal_credit_score
)
)
)
FROM
customer_statements cs
try this, change your table to contain the range of the scores:
ScoreTable
-------------
LowScore int
HighScore int
ScoreDescription string
data values
LowScore HighScore ScoreDescription
-------- --------- ----------------
1 60 Low
61 99 Med
100 999 High
query:
Select
.... , Score.ScoreDescription
FROM YourTable
INNER JOIN Score ON YourTable.Score>=Score.LowScore
AND YourTable.Score<=Score.HighScore
WHERE ...
Assuming you table is named CreditTable, this is what you want:
select * from
(
select Description, Score
from CreditTable
where Score > 80 /*client's credit*/
order by Score
)
where rownum = 1
Also, make sure your high score reference value is 1000, even though client's highest score possible is 999.
Update
The above SQL gives you the credit record for a given value. If you want to join with, say, Clients table, you'd do something like this:
select
c.Name,
c.Score,
(select Description from
(select Description from CreditTable where Score > c.Score order by Score)
where rownum = 1)
from clients c
I know this is a sub-select that executed for each returning row, but then again, CreditTable is ridiculously small and there will be no significant performance loss because of the the sub-select usage.
You can use analytic functions to convert the data in your score description table to ranges (I assume that you meant that 100-999 should map to 'HIGH', not 99-999).
SQL> ed
Wrote file afiedt.buf
1 with x as (
2 select 60 score, 'Low' description from dual union all
3 select 99, 'Med' from dual union all
4 select 999, 'High' from dual
5 )
6 select description,
7 nvl(lag(score) over (order by score),0) + 1 low_range,
8 score high_range
9* from x
SQL> /
DESC LOW_RANGE HIGH_RANGE
---- ---------- ----------
Low 1 60
Med 61 99
High 100 999
You can then join this to your CUSTOMER table with something like
SELECT c.*,
sd.*
FROM customer c,
(select description,
nvl(lag(score) over (order by score),0) + 1 low_range,
score high_range
from score_description) sd
WHERE c.credit_score BETWEEN sd.low_range AND sd.high_range