I have table with two columns - ID and total duration:
id tot_dur
123 1
124 2
125 5
126 8
I want to have a Hive query that select only 75th percentile. It should be only the last record:
id tot_dur
126 8
This is what I have, but its hard for me to understand the use of OVER() and PARTITIONED BY() functions, since from what I researched, this are the functions I should use. Before I get the tot_dur column I should sum and group by column duration. Not sure if percentile is the correct function, because I found use cases with percentile_approx.
select k1.id as id, percentile(cast(tot_dur as bigint),0.75) OVER () as tot_dur
from (
SELECT id, sum(duration) as tot_dur
FROM data_source
GROUP BY id) k1
group by id
If I've got you right, this is what you want:
with data as (select stack(4,
123, 1,
124, 2,
125, 5,
126, 8) as (id, tot_dur))
-----------------------------------------------------------------------------
select data.id, data.tot_dur
from data
join (select percentile(tot_dur, 0.75) as threshold from data) as t
where data.tot_dur >= t.threshold;
Related
Say I have a simple table:
id, value
2, 5
4, 3
10, 4
20, 5
24, 4
40, 3
60, 3
80, 3
150, 3
90, 3
120, 3
As you can see the majority of the value column is 3. If I want to obtain a subset of this table, there is a high likelihood that 3 would dominate, i.e., SELECT * FROM TABLE LIMIT 10. So how can I thus perform some statistics to ensure that I have a uniform distribution, i.e., a subset that contains 2 of each distinct value?
You can use a query like this
WITH cte AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY value ORDER BY id) AS rn
FROM data
)
SELECT
id,
value
FROM cte
WHERE rn < 3
GROUP BY value, id
It would give you at most 2 rows per value.
You can check a working demo here
I have table like this, and I want to query to store the average of others row points.
USER_ID POINTS
------------- --------
a14e43e4f851 134
1e86e5adedbf 40
3c66730edf69 149
32e24082f97b 67
b33e3100a7be 124
274ee414ad8f 85
bdeef25fc797 172
For example - for user_id = a14e43e4f851, the average sum of points should be
avg(40+149+67+124+85+172) .
PS - not taken the points (134) in calculation for user a14e43e4f851.
Output should look like this --
USER_ID POINTS AVG
------------- ------- ------
a14e43e4f851 134 106 which is avg(40+149+67+124+85+172)
1e86e5adedbf 40 avg(134+149+67+124+85+172)
3c66730edf69 149 avg(134+40+67+124+85+172)
32e24082f97b 67 avg(134+40+149+124+85+172)
b33e3100a7be 124 ...
274ee414ad8f 85 ...
bdeef25fc797 172 ...
You could use a correlated subquery:
select t.*,
(select avg(t1.points) from mytable t1 where t1.user_id <> t.user_id) as average
from mytable t
An alternative uses window functions:
select t.*,
(sum(points) over() - points) / nullif(count(*) - 1, 0) as average
from mytable t
Note: avg obviously conflicts with a language keyword, I use average instead.
If you wanted an update statement:
update mytable t
set t.average = (
select avg(t1.points) from mytable t1 where t1.user_id <> t.user_id
)
However, I would not recommend actually storing this value; this is derived information, that can easily be computed on the fly whenever needed, using the first statement. If you are going to run the query often, you could create a view:
create view myview as
select t.*,
(sum(points) over() - points) / nullif(count(*) - 1, 0) as average
from mytable t
I'm assuming user_id is the PK.
WITH q AS (SELECT sum(points) AS s, count(*) AS n FROM mytable)
UPDATE table SET average = (q.s-points)/(q.n-1);
The idea is that
average of all other user's score is sum(score)/count(*)
sum of all user scores except this one is equal to sum of all scores minus this user's score
average of all other users' score except this one is (sum(score)-score_for_this_user)/(count(*)-1)
Nice thing is it only has to calculate the sum() and count() once.
To handle the case where there is only one row in the table:
WITH q AS (SELECT sum(points) AS s, NULLIF(count(*),0) AS n FROM mytable)
UPDATE table SET average = (q.s-points)/(q.n-1);
This makes the count NULL instead of 0, so the average updated should be NULL too.
So I wrote a query to calculate the retention, new and returning student growth rate. The code below returns a result similar to this.
Row visit_month student_type numberofstd growth
1 2013 new 574 null
2 2014 new 220 -62%
3 2014 retained 442 245%
4 2015 new 199 -10%
5 2015 retained 533 21%
6 2016 new 214 8%
7 2016 retained 590 11%
8 2016 returning 1 -100%
Query I have tried.
with visit_log AS (
SELECT studentid,
cast(substr(session, 1, 4) as numeric) as visit_month,
FROM abaresult
GROUP BY 1,
2
ORDER BY 1,
2),
time_lapse_2 AS (
SELECT studentid,
Visit_month,
lag(visit_month, 1) over (partition BY studentid ORDER BY studentid, visit_month) lag
FROM visit_log),
time_diff_calculated_2 AS (
SELECT studentid,
visit_month,
lag,
visit_month - lag AS time_diff
FROM time_lapse_2),
student_categorized AS (
SELECT studentid,
visit_month,
CASE
WHEN time_diff=1 THEN 'retained'
WHEN time_diff>1 THEN 'returning'
WHEN time_diff IS NULL THEN 'new'
END AS student_type,
FROM time_diff_calculated_2)
SELECT visit_month,
student_type,
Count(distinct studentid) as numberofstd,
ROUND(100 * (COUNT(student_type) - LAG(COUNT(student_type), 1) OVER (ORDER BY student_type)) / LAG(COUNT(student_type), 1) OVER (ORDER BY student_type),0) || '%' AS growth
FROM student_categorized
group by 1,2
order by 1,2
The query above calculates the retention, new and returning rate based on the figures of the last session student_type category.
I am looking for a way to calculate these figures based on the total number of students in each visit_month and not from each category. Is there a way I can achieve this?
I am trying to get a table similar to this
Row visit_month student_type totalstd numberofstd growth
1 2013 new 574 574 null
2 2014 new 662 220 62%
3 2014 retained 662 442 22%
4 2015 new 732 199 10%
5 2015 retained 732 533 21%
6 2016 new 804 214 8%
7 2016 retained 804 590 11%
8 2016 returning 804 1 100%
Note:
The totalstd is the total number of student in each session and is gotten by new+retention+returning.
The growth calculation was assumed.
Please help!
Thank you.
While I do not have your source data, I am relying myself in the query you shared and the output results.
I created some extra code in order to output the desired result. I would like to point that I did not have access to BigQuery's compilation because I did not have the data. Thus, I have tried to prevent any possible errors the query myself. In addition, the queries between ** are unchanged and were copied from your code. Below is the code (it is a mix of yours and the extra bits I created):
#*****************************************************************
with visit_log AS (
SELECT studentid,
cast(substr(session, 1, 4) as numeric) as visit_month,
FROM abaresult
GROUP BY 1,
2
ORDER BY 1,
2),
time_lapse_2 AS (
SELECT studentid,
Visit_month,
lag(visit_month, 1) over (partition BY studentid ORDER BY studentid, visit_month) lag
FROM visit_log),
time_diff_calculated_2 AS (
SELECT studentid,
visit_month,
lag,
visit_month - lag AS time_diff
FROM time_lapse_2),
student_categorized AS (
SELECT studentid,
visit_month,
CASE
WHEN time_diff=1 THEN 'retained'
WHEN time_diff>1 THEN 'returning'
WHEN time_diff IS NULL THEN 'new'
END AS student_type,
FROM time_diff_calculated_2)
#**************************************************************
#Code I added
#each unique visit_month will have its count
WITH total_stud AS (
SELECT visit_month, count(distinct studentid) as totalstd FROM visit_log
GROUP BY 1
ORDER BY visit_month
),
#After you have your student_categorized temp table, create another one
#It will have the count of the number of students per visit_month per student_type
number_std_monthType AS (
SELECT visit_month,student_type, Count(distinct studentid) as numberofstd from student_categorized
GROUP BY 1, 2
),
#You will have one row per combination of visit_month and student_type
student_categorized2 AS(
SELECT DISTINCT visit_month,student_type FROM student_categorized2
GROUP BY 1,2
),
#before calculation, create the table with all the necessary data
#you have the desired table without the growth
#notice that I used two keys to join t1 and t3 so the results are correct
final_t AS (
SELECT t1.visit_month,
t1.student_type,
t2.totalstd as totalstd,
t3.numberofstd
FROM student_categorized2 t1
LEFT JOIN total_stud AS t2 ON t1.visit_month = t2.visit_month
LEFT JOIN number_std_monthType t3 ON (t1.visit_month = t3.visit_month and t1.student_type = t3.student_type)
ORDER BY
)
#Now all the necessary values to calculate growth are in the temp table final_t
SELECT visit_month, student_type, totalstd, numberofstd,
ROUND(100 * (totalstd - LAG(totalstd) OVER (PARTITION BY visit_month ORDER BY visit_month ASC) /LAG(totalstd) OVER (PARTITION BY visit_month ORDER BY visit_month ASC) || '%' AS growth
FROM final_t
Notice that I used LEFT JOIN in order to have the proper counts in the final table, once each count was calculated in a different temp table. Also, I did not use your final SELECT statement.
If you have any issues with the code, do not hesitate to ask.
I'd like to combine two queries
SELECT COUNT(*) FROM abc;
SELECT COUNT(Status) FROM ABC WHERE Status='Active';
And then calculate the percentage (by taking the 2nd query divided by first query). I'd like to achieve this in one single query. What i've attempted so far:
SELECT COUNT(*) AS A FROM abc
UNION
SELECT COUNT(Status) AS B FROM ABC WHERE Status='Active';
UNION
SELECT(COUNT(Status)*100/SELECT COUNT(*) FROM abc)) AS %ofAB FROM abc WHERE Status='Active'
What I get:
A
--
31
36
86,11111111
What I want:
A | B | %ofAB
---------------------
36 | 31 | 86,1111111%
This should give you what you want:
SELECT
COUNT(*) AS TotalCount,
SUM(IIF(Status = 'Active', 1, 0)) AS ActiveCount,
ROUND((SUM(IIF(Status = 'Active', 1, 0)) * 100/ COUNT(*)),2) AS PctActive
FROM
Abc
EDIT: Didn't notice that this was for Access. I don't know if CAST is available in Access, so you may need to use an equivalent function to make sure that the integers don't simply yield 1 or 0. It's possible that Access will convert a division into a decimal automatically, but in SQL Server it does not.
Trying to find the best way to write this SQL statement.
I have a customer table that has the internal credit score of that customer. Then i have another table with definitions of that credit score. I would like to join these tables together, but the second table doesn't have any way to link it easily.
The score of the customer is an integer between 1-999, and the definition table has these columns:
Score
Description
And these rows:
60 LOW
99 MED
999 HIGH
So basically if a customer has a score between 1 and 60 they are low, 61-99 they are med, and 100-999 they are high.
I can't really INNER JOIN these, because it would only join them IF the score was 60, 99, or 999, and that would exclude anyone else with those scores.
I don't want to do a case statement with the static numbers, because our scores may change in the future and I don't want to have to update my initial query when/if they do. I also cannot create any tables or functions to do this- I need to create a SQL statement to do it for me.
EDIT:
A coworker said this would work, but its a little crazy. I'm thinking there has to be a better way:
SELECT
internal_credit_score
(
SELECT
credit_score_short_desc
FROM
cf_internal_credit_score
WHERE
internal_credit_score = (
SELECT
max(credit.internal_credit_score)
FROM
cf_internal_credit_score credit
WHERE
cs.internal_credit_score <= credit.internal_credit_score
AND credit.internal_credit_score <= (
SELECT
min(credit2.internal_credit_score)
FROM
cf_internal_credit_score credit2
WHERE
cs.internal_credit_score <= credit2.internal_credit_score
)
)
)
FROM
customer_statements cs
try this, change your table to contain the range of the scores:
ScoreTable
-------------
LowScore int
HighScore int
ScoreDescription string
data values
LowScore HighScore ScoreDescription
-------- --------- ----------------
1 60 Low
61 99 Med
100 999 High
query:
Select
.... , Score.ScoreDescription
FROM YourTable
INNER JOIN Score ON YourTable.Score>=Score.LowScore
AND YourTable.Score<=Score.HighScore
WHERE ...
Assuming you table is named CreditTable, this is what you want:
select * from
(
select Description, Score
from CreditTable
where Score > 80 /*client's credit*/
order by Score
)
where rownum = 1
Also, make sure your high score reference value is 1000, even though client's highest score possible is 999.
Update
The above SQL gives you the credit record for a given value. If you want to join with, say, Clients table, you'd do something like this:
select
c.Name,
c.Score,
(select Description from
(select Description from CreditTable where Score > c.Score order by Score)
where rownum = 1)
from clients c
I know this is a sub-select that executed for each returning row, but then again, CreditTable is ridiculously small and there will be no significant performance loss because of the the sub-select usage.
You can use analytic functions to convert the data in your score description table to ranges (I assume that you meant that 100-999 should map to 'HIGH', not 99-999).
SQL> ed
Wrote file afiedt.buf
1 with x as (
2 select 60 score, 'Low' description from dual union all
3 select 99, 'Med' from dual union all
4 select 999, 'High' from dual
5 )
6 select description,
7 nvl(lag(score) over (order by score),0) + 1 low_range,
8 score high_range
9* from x
SQL> /
DESC LOW_RANGE HIGH_RANGE
---- ---------- ----------
Low 1 60
Med 61 99
High 100 999
You can then join this to your CUSTOMER table with something like
SELECT c.*,
sd.*
FROM customer c,
(select description,
nvl(lag(score) over (order by score),0) + 1 low_range,
score high_range
from score_description) sd
WHERE c.credit_score BETWEEN sd.low_range AND sd.high_range