Get the first appearance of each value in a hive query

Get the first appearance of each value in a hive query - hive

I have an event table, every row has:
event-id (primary key)
user-id
item-id
day
So, it's possible that and the same (item-id) appear in different days, but I need obtain the first day that appear every (item-id) and count all occurrence in this day.
For example
event-id user-id item-id day
1 pp a 2015/05/01
2 df a 2015/05/01
3 pp b 2015/05/02
3 al a 2015/05/02
I want the follow result:
day item-id count
2015/05/01 a 2
2015/05/02 b 1
I'm using this query:
SELECT
min(day) as day,
item_id,
count (event_id) as count
FROM
events
GROUP BY
day,
item_id;
but doesn't work correctly.

Here is one method:
SELECT e.*
FROM (SELECT day, item_id, count(*) as cnt,
MIN(day) OVER (PARTITION BY item_id) as minday
FROM events
GROUP BY day, item_id
) e
WHERE day = minday;

Related

Finding the average when values are missing using SQL

I'm using Presto but any flavor of SQL will do.
I have a table in that format.
Group_id
event_id
month
party
time_interval
1
1
Jan
Player A
1 hour
1
1
Jan
Player A
2 hours
1
1
Jan
Player B
1 hours
1
1
Jan
Player B
1 hour
1
2
Jan
Player A
3 hour
I need to get the average per group_id, per month, per party
Here's how my average should be calculated
total number of hours per group, per month, per party/total number of events per org, per month
Here's the output I should be expecting for clarity's sake:
Group_id
month
party
avg_time_interval
1
Jan
Player A
3 hours
1
Jan
Player B
1 hour
Now here's the tricky part. For the first row everything makes perfect sense. We have 6 hours across both events, which we divide by 2 distinct events and get an average of 3.
However for the 2nd row, we get 1 hour instead of 2 because since the user did not get a time included we should be assuming that the interval there was 0. This means that there are still 2 unique events across that org_id, month. So the 2 hours totaled should be divided by 2 and not by 1.
This missing data essentially has made this way more complicated than it should be. Otherwise I believe running the following would've solved it
SELECT Group_id , month, party, total/num_cases FROM(
SELECT Group_id , month, party, SUM(time_interval) AS total, COUNT(DISTINCT(event_id)) AS num_cases
FROM table
GROUP BY Group_id , month, party
)

You may find the count of distinct event_id values grouped by group_id, month; then join this with your table as the following:
SELECT T.Group_id, T.month, T.party
,SUM(T.time_interval)*1.0/ MAX(D.eid) AS avg_time_interval
FROM tbl T
JOIN
(
SELECT Group_id, month,
COUNT(DISTINCT event_id) AS eid
FROM tbl GROUP BY Group_id, month
) D
ON T.Group_id=D.Group_id AND
T.month=D.month
GROUP BY T.Group_id,T.month,T.party
ORDER BY T.Group_id,T.month,T.party

select distinct Group_id
,month
,party
,total_hours_per_party/max(dns_rnk) over() as avg_time_interval
from (
select Group_id
,month
,party
,sum(time_interval) over(partition by party) as total_hours_per_party
,dense_rank() over(order by event_id) as dns_rnk
from t
) t
Group_id
month
party
avg_time_interval
1
Jan
Player A
3
1
Jan
Player B
1
Fiddle

Retrieve Customers with a Monthly Order Frequency greater than 4

I am trying to optimize the below query to help fetch all customers in the last three months who have a monthly order frequency +4 for the past three months.
Customer ID
Feb
Mar
Apr
0001
4
5
6
0002
3
2
4
0003
4
2
3
In the above table, the customer with Customer ID 0001 should only be picked, as he consistently has 4 or more orders in a month.
Below is a query I have written, which pulls all customers with an average purchase frequency of 4 in the last 90 days, but not considering there is a consistent purchase of 4 or more last three months.
Query:
SELECT distinct lines.customer_id Customer_ID, (COUNT(lines.order_id)/90) PurchaseFrequency
from fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY Customer_ID
HAVING PurchaseFrequency >=4;
I tried to use window functions, however not sure if it needs to be used in this case.

I would sum the orders per month instead of computing the avg and then retrieve those who have that sum greater than 4 in the last three months.
Also I think you should select your interval using "month(CURRENT_DATE()) - 3" instead of using a window of 90 days. Of course if needed you should handle the case of when current_date is jan-feb-mar and in that case go back to oct-nov-dec of the previous year.
I'm not familiar with Google BigQuery so I can't write your query but I hope this helps.

So I've found the solution to this using WITH operator as below:
WITH filtered_orders AS (
select
distinct customer_id ID,
extract(MONTH from date) Order_Month,
count(order_id) CountofOrders
from customer_order_lines` lines
where EXTRACT(YEAR FROM date) = 2022 AND EXTRACT(MONTH FROM date) IN (2,3,4)
group by ID, Order_Month
having CountofOrders>=4)
select distinct ID
from filtered_orders
group by ID
having count(Order_Month) =3;
Hope this helps!

An option could be first count the orders by month and then filter users which have purchases on all months above your threshold:
WITH ORDERS_BY_MONTH AS (
SELECT
DATE_TRUNC(lines.date, MONTH) PurchaseMonth,
lines.customer_id Customer_ID,
COUNT(lines.order_id) PurchaseFrequency
FROM fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY PurchaseMonth, Customer_ID
)
SELECT
Customer_ID,
AVG(PurchaseFrequency) AvgPurchaseFrequency
FROM ORDERS_BY_MONTH
GROUP BY Customer_ID
HAVING COUNT(1) = COUNTIF(PurchaseFrequency >= 4)

PostgreSQL for the average number of attendances of an event per month

I'm trying to write some SQL to understand the average number of events attended, per month.
Attendees
| id | user_id | event_id | created_at |
I've tried:
SELECT AVG(b.rcount) FROM (select count(*) as rcount FROM attendees GROUP BY attendees.user_id) as b;
But this returns 5.77 (which is just the average of all time). I'm trying to get the average per month.
The results would ideally be:
2020-01-01, 2.1
2020-01-02, 2.4
2020-01-03, 3.3
...
I also tried this:
SELECT mnth, AVG(b.rcount) FROM (select date_trunc('month', created_at) as mnth, count(*) as rcount FROM attendees GROUP BY 1, 2) as b;
But got: ERROR: aggregate functions are not allowed in GROUP BY

If I follow you correctly, a simple approach is to divide the number of rows per month by the count of distinct users:
select
date_trunc('month', created_at) created_month,
1.0 * count(*) / count(distinct user_id) avg_events_per_user
from attendees
group by date_trunc('month', created_at)

SQL Retention Cohort Analysis

I am trying to write a query for monthly retention, to calculate percentage of users returning from their initial start month and moving forward.
TABLE: customer_order
fields
id
date
store_id
TABLE: customer
id
person_id
job_id
first_time (bool)
This gets me the initial monthly cohorts based on the first dates
SELECT first_job_month, COUNT( DISTINCT person_id) user_counts
FROM
( SELECT DATE_TRUNC(MIN(CAST(date AS DATE)), month) first_job_month, person_id
FROM customer_order cd
INNER JOIN consumer co ON co.job_id = cd.id
GROUP BY 2
ORDER BY 1 ) first_d GROUP BY 1 ORDER BY 1
first_job_month user_counts
2018-04-01 36
2018-05-01 37
2018-06-01 39
2018-07-01 45
2018-08-01 38
I have tried a bunch of things, but I can't figure out how to keep track of the original cohorts/users from the first month onwards

Get your the first order month for every customer
Join orders to the previous subquery to find out what is the difference in months between the given order and the first order
Use conditional aggregates to count customers that still order by X month
There are some alternative options like using window functions to do (1) and (2) in the same subquery but the easiest option is this one:
WITH
cohorts as (
SELECT person_id, DATE_TRUNC(MIN(CAST(date AS DATE)), month) as first_job_month
FROM customer_order cd
JOIN consumer co
ON co.job_id = cd.id
GROUP BY 1
)
,orders as (
SELECT
*
,round(1.0*(DATE_TRUNC(MIN(CAST(cd.date AS DATE))-c.first_job_month)/30) as months_since_first_order
FROM cohorts c
JOIN customer_order cd
USING (person_id)
)
SELECT
first_job_month as cohort
,count(distinct person_id) as size
,count(distinct case when months_since_first_order>=1 then person_id end) as m1
,count(distinct case when months_since_first_order>=2 then person_id end) as m2
,count(distinct case when months_since_first_order>=3 then person_id end) as m3
-- hardcode up to the number of months you want and the history you have
FROM orders
GROUP BY 1
ORDER BY 1
See, you can use CASE statements inside the aggregate functions like COUNT to identify different subsets of rows that you'd like to aggregate within the same group. This is one of the most important BI techniques in SQL.
Note, >= not = is used in the conditional aggregate so that for example if the customer buys in m3 after m1 and doesn't buy in m2 they will still be counted in m2. If you want your customers to buy every month and/or see the actual retention for every month and are ok if subsequent months values can be higher than previous you can use =.
Also, if you don't want the "triangle" view like one you get from this query or you don't want to hardcode the "mX" part you would just group by first_job_month and months_since_first_order and count distinct. Some visualization tools might consume this simple format and make a triangle view out of it.

Group by statement to do average of time

My existing database has data coming for an id, value and time. There is one record coming every 3 seconds. I want my select statement to use these data and group them based on id and hrly basis to show the average of the values in that hr. How can I use group by to achieve this ?
This is my sample data:
id value date time
a 5 5/18/2015 10:27:22
a 9 5/18/2015 10:27:25
b 7 5/18/2015 10:27:22
b 8 5/18/2015 10:27:22
I have data coming in every 3 seconds. I want it to be aggregated based on every hr of the day to reflect avg values of that id in that hr.
I want the output to look like
id -a , gives avg of 7 , at 10 on 5/18/2015

This is a relatively simple group by which will have two types of columns generally. Your grouped columns and your aggregates. In this case your grouped columns will have ID,date, and hr(calculated from [time]). You only have one aggregated column in this case: the average of value. Check out my code:
SELECT ID,
[date],
DATEPART(HOUR,[time]) AS hr,
AVG(value) AS avg_val
FROM yourTable
GROUP BY ID,[date],DATEPART(HOUR,[time])

This query will pull each ID along with the average value, grouped by each hour of the day. If you want to run this for more than 1 day, you would have to group by the date + the hour so 6/19/2015 10:00, then 6/19/2015 11:00 and so forth.
SELECT
id,
avg(value) AS avg_val,
datepart(hh, time_interval) AS time_interval
FROM my_table
WHERE time_interval = '6/19/2015'
GROUP BY id, datepart(hh, time_interval)
To include multiple days, you could group change the group by section to be:
GROUP BY id, convert(varchar(10), time_interval, 120), datepart(hh,time_interval)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get the first appearance of each value in a hive query - hive

Here is one method: SELECT e.* FROM (SELECT day, item_id, count(*) as cnt, MIN(day) OVER (PARTITION BY item_id) as minday FROM events GROUP BY day, item_id ) e WHERE day = minday;

Related

Finding the average when values are missing using SQL

Retrieve Customers with a Monthly Order Frequency greater than 4

PostgreSQL for the average number of attendances of an event per month

SQL Retention Cohort Analysis

Group by statement to do average of time

Categories

Resources