Cumulative sum function in hive - sql

Is there any way to get the cumulative count(customer_id) for today's date and a number of days leading up to today's date, i am running count in Hive? The date column in this format:
20120907
I have 2 columns in my dataset, customer_id and date.
There are also partitions in my table and some of the values in the customer_id column are NULL. I am not sure if there are duplicates so I will use
count(distinct(customer_id))
Here is an example of my data.
customer_id date
10001 20140901
10003 20141001
NULL 20150101
10007 20150102
Please let me know if you need anymore info.

I have the same problem, getting the cumulative count of distinct users per day.
The difficulty here is that you hardly can pre-aggregate the counts per day and sum them up, because the days might have "overlapping" users, thus you would count them multiple times instead of once.
But I was stumbling upon this approach, which basically hashes all users in a "sketch_set" per day and later unions the different hash set, upon which it applies a count estimation.

Related

SQL - Returning max count, after breaking down a day into hourly rows

I need to write a SQL query that helps return the highest count in a given hourly range. The problem is that in my table, it just logs orders as they come and doesn’t have a unique identifier that separates hours from hours.
So basically, I need to find the highest number of orders (on any given hour), from 7/08/2022, - 7/15/2022, have a table that does not distinguish distinct hour sets, and logs orders as they come.
I have tried to use a query that combines MAX(), COUNT(), and DATETIME(), but to no avail.
Can I please receive some help?
I've had to tackle this kind of measurement in the past..
Here's what I did for 15 minute intervals:
My datetime column is named datreg in my database log area.
cast(round(floor(cast(datreg as float(53))*24*4)/(24*4),5) as smalldatetime
I times by 4 in this formula, to get 4 intervals inside my 24 hour period.. For you it would look like this to get just hourly intervals:
cast(round(floor(cast(datreg as float(53))*24)/(24),5) as smalldatetime
This is a little piece of magic when it comes to dashboards and reports.

Cohort retention with SQL BigQuery

I am trying to create a retention table like the following using SQL in Big Query but with MONTHLY cohorts;
I have the following columns to use in my dataset, I am only using one table and it's name is 'curious-furnace-341507.TEST.Test_Dataset_-_Orders'
order_date
order_id
customer_id
2020-01-02
12345
6789
I do not need the new user column and the data goes through June 2020 I think ideally a cohort month column that lists January-June cohorts and then 5 periods across.
I have tried so many different things and keep getting errors in BigQuery I think I am approaching it all wrong. The online queries I am trying to pull from seem to use dates rather than months which is also causing some confusion as I think I need to truncate my date column to months only in the query?
Does anyone have a go-to query that will work in BigQuery for a retention table or can help me approach this? Thanks!
This may help you:
With cohorts AS (
SELECT
customer_id,
MIN(DATE(order_date)) AS cohort_date
FROM 'curious-furnace-341507.TEST.Test_Dataset_-_Orders'
GROUP BY 1)
SELECT
FORMAT_DATE("%Y-%m", c.cohort_date) AS cohort_mth,
t.customer_id AS cust_id,
DATE_DIFF(t.order_date, c.cohort_date, month) AS order_period,
FROM 'curious-furnace-341507.TEST.Test_Dataset_-_Orders' t
JOIN cohorts c ON t.customer_id = c.customer_id
WHERE cohort_date >= ('2020-01-01')
AND DATE_DIFF(t.order_date, c.cohort_date, month) <=5
GROUP BY 1, 2, 3
I typically do pivots and % calcs in excel/ sheets. So this will give just you the input data you need for that.
NOTE:
This will give you a count of unique customers who ordered in period X (ignores repeat orders in period).
This also has period 0 (ordered again in cohort_mth) which you may wish to keep/ exclude.

SQL GROUPING SETS averages with multiple many-to-many dimensions

I have a table of data with the following:
User,Platform,Dt,Activity_Flag,Total_Purchases
1,iOS,05/05/2016,1,1
1,Android,05/05/2016,1,2
2,iOS,05/05/2016,1,0
2,Android,05/05/2016,1,2
3,iOS,05/05/2016,1,1
3,Android,06/05/2016,1,3
1,iOS,06/05/2016,1,2
4,Android,06/05/2016,1,2
1,Android,06/05/2016,1,0
3,iOS,07/05/2016,1,2
2,iOS,08/05/2016,1,0
I want to do a GROUPING SETS (Platform,Dt,(Platform,Dt),()) aggregation to be able to find for each combination of Platform and Dt the following:
Total Purchases
Total Unique Users
Average Purchases per User per Day
The first two are simple as these can be achieved via a sum(Total_Purchases) and count(distinct user) respectively.
The problem I have is with the last metric. The result set should look like this but I don't know how to get the last column to be calculated correctly:
Platform,Dt,Total_Purchases,Total_Unique_Users,Average_Purchases_Per_User_Per_Day
Android,05/05/2016,4,2,2.0
iOS,05/05/2016,2,3,0.7
Android,06/05/2016,5,3,1.7
iOS,06/05/2016,2,1,2.0
iOS,07/05/2016,2,1,2.0
iOS,08/05/2016,0,1,0.0
,05/05/2016,6,3,2.0
,06/05/2016,7,3,2.3
,07/05/2016,1,1,1.0
,08/05/2016,1,1,1.0
Android,,9,4,1.8
iOS,,6,3,1.2
,,15,4,1.6
For the first ten rows we see that getting the Average purchase per user per day is a simple division of the first two columns as the dimension in these rows represent a single date only. But when we look at the final 3 rows we see that the division is not the way to achieve the desired result. This is because it needs to take an average for each day in turn to get the overall per day amount.
If this isn't clear please let me know and I'll be happy to explain better. This is my first post on this site!

Iterate through table by date column for each common value of different column

Below I have the following table structure:
CREATE TABLE StandardTable
(
RecordId varchar(50),
Balance float,
Payment float,
ProcDate date,
RecordIdCreationDate date,
-- multiple other columns used for calculations
)
And here is what a small sample of what my data might look like:
RecordId Balance Payment ProcDate RecordIdCreationDate
1 1000 100 2005-01-01 2005-01-01
2 5000 250 2008-01-01 2008-01-01
3 7500 350 2006-06-01 2006-06-01
1 900 100 2005-02-01 NULL
2 4750 250 2008-02-01 NULL
3 7150 350 2006-07-01 NULL
The table holds data on a transactional basis and has millions of rows in it. The ProcDate field indicates the month that each transaction is being processed. Regardless of when the transaction occurs throughout the month, the ProcDate field is hard coded to the first of the month that the transaction happened in. So if a transaction occurred on 2009-01-17, the ProcDate field would be 2009-01-01. I'm dealing with historical data, and it goes back to as early as 2005-01-01. There are multiple instances of each RecordId in the table. A RecordId will show up in each month until the Balance column reaches 0. Some RecordId's originate in the month the data starts (where ProcDate is 2005-01-01) and others don't originate until a later date. The RecordIdCreationDate field represents the date where the RecordId was originated. So that row has millions of NULL values in the table because every month that each RecordId didn't originate in is equal to NULL.
I need to somehow look at each RecordId, and run a number of different calculations on a month to month basis. What I mean is I have to compare column values for each RecordId where the ProcDate might be something like 2008-01-01, and compare those values to the same column values where the ProcDate would be 2008-02-01. Then after I run my calculations for the RecordId in that month, I have to compare values from 2008-02-01 to values in 2008-03-01 and run my calculations again, etc. I'm thinking that I can do this all within one big WHILE loop, but I'm not entirely sure what that would look like.
The first thing I did was create another table in my database that had the same table design as my StandardTable and I called it ProcTable. In that ProcTable, I inserted all of the data where the RecordIdCreationDate was not equal to NULL. This gave me the first instance of each RecordId in the database. I was able to run my calculations for the first month successfully, but where I'm struggling is how I use the column values in the ProcTable, and compare those to the column values where the ProcDate is the month after that. Even if I could somehow do that, I'm not sure how I would repeat that process to compare the 2nd month's data to the 3rd month's data, and the 3rd month's data to the 4th month's data, etc.
Any suggestions? Thanks in advance.
Seems to me, all you need to do is JOIN the table to itself, on this condition
ON MyTable1.RecordId = MyTable2.RecordId
AND MyTable1.ProcDate = DATEADD(month, -1, MyTable2.ProcDate)
Then you will have all the rows in your table (MyTable1), joined to the same RecordId's row from the next month (MyTable2).
And in each row you can do whatever calculations you want between the two joined tables.

Query to find a weekly average

I have an SQLite database with the following fields for example:
date (yyyymmdd fomrat)
total (0.00 format)
There is typically 2 months of records in the database. Does anyone know a SQL query to find a weekly average?
I could easily just execute:
SELECT COUNT(1) as total_records, SUM(total) as total FROM stats_adsense
Then just divide total by 7 but unless there is exactly x days that are divisible by 7 in the db I don't think it will be very accurate, especially if there is less than 7 days of records.
To get a daily summary it's obviously just total / total_records.
Can anyone help me out with this?
You could try something like this:
SELECT strftime('%W', thedate) theweek, avg(total) theaverage
FROM table GROUP BY strftime('%W', thedate)
I'm not sure how the syntax would work in SQLite, but one way would be to parse out the date parts of each [date] field, and then specifying which WEEK and DAY boundaries in your WHERE clause and then GROUP by the week. This will give you a true average regardless of whether there are rows or not.
Something like this (using T-SQL):
SELECT DATEPART(w, theDate), Avg(theAmount) as Average
FROM Table
GROUP BY DATEPART(w, theDate)
This will return a row for every week. You could filter it in your WHERE clause to restrict it to a given date range.
Hope this helps.
Your weekly average is
daily * 7
Obviously this doesn't take in to account specific weeks, but you can get that by narrowing the result set in a date range.
You'll have to omit those records in the addition which don't belong to a full week. So, prior to summing up, you'll have to find the min and max of the dates, manipulate them such that they form "whole" weeks, and then run your original query with a WHERE that limits the date values according to the new range. Maybe you can even put all this into one query. I'll leave that up to you. ;-)
Those values which are "truncated" are not used then, obviously. If there's not enough values for a week at all, there's no result at all. But there's no solution to that, apparently.