Count the number of transactions per month for an individual group by date Hive - sql

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date.
There is a category column called cal_month_nbr which ranges from 1 to 12 based on which month transaction occurred.
The data looks like below
Id visit_date Cal_month_nbr
---- ------ ------
1 01/01/2020 1
1 01/02/2020 1
1 01/01/2020 1
2 02/01/2020 2
1 02/01/2020 2
1 03/01/2020 3
3 03/01/2020 3
first
I want to know how many times customer visits per month using their visit_date
i.e i want below output
id cal_month_nbr visit_per_month
--- --------- ----
1 1 2
1 2 1
1 3 1
2 2 1
3 3 1
and what is the avg frequency of visit per ids
ie.
id Avg_freq_per_month
---- -------------
1 1.33
2 1
3 1
I tried with below query but it counts each item as one transaction
select avg(count_e) as num_visits_per_month,individual_id
from
(
select r.individual_id, cal_month_nbr, count(*) as count_e
from
ww_customer_dl_secure.cust_scan
GROUP by
r.individual_id, cal_month_nbr
order by count_e desc
) as t
group by individual_id
I would appreciate any help, guidance or suggestions

You can divide the total visits by the number of months:
select individual_id,
count(*) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
If you want the average number of days per month, then:
select individual_id,
count(distinct visit_date) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
Actually, Hive may not be efficient at calculating count(distinct), so multiple levels of aggregation might be faster:
select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
from (select distinct individual_id, visit_date, cal_month_nbr
from ww_customer_dl_secure.cust_scan c
) iv
group by individual_id, cal_month_nbr
) ic
group by individual_id;

Related

Find Individuals who have purchased 10 times within a rolling 1 year period

So let's say I have 2 tables. One table is for consumers, and another is for sales.
Consumers
ID
Name
...
1
John Johns
...
2
Cathy Dans
...
Sales
ID
consumer_id
purchase_date
...
1
1
01/03/05
...
2
1
02/04/10
...
3
1
03/04/11
...
4
2
02/14/07
...
5
2
09/24/08
...
6
2
12/15/09
...
I want to find all instances of consumers who made more than 10 purchases within any 6 month rolling period.
SELECT
consumers.id
, COUNT(sales.id)
FROM
consumers
JOIN sales ON consumers.id = sales.consumer_id
GROUP BY
consumers.id
HAVING
COUNT(sales.id) >= 10
ORDER BY
COUNT(sales.id) DESC
So I have this code, which just gives me a list of consumers who have made more than 10 purchases ALL TIME. But how do I incorporate the rolling 6 month period logic?!
Any help or guidance on which functions can help me accomplish this would be appreciated!
You can use window functions to count the number of sales in a six-month period. Then just filter down to those consumers:
select distinct consumer_id
from (select s.*,
count(*) over (partition by consumer_id
order by purchase_date
range between current row and interval '6 month' following
) as six_month_count
from sales s
) s
where six_month_count > 10;

How to write a sql script for a range of Oracle assignment date records by different employee's job titles

I am trying to write an ad-hoc query for a range of assignment date records by employee's job title. These examples are used for the Oracle application assignment table.
First sample:
AsgId Start_Date End_Date Job_ID
1 1/1/14 6/30/14 10
1 7/1/14 11/15/14 10
1 11/16/14 1/10/15 20
1 1/11/15 3/10/15 10
1 3/11/15 3/31/15 10
1 4/1/15 12/31/18 20
I have tried analytical functions, in-line views, and other code without success.
Expected report results of 3 date-range records by job title:
asgid start_date end_date job_title
1 1/1/14 11/15/14 10
1 11/16/14 1/10/15 20
1 1/11/15 3/31/15 10
1 4/1/15 12/31/18 20
Second sample:
EMP_ID START_DATE END_DATE JOB_TITLE
1 1/1/14 11/15/14 10
1 11/16/14 11/10/15 10
1 11/11/15 12/31/15 20
1 1/1/16 1/31/16 10
1 2/1/16 12/31/16 10
Expected report results of 3 date-range records by job title
EMP_ID START_DATE END_DATE JOB_TITLE
1 1/1/14 11/10/15 10
1 11/11/15 12/31/15 20
1 1/1/16 12/31/16 10
This is a type of gaps-and-islands problem. Assuming that there are no gaps or overlaps, you can use left join and a cumulative sum to determine the islands. The rest is aggregation:
select asgid, job_id, min(start_date) as start_date,
max(end_date) as end_date
from (select a.*,
sum(case when aprev.asgid is null then 1 else 0 end) over (partition by a.asgid, a.job_id order by a.start_date) as grp
from assignment a left join
assignment aprev
on aprev.asgid = a.asgid and
aprev.job_id = a.job_id and
aprev.end_date = a.start_date - 1
) a
group by asgid, job_id, grp
order by asgid, min(a.start_date);
Here is a db<>fiddle.

Most Efficient SQL to Calculate Running Streak Occurrences

I am looking for the most efficient manner to determine the longest occurrence of a streak within a given data set; specifically, to determine the longest winning streak of games.
Below is the SQL that I have thus far, and it does seem to perform very fast, and as expected from the limited testing I've done on a dataset with around 100,000 records.
DECLARE #HistoryDateTimeLimit datetime = '3/15/2018';
CTE to create result subset from voting dataset.
WITH Results AS (
SELECT
EntityPlayerId,
(CASE
WHEN VoteTeamA = 1 AND ParticipantAScore > ParticipantBScore THEN 'W'
WHEN VoteTeamA = 0 AND ParticipantBScore > ParticipantAScore THEN 'W'
ELSE 'L'
END) AS WinLoss,
match.ScheduledStartDateTime
FROM
[dbo].[MatchVote] vote
INNER JOIN [dbo].[MatchMetaData] match ON vote.MatchId = match.MatchId
WHERE
IsComplete = 1
AND ScheduledStartDateTime >= #HistoryDateTimeLimit
)
CTE to create subset of data with streak type as WinLoss and total count of votes in the partition using ROW_NUMBER().
WITH Streaks AS (
SELECT
EntityPlayerId,
ScheduledStartDateTime,
WinLoss,
ROW_NUMBER() OVER (PARTITION BY EntityPlayerId ORDER BY ScheduledStartDateTime) -
ROW_NUMBER() OVER (PARTITION BY EntityPlayerId, WinLoss ORDER BY ScheduledStartDateTime) AS Streak
FROM
Results
)
CTE to summarize the partitioned vote streaks by WinLoss and a begin date/time, with the total count in the streak.
WITH StreakCounts AS (
SELECT
EntityPlayerId,
WinLoss,
MIN(ScheduledStartDateTime) StreakStart,
MAX(ScheduledStartDAteTime) StreakEnd,
COUNT(*) as Streak
FROM
Streaks
GROUP BY
EntityPlayerId, WinLoss, Streak
)
CTE to select the MAXIMUM (longest) vote streak for WinLoss of W (win) grouped by players.
WITH LongestWinStreak AS (
SELECT
EntityPlayerId,
MAX(Streak) AS LongestStreak
FROM
StreakCounts
WHERE
WinLoss = 'W'
GROUP BY
EntityPlayerId
)
Selecting the useful data from the LongestWinStreak CTE.
SELECT * FROM LongestWinStreak
This is the 3rd iteration of the code; at first I feel like I was overthinking and using windows with the LAG function to define a reset period that was later used for partitioning.
[UPDATE]: SQLFiddle example at http://sqlfiddle.com/#!18/5b33a/1 -- Sample data for the two tables that are used above are as follows.
The data is meant to show the schema, and can be extrapolated for your own testing/usage;
MatchVote table data.
EntityPlayerId IsExtMatch MatchId VoteTeamA VoteDateTime IsComplete
-------------------- ------------ -------------------- --------- ----------------------- ----------
158 1 152639 0 2018-03-20 23:25:28.910 1
158 1 156058 1 2018-03-13 23:36:57.517 1
MatchMetaData table data.
MatchId IsTeamTournament MatchCompletedDateTime ScheduledStartDateTime MatchIsFinalized TournamentId TournamentTitle TournamentLogoUrl TournamentLogoThumbnailUrl GameName GameShortCode GameLogoUrl ParticipantAScore ParticipantAName ParticipantALogoUrl ParticipantBScore ParticipantBName ParticipantBLogoUrl
--------- ---------------- ----------------------- ----------------------- ---------------- -------------------- ------------------ ----------------------- ---------------------------- --------------------------------- -------------- ----------------------- ------------------ ------------------- --------------------- ----------------- ------------------- --------------------
23354 1 2014-07-30 00:30:00.000 2014-07-30 00:00:00.000 1 543 Sample https://...Small.png https://...Small.png Dota 2 Dota 2 https://...logo.png 3 Natus Vincere.US https://...VI.png 0 Not Today https://...ay.png
44324 1 2014-12-15 12:40:00.000 2014-12-15 11:40:00.000 1 786 Sample https://...Small.png https://...Small.png Counter-Strike: Global Offensive CS:GO https://...logo.png 0 Avalier's stars https://...oto.png 1 Kassad's Legends https://...oto.png

How do I create a frequency distribution?

I'm trying to create a frequency distribution to show how many customers have transacted 1x, 2x, 3x, etc.
I have a database transactions and column user_id. Each row indicates a transaction, and if a user_id shows up in multiple rows, that user has done multiple transactions.
Now I'd like to get a list that looks something like this:
Tra. | Freq.
0 | 345
1 | 543
2 | 45
3 | 20
4 | 0
5 | 3
etc
Currently I have this, but it just shows a list of users and how many transactions they have had.
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
ORDER BY number_of_transactions DESC;
I did some digging and was suggested that generate_series might help, but I'm stuck and don't know how to move forward.
Use the first result as input to an outer query where you apply the count again, but this time grouping on number_of_transactions:
SELECT number_of_transactions, COUNT(*) AS freq
FROM (
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
) A
GROUP BY number_of_transactions;
This would transform a result like:
user_id number_of_transactions
----------- ----------------------
1 2
2 1
3 2
4 4
to this:
number_of_transactions freq
---------------------- -----------
1 1
2 2
4 1

SQL Server : count types with totals by date change

I need to count a value (M_Id) at each change of a date (RS_Date) and create a column grouped by the RS_Date that has an active total from that date.
So the table is:
Ep_Id Oa_Id M_Id M_StartDate RS_Date
--------------------------------------------
1 2001 5 1/1/2014 1/1/2014
1 2001 9 1/1/2014 1/1/2014
1 2001 3 1/1/2014 1/1/2014
1 2001 11 1/1/2014 1/1/2014
1 2001 2 1/1/2014 1/1/2014
1 2067 7 1/1/2014 1/5/2014
1 2067 1 1/1/2014 1/5/2014
1 3099 12 1/1/2014 3/2/2014
1 3099 14 2/14/2014 3/2/2014
1 3099 4 2/14/2014 3/2/2014
So my goal is like
RS_Date Active
-----------------
1/1/2014 5
1/5/2014 7
3/2/2014 10
If the M_startDate = RS_Date I need to count the M_id and then for
each RS_Date that is not equal to the start date I need to count the M_Id and then add that to the M_StartDate count and then count the next RS_Date and add that to the last active count.
I can get the basic counts with something like
(Case when M_StartDate <= RS_Date
then [m_Id] end) as Test.
But I am stuck as how to get to the result I want.
Any help would be greatly appreciated.
Brian
-added in response to comments
I am using Server Ver 10
If using SQL SERVER 2012+ you can use ROWS with your the analytic/window functions:
;with cte AS (SELECT RS_Date
,COUNT(DISTINCT M_ID) AS CT
FROM Table1
GROUP BY RS_Date
)
SELECT *,SUM(CT) OVER(ORDER BY RS_Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Run_CT
FROM cte
Demo: SQL Fiddle
If stuck using something prior to 2012 you can use:
;with cte AS (SELECT RS_Date
,COUNT(DISTINCT M_ID) AS CT
FROM Table1
GROUP BY RS_Date
)
SELECT a.RS_Date
,SUM(b.CT)
FROM cte a
LEFT JOIN cte b
ON a.RS_DAte >= b.RS_Date
GROUP BY a.RS_Date
Demo: SQL Fiddle
You need a cumulative sum, easy in SQL Server 2012 using Windowed Aggregate Functions. Based on your description this will return the expected result
SELECT p_id, RS_Date,
SUM(COUNT(*))
OVER (PARTITION BY p_id
ORDER BY RS_Date
ROWS UNBOUNDED PRECEDING)
FROM tab
GROUP BY p_id, RS_Date
It looks like you want something like this:
SELECT
RS_Date,
SUM(c) OVER (PARTITION BY M_StartDate ORDER BY RS_Date ROWS UNBOUNDED PRECEEDING)
FROM
(
SELECT M_StartDate, RS_Date, COUNT(DISTINCT M_Id) AS c
FROM my_table
GROUP BY M_StartDate, RS_Date
) counts
The inline view computes the counts of distinct M_Id values within each (M_StartDate, RS_Date) group (distinctness enforced only within the group), and the outer query uses the analytic version of SUM() to add up the counts within each M_StartDate.
Note that this particular query will not exactly reproduce your example results. It will instead produce:
RS_Date Active
-----------------
1/1/2014 5
1/5/2014 7
3/2/2014 8
3/2/2014 2
This is on account of some rows in your example data with RS_Date 3/2/2014 having a later M_StartDate than others. If this is not what you want then you need to clarify the question, which currently seems a bit inconsistent.
Unfortunately, analytic functions are not available until SQL Server 2012. In SQL Server 2010, the job is messier. It could be done like this:
WITH gc AS (
SELECT M_StartDate, RS_Date, COUNT(DISTINCT M_Id) AS c
FROM my_table
GROUP BY M_StartDate, RS_Date
)
SELECT
RS_Date,
(
SELECT SUM(c)
FROM gc2
WHERE gc2.M_StartDate = gc.M_StartDate AND gc2.RS_Date <= gc.RS_Date
) AS Active
FROM gc
If you are using SQL 2012 or newer you can use LAG to produce a running total.
https://msdn.microsoft.com/en-us/library/hh231256(v=sql.110).aspx