How to calculate average number of actions in selected month per client in Teradata SQL? - sql

I have table with transactions in Teradata SQL like below:
ID | trans_date
-------------------
123 | 2021-09-15
456 | 2021-10-20
777 | 2021-11-02
890 | 2021-02-14
... | ...
And I need to calculate average number of transactions made by clients in month: 09, 10 and 11, so as a result I need something like below:
Month | Avg_num_trx
--------------------------------------------------------
09 | *average number of transactions per client in month 09*
10 | *average number of transactions per client in month 10*
11 | *average number of transactions per client in month 11*
How can I do taht in Teradata SQL ?

Not as familiar with Teradata, you could probably start by extracting the month from the trans_date, then grouping id and month and adding in count(id). From there you could group month by avg(count_id). Something like this -
WITH extraction AS(
SELECT
ID,
EXTRACT (MONTH FROM trans_date) AS MM
FROM your_table)
,
WITH id_counter AS(
SELECT
ID,
MM,
COUNT(ID) as id_count
FROM extraction
GROUP BY ID, MM)
SELECT
MM,
AVG(id_count) AS Avg_num_trx
FROM id_counter
ORDER BY MM;
The first CTE grabs month from trans_date.
The second CTE groups ID and month with count(ID) - should give you the total actions in that month for that client ID as id_count.
The final table gets the average of id_count grouped by month, which should be the average interactions per client for the period.
If EXTRACT doesn't work for some reason you could also try STRTOK(trans_date, '-', 2).
Other potential methods to replace -
--current
EXTRACT (MONTH FROM trans_date) AS MM
--option 1
STRTOK(trans_date, '-', 2) AS MM
--option 2
LEFT(RIGHT(trans_date, 5),2) AS MM
Above reworked as subqueries - should help with debugging -
SELECT
MM,
AVG(id_count) AS Avg_num_trx
FROM (SELECT
ID,
MM,
COUNT(ID) as id_count
FROM (SELECT
ID,
EXTRACT (MONTH FROM trans_date) AS MM
FROM your_table) AS a
GROUP BY ID, MM) AS b
ORDER BY MM;

This will return the expected answer:
SELECT
Extract (MONTH From trans_date) AS MM,
Cast(Count(*) AS FLOAT) / Count(DISTINCT id)
FROM my_table
GROUP BY MM
Compare to #procopypaster's answer too see which one is more efficient for your data.

Related

SQL BigQuery: For the current month, count number of distinct CustomerIDs in the previous 3 month

The following is the table with distinct CustomerID and Trunc_Date(Date,MONTH) called Date.
DATE
CustomerID
2021-01-01
111
2021-01-01
112
2021-02-01
111
2021-03-01
113
2021-03-01
115
2021-04-01
119
For a given month M, I want to get the count of distinct CustomerIDs of the three previous months combined. Eg. for the month of July (7), I want to get the distinct count of CustomerIDs from the month of April (4), May (5) and until June (6). I do not want the customer in July (7) to be included for the record for July.
So the output will be like:
DATE
CustomerID Count
2021-01-01
535
2021-02-01
657
2021-03-01
777
2021-04-01
436
2021-05-01
879
2021-06-01
691
Consider below
select distinct date,
( select count(distinct id)
from t.prev_3_month_customers id
) customerid_count
from (
select *,
array_agg(customerid) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_customers,
from (
select *, date_diff(date, '2000-01-01', month) pos
from `project.dataset.table`
)
) t
If applied to sample data in your question - output is
You can also solve this problem by creating a record for each month in the three following months and then aggregating:
select date_add(date, interval n month) as month,
count(distinct customerid)
from t cross join
unnest(generate_array(1, 3, 1)) n
group by month;
BigQuery ran out of memory running this since we have lots of data
In cases like this - the most scalable and performant approach is to use HyperLogLog++ functions as in example below
select distinct date,
( select hll_count.merge(sketch)
from t.prev_3_month_sketches sketch
) customerid_count
from (
select *,
array_agg(customers_hhl) over(order by pos range between 3 preceding and 1 preceding) prev_3_month_sketches,
from (
select date_diff(date, '2000-01-01', month) pos,
min(date) date,
hll_count.init(customerid) customers_hhl
from `project.dataset.table`
group by pos
)
) t
If applied to sample data in your question - output is
Note: HLL++ functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical uncertainty. This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

SQL Bigquery Counting repeated customers from transaction table

I have a transaction table that looks something like this.
userid
orderDate
amount
111
2021-11-01
20
112
2021-09-07
17
111
2021-11-21
17
I want to count how many distinct customers (userid) that bought from our store this month also bought from our store in the previous month. For example, in February 2020, we had 20 customers and out of these 20 customers 7 of them also bought from our store in the previous month, January 2020. I want to do this for all the previous months so ending up with something like.
year
month
repeated customers
2020
01
11
2020
02
7
2020
03
9
I have written this but this only works for only the current month. How would I iterate or rewrite it to get the table as shown above.
WITH CURRENT_PERIOD AS (
SELECT DISTINCT userid
FROM table1
WHERE DATE(orderDate) BETWEEN DATE_TRUNC(CURRENT_DATE(),MONTH) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
),
PREVIOUS_PERIOD AS (
SELECT DISTINCT userid
FROM table1
WHERE DATE(orderDate) BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH),MONTH) AND LAST_DAY(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH))
)
SELECT count(1)
FROM CURRENT_PERIOD RC
WHERE RC.userid IN (SELECT DISTINCT userid FROM PREVIOUS_PERIOD)
You can summarize to get one record per month, use lag(), and then aggregate:
select yyyymm,
countif(prev_yyyymm = date_add(yyyymm, interval -1 month)
from (select userid, date_trunc(order_date, month) as yyyymm,
lag(date_trunc(order_date, month)) over (partition by userid order by date_trunc(order_date, month)) as prev_yyyymm
from table1
group by 1, 2
) t
group by yyyymm
order by yyyymm;

PostgreSQL group by and order by

I have a table with a date column. I wanted to get the count of months and display them in the order of months. Months should be displayed as 'Jan', 'Feb' etc. If I use to_char function, the order by happens on text. I can use extract(month from dt), but that will also display month in number format. This is part of a report and month should be displayed in 'Mon' format only.
SELECT to_char(dt,'Mon'), COUNT(*) FROM tb GROUP BY to_char(dt,'Mon') ORDER BY to_char(dt,'Mon');
to_char | count
---------+-------
Dec | 1
Jan | 1
Jul | 2
select month, total
from (
select
extract(month from dt) as month_number,
to_char(dt,'mon') as month,
count(*) as total
from tb
group by 1, 2
) s
order by month_number

Calculating difference between daily sum and a average for the same day of the week in defined time range. SQL 10g Oracle

Hi I'm working with data depending mostly on the day of the week. Data is formatted in a table
Date - position - count/number.
There are multiple different positions.
I was able to sort my data for a each day of the week using.
select MOD(to_char(time, 'J'),7),
sum(COUNT))
from TABLE
where time > sysdate -x
group by to_char(time, 'J')
order by to_char(time, 'J');
This outputs daily sums according to day of the week.
Now I'm able to get an average for a single day of a week in a year.
This code outputs an average for only Sunday
SELECT AVG(asset_sums)
FROM (
select MOD(to_char(time, 'J'),7),
sum(COUNT)) as asset_sums
from table
where time > sysdate -365
and MOD(TO_CHAR(time, 'J'), 7) + 1 IN (7)
group by to_char(time, 'J')
order by to_char(time, 'J')
);
My goal is to be able to get a table with daily sum compared with yearly average for that particular day of the week.
For example yearly average number for Mondays is 57 , Tuesdays 60.
This week my Monday is 59 and Tuesday is 57. Output of the table is
Monday +2, Tuesday -3.
What is the easiest way / most efficient ?
Thanks for your help.
Edit : Format of my data
Date : yyyy-mm-dd | Place : xxxx | Number( of customers) 0 to 10000
2013-09-16 | AAAA | 1534
2013-09-16 | AAAB | 534
2013-09-17 | AAAA | 1434
2013-09-17 | AAAC | 834
2013-09-18 | AAAA | 134
2013-09-18 | AAAD | 183
Needed output
2013-09-16 | Day of the week | Sum | Average monday this year | Difference Sum-AVG
2013-09-16 | 1 (= Monday) | 2068 | 2015| 53
For clarity I will use subquery factoring. First, select the current weeks data. Next, subquery the sum for the day over the current week. Then, subquery the sum for each day over the past year. Then, average the daily sum of each day for each day of the week. Finally, join the two and display the difference.
with
this_week as (
select
time
from table
where time > x - 7
group by time
),
this_week_dly_sum as (
select
to_char(time, 'd') day,
sum(count) sum
from this_week
group by to_char(time, 'd')
),
this_year_dly_sum as (
select
time,
sum(count) sum
from table
where time > x - 365
group by time
),
this_year_dly_avg as (
select
to_char(day, 'd'),
avg(sum) avg
from this_year_dly_sum
group by to_char(day, 'd')
)
select
this_week.time,
to_char(this_week.time, 'day') day of week,
this_week_dly_sum.sum,
this_year_dly_avg.avg,
this_week_dly_sum.sum - this_year_dly_avg.avg difference
from this_week
inner join this_week_dly_sum
on to_char(this_week.time, 'd') = this_week_dly_sum.day
inner join this_year_dly_avg
on to_char(this_week.time, 'd').day = this_year_dly_avg.
group by time
;
You can use analytic function for this.
select date1, to_char(date1, 'd'),
sum(val) over(partition by to_char(date1, 'd')),
avg(val) over(partition by to_char(date1, 'd')),
sum(val) over(partition by to_char(date1, 'd'))-
avg(val) over(partition by to_char(date1, 'd'))
from table1
time > add_month(sysdate,-12);
This will give you daily counts for the last year:
SELECT TRUNC(time, 'DD') AS date,
SUM(count) AS asset_sum
FROM yourtable
WHERE time > SYSDATE - 365
GROUP BY TRUNC(time, 'DD')
You can modify it to additionally return averages per day of the week for the specified range:
SELECT TRUNC(time, 'DD') AS date,
SUM(count) AS asset_sum,
AVG(SUM(count)) OVER
(PARTITION BY TO_CHAR(TRUNC(time, 'DD'), 'D')) AS asset_sum_avg
FROM yourtable
WHERE time > SYSDATE - 365
GROUP BY TRUNC(time, 'DD')
At this point you have all the initial data you need but probably for more days than necessary. You can use the above query as a derived table to limit the rows to just those where date > SYSDATE - x:
WITH last_year_by_day AS
(
SELECT TRUNC(time, 'DD') AS date,
SUM(count) AS asset_sum,
AVG(SUM(count)) OVER
(PARTITION BY TO_CHAR(TRUNC(time, 'DD'), 'D')) AS asset_sum_avg
FROM yourtable
WHERE time > SYSDATE - 365
GROUP BY TRUNC(time, 'DD')
)
SELECT date,
TO_CHAR(TRUNC(time, 'DD'), 'D') AS day_of_week,
asset_sum,
asset_sum_avg,
asset_sum - asset_sum_avg AS asset_sum_diff
FROM last_year_by_day
WHERE date > SYSDATE - x
;
As some expressions are being repeated multiple times, it can be a good idea to re-factor the query to avoid the repetition. Here's one way:
WITH last_year AS
(
SELECT TRUNC(time, 'DD') AS date,
TO_CHAR(time, 'D') AS day_of_week,
count
FROM yourtable
WHERE time > SYSDATE - 365
),
last_year_by_day AS
(
SELECT date,
day_of_week,
SUM(count) AS asset_sum,
AVG(SUM(count)) OVER (PARTITION BY day_of_week) AS asset_sum_avg
FROM last_year
GROUP BY date, day_of_week
)
SELECT date,
day_of_week,
asset_sum,
asset_sum_avg,
asset_sum - asset_sum_avg AS asset_sum_diff
FROM last_year_by_day
WHERE date > SYSDATE - x
;
One last note is about TO_CHAR('D'), which is used to obtain the day_of_week values. Since you are using a different method for the same results, you may not be aware that the results of TO_CHAR('D') are affected by the NLS_TERRITORY setting. You may want to use an ALTER SESSION statement to set NLS_TERRITORY to the value that would cause TO_CHAR('D') to return 1 for Monday, 2 for Tuesday etc. Here is the list of territories supported.

Need to find Average of top 3 records grouped by ID in SQL

I have a postgres table with customer ID's, dates, and integers. I need to find the average of the top 3 records for each customer ID that have dates within the last year. I can do it with a single ID using the SQL below (id is the customer ID, weekending is the date, and maxattached is the integer).
One caveat: the maximum values are per month, meaning we're only looking at the highest value in a given month to create our dataset, thus why we're extracting month from the date.
SELECT
id,
round(avg(max),0)
FROM
(
select
id,
extract(month from weekending) as month,
extract(year from weekending) as year,
max(maxattached) as max
FROM
myTable
WHERE
weekending >= now() - interval '1 year' AND
id=110070 group by id,month,year
ORDER BY
max desc limit 3
) AS t
GROUP BY id;
How can I expand this query to include all ID's and a single averaged number for each one?
Here is some sample data:
ID | MaxAttached | Weekending
110070 | 5 | 2011-11-10
110070 | 6 | 2011-11-17
110071 | 4 | 2011-11-10
110071 | 7 | 2011-11-17
110070 | 3 | 2011-12-01
110071 | 8 | 2011-12-01
110070 | 5 | 2012-01-01
110071 | 9 | 2012-01-01
So, for this sample table, I would expect to receive the following results:
ID | MaxAttached
110070 | 5
110071 | 8
This averages the highest value in a given month for each ID (6,3,5 for 110070 and 7,8,9 for 110071)
Note: postgres version 8.1.15
First - get the max(maxattached) for every customer and month:
SELECT id,
max(maxattached) as max_att
FROM myTable
WHERE weekending >= now() - interval '1 year'
GROUP BY id, date_trunc('month',weekending);
Next - for every customer rank all his values:
SELECT id,
max_att,
row_number() OVER (PARTITION BY id ORDER BY max_att DESC) as max_att_rank
FROM <previous select here>;
Next - get the top 3 for every customer:
SELECT id,
max_att
FROM <previous select here>
WHERE max_att_rank <= 3;
Next - get the avg of the values for every customer:
SELECT id,
avg(max_att) as avg_att
FROM <previous select here>
GROUP BY id;
Next - just put all the queries together and rewrite/simplify them for your case.
UPDATE: Here is an SQLFiddle with your test data and the queries: SQLFiddle.
UPDATE2: Here is the query, that will work on 8.1 :
SELECT customer_id,
(SELECT round(avg(max_att),0)
FROM (SELECT max(maxattached) as max_att
FROM table1
WHERE weekending >= now() - interval '2 year'
AND id = ct.customer_id
GROUP BY date_trunc('month',weekending)
ORDER BY max_att DESC
LIMIT 3) sub
) as avg_att
FROM customer_table ct;
The idea - to take your initial query and run it for every customer (customer_table - table with all unique id for customers).
Here is SQLFiddle with this query: SQLFiddle.
Only tested on version 8.3 (8.1 is too old to be on SQLFiddle).
8.3 version
8.3 is the oldest version I've got access to, so I can't guarantee it'll work in 8.1
I'm using a temporary table to work out the best three records.
CREATE TABLE temp_highest_per_month as
select
id,
extract(month from weekending) as month,
extract(year from weekending) as year,
max(maxattached) as max_in_month,
0 as priority
FROM
myTable
WHERE
weekending >= now() - interval '1 year'
group by id,month,year;
UPDATE temp_highest_per_month t
SET priority =
(select count(*) from temp_highest_per_month t2
where t2.id = t.id and
(t.max_in_month < t2.max_in_month or
(t.max_in_month= t2.max_in_month and
t.year * 12 + t.month > t2.year * 12 + t.month)));
select id,round(avg(max_in_month),0)
from temp_highest_per_month
where priority <= 3
group by id;
The year & month are included in the working out the priority so that if two months have the same maximum, they'll still be included in the numbering correctly.
9.1 version
Similar to Igor's answer, but I used the With clause to split the steps.
with highest_per_month as
( select
id,
extract(month from weekending) as month,
extract(year from weekending) as year,
max(maxattached) as max_in_month
FROM
myTable
WHERE
weekending >= now() - interval '1 year'
group by id,month,year),
prioritised as
( select id, month, year, max_in_month,
row_number() over (partition by id, month, year
order by max_in_month desc)
as priority
from highest_per_month
)
select id, round(avg(max_in_month),0)
from prioritised
where priority <= 3
group by id;