I want to get a specific answer by comparing two columns in postgresql - sql

I have a query like this :
with base_data as
( Select
receipt_date,
receipt_value,
receipt_customer_id
From table1 )
Select
count(distinct (receipt_customer_id) , sum(receipt_value)
From
base_data
where
(receipt_date:timestamp <= current_date - interval '1 month' and
receipt_date: timestamp >= current_date - interval '2 month)
This basically gives me the number of distinct clients and their sum of receipt values for July and August considering the current month as September
I want to reduce this further and just want data for
distinct clients and sum of their receipt values
for whom there was no receipt in July i.e. they never transacted with us in July but came back in August basically they skipped a month and then transacted again.
I am unable to write this clause which I am putting in English below as a problem statement :
Give me the data for a distinct count of clients and their total sum of receipts who transacted with us in August but had no receipt value in July
I hope I am able to explain it. I have been racking my brain on this for a while but am unable to figure out a solution. Please help.
The current result looks like this
Count: 120
Sum: 207689
I want it reduced to (assumption)
Count: 12
Sum: 7000

The first issue I can see is with "sum of receipt values for July and August"; the return from your current query will depend upon when it is run (and will not be for calendar months). Lets put that aside and simplify/fix (the query as stated does not run) your query to one that will list all transactions in August (I think its simpler to understand using hard coded dates for now):
Select
receipt_customer_id, sum(receipt_value)
From
table1
where
-- Transacted in August
receipt_date >= '2020-08-01'::timestamp and
receipt_date < '2020-09-01'::timestamp
group by receipt_customer_id;
We can now add another clause to the where to filter out customers with transactions totalling $0/NULL (so total of $0 or no transactions at all) in July:
Select
receipt_customer_id, sum(receipt_value)
From
table1 t
where
-- Transacted in August
t.receipt_date >= '2020-08-01'::timestamp and
t.receipt_date < '2020-09-01'::timestamp
and (
select coalesce(sum(receipt_value), 0)
from table1
where
receipt_customer_id = t.receipt_customer_id and
-- Transacted in July
receipt_date >= '2020-07-01'::timestamp and
receipt_date < '2020-08-01'::timestamp
) = 0
group by receipt_customer_id;
or if you just want the count of customers and sum of receipt_value:
Select
count(distinct receipt_customer_id), sum(receipt_value)
From
table1 t
where
-- Transacted in August
t.receipt_date >= '2020-08-01'::timestamp and
t.receipt_date < '2020-09-01'::timestamp
and (
select coalesce(sum(receipt_value), 0)
from table1
where
receipt_customer_id = t.receipt_customer_id and
-- Transacted in July
receipt_date >= '2020-07-01'::timestamp and
receipt_date < '2020-08-01'::timestamp
) = 0
See this db fiddle for a test of this (feel free to use this if you want to ask follow-up questions). Note that if you want to reintroduce current_date you can do so (but you probably want to calculate the start of the month date_trunc can help with this).

Related

Count distinct customers, active within a year, for every week of the year

I am working with an existing E-commerce database. Actually, this process is usually done in Excel, but we want to try it directly with a query in PostgreSQL (version 10.6).
We define as an active customer a person who has bought at least once within 1 year. This means, if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
I want the output for each week of the year (2020). Basically what I need is ...
select
email,
orderdate,
id
from
orders_table
where
paid = true;
|---------------------|-------------------|-----------------|
| email | orderdate | id |
|---------------------|-------------------|-----------------|
| email1#email.com |2020-06-02 05:04:32| Order-2736 |
|---------------------|-------------------|-----------------|
I can't create new tables. And I would like to see the output like this:
Year| Week | Active customers
2020| 25 | 6978
2020| 24 | 3948
depending on whether there is a year and week column you can use a OVER (PARTITION BY ...) with extract:
SELECT
extract(year from orderdate),
extract(week from orderdate),
sum(1) as customer_count_in_week,
OVER (PARTITION BY extract(YEAR FROM TIMESTAMP orderdate),
extract(WEEK FROM TIMESTAMP orderdate))
FROM ordertable
WHERE paid=true;
Which should bucket all orders by year and week, thus showing the total count per week in a year where paid is true.
references:
https://www.postgresql.org/docs/9.1/tutorial-window.html
https://www.postgresql.org/docs/8.1/functions-datetime.html
if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
Problems on your side
This method has some corner case ambiguities / issues:
Do you include or exclude "week 22 in 2020"? (I exclude it below to stay closer to "a year".)
A year can have 52 or 53 full weeks. Depending on the current date, the calculation is based on 52 or 53 weeks, causing a possible bias of almost 2 %!
If you start the time range on "the same date last year", then the margin of error is only 1 / 365 or ~ 0.3 %, due to leap years.
A fixed "period of 365 days" (or 366) would eliminate the bias altogether.
Problems on the SQL side
Unfortunately, window functions do not currently allow the DISTINCT key word (for good reasons). So something of the form:
SELECT count(DISTINCT email) OVER (ORDER BY year, week
GROUPS BETWEEN 52 PRECEDING AND 1 PRECEDING)
FROM ...
.. triggers:
ERROR: DISTINCT is not implemented for window functions
The GROUPS keyword has only been added in Postgres 10 and would otherwise be just what we need.
What's more, your odd frame definition wouldn't even work exactly, since the number of weeks to consider is not always 52, as discussed above.
So we have to roll our own.
Solution
The following simply generates all weeks of interest, and computes the distinct count of customers for each. Simple, except that date math is never entirely simple. But, depending on details of your setup, there may be faster solutions. (I had several other ideas.)
The time range for which to report may change. Here is an auxiliary function to generate weeks of a given year:
CREATE OR REPLACE FUNCTION f_weeks_of_year(_year int)
RETURNS TABLE(year int, week int, week_start timestamp)
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE
ROWS 52 COST 10 AS
$func$
SELECT _year, d.week::int, d.week_start
FROM generate_series(date_trunc('week', make_date(_year, 01, 04)::timestamp) -- first day of first week
, LEAST(date_trunc('week', localtimestamp), make_date(_year, 12, 28)::timestamp) -- latest possible start of week
, interval '1 week') WITH ORDINALITY d(week_start, week)
$func$;
Call:
SELECT * FROM f_weeks_of_year(2020);
It returns 1 row per week, but stops at the current week for the current year. (Empty set for future years.)
The calculation is based on these facts:
The first ISO week of the year always contains January 04.
The last ISO week cannot start after December 28.
Actual week numbers are computed on the fly using WITH ORDINALITY. See:
PostgreSQL unnest() with element number
Aside, I stick to timestamp and avoid timestamptz for this purpose. See:
Generating time series between two dates in PostgreSQL
The function also returns the timestamp of the start of the week (week_start), which we don't need for the problem at hand. But I left it in to make the function more useful in general.
Makes the main query simpler:
WITH weekly_customer AS (
SELECT DISTINCT
EXTRACT(YEAR FROM orderdate)::int AS year
, EXTRACT(WEEK FROM orderdate)::int AS week
, email
FROM orders_table
WHERE paid
AND orderdate >= date_trunc('week', timestamp '2019-01-04') -- max range for 2020!
ORDER BY 1, 2, 3 -- optional, might improve performance
)
SELECT d.year, d.week
, (SELECT count(DISTINCT email)
FROM weekly_customer w
WHERE (w.year, w.week) >= (d.year - 1, d.week) -- row values, see below
AND (w.year, w.week) < (d.year , d.week) -- exclude current week
) AS active_customers
FROM f_weeks_of_year(2020) d; -- (year int, week int, week_start timestamp)
db<>fiddle here
The CTE weekly_customer folds to unique customers per calendar week once, as duplicate entries are just noise for our calculation. It's used many times in the main query. The cut-off condition is based on Jan 04 once more. Adjust to your actual reporting period.
The actual count is done with a lowly correlated subquery. Could be a LEFT JOIN LATERAL ... ON true instead. See:
What is the difference between LATERAL and a subquery in PostgreSQL?
Using row value comparison to make the range definition simple. See:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'

Last year occurrence of date and month

I am looking to find a way to find the last time the day and year combination has happened.
I have a long list of dates and I want to find out what year the date and month has last occurred.
ie. 01/01 has happened in 2018 so I want 2018 as the output.
31/12 has not happened in 2018 yet, the last time is happened was in 2017, so I want 2017 as the output.
Table 1
(01/01/2015),
(01/01/2016),
(31/12/2015),
(25/07/2004)
Return table 2
(01/01/2015, 01/01/2018),
(01/01/2016, 01/01/2018),
(31/12/2015, 31/12/2017),
(25/07/2004, 25/07/2017)
OR even just return
(01/01/2015, 2018),
(01/01/2016, 2018),
(31/12/2015, 2017),
(25/07/2004, 2017)
Is this what you want?
select t2.*,
(case when month(col) < month(current_date) or
(month(col) < month(current_date) and day(col) <= day(current_date))
then year(current_date)
else 1 + year(current_date)
end)
from table2 t2;
This is using a reasonable set of date/time functions. These can vary by database.
To filter the month and year of a given date to the current date you can use:
SELECT *
FROM YourTable
WHERE month(date) = month(get_some_date()) and year(date) = year(get_somedate())
Here you can replace get_some_date to your function logic.

Sum of shifting range in SQL Query

I am trying to write an efficient query to get the sum of the previous 7 days worth of values from a relational DB table, and record each total against the final date in the 7 day period (e.g. the 'WeeklyTotals Table' in the example below). For example, in my WeeklyTotals query, I would like the value for February 15th to be 333, since that is the total sum of users from Feb 9th - Feb 15th, and so on:
I have a base query which gets me my previous weeks users for today's date (simplified for the sake of the example):
SELECT Date, Sum("Total Users")
FROM "UserRecords"
WHERE (dateadd(hour, -8, "UserRecords"."Date") BETWEEN
dateadd(hour, -8, sysdate) - INTERVAL '7 DAY' AND dateadd(hour, -8, sysdate);
The problem is, this only get's me the total for today's date. I need a query which will get me this information for the previous seven days.
I know I can make a view for each date (since I only need the previous seven entries) and join them all together, but that seems really inefficient (I'll have to create/update 7 views, and then do all the inner join operations). I am wondering if there's a more efficient way to achieve this.
Provided there are no gaps, you can use a running total with SUM OVER including the six previous rows. Use ROW_NUMBER to exclude the first six records, as their totals don't represent complete weeks.
select log_date, week_total
from
(
select
log_date,
sum(total_users) over (order by log_date rows 6 preceding) as week_total,
row_number() over (order by log_date) as rn
from mytable
where log_date > 0
)
where rn >= 7
order by log_date;
UPDATE: In case there are gaps, it should be
sum(total_users) over (order by log_date range interval '6' day preceding)
but I don't know whether PostgreSQL supports this already. (Moreover the ROW_NUMBER exclusion wouldn't work then and would have to be replaced by something else.)
Here's a a query that self joins to the previous 6 days and sums the value to get the weekly totals:
select u1.date, sum(u2.total_users) as weekly_users
from UserRecords u1
join UserRecords u2
on u1.date - u2.date < 7
and u1.date >= u2.date
group by u1.date
order by u1.date
You can use the SUM over Window function, with the expression using Date Part, of week.
Self joins are much slower than Window functions.

How to write a Report query based on number of days in a month?

I am trying to write a query to generate automated report. This report is to run for last month's transactions it should run on 1st of every month. I have job to run this report to run 1st of every month. But how can I make this query to choose no of days in a month? (some months will have 30 and some will have 31 and in feb no of days changes based on leap year).
Here one more requirement is I only have to pass one parameter in the query. below is example of query that I have now
select id,name,address,trans_dt from tab1 where trans_dt between to_date('&1','MM-DD-YYYY')-30 AND to_date('&1','MM-DD-YYYY');
The above query is generating last 30 days transactions, but it will be wrong if no days for month is 31 or 28. I am using oracle 11r2 as database. Please help in writing this.
mySQL
SELECT
id,
NAME,
address,
trans_dt
FROM
tab1
WHERE
MONTH(trans_dt) = 02 /* Param for month passed in */
AND
YEAR(trans_dt) = YEAR(CURDATE())
Oracle
SELECT
id,
NAME,
address,
trans_dt
FROM
tab1
WHERE
to_char( dt_column, 'mm' ) = 02 /* Param for month passed in */
AND
EXTRACT(YEAR FROM DATE trans_dt) = trunc(sysdate, 'YEAR')
Maybe this can work? I only really know mySQL
Found solution but forget to reply here. Here is my solution
select id,name,address,trans_dt from tab1 where trans_dt between trunc(trunc(sysdate,'MM')-1,'MM') and trunc(sysdate,'MM');
above will give you report from 1st day of month to last day of month from calender.
select id
,name
,address
,trans_dt from tab1
where trans_dt between to_date(p_dt,'yyyy-mmm-dd') and add_months(to_date(p_dt,'yyyy-mmm-dd'))

PostgreSQL - Getting statistical data

I need to collect some statistical information in my application.
I have a table of users (tb_user)
Every time a new user accesses the application, it adds a new record in this table, ie, one line for each user. The main field are id and date_hour (timestamp for the first time user accessed the application).
tb_user
id (bigint) | date_time (timestamp with time zone)
1 | 2012-01-29 11:29:50.359-03
2 | 2012-01-31 14:27:10.359-03
I need get:
amount average users by day, week and month
Example:
by day: 55.45
by week : XX.XX
month: XX.XX
EDIT:
My best solution was:
WITH daily_count AS (SELECT COUNT(id) AS user_count FROM tb_user)
SELECT user_count, tbaux2.days, (user_count/tbaux2.days) FROM daily_count,
(SELECT EXTRACT(DAY FROM (t2.diff) ) + 1 AS days
FROM
(with tbaux AS(SELECT min(date_time) AS min FROM tb_user)
SELECT (now() - min) AS diff
FROM tbaux) AS t2) AS tbaux2
GROUP BY user_count, tbaux2.days
But this solution only worked with EXTRACT (DAY ... With weeks and month did not work
Any help is welcome.
Alternatively:
SELECT user_count, tbaux2.days, (user_count/tbaux2.days) AS userPerDay, ((user_count/tbaux2.days) * 7) AS userPerWeek, ((user_count/tbaux2.days) * 30) AS userPerMonth
EDIT 2:
Based on responses from #Bruno, there are some considerations:
When I asked the question, in really I requested a way to select data by day, month and year. I believe that the search that I posted and #Bruno refined, should be interpreted as average of "a day, every 7 days and every 30 days" and not by days, weeks and months. I believe that if it is interpreted in this way, there not will be problems of gender-quoted in example (10% drop). I believe this approach of "every" is answer I need in moment, so will sign this answer.
I suggest as an improvement of post:
Consider only closed day in result (not collect users of the current day, and not counting the current day in division)
The result is two numeric digits.
New research considering a data really per week and per month.
Thanks.
You should look into aggregate functions (min, max, count, avg), which go hand in hand with GROUP BY. For date-based aggregations, date_trunc is also useful.
For example, this will return the number of rows per day:
SELECT date_trunc('day', date_time) AS day_start,
COUNT(id) AS user_count FROM tb_user
GROUP BY date_trunc('day', date_time);
You can then do the daily average using something like this (with a CTE):
WITH daily_count AS (SELECT date_trunc('day', date_time) AS day_start,
COUNT(id) AS user_count FROM tb_user
GROUP BY date_trunc('day', date_time))
SELECT AVG(user_count) FROM daily_count;
Use 'week' instead of day for the weekly counts, and so on (see date_trunc documentation).
EDIT: (Following comment: average up to and including 5/1/2012, i.e. before the 6th.)
WITH daily_count AS (SELECT date_trunc('day', date_time) AS day_start,
COUNT(id) AS user_count
FROM tb_user
WHERE date_time >= DATE('2012-01-01') AND date_time < DATE('2012-01-06')
GROUP BY date_trunc('day', date_time))
SELECT SUM(user_count)/(DATE('2012-01-06') - DATE('2012-01-01')) FROM daily_count;
What's above is over-complicated, in this case. This should give you the same result:
SELECT COUNT(id)/(DATE('2012-01-06') - DATE('2012-01-01'))
FROM tb_user
WHERE date_time >= DATE('2012-01-01') AND date_time < DATE('2012-01-06');
EDIT 2: After your edit, I guess what you're after is just a single global average for the entire period of existence of your database, rather than groups by month/week/day.
This should give you the average number of rows per day:
WITH total_min_max AS (SELECT
COUNT(id) AS total_visits,
MIN(date_time) AS first_date_time,
MAX(date_time) AS last_date_time,
FROM tb_user)
SELECT total_visits/((last_date_time::date-first_date_time::date)+1) AS users_per_day
FROM total_min_max
(I would replace last_date_time with NOW() to make the average over the time until now, rather than until the last visit, if there's no recent visit.)
Then, for daily, weekly, and "monthly":
WITH daily_avg AS (
WITH total_min_max AS (SELECT
COUNT(id) AS total_visits,
MIN(date_time) AS first_date_time,
MAX(date_time) AS last_date_time,
FROM tb_user)
SELECT total_visits/((last_date_time::date-first_date_time::date)+1) AS users_per_day
FROM total_min_max)
SELECT
users_per_day,
(users_per_day * 7) AS users_per_week,
(users_per_month * 30) AS users_per_month
FROM daily_avg
This being said, conclusions you draw from such statistics might not be great, especially if you want to see how it changes.
I would also normalise the data per day rather than assuming 30 days in a month (if not per hour, because not all days have 24 hours). Say you have 10 visits per day in Jan 2011 and 10 visits per day in Feb 2011. That gives you 310 visits in Jan and 280 visits in Feb. If you don't pay attention, you could think you've had a almost a 10% drop in terms of number of visitors, so something went wrong in Feb, when really, this isn't the case.