Monthly data not reflecting properly - sql

Need last four months data:
select count(distinct session_id)
from master_gui partition for (to_date('11-25-2020','MM-DD-YYYY'))
where session_id in (select distinct session_id
from reporting_data partition for (to_date('11-25-2020','MM-DD-YYYY'))
where flow_name in ('BEGIN_STATUS'));
any suggestion in above query how to include dates for last 4 months.
CHECKED FROM BELOW partition key value:
SELECT OWNER, NAME, OBJECT_TYPE, COLUMN_NAME, COLUMN_POSITION FROM ALL_PART_KEY_COLUMNS
REPORTING_USER REPORTING_DATA TABLE CREATE_TIME 1
REPORTING_USER MASTER_GUI TABLE SESSION_START_TIME 1
using below query to get last 4 months records(Aug, Spet, Oct and Nov month)
select count(distinct session_id)
from master_gui where SESSION_START_TIME >= add_months(trunc(sysdate), -4)
and session_id in (select distinct session_id from reporting_data where CREATE_TIME>= add_months(trunc(sysdate), -4)
and flow_name in ('BEGIN_STATUS'));
Thanks Experts,
Used below query after changes, is it correct:
As we have to get count from master_gui table so used it and parent key value SESSION_START_TIME also reporting_data tab;e parent key value CREATE_TIME.
select count(distinct session_id)
from master_gui where SESSION_START_TIME < trunc(sysdate,'mm')
and SESSION_START_TIME >= add_months( trunc(sysdate, 'mm'),-4)
and session_id in (select distinct session_id from REPORTING_DATA where create_time < trunc(sysdate,'mm')
and create_time >= add_months( trunc(sysdate, 'mm'),-4)
and flow_name in ('BEGIN_STATUS'));
Thanks experts,
is below is correct will get some performance better by using below query, removed distinct clause from subquery inside.
select count(distinct session_id)
from master_gui where SESSION_START_TIME < trunc(sysdate,'mm')
and SESSION_START_TIME >= add_months( trunc(sysdate, 'mm'),-4)
and session_id in (select session_id from REPORTING_DATA where create_time < trunc(sysdate,'mm')
and create_time >= add_months( trunc(sysdate, 'mm'),-4)
and flow_name in ('BEGIN_STATUS'));
Thanks Experts,
I need to use in partition only to get faster perofmance:
select count(distinct session_id)
from master_gui partition for (to_date('11-01-2020','MM-DD-YYYY'))
where session_id in (select distinct session_id from reporting_data partition for (to_date('11-30-2020','MM-DD-YYYY'))
where flow_name in ('BEGIN_STATUS'));
Is above query is correct for 1st Nov 2020 to 30th Nov 2020.

This part of your query means you are selecting records only from the partition which holds values for 25-NOV-2020.
from reporting_data partition for (to_date('11-25-2020','MM-DD-YYYY'))
Therefore if your table is partitioned by daily intervals you will get records only for the 25th. If the partition key is monthly you will get records only for November. Using this syntax you could only get records for the last four months if the partition key is (say) annual.
The solution is simply to omit the partition clause and use a WHERE clause instead.
select count(distinct session_id)
from master_gui
where session_id in (select distinct session_id
from reporting_data partition
where <<partition_key_column>> >= sysdate - interval '4' month)
where flow_name in ('BEGIN_STATUS')
and <<partition_key_column>> >= sysdate - interval '4' month;
This query will still use partition pruning.
is it correct?
Looks like what I suggested. However, you have refined "last four months" to mean the last four complete months i.e. excluding the current month. My search criteria includes the current month. So maybe what you actually need is something like
select session_id
from reporting_data
where create_time < trunc(sysdate,'mm')
and create_time >= add_months( trunc(sysdate, 'mm'),-4)
This will provide a span from 01-AUG-2020 to 30-NOV-2020.
Incidentally, you don't need the DISTINCT in the subquery. The IN clause will handle duplicates so DISTINCT just adds unnecessary work, which could matter if you're dealing with large amounts of data.

There's a DATE datatype column, I presume. If so, include it into the where clause, e.g.
... and date_column >= add_months(trunc(sysdate, 'mm'), -4)

Related

Postgresql Distinct Statement

How can i get the minutes distinct value with timestamp ...
Like , if table contains 1 minute 100 records are there...so i want count of records present or not per minute ...
For example,
SELECT DISTINCT(timestamp) FROM customers WHERE DATE(timestamp) = CURRENT_DATE
Result should be ..like
timestamp record
30-12-2019 11:30 5
30-12-2019 11:31 8
One option would be ::date conversion for timestamp column including GROUP BY :
SELECT timestamp, count(*)
FROM tab
WHERE timestamp::date = current_date
GROUP BY timestamp
Demo for current day
timestamp::date might be replaced with date(timestamp) like in your case.
Update : If the table contains data with precision upto microseconds, then
SELECT to_char(timestamp,'YYYY-MM-DD HH24:MI'), count(*)
FROM tab
WHERE date(timestamp) = current_date
GROUP BY to_char(timestamp,'YYYY-MM-DD HH24:MI')
might be considered.
Try something like the following:
SELECT DATE_TRUNC('minute', timestamp) as timestamp, COUNT(*) as record
FROM customers
WHERE DATE(timestamp) = CURRENT_DATE
GROUP BY DATE_TRUNC('minute', timestamp)
ORDER BY DATE_TRUNC('minute', timestamp)

Can I reduce the number of SQL queries here (Postgresql)?

It's been a while since I've touched SQL.
I'm working on a pretty large database.
In a certain table which has some 30 million rows I'm trying to figure out when the highest number of entries was made for a certain period e.g. a year, down to the detail-level of one hour.
What I do now is something like this:
For the year 2018:
Find month with highest entry number for 2018 (i.e. 12 queries):
select count(*) from sing
where to_char(create_time, 'YYYY-MM-DD') like '2018-01-%'
select count(*) from sing
where to_char(create_time, 'YYYY-MM-DD') like '2018-02-%'
After I find the month with the highest number I must find the day (i.e. up to 31 queries) :
select count(*) from sing
where to_char(create_time, 'YYYY-MM-DD') = '2018-01-01'
select count(*) from sing
where to_char(create_time, 'YYYY-MM-DD') = '2018-01-02'
After I find the day with the highest number I must find the hour (i.e. 24 queries):
select count(*) from sing
where to_char(create_time, 'YYYY-MM-DD HH24:MI:SS') >= '2018-01-02 08:00:00'
and to_char(create_time, 'YYYY-MM-DD HH24:MI:SS') <= '2018-01-02 08:59:59'
As you can see this is a tedious task. So my question is, if and how I can optimize this process?
The database is a PostgreSQL, and I'm using the pgadmin.
Thanks in advance.
Youy can use GROUP BY and the date_part function to simplify things
SELECT date_part('month', create_time), count(*)
FROM sing
WHERE date_part('year', create_time) = 2018
GROUP BY date_part('month', create_time)
and then for the day
SELECT date_part('day', create_time), count(*)
FROM sing
WHERE date_part('year', create_time) = 2018
AND date_part('month', create_time) = <month from previous query>
GROUP BY date_part('day', create_time)
and so on
For the year 2018 would be 1 query:
select count(*) from sing where date_part('year', create_time) = '2018'
So you can use better date_part then to_char I think
https://www.w3resource.com/PostgreSQL/date_part-function.php

Postgres - Cohort analysis across months sequentially, not if exists in any later month

I'm doing a cohort analysis and can get the group of users to examine, then see whether they transacted in the months following on. But I want it like this:
Of that group in December, who transacted in Jan; of the Jan group from Dec, who transacted in Feb. Basically i'm tracking decay of the customer base
What I don't want is those that return in any month following Dec, which is this:
WITH start_sample AS (
SELECT
user_fk,
created_at AS start_sample_date
FROM transactions
WHERE created_at >= '2016-11-01' AND created_at < '2016-12-01'
GROUP BY user_fk,
start_sample_date),
start_sample_min AS (
SELECT
user_fk,
MIN(start_sample_date) AS first_transaction
FROM start_sample
GROUP BY user_fk
)
SELECT
DATE_TRUNC('month', created_at) AS transacting_month,
COUNT(DISTINCT user_fk)
FROM transactions
WHERE created_at >= '2016-11-01'
AND t.user_fk IN(SELECT user_fk FROM start_sample_min)
GROUP BY transacting_month
ORDER BY transacting_month;
Then I made a churn model to see if it would get what I need, but it doesn't:
WITH monthly_users AS (
SELECT
user_fk AS monthly_user_fk,
DATE_TRUNC('month', created_at) AS month
FROM transactions
WHERE created_at >= '2016-11-01' AND created_at < '2017-12-01'
GROUP BY monthly_user_fk, month
ORDER BY monthly_user_fk, month
),
lag_lead AS (
SELECT
monthly_user_fk,
month,
LAG(month,1) OVER (PARTITION BY monthly_user_fk ORDER BY month) AS lag,
LEAD(month,1) OVER (PARTITION BY monthly_user_fk ORDER BY month) AS lead
FROM monthly_users),
lag_lead_with_diffs AS (
SELECT
monthly_user_fk,
month,
lag AS previous_month,
lead AS next_month,
EXTRACT(EPOCH FROM (month - lag)/86400)::INT AS lag_size,
EXTRACT(EPOCH FROM (lead - month)/86400)::INT AS lead_size
FROM lag_lead
),
calculated AS (
SELECT
month,
CASE WHEN previous_month IS NULL THEN 'ACTIVATION'
WHEN lag_size <= 31 THEN 'ACTIVE'
WHEN lag_size > 31 THEN 'RETURN' END AS this_month_values,
CASE WHEN (lead_size > 31 OR lead_size IS NULL) THEN 'CHURN' ELSE NULL END AS next_month_churn,
COUNT(DISTINCT monthly_user_fk) AS c_d_users
FROM lag_lead_with_diffs
GROUP BY month, 2, 3
)
SELECT
month,
this_month_values,
SUM(c_d_users) AS distinct_users
FROM calculated
GROUP BY month, this_month_values
UNION
SELECT month + INTERVAL '1 month',
'CHURN',
SUM(c_d_users)
FROM calculated
WHERE next_month_churn IS NOT NULL
GROUP BY month + INTERVAL '1 month', 2
HAVING (EXTRACT(EPOCH FROM (month + INTERVAL '1 month'))) < 1512086400
ORDER BY month, this_month_values;
However this is not fixed at the initial group. The Active group rolls from month to month.
I understand that the above is likely more complicated than what i'm asking, but I can't seem to get my head around it
Thanks in advance
Perhaps this is what you are looking for:
with Monthly_Users as (
select user_fk
, date_trunc('month',created_at) as month
, (date_part('year', created_at) - 2016) * 12
+ date_part('month', created_at) - 11 as Months_Between
from transactions
where created_at between date '2016-11-01'
and date '2017-12-01'
group by user_fk, month, months_between
), t2 as (
select Monthly_Users.*
, count(*) over (partition by user_fk
order by month rows between unbounded preceding
and 1 preceding) prev_rec_cnt
from Monthly_Users
)
select month
, count(*)
from t2
where Months_Between = Prev_Rec_Cnt
group by month
order by month;
In this query the Monthly_Users CTE is just like yours, but adds a computation of the number of Months_Between the created_at date and your initial starting date. In the second Common Table Expression, I count the number of occurrences of each user_fk prior to the current months record. Finally in the output query I limit the results to only those records where the Months_Between value matches the Prev_Rec_Cnt value. Any missed months will cause the Prev_Rec_Cnt value to not match the Months_Between value, so you'll be able to see the fall off of user_fk values from month to month.

How to select a row having a column with max value with a group by

I have a table with the next columns
MSG_ID NOT NULL NUMBER(10)
CREATION_DATE DATE
PORT VARCHAR2(50)
MESSAGE VARCHAR2(1024)
IP_ADDRESS VARCHAR2(50)
PARSED NUMBER(1)
PARSED_ON DATE
Where parse time is parsed_on - creation_date.
I would like to know if it is possible in 1 single query extract for each hour the message that take longer to parse, getting the HOUR, PORT, MSG_ID MINUTES...I am blocked here
select TO_CHAR(CREATION_DATE, 'HH24') || ':mm' HOUR, PORT, MSG_ID, ROUND(MAX(parsed_on - creation_date)) * 24*60 MINUTES
from T_INCOME_CALLS
where TO_CHAR(CREATION_DATE, 'dd/mm/yyyy') = TO_CHAR(SYSDATE, 'dd/mm/yyyy')
group by TO_CHAR(CREATION_DATE, 'HH24'), PORT, MSG_ID
order by TO_CHAR(CREATION_DATE, 'HH24') ;
You can use window function row_number to find row with largest parse time in each hour like this:
select *
from (
select to_number(to_char(creation_date, 'HH24')) as hour,
port,
msg_id,
round(parsed_on - creation_date) * 24 * 60 as parse_time,
row_number() over (
partition by to_char(creation_date, 'HH24'), port, msg_id
order by (parsed_on - creation_date) desc nulls last
) as rn
from t_income_calls t
where creation_date between trunc(sysdate)
and trunc(sysdate + 1) - interval '1' second
) t
where rn = 1;
Also, notice the filter. I used date range instead of to_char on creation_date. The use of to_char on creation_date inhibits the use of index on creation_date if it is present.
I have assumed that the need is for the item that takes most time, per hour, for a grouping of IP_ADDRESS and PORT, which is different to your original query. I am also assuming MSG_ID is unique.
If you want 1 and only 1 row per recorded hour then use row_number(), if however you want tied values as well substitute dense_rank() in the query below. The create_on date has been used as a tie-beaker for sorting.
SELECT
TO_CHAR(CREATION_DATE, 'HH24') || ':mm' HOUR
, PORT, MSG_ID
, ROUND(parsed_on - creation_date) * 24*60 MINUTES
FROM (
SELECT
T_INCOME_CALLS.*
, ROW_NUMBER() OVER(PARTITION BY IP_ADDRESS, port, TO_CHAR(CREATION_DATE, 'HH24')
ORDER BY (parsed_on - creation_date) desc, CREATION_DATE) AS rn
FROM T_INCOME_CALLS
WHERE CREATION_DATE >= TRUNC(SYSDATE) AND CREATION_DATE < TRUNC(SYSDATE) + 1
)
WHERE rn = 1
Please avoid converting dates into strings for your where clause, this is not efficient . Instead leave created_on untouched and amend the criteria to suit that data which will allow access to indexes for the filtering.
You can get it also without a sub-query when you use FIRST function:
SELECT TO_CHAR(CREATION_DATE, 'HH24') || ':mm' HOUR, PORT, MSG_ID,
MAX(MESSAGE) KEEP (DENSE_RANK FIRST ORDER BY (parsed_on - creation_date) desc, CREATION_DATE)
FROM T_INCOME_CALLS
WHERE CREATION_DATE >= TRUNC(SYSDATE) AND CREATION_DATE < TRUNC(SYSDATE) + 1
GROUP BY TO_CHAR(CREATION_DATE, 'HH24'), PORT, MSG_ID
ORDER BY TO_CHAR(CREATION_DATE, 'HH24');

Grab abandoned carters from the last hour in Oracle Responsys

I'm trying to grab people out of a table who have an abandon date between 20 minutes ago and 2 hours ago. This seems to grab the right amount of time, but is all 4 hours old:
SELECT *
FROM $A$
WHERE ABANDONDATE >= SYSDATE - INTERVAL '2' HOUR
AND ABANDONDATE < SYSDATE - INTERVAL '20' MINUTE
AND EMAIL_ADDRESS_ NOT IN(SELECT EMAIL_ADDRESS_ FROM $B$ WHERE ORDERDATE >= sysdate - 4)
also, it grabs every record for everyone and I only want the most recent product abandoned (highest abandondate) for each email address. I can't seem to figure this one out.
If the results are EXACTLY four hours old, it is possible that there is a time zone mismatch. What is the EXACT data type of ABANDONDATE in your database? Perhaps TIMESTAMP WITH TIMEZONE? Four hours seems like the difference between UTC and EDT (Eastern U.S. with daylight savings time offset).
For your other question, did you EXPECT your query to only pick up the most recent product abandoned? Which part of your query would do that? Instead, you need to add row_number() over (partition by [whatever identifies clients etc.] order by abandondate), make the resulting query into a subquery and wrap it within an outer query where you filter by (WHERE clause) rn = 1. We can help with this if you show us the table structure (name and data type of columns in the table - only the relevant columns - including which is or are Primary Key).
Try
SELECT * FROM (
SELECT t.*,
row_number()
over (PARTITION BY email_address__ ORDER BY ABANDONDATE DESC) As RN
FROM $A$ t
WHERE ABANDONDATE >= SYSDATE - INTERVAL '2' HOUR
AND ABANDONDATE < SYSDATE - INTERVAL '20' MINUTE
AND EMAIL_ADDRESS_ NOT IN(
SELECT EMAIL_ADDRESS_ FROM $B$
WHERE ORDERDATE >= sysdate - 4)
)
WHERE rn = 1
another approach
SELECT *
FROM $A$
WHERE (EMAIL_ADDRESS_, ABANDONDATE) IN (
SELECT EMAIL_ADDRESS_, MAX( ABANDONDATE )
FROM $A$
WHERE ABANDONDATE >= SYSDATE - INTERVAL '2' HOUR
AND ABANDONDATE < SYSDATE - INTERVAL '20' MINUTE
AND EMAIL_ADDRESS_ NOT IN(
SELECT EMAIL_ADDRESS_ FROM $B$
WHERE ORDERDATE >= sysdate - 4)
GROUP BY EMAIL_ADDRESS_
)