oracle sql: efficient way to calculate business days in a month - sql

I have a pretty huge table with columns dates, account, amount, etc. eg.
date account amount
4/1/2014 XXXXX1 80
4/1/2014 XXXXX1 20
4/2/2014 XXXXX1 840
4/3/2014 XXXXX1 120
4/1/2014 XXXXX2 130
4/3/2014 XXXXX2 300
...........
(I have 40 months' worth of daily data and multiple accounts.)
The final output I want is the average amount of each account each month. Since there may or may not be record for any account on a single day, and I have a seperate table of holidays from 2011~2014, I am summing up the amount of each account within a month and dividing it by the number of business days of that month. Notice that there is very likely to be record(s) on weekends/holidays, so I need to exclude them from calculation. Also, I want to have a record for each of the date available in the original table. eg.
date account amount
4/1/2014 XXXXX1 48 ((80+20+840+120)/22)
4/2/2014 XXXXX1 48
4/3/2014 XXXXX1 48
4/1/2014 XXXXX2 19 ((130+300)/22)
4/3/2014 XXXXX2 19
...........
(Suppose the above is the only data I have for Apr-2014.)
I am able to do this in a hacky and slow way, but as I need to join this process with other subqueries, I really need to optimize this query. My current code looks like:
<!-- language: lang-sql -->
select
date,
account,
sum(amount/days_mon) over (partition by last_day(date))
from(
select
date,
-- there are more calculation to get the account numbers,
-- so this subquery is necessary
account,
amount,
-- this is a list of month-end dates that the number of
-- business days in that month is 19. similar below.
case when last_day(date) in ('','',...,'') then 19
when last_day(date) in ('','',...,'') then 20
when last_day(date) in ('','',...,'') then 21
when last_day(date) in ('','',...,'') then 22
when last_day(date) in ('','',...,'') then 23
end as days_mon
from mytable tb
inner join lookup_businessday_list busi
on tb.date = busi.date)
So how can I perform the above purpose efficiently? Thank you!

This approach uses sub-query factoring - what other RDBMS flavours call common table expressions. The attraction here is that we can pass the output from one CTE as input to another. Find out more.
The first CTE generates a list of dates in a given month (you can extend this over any range you like).
The second CTE uses an anti-join on the first to filter out dates which are holidays and also dates which aren't weekdays. Note that Day Number varies depending according to the NLS_TERRITORY setting; in my realm the weekend is days 6 and 7 but SQL Fiddle is American so there it is 1 and 7.
with dates as ( select date '2014-04-01' + ( level - 1) as d
from dual
connect by level <= 30 )
, bdays as ( select d
, count(d) over () tot_d
from dates
left join holidays
on dates.d = holidays.hol_date
where holidays.hol_date is null
and to_number(to_char(dates.d, 'D')) between 2 and 6
)
select yt.account
, yt.txn_date
, sum(yt.amount) over (partition by yt.account, trunc(yt.txn_date,'MM'))
/tot_d as avg_amt
from your_table yt
join bdays
on bdays.d = yt.txn_date
order by yt.account
, yt.txn_date
/
I haven't rounded the average amount.

You have 40 month of data, this data should be very stable.
I will assume that you have a cold body (big and stable easily definable range of data) and hot tail (small and active part).
Next, I would like to define a minimal period. It is a data range that is a smallest interval interesting for Business.
It might be year, month, day, hour, etc. Do you expect to get questions like "what was averege for that account between 1900 and 12am yesterday?".
I will assume that the answer is DAY.
Then,
I will calculate sum(amount) and count() for every account for every DAY of cold body.
I will not create a dummy records, if particular account had no activity on some day.
and I will save day, account, total amount, count in a TABLE.
if there are modifications later to the cold body, you delete and reload affected day from that table.
For hot tail there might be multiple strategies:
Do the same as above (same process, clear to support)
always calculate on a fly
use materialized view as an averege between 1 and 2.
Cold body table totalc could also be implemented as materialized view, but if data never change - no need to rebuild it.
With this you go from (number of account) x (number of transactions per day) x (number of days) to (number of account)x(number of active days) number of records.
That should speed up all following calculations.

Related

How to segregate an monthly data from daily data using sql or bigquery?

We are receiving data monthly for some ids and daily for some ids. I need to segregate the data as monthly or daily before applying the logic required for further analysis.
i have tried to use datadiff sql function to do that, but its not really helpful in this case. Is there any way to segregate the daily receiving data from monthly using sql or big query?
Id date
55 11-02-2022 00:00
66 15-05-2022 00:00
77 13-08-2022 00:00
66 15-07-2022 00:00
77 12-08-2022 00:00
55 12-02-2022 00:00
66 15-06-2022 00:00
A count aggregation per id per month is an efficient way to separate the two groups. The question is whether a single ID consistently updates at the same cadence, and if it does not, do you want to treat it differently per time period or always daily if ever daily, etc.
Here's the basic logic to categorize the two groups by month
select
id
, date_trunc(date, MONTH) as mo
, count(*) > 1 as is_daily
from tbl
group by 1,2;
That gives you a per-id, per-month categorization.
Here is the same logic taken a step further to categorize any ID that ever receives daily updates as daily.
with bins as (
select
id
, date_trunc(date, MONTH) as mo
, count(*) > 1 as is_daily
from tbl
group by 1,2
)
select
id
, sum(case when is_daily then 1 else 0 end) > 0 as is_daily
from bins
group by 1;

Oracle query for the attached output

I have a scenario where I need to show daily transactions and also total transaction for that month with date and other fields like type, product etc.
Once I have that, the main requirement is to get the daily percentage of total for that month, below is an example of it. 3 transaction on 1st jan and 257 for total of jan and the percentage of 1st jan is (3/257)*100, similarly 10 is for 2nd jan and the percentage is (10/257) and so on.
can anyone help me with the sql query?
Date Type Transaction Total_For_month Percentage
1/1/2017 A 3 257 1%
1/2/2017 B 10 257 4%
1/3/2017 A 5 257 2%
1/4/2017 C 8 257 3%
1/5/2017 D 12 257 5%
1/6/2017 D 17 257 7%
Use window functions:
select t.*,
sum(transaction) over (partition by to_char(date, 'YYYY-MM')) as total_for_month,
transaction / sum(transaction) over (partition by to_char(date, 'YYYY-MM')) as ratio
from t;
DATE and TYPE are Oracle keywords, I hope you are not using them literally as column names. I will use DT and TP below.
You didn't say one way or the other, but it seems like you must filter your data so that the final report is for a single month (rather than for a full year, say). If so, you could do something like this. Notice the analytic function RATIO_TO_REPORT. Note that I multiply the ratio by 100, and I use some non-standard formatting to get the result in the "percentage" format; don't worry too much if you don't understand that part from the first reading.
select dt, tp, transaction, sum(transaction) over () as total_trans_for_month,
to_char(100 * ratio_to_report(transaction) over (), '90.0L',
'nls_currency=%') as pct_of_monthly_trans
from your_table
where dt >= date '2017-01-01' and dt < add_months(date '2017-01-01', 1)
order by dt -- if needed (add more criteria as appropriate).
Notice the analytic clause: over (). We are not partitioning by anything, and we are not ordering by anything either; but since we want every row of input to generate a row in the output, we still need the analytic version of sum, and the analytic function ratio_to_report. The proper way to achieve this is to include the over clause, but leave it empty: over ().
Note also that in the where clause I did not wrap dt within trunc or to_char or any other function. If you are lucky, there is an index on that column, and writing the where conditions as I did allows that index to be used, if the Optimizer finds it should be.
The date '2017-01-01' is arbitrary (chosen to match your example); in production it should probably be a bind variable.

Multiple aggregate sums from different conditions in one sql query

Whereas I believe this is a fairly general SQL question, I am working in PostgreSQL 9.4 without an option to use other database software, and thus request that any answer be compatible with its capabilities.
I need to be able to return multiple aggregate totals from one query, such that each sum is in a new row, and each of the groupings are determined by a unique span of time, e.g. WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'. The number of records that satisfy there WHERE clause is unknown and may be zero, in which case ideally the result is "0". This is what I have worked out so far:
(
SELECT SUM(minutes) AS min
FROM downtime
WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-14' AND '2016-02-21'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-28' AND '2016-03-06'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-06' AND '2016-03-13'
)
UNION ALL
(
SELECT SUM(minutes))
FROM downtime
WHERE time_stamp BETWEEN '2016-03-13' AND '2016-03-20'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-20' AND '2016-03-27'
)
Result:
min
---+-----
1 | 119
2 | 4
3 | 30
4 |
5 | 62
6 | 350
That query gets me almost the exact result that I want; certainly good enough in that I can do exactly what I need with the results. Time spans with no records are blank but that was predictable, and whereas I would prefer "0" I can account for the blank rows in software.
But, while it isn't terrible for the 6 weeks that it represents, I want to be flexible and to be able to do the same thing for different time spans, and for a different number of data points, such as each day in a week, each week in 3 months, 6 months, each month in 1 year, 2 years, etc... As written above, it feels as if it is going to get tedious fast... for instance 1 week spans over a 2 year period is 104 sub-queries.
What I'm after is a more elegant way to get the same (or similar) result.
I also don't know if doing 104 iterations of a similar query to the above (vs. the 6 that it does now) is a particularly efficient usage.
Ultimately I am going to write some code which will help me build (and thus abstract away) the long, ugly query--but it would still be great to have a more concise and scale-able query.
In Postgres, you can generate a series of times and then use these for the aggregation:
select g.dte, coalesce(sum(dt.minutes), 0) as minutes
from generate_series('2016-02-07'::timestamp, '2016-03-20'::timestamp, interval '7 day') g(dte) left join
downtime dt
on dt.timestamp >= g.dte and dt.timestamp < g.dte + interval '7 day'
group by g.dte
order by g.dte;

count occurrences for each week using db2

I am looking for some general advice rather than a solution. My problem is that I have a list of dates per person where due to administrative procedures, a person may have multiple records stored for this one instance, yet the date recorded is when the data was entered in as this person is passed through the paper trail. I understand this is quite difficult to explain so I'll give an example:
Person Date Audit
------ ---- -----
1 2000-01-01 A
1 2000-01-01 B
1 2000-01-02 C
1 2003-04-01 A
1 2003-04-03 A
where I want to know how many valid records a person has by removing annoying audits that have recorded the date as the day the data was entered, rather than the date the person first arrives in the dataset. So for the above person I am only interested in:
Person Date Audit
------ ---- -----
1 2000-01-01 A
1 2003-04-01 A
what makes this problem difficult is that I do not have the luxury of an audit column (the audit column here is just to present how to data is collected). I merely have dates. So one way where I could crudely count real events (and remove repeat audit data) is to look at individual weeks within a persons' history and if a record(s) exists for a given week, add 1 to my counter. This way even though there are multiple records split over a few days, I am only counting the succession of dates as one record (which after all I am counting by date).
So does anyone know of any db2 functions that could help me solve this problem?
If you can live with standard weeks it's pretty simple:
select
person, year(dt), week(dt), min(dt), min(audit)
from
blah
group by
person, year(dt), week(dt)
If you need seven-day ranges starting with the first date you'd need to generate your own week numbers, a calendar of sorts, e.g. like so:
with minmax(mindt, maxdt) as ( -- date range of the "calendar"
select min(dt), max(dt)
from blah
),
cal(dt,i) as ( -- fill the range with every date, count days
select mindt, 0
from minmax
union all
select dt+1 day , i+1
from cal
where dt < (select maxdt from minmax) and i < 100000
)
select
person, year(blah.dt), wk, min(blah.dt), min(audit)
from
(select dt, int(i/7)+1 as wk from cal) t -- generate week numbers
inner join
blah
on t.dt = blah.dt
group by person, year(blah.dt), wk

Oracle SQL Month Statement Generation

I am having performance issue on a set of SQLs to generate current month's statement in realtime.
Customers will purchase some goods using points from an online system, and the statement containing "open_balance", "point_earned", "point_used", "current_balance" should be generated.
The following shows the shortened schema :
//~200k records
customer: {account_id:string, create_date:timestamp, bill_day:int} //totally 14 fields
//~250k records per month, kept for 6 month
history_point: {point_id:long, account_id:string, point_date:timestamp, point:int} //totally 9 fields
//each customer have maximum of 12 past statements kept
history_statement: {account_id:string, open_date:date, close_date:date, open_balance:int, point_earned:int, point_used:int, close_balance:int} //totally 9 fields
On every bill day, the view should automatically create a new month statement.
i.e. If bill_day is 15, then transaction done on or after 16 Dec 2013 00:00:00 should belongs to new bill cycle of 16 Dec 2013 00:00:00 - 15 Jan 2014 23:59:59
I tried the approach described below,
Calculate the last close day for each account (in materialized view, so that it update only after there is new customer or past month statement inserted into history_statement)
Generate a record for each customer each month that I need to calculate (Also in materialized view)
Sieve the point record for only point records within the date that I will calculate (This takes ~0.1s only)
Join 2 with 3 to obtain point earned and used for each customer each month
Join 4 with 4 on date less than open date to sum for open and close balance
6a. Select from 5 where open date is less than 1 month old as current balance (these are not closed yet, and the point reflect the point each customer own now)
6b. All the statements are obtained by union of history_statement and 5
On a development server, the average response time (200K customer, 1.5M transactions in current month) is ~3s which is pretty slow for web application, and on the testing server, where resources are likely to be shared, the average response time (200K customer, ~200k transaction each month for 8 months) is 10-15s.
Does anyone have some idea on writing a query with better approach or to speed up the query?
Related SQL:
2: IV_STCLOSE_2_1_T(Materialized view)
3: IV_STCLOSE_2_2_T (~0.15s)
SELECT ACCOUNT_ID, POINT_DATE, POINT
FROM history_point
WHERE point_date >= (
SELECT MIN(open_date)
FROM IV_STCLOSE_2_1_t
)
4: IV_STCLOSE_3_T (~1.5s)
SELECT p0.account_id, p0.open_date, p0.close_date, COALESCE(SUM(DECODE(SIGN(p.point),-1,p.point)),0) AS point_used, COALESCE(SUM(DECODE(SIGN(p.point),1,p.point)),0) AS point_earned
FROM iv_stclose_2_1_t p0
LEFT JOIN iv_stclose_2_2_t p
ON p.account_id = p0.account_id
AND p.point_date >= p0.open_date
AND p.point_date < p0.close_date + INTERVAL '1' DAY
GROUP BY p0.account_id, p0.open_date, p0.close_date
5: IV_STCLOSE_4_T (~3s)
WITH t AS (SELECT * FROM IV_STCLOSE_3_T)
SELECT t1.account_id AS STAT_ACCOUNT_ID, t1.open_date, t1.close_date, t1.open_balance, t1.point_earned AS point_earn, t1.point_used , t1.open_balance + t1.point_earned + t1.point_used AS close_balance
FROM (
SELECT v1.account_id, v1.open_date, v1.close_date, v1.point_earned, v1.point_used, COALESCE(sum(v2.point_used + v2.point_earned),0) AS OPEN_BALANCE
FROM t v1
LEFT JOIN t v2
ON v1.account_id = v2.account_id
AND v1.OPEN_DATE > v2.OPEN_DATE
GROUP BY v1.account_id, v1.open_date, v1.close_date, v1.point_earned, v1.point_used
) t1
It turns out to be that in IV_STCLOSE_4_T
WITH t AS (SELECT * FROM IV_STCLOSE_3_T)
is problematic.
At first thought WITH t AS would be faster as IV_STCLOSE_3_T is only evaluated once, but it apparently forced materializing the whole IV_STCLOSE_3_T, generating over 200k records despite I only need at most 12 of them from a single customer at any time.
With the above statement removed and appropriately indexing account_id, the cost reduced from over 500k to less than 500.