Select most current data in grouped set in Oracle - sql

I am writing a procedure to query some data in Oracle and grouping it:
Account Amt Due Last payment Last Payment Date (mm/dd/yyyy format)
1234 10.00 5.00 12/12/2013
1234 35.00 8.00 12/12/2013
3293 15.00 10.00 11/18/2013
4455 8.00 3.00 5/23/2013
4455 14.00 5.00 10/18/2013
I want to group the data, so there is one record per account, the Amt due is summed, as well as the last payment. Unless the last payment date is different -- if the date is different, then I just want the last payment. So I would want to have a result of something like this:
Account Amt Due Last payment Last Payment Date
1234 45.00 13.00 12/12/2013
3293 15.00 10.00 11/18/2013
4455 22.00 5.00 10/18/2013
I was doing something like
select Account, sum (AmtDue), sum (LastPmt), Max (LastPmtDt)
from all my tables
group by Account
But, that doesn't work for the last record above, because the last payment was only the $5.00 on 10/18, not the sum of them on 10/18.
If I group by Account and LastPmtDt, then I get two records for the last, but I only want one per account.
I have other data I'm querying, and I'm using a CASE, INSTR, and LISTAGG on another field (if combining them gives me this substring and that, then output 'Both'; else if it only gives me this substring, then output the substring; else if it only gives me the other substring, then output that one). It seems like I may need something similar, but not by looking for a specific date. If the dates are the same, then sum (LastPmt) and max (LastPmtDt) works fine, if they are not the same, then I want to ignore all but the most recent LastPmt and LastPmtDt record(s).
Oh, and my LastPmt and LastPmtDt fields are already case statements within the select. They aren't fields that I already can just access. I'm reading other posts about RANK and KEEP, but to involve both fields, I'd need all that calculation of each field as well. Would it be more efficient to query everything, and then wrap another query around that to do the grouping, summing, and selecting fields I want?
Related: HAVING - GROUP BY to get the latest record
Can someone provide some direction on how to solve this?

Try this:
select Account,
sum ( Amt_Due),
sum (CASE WHEN Last_Payment_Date = last_dat THEN Last_payment ELSE 0 END),
Max (Last_Payment_Date)
from (
SELECT t.*,
max( Last_Payment_Date ) OVER( partition by Account ) last_dat
FROM table1 t
)
group by Account
Demo --> http://www.sqlfiddle.com/#!4/fc650/8

Rank is the right idea.
Try this
select a.Account, a.AmtDue, a.LastPmt, a.LastPmtDt from (
select Account, sum (AmtDue) AmtDue, sum (LastPmt) LastPmt, LastPmtDt,
RANK() OVER (PARTITION BY Account ORDER BY LastPmtDt desc) as rnk
from all my tables
group by Account, LastPmtDt
) a
where a.rnk = 1
I haven't tested this, but it should give you the right idea.

Try this:
select Account, sum(AmtDue), sum(LastPmt), LastPmtDt
from (select Account,
AmtDue,
LastPmt,
LastPmtDt,
max(LastPmtDt) over(partition by Account) MaxLastPmtDt
from your_table) t
where t.LastPmtDt = t.MaxLastPmtDt
group by Account, LastPmtDt

Related

Calculating Datediff of two days based on when the sum of a column hits a number cap

Tried to see if this was asked anywhere else but doesn't seem like it. Trying to create a sql query to give me the date difference in days between '2022-10-01' and the date when our impression sum hits our cap of 5.
For context, we may see duplicate dates because someone revisit our website that day so we'll get a different session number to pair with that count. Here's an example table of one individual and how many impressions logged.
My goal is to get the number of days it takes to hit an impression cap of 5. So for this individual, they would hit the cap on '2022-10-07' and the days between '2022-10-01' and '2022-10-07' is 6. I am also calculating the difference before/after '2023-01-01' since I need this count for Q4 of '22 and Q1 of '23 but will not include in the example table. I have other individuals to include but for the purpose of asking here, I kept it to one.
Current Query:
select
click_date,
case
when date(click_date) < date('2023-01-01') and sum(impression_cnt = 5) then datediff('day', '2022-10-01', click_date)
when date(click_date) >= date('2023-01-01') and sum(impression_cnt = 5) then datediff('day', '2023-01-01', click_date)
else 0
end days_to_capped
from table
group by customer, click_date, impression_cnt
customer
click date
impression_cnt
123456
2022-10-05
2
123456
2022-10-05
1
123456
2022-10-06
1
123456
2022-10-07
1
123456
2022-10-11
1
123456
2022-10-11
3
Result Table
customer
days_to_cap
123456
6
I'm currently only getting 0 days and then 81 days once it hits 2022-12-21 (last date) for this individual so i know I need to fix my query. Any help would be appreciated!
Edited: This is in snowflake!
So, the issue with your query is that the sum is being calculated at the level that you are grouping by, which is every field, so it will always just be the value of the impressions field every time.
What you need to do is a running sum, which is a SUM() OVER (PARTITION BY...) statement. And then qualify the results of that:
First, just to get the data that you have:
with x as (
select *
from values
(123456,'2022-10-05'::date,2),
(123456,'2022-10-05'::date,1),
(123456,'2022-10-06'::date,1),
(123456,'2022-10-07'::date,1),
(123456,'2022-10-11'::date,1),
(123456,'2022-10-11'::date,3) x (customer,click_date,impression_cnt)
)
Then, I query the CTE to do the running sum with a QUALIFY statement to choose the record that actually has the value I'm looking for
select
customer,
case
when click_date < '2023-01-01'::date and sum(impression_cnt) OVER (partition by customer order by click_date) = 5 then datediff('day', '2022-10-01', click_date)
when click_date >= '2023-01-01'::date and sum(impression_cnt) OVER (partition by customer order by click_date) = 5 then datediff('day', '2023-01-01', click_date)
else 0
end days_to_capped
from x
qualify days_to_capped > 0;
The qualify filters your results to just the record that you cared about.

How can I use lag function to obtain same day changes in customer movement type

I have a dataset that I am trying to get the total number of times a customer has left during a same day period (basically a refund).
If a customer has a new business and a churn with the same transaction time, then it is considered a refund. i was trying to use the lag function for this, but I am simply getting any result if there is change from new_business to churn. What I need is a change from new_business to churn as well as happening during the same day period.
Data looks like:
user_id time transaction_type
1234 2020-01-10 new_business
1234 2020-01-10 churn
5678 2020-01-10 new_business
5678. 2020-05-01 churn
1011 2020-01-10 new_business
In the above example, user_id 1234 would be a refund but 5678 would not be. user 1011 is still a customer. I am trying to get the total count of refund customers
My query:
select count(*)
lag(time) over (partition by user_id order by time)
from data
where transaction_type in('churn','new_business')
However whats happening with this query is that I am getting all times there is a change with both of them. So I am getting user_id 1234 and 5678. What am I missing in order to limit this to only user_id 1234?
If you want people who have the two types on the same date, then you can use aggregation:
select user_id, time
from data
where transaction_type in ('churn', 'new_business')
group by user_id, time
having count(distinct transaction_type) = 2;
If you want a count of these, you can use a subquery.

STDEVP for calculated fields

I have a table that looks like this:
ID CHANNEL VENDOR num_PERIOD SALES.A SALES.B
000001 Business Shop 1 40 30
000001 Business Shop 2 60 20
000001 Business Shop 3 NULL 30
With many combinations of ID, CHANNEL and VENDOR, and sales records for each of them over time (num_PERIOD).
I want to get the average Standard Deviation of a new field, which returns the sum of SALES.A + SALES.B sum(IS.NULL(SALES.A,0) + ISNULL(SALES.B,0)).
The problem I have is that STDEVP seem to fail with calculated fields, and the result that returns is invalid.
I have been trying with:
select ID, CHANNEL, VENDOR, stdevp(sum(isnull(SALES.A,0) + ISNULL(QSALES.B,0))) OVER (PARTITION BY ID, CHANNEL, VENDOR) as STDEV_SALES
FROM TABLE
GROUP BY ID, CHANNEL, VENDOR
However, the results I'm obtaning are always 0 or NULL.
What I want to obtain is the Average Standard Deviation of each ID, CHANNEL and VENDOR over time (num_PERIOD).
Can someone find an approximation for this please?
Your query doesn't match the sample data.
I can see the problem, though. The SUM() are calculating a single value for each group, and then you are taking the standard deviation of that value. Because you cannot nest aggregation functions, you have turned it into a window function.
Get rid of the sum(). The following should work in SQL Server:
SELECT ID, CHANNEL, VENDOR,
STDEVP(COALESCE(SALES.A, 0) + COALESCE(QSALES.B, 0)) as STDEV_SALES
FROM SALES . . .
QSALES
GROUP BY ID, CHANNEL, VENDOR;
I would also return the COUNT(*) . . . the standard deviation doesn't make sense if you have fewer than 3 rows. (Okay, it is defined for two values, but not very useful.)

select and delete query based on older entries

I have an Excel sheet that is pushing data to an Access database using ADO. It is essentially putting invoices into a database. Sometimes I will revise my invoice and therefore the database will end up with the same invoice twice. I need to make a select and delete query that will find duplicates based on the invoice number, and delete the older version of the invoice (older record), for a simple example:
id invoice# total item datestamp
1 1234 456.29$ shoes 06/06/2016 03:51
2 1234 78.58$ boots 06/06/2016 03:51
3 1234 22.74$ scarf 06/06/2016 03:51
4 1234 539.34$ shoes 06/07/2016 12:44
4 1234 66.24$ pants 06/07/2016 12:44
As you can see row 4 and 5 are my new invoice for this customer. I want every previous order of the same invoice # to be deleted. Please note: they are not actually duplicates, only the invoice number is duplicated. The query needs to see dupliactes based on invoice number and criteria sees dates older than the most recent date.
At that point it is way beyond me. I would appreciate the help.
Consider using a correlated aggregate subquery in WHERE clause:
DELETE *
FROM InvoiceTable
WHERE NOT datestamp IN
(SELECT Max(datestamp)
FROM InvoiceTable sub
WHERE sub.InvoiceNumber = InvoiceTable.InvoiceNumber)
As I said, try being conservative and not deleting. Instead, select rows that are based on the maximum date stamp for a given invoice number:
SELECT
invoices.id, invoices.invoice, invoices.total, invoices.item, invoices.datestamp
FROM
invoices
INNER JOIN
(SELECT
id, MAX(datestamp) AS maxdate
FROM
invoices
GROUP BY
id) lastinv
ON invoices.id = lastinv.id AND
invoices.datestamp = lastinv.maxdate
This is untested code, but should, pretty much do what you want. All you have to do is mangle it into Microsoft Access, as this is T-SQL.

MS Access: Rank SUM() Values

I am working on an old web app that is still using MS Access as it's data source and I have ran into issue while trying to rank SUM() values.
Let's say I have 2 different account numbers each of those account numbers has an unknown number of invoices. I need to sum up the total of all the invoices, group it by account number then add a rank (1-2).
RAW TABLE EXAMPLE...
Account | Sales | Invoice Number
001 | 400 | 123
002 | 150 | 456
001 | 300 | 789
DESIRED RESULTS...
Account | Sales | Rank
001 | 700 | 1
002 | 150 | 2
I tried...
SELECT Account, SUM(Sales) AS Sales,
(SELECT COUNT(*) FROM Invoices) AS RANK
FROM Invoices
ORDER BY Account
But that query keeps returning the number of records assigned to that account and not a rank.
This would be easier in a report, with a running count: Report - Running Count within a Group
This is not standard in a query, but you can do something with custom functions (it's elaborate, but possible):
http://support.microsoft.com/kb/94397/en-us
Easiest way is to break it up in to 2 queries, the first one is this and I've saved it as qryInvoices:
SELECT Invoices.Account, Sum(Invoices.Sales) AS Sales
FROM Invoices
GROUP BY Invoices.Account;
And then the second query uses the first as follows:
SELECT qryInvoices.Account, qryInvoices.Sales, (SELECT Count(*) FROM qryInvoices AS I WHERE I.Sales > qryInvoices.Sales)+1 AS Rank
FROM qryInvoices
ORDER BY qryInvoices.Sales DESC;
I've tested this and got the desired results as outlined in the question.
Note: It may be possible to achieve in one query using a Defined table, but in this instance it was looking a little ugly.
If you need the answer in one query, it should be
SELECT inv.*, (
SELECT 1+COUNT(*) FROM (
SELECT Account, Sum(Sales) AS Sum_sales FROM Invoices GROUP BY Account
) WHERE Sum_sales > inv.Sum_sales
) AS Rank
FROM (
SELECT Account, Sum(Sales) AS Sum_sales FROM Invoices GROUP BY Account
) inv
I have tried it on Access and it works. You may also use different names for the two instances of "Sum_sales" above to avoid confusion (in which case you can drop the "inv." prefix).