I have a data set which contains account_number, date, balance, interest charged, and code. This is accounting data so transactions are posted and then reversed if they're was a mistake by the data provider so things can be posted and reversed multiple times.
Account_Number Date Balance Interest Charged Code
0012 01/01/2017 1,000,000 $ 50.00 Posted
0012 01/05/2017 1,000,000 $-50.00 Reversed
0012 01/07/2017 1,000,000 $ 50.00 Posted
0012 01/10/2017 1,000,000 $-50.00 Reversed
0012 01/15/2017 1,000,000 $50.00 Posted
0012 01/17/2017 1,500,000 $25.00 Posted
0012 01/18/2017 1,500,000 $-25.00 Reversed
Looking at the data set above- I am trying to figure out a way to look at every row by account number and balance and if they're is a inverse charge it should remove both of those rows and only keep a charge if they're is no corresponding reversal for it (01/15/2017). For example on 01/01/2017 a charge of 50.00 dollar was posted on a balance of 1,000,000 and on 01/05/2017 the charged was reversed on the same balance -- so both of these rows should be thrown out. This is the same case for 01/07 and 01/10.
I am not to sure on how to code out this problem - any ideas or tips would be great!
So the problem with a question like this is that there are many corner cases. Optimizing for them many or many not depend on how the data is already processed. That being said, here is one solution. Assuming -
For each Account number and balance, the for for each Reversed transaction is just after the corresponding payment.
>>import pandas as pd
>>from datetime import date
>>df = pd.DataFrame(data = [
['0012', date(2017, 1, 1), 1000000, 50, 'Posted'],['0012', date(2017, 1, 5), 1000000, -50, 'Reversed'],
['0012', date(2017, 1, 7), 1000000, 50, 'Posted'],['0012', date(2017, 1, 10), 1000000, -50, 'Reversed'],
['0012', date(2017, 1, 15), 1000000, 50, 'Posted'],['0012', date(2017, 1, 17), 1500000, 25, 'Posted'],
['0012', date(2017, 1, 18), 1500000, -25, 'Reversed'],],
columns=['Account_Number', 'Date', 'Balance', 'Interest Charged', 'Code'])
>>df
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-01 1000000 50 Posted
1 0012 2017-01-05 1000000 -50 Reversed
2 0012 2017-01-07 1000000 50 Posted
3 0012 2017-01-10 1000000 -50 Reversed
4 0012 2017-01-15 1000000 50 Posted
5 0012 2017-01-17 1500000 25 Posted
6 0012 2017-01-18 1500000 -25 Reversed
>> def f(df_g):
idx = df_g[df_g['Code'] == 'Reversed'].index
return df_g.loc[~df_g.index.isin(idx.union(idx-1)), ['Date', 'Interest Charged', 'Code']]
>>df.groupby(['Account_Number', 'Balance']).apply(f).reset_index().loc[:, df.columns]
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-15 1000000 50 Posted
How it works - Basically for each combination of Account Number and Balance, I look at the Rows with Reversed, and I remove them plus the row just before it.
EDIT: - To make it slightly more Robust (now it picked up the last row based on Amount, Balance and account_number:
>>df = pd.DataFrame(data = [
['0012', date(2017, 1, 1), 1000000, 53, 'Posted'],['0012', date(2017, 1, 7), 1000000, 50, 'Posted'],['0012', date(2017, 1, 5), 1000000, -50, 'Reversed'],
['0012', date(2017, 1, 10), 1000000, -53, 'Reversed'],
['0012', date(2017, 1, 15), 1000000, 50, 'Posted'],['0012', date(2017, 1, 17), 1500000, 25, 'Posted'],
['0012', date(2017, 1, 18), 1500000, -25, 'Reversed'],],
columns=['Account_Number', 'Date', 'Balance', 'Interest Charged', 'Code'])
>>df
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-01 1000000 53 Posted
1 0012 2017-01-07 1000000 50 Posted
2 0012 2017-01-05 1000000 -50 Reversed
3 0012 2017-01-10 1000000 -53 Reversed
4 0012 2017-01-15 1000000 50 Posted
5 0012 2017-01-17 1500000 25 Posted
6 0012 2017-01-18 1500000 -25 Reversed
>>output_cols = df.columns
>>df['ABS_VALUE'] = df['Interest Charged'].abs()
>>def f(df_g):
df_g = df_g.reset_index() # Added this new line
idx = df_g[df_g['Code'] == 'Reversed'].index
return df_g.loc[~df_g.index.isin(idx.union(idx-1)), ['Date', 'Interest Charged', 'Code']]
>>df.groupby(['Account_Number', 'Balance', 'ABS_VALUE']).apply(f).reset_index().loc[:, output_cols]
Account_Number Date Balance Interest Charged Code
0 0012 2017-01-15 1000000 50 Posted
Related
Please consider this table:
Year Month Value YearMonth
2011 1 70 201101
2011 1 100 201101
2011 2 200 201102
2011 2 50 201102
2011 3 80 201103
2011 3 250 201103
2012 1 100 201201
2012 2 200 201202
2012 3 250 201203
I want to get a cumulative sum based on each year. For the above table I want to get this result:
Year Month Sum
-----------------------
2011 1 170
2011 2 420 <--- 250 + 170
2011 3 750 <--- 330 + 250 + 170
2012 1 100
2012 2 300 <--- 200 + 100
2012 3 550 <--- 250 + 200 + 100
I wrote this code:
Select c1.YearMonth, Sum(c2.Value) CumulativeSumValue
From #Tbl c1, #Tbl c2
Where c1.YearMonth >= c2.YearMonth
Group By c1.YearMonth
Order By c1.YearMonth Asc
But its CumulativeSumValue is calculated twice for each YearMonth:
YearMonth CumulativeSumValue
201101 340 <--- 170 * 2
201102 840 <--- 420 * 2
201103 1500
201201 850
201202 1050
201203 1300
How can I achieve my desired result?
I wrote this query:
select year, (Sum (aa.[Value]) Over (partition by aa.Year Order By aa.Month)) as 'Cumulative Sum'
from #Tbl aa
But it returned multiple records for 2011:
Year Cumulative Sum
2011 170
2011 170
2011 420
2011 420
2011 750
2011 750
2012 100
2012 300
2012 550
You are creating a cartesian product here. In your ANSI-89 implicit JOIN (you really need to stop using those and switch to ANSI-92 syntax) you are joining on c1.YearMonth >= c2.YearMonth.
For your first month you have two rows with the same value of the year and month, so each of those 2 rows joins to the other 2; this results in 4 rows:
Year
Month
Value1
Value2
2011
1
70
70
2011
1
70
100
2011
1
100
70
2011
1
100
100
When you SUM this value you get 340, not 170, as you have 70+70+100+100.
Instead of a triangular JOIN however, you should be using a windowed SUM. As you want to also get aggregate nmonths into a single rows, you'll need to also aggregate inside the windowed SUM like so:
SELECT V.YearMonth,
SUM(SUM(V.Value)) OVER (PARTITION BY Year ORDER BY V.YearMonth) AS CumulativeSum
FROM (VALUES (2011, 1, 70, 201101),
(2011, 1, 100, 201101),
(2011, 2, 200, 201102),
(2011, 2, 50, 201102),
(2011, 3, 80, 201103),
(2011, 3, 250, 201103),
(2012, 1, 100, 201201),
(2012, 2, 200, 201202),
(2012, 3, 250, 201203)) V (Year, Month, Value, YearMonth)
GROUP BY V.YearMonth,
V.Year;
I have something like this:
df_columns = {
'firm_ID': [1, 1, 2, 2, 2],
'date_incident' : ['2015-01-01', '2015-01-01', '2016-10-01', '2016-10-01', '2016-10-01'],
'date_meeting' : ['2014-02-01', '2016-03-01', '2015-10-01', '2017-02-01', '2018-11-01'],
}
simple_df = pd.DataFrame(df_columns)
simple_df['date_incident'] = pd.to_datetime(simple_df['date_incident'])
simple_df['date_meeting'] = pd.to_datetime(simple_df['date_meeting'])
simple_df['date_delta'] = simple_df['date_incident'] - simple_df['date_meeting']
There is only on date_incident per firm_ID, but several date_meetings per firm_ID. I want an additional column that returns the minimum date delta per firm_ID. Note that this delta can be negative as well.
So I get this (e.g., for firm_ID = 2 the closest meeting was -123 days prior):
Thanks.
Use DataFrameGroupBy.idxmin for rows with minimal absolute values of timedeltas converted to days and then create new column by mapping with Series.map:
idx = simple_df['date_delta'].dt.days.abs().groupby(simple_df['firm_ID']).idxmin()
df = simple_df.loc[idx]
simple_df['new'] = simple_df['firm_ID'].map(df.set_index('firm_ID')['date_delta'])
print (simple_df)
firm_ID date_incident date_meeting date_delta new
0 1 2015-01-01 2014-02-01 334 days 334 days
1 1 2015-01-01 2016-03-01 -425 days 334 days
2 2 2016-10-01 2015-10-01 366 days -123 days
3 2 2016-10-01 2017-02-01 -123 days -123 days
4 2 2016-10-01 2018-11-01 -761 days -123 days
I have a table that has two different columns for AMOUNTS. FIRST_AMOUNT & SECOND_AMOUNT. I need to find the balance of the difference of the two.
For example, the table below for each number_id, we have two separate amount columns. I need to take the (FIRST_AMOUNT - SECOND_AMOUNT) to calculate the difference. Then based on this, calculate the BALANCE in the query.
However, because of the way we are receiving the data, we need to subtract the ABSOLUTE values of the amounts. So ABS(FIRST_AMOUNT) - ABS(SECOND_AMOUNT). However, if both the FIRST_AMOUNT and SECOND_AMOUNT have negative, we need the subtraction to also include a negative. So -80 (first_amount) and -32 (second_amount) would be (80-32) WITH a negative, so -48. If there is a negative on only one of the columns, take normal subtraction. So -10 (first_amount) and 0 (second_amount) would be -10.
Then I need a column that calculates the balance of the difference starting from the EARLIEST date.
However, in the final query, I would not like to display the difference column, only the balance. Something like this:
Is there a way to formulate a query that could give this result?
In Snowflake this is trivial, as you can reference to prior output calculations in the following, and Snowflake can see how to resolve the order of dependencies, you can just say what you want.
This break down when you start downing nested window functions, but for now, it can be written in one block.
WITH data(number_id, first_amount, second_amount, date) AS (
SELECT column1, column2, column3, to_date(column4, 'YYYY-MM-DD') FROM VALUES
(111, -10, 0, '2021-12-23'),
(111, -20, 0, '2021-12-22'),
(111, 30, 0, '2021-12-21'),
(111, -80, -32, '2021-12-20'),
(111, 48, 0, '2021-12-19'),
(111, 5, 5, '2021-12-18'),
(111, 72, 72, '2021-12-17'),
(111, 150, 150, '2021-12-16')
)
SELECT number_id
,FIRST_AMOUNT
,SECOND_AMOUNT
,date
,FIRST_AMOUNT - SECOND_AMOUNT as difference
,sum( difference )over(partition by number_id order by date rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) as balance
FROM data
ORDER BY 1, 4 DESC;
gives:
NUMBER_ID
FIRST_AMOUNT
SECOND_AMOUNT
DATE
DIFFERENCE
BALANCE
111
-10
0
2021-12-23
-10
0
111
-20
0
2021-12-22
-20
10
111
30
0
2021-12-21
30
30
111
-80
-32
2021-12-20
-48
0
111
48
0
2021-12-19
48
48
111
5
5
2021-12-18
0
0
111
72
72
2021-12-17
0
0
111
150
150
2021-12-16
0
0
I'm trying to calculate and find the largest percentage changes between dates based on an indicator_id
year indicator_id value
-------------------------- --------------- -------
January 1, 1999, 12:00 AM 1 1.99
January 1, 2000, 12:00 AM 1 1.76
January 1, 2001, 12:00 AM 2 3.37
January 1, 2006, 12:00 AM 2 4.59
The output I'm trying to get is
year indicator_id value % change
-------------------------- --------------- ------- ---------
January 1, 1999, 12:00 AM 1 1.99 0%
January 1, 2000, 12:00 AM 1 1.76 ?
January 1, 2001, 12:00 AM 2 3.37 0%
January 1, 2006, 12:00 AM 2 4.59 ?
Please help
You want lag() and some arithmetic:
select t.*,
(1 - value / nullif(lag(value) over (partition by indicator_id order by year), 0)) as ratio
from t;
Note: This returns a ratio between 0 and 1. You can multiple by 100, if you want a percentage.
Also, the first result is NULL, which makes more sense to me than 0. If you really want 0, you can use the 3 argument form of lag(): lag(value, 1, value).
I am looking to pivot the rows in the attached image and I want the output to look something like this
ID Age Factor
1 30 8.650
1 35 11.52
1 40 13.87
till 100
2 30 7.99
2 35 10.98
2 40 13.43
till 100
3 30 7.32
3 35 10.98
3 40 13.43
till 100
and so on until i reach the last row (81) in the attached data source.
Thank you :)
Source data
Finally got it working -
SELECT a.ID - 1 AS ID, b.Age, CAST(b.Factor AS DECIMAL(19,2)) AS Factor
from t1 a -- data source table.
cross apply (values
(30, F1),
(35, F2),
(40, F3),
(45, F4),
(50, F5),
(55, F6),
(60, F7),
(65, F8),
(70, F9),
(100, F10)
) b(Age, Factor)
where a.ID >= 2