groupby sum month wise on date time data - pandas

I have a transaction data as shown below. which is a 3 months data.
Card_Number Card_type Category Amount Date
0 1 PLATINUM GROCERY 100 10-Jan-18
1 1 PLATINUM HOTEL 2000 14-Jan-18
2 1 PLATINUM GROCERY 500 17-Jan-18
3 1 PLATINUM GROCERY 300 20-Jan-18
4 1 PLATINUM RESTRAUNT 400 22-Jan-18
5 1 PLATINUM HOTEL 500 5-Feb-18
6 1 PLATINUM GROCERY 400 11-Feb-18
7 1 PLATINUM RESTRAUNT 600 21-Feb-18
8 1 PLATINUM GROCERY 800 17-Mar-18
9 1 PLATINUM GROCERY 200 21-Mar-18
10 2 GOLD GROCERY 1000 12-Jan-18
11 2 GOLD HOTEL 3000 14-Jan-18
12 2 GOLD RESTRAUNT 500 19-Jan-18
13 2 GOLD GROCERY 300 20-Jan-18
14 2 GOLD GROCERY 400 25-Jan-18
15 2 GOLD HOTEL 1500 5-Feb-18
16 2 GOLD GROCERY 400 11-Feb-18
17 2 GOLD RESTRAUNT 600 21-Mar-18
18 2 GOLD GROCERY 200 21-Mar-18
19 2 GOLD HOTEL 700 25-Mar-18
20 3 SILVER RESTRAUNT 1000 13-Jan-18
21 3 SILVER HOTEL 1000 16-Jan-18
22 3 SILVER GROCERY 500 18-Jan-18
23 3 SILVER GROCERY 300 23-Jan-18
24 3 SILVER GROCERY 400 28-Jan-18
25 3 SILVER HOTEL 500 5-Feb-18
26 3 SILVER GROCERY 400 11-Feb-18
27 3 SILVER HOTEL 600 25-Mar-18
28 3 SILVER GROCERY 200 29-Mar-18
29 3 SILVER RESTRAUNT 700 30-Mar-18
I am struggling to get below dataframe.
Card_No Card_Type D Jan_Sp Jan_N Feb_Sp Feb_N Mar_Sp GR_T RES_T
1 PLATINUM 70 3300 5 1500 3 1000 2300 100
2 GOLD 72 5200 5 1900 2 1500 2300 1100
3 SILVER . 76 2900 5 900 2 1500 1800 1700
D = Duration in days from first transaction to last transaction.
Jan_Sp = Total spending on January.
Feb_Sp = Total spending on February.
Mar_Sp = Total spending on March.
Jan_N = Number of transaction in Jan.
Feb_N = Number of transaction in Feb.
GR_T = Total spending on GROCERY.
RES_T = Total spending on RESTRAUNT.
I tried following code. I am very new to pandas.
q9['Date'] = pd.to_datetime(Card_Number['Date'])
q9 = q9.sort_values(['Card_Number', 'Date'])
q9['D'] = q9.groupby('ID')['Date'].diff().dt.days

My approach is three steps
get the date range
get the Monthly spending
get the category spending
Step 1: Date
date_df = df.groupby('Card_type').Date.apply(lambda x: (x.max()-x.min()).days)
Step 2: Month
month_df = (df.groupby(['Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack('Date')
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
Step 3: Category
cat_df = df.pivot_table(index='Card_type',
columns='Category',
values='Amount',
aggfunc='sum')
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
And finally concat:
pd.concat( (date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T RE_T
Card_type
GOLD 72 1900 5200 1500 2 5 3 2300 5200 1100
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500 1000
SILVER 76 900 3200 1500 2 5 3 1800 2100 1700
If your data have several years, and you want to separate them by year, then you can add df.Date.dt.year in each groupby above:
date_df = df.groupby([df.Date.dt.year,'Card_type']).Date.apply(lambda x: (x.max()-x.min()).days)
month_df = (df.groupby([df.Date.dt.year,'Card_type', df.Date.dt.month_name().str[:3]])
.Amount
.agg({'sum','count'})
.rename({'sum':'_Sp', 'count': '_N'}, axis=1)
.unstack(level=-1)
)
# rename
month_df.columns = [b+a for a,b in month_df.columns]
cat_df = (df.groupby([df.Date.dt.year,'Card_type', 'Category'])
.Amount
.sum()
.unstack(level=-1)
)
# rename
cat_df.columns = [a[:2]+"_T" for a in cat_df.columns]
pd.concat((date_df, month_df, cat_df), axis=1)
gives:
Date Feb_Sp Jan_Sp Mar_Sp Feb_N Jan_N Mar_N GR_T HO_T
Date Card_type
2017 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
2018 GOLD 72 1900 5200 1500 2 5 3 2300 5200
PLATINUM 70 1500 3300 1000 3 5 2 2300 2500
SILVER 76 900 3200 1500 2 5 3 1800 2100
I would recommend keeping the dataframe this way, so you can access the annual data, e.g. result_df.loc[2017] gives you 2017 data. If you really want 2017 as year, you can do result_df.unstack(level=0).

Related

Cumulative over table rows with condition Oracle PL/SQL

I have two tables:
Employees:
employee_id field max_amount
3 a 3000
4 a 3000
1 a 1600
2 a 500
4 b 4000
2 b 4000
3 b 1700
ordered by employee, field, amount desc.
Amounts (pol, premia,field):
pol premia field **assign_to_employee**
11 900 a 3
44 1000 a 3
55 1400 a 4
77 500 a 3
88 1300 a 1
22 800 b 4
33 3900 b 2
66 1300 b 4
Assign Stats Table:
employee_id field max_amount true_amount remain
3 a 3000 2400 600
4 a 3000 1400 1600
1 a 1600 1300 300
2 a 500 0 500
4 b 4000 2100 1900
2 b 4000 3900 100
3 b 1700 0 1700
The output : assign_to_employee field (merged to amounts table).
Algoritem wise : The method is to assign pol's to employees until the premia needs to be added to the cumulative_sum is bigger then the max amount per employee listed in the employees table. You always start with the employess with most max amount until you cannot add any other pols to the employee.
I start with the employees with the grater max_amount per field.
I keep doing this until no pols remains to be assign.
Can you help me solve this?
Thank you.

How can I transfer the data from on column to another based on another column values in SQL

select Products, Fiscal_year, Fiscal_Period, Stock_QTY, DaysRemaining,
(Stock_QTY / DaysRemaining) as QtyforPeriod,
Stock_QTY -(Stock_QTY / DaysRemaining) as LeftforNextmonth
from Stocks
products| Fiscal_yaer| Fiscal_period| Stock_QTY |DaysReamain| QtyforPeriod |LeftforNextMonth
5000 22 1 100 4
6000 22 1 200 4
7000 22 2 300 20
7000 22 3 400 40
8000 23 1 500 60
5000 23 1 600 60
7000 23 2 700 90
8000 23 3 800 100
There is any possibility to write a query if the Fiscal_yae =22 Fiscal_period=4. Subtract StockTY - LeftforNextMonth of period 3 and divided by DaysRemaining.
Like if the Fiscal_yae =22 Fiscal_period=5. Subtract StockTY - LeftforNextMonth of period 4 and divided by days remaining.
Like if the Fiscal_yae =22 Fiscal_period=6. Subtract StockTY ( - ) LeftforNextMonth of period 5 and divided by days remaining.

pandas groupby sum, count by date time, where consider only year

I have a transaction data as shown below. which is a 3 months data.
Card_Number Card_type Category Amount Date
0 1 PLATINUM GROCERY 100 10-Jan-18
1 1 PLATINUM HOTEL 2000 14-Jan-18
2 1 PLATINUM GROCERY 500 17-Jan-18
3 1 PLATINUM GROCERY 300 20-Jan-18
4 1 PLATINUM RESTRAUNT 400 22-Jan-18
5 1 PLATINUM HOTEL 500 5-Feb-19
6 1 PLATINUM GROCERY 400 11-Feb-19
7 1 PLATINUM RESTRAUNT 600 21-Feb-19
8 1 PLATINUM GROCERY 800 17-Mar-17
9 1 PLATINUM GROCERY 200 21-Mar-17
10 2 GOLD GROCERY 1000 12-Jan-18
11 2 GOLD HOTEL 3000 14-Jan-18
12 2 GOLD RESTRAUNT 500 19-Jan-18
13 2 GOLD GROCERY 300 20-Jan-18
14 2 GOLD GROCERY 400 25-Jan-18
15 2 GOLD HOTEL 1500 5-Feb-19
16 2 GOLD GROCERY 400 11-Feb-19
17 2 GOLD RESTRAUNT 600 21-Mar-17
18 2 GOLD GROCERY 200 21-Mar-17
19 2 GOLD HOTEL 700 25-Mar-17
20 3 SILVER RESTRAUNT 1000 13-Jan-18
21 3 SILVER HOTEL 1000 16-Jan-18
22 3 SILVER GROCERY 500 18-Jan-18
23 3 SILVER GROCERY 300 23-Jan-18
24 3 SILVER GROCERY 400 28-Jan-18
25 3 SILVER HOTEL 500 5-Feb-19
26 3 SILVER GROCERY 400 11-Feb-19
27 3 SILVER HOTEL 600 25-Mar-17
28 3 SILVER GROCERY 200 29-Mar-17
29 3 SILVER RESTRAUNT 700 30-Mar-17
I am struggling to get below dataframe.
Card_No Card_Type D 2018_Sp 2018_N 2019_Sp 2019_N 2018_Sp
1 PLATINUM 70 3300 5 1500 3 1000
2 GOLD 72 5200 5 1900 2 1500
3 SILVER . 76 2900 5 900 2 1500
D = Duration in days from first transaction to last transaction.
2018_Sp = Total spending on year 2018.
2019_Sp = Total spending on 2019.
2017_Sp = Total spending on 2017.
2018_N = Number of transaction in 2018.
2019_N = Number of transaction in 2019.
Use:
#convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#sorting if necessary
df = df.sort_values(['Card_Number','Card_type', 'Date'])
#aggregate count and sum
df1 = (df.groupby(['Card_Number','Card_type', df['Date'].dt.year])['Amount']
.agg([('Sp','size'),('N','sum')])
.unstack()
.sort_index(axis=1, level=1))
#MultiIndex to columns
df1.columns = [f'{b}_{a}' for a, b in df1.columns]
#difference (different output, because different years)
s = df.groupby('Card_type').Date.apply(lambda x: (x.max()-x.min()).days).rename('D')
#join together
df1 = df1.join(s).reset_index()
print (df1)
Card_Number Card_type 2017_N 2017_Sp 2018_N 2018_Sp 2019_N 2019_Sp \
0 1 PLATINUM 1000 2 3300 5 1500 3
1 2 GOLD 1500 3 5200 5 1900 2
2 3 SILVER 1500 3 3200 5 900 2
D
0 706
1 692
2 688

Select most recent entry in table by another column

I have 3 SQL tables. StudentTable, FeeAssociationTable and InvoiceTable.
StudentTable has primary key of AdmissionNumber, FeeAssociationTable has primary key of FeeAssociationID and InvioceTable has primary key InvoiceID.
The FeeAssociationTable takes AdmissionNumber and assign a fee to it, then while depositing fee the InvoiceTable takes AdmissionNumber and calculate all his fee and subtract from paid and the inserts in the dues.
The problem is; same AdmissionNumbercan have multiple row in InvoiceTable, "How can I can select and sum all recent dues of each AdmissionNumber(not repeating)."
Here is some data;
37 1 3000 January-2018 3000 0 2018-08-17
38 2 3000 January-2018 3000 0 2018-08-17
39 3 3000 January-2018 3000 0 2018-08-17
40 4 3000 January-2018 3000 0 2018-08-17
41 5 3000 January-2018 3000 0 2018-08-17
42 6 3000 January-2018 3000 0 2018-08-17
43 7 3000 January-2018 3000 0 2018-08-17
44 8 3000 January-2018 3000 0 2018-08-17
45 9 3000 January-2018 3000 0 2018-08-17
46 10 3000 January-2018 3000 0 2018-08-17
47 1 3200 June-2018 2500 700 2018-08-17
48 2 3200 June-2018 2500 700 2018-08-17
49 3 3200 June-2018 2500 700 2018-08-17
50 4 3200 June-2018 2500 700 2018-08-17
51 5 3200 June-2018 2500 700 2018-08-17
52 6 3200 June-2018 2500 700 2018-08-17
53 7 3200 June-2018 2500 700 2018-08-17
54 8 3200 June-2018 2500 700 2018-08-17
55 9 3200 June-2018 2500 700 2018-08-17
56 10 3200 June-2018 2500 700 2018-08-17
57 10 3700 July-2018 2500 1200 2018-08-17
58 9 3700 July-2018 2400 1300 2018-08-17
59 8 3700 July-2018 200 3500 2018-08-17
60 7 3700 July-2018 2000 1700 2018-08-17
61 1 3700 July-2018 1500 2200 2018-08-17
62 2 3700 July-2018 3100 600 2018-08-17
Expectation:
I want the most recent dues of each student without adding in the previous one.
For example:-
InvoiceId AdmissionNumber TotalFee Month Paid Dues Date
37 1 3000 January-2018 3000 0 2018-08-17
47 1 3200 June-2018 2500 700 2018-08-17
61 1 3700 July-2018 1500 2200 2018-08-17
There are 3 entries of AdmissionNumber 1 in the InvoiceTable. In the first one there are no due, but in the second entry there are Dues of 700, and in the third the dues of 2200 on AdmissionNumber 1.
The thing I want to do is select last one which can be done by this code given below:
SELECT Dues FROM InvoiceTable AS IT
WHERE IT.InvoiceID = (SELECT MAX(InvoiceID)
FROM InvoiceTable WHERE AdmissionNumber = 1)
This is for single I want list of recent dues of every student.
Thanks in Advance
Based on your follow up information, I believe the following will give you what you are looking for:
SELECT invoiceId, AdmissionNumber, Dues
FROM InvoiceTable AS IT
WHERE IT.invoiceId IN (SELECT MAX(invoiceId)
FROM InvoiceTable
GROUP BY AdmissionNumber)
ORDER BY AdmissionNumber ASC
They query you tried in your example was close, it just had to be adapted to work for the whole table and not a single AdmissionNumber.
Here is a demo of this working: SQL Fiddle

Splitting Column Headers and Duplicating Row Values in Pandas Dataframe

In the example df below, I'm trying to find a way to split the column headers ('1;2','4','5;6') based on the ';' that exists and duplicate the row values in these split columns. (My actual df comes from an imported csv file so generally I have around 50-80 column headers that need spliting)
Below is my code below with output
import pandas as pd
import numpy as np
#
data = np.array([['Market','Product Code','1;2','4','5;6'],
['Total Customers',123,1,500,400],
['Total Customers',123,2,400,320],
['Major Customer 1',123,1,100,220],
['Major Customer 1',123,2,230,230],
['Major Customer 2',123,1,130,30],
['Major Customer 2',123,2,20,10],
['Total Customers',456,1,500,400],
['Total Customers',456,2,400,320],
['Major Customer 1',456,1,100,220],
['Major Customer 1',456,2,230,230],
['Major Customer 2',456,1,130,30],
['Major Customer 2',456,2,20,10]])
df =pd.DataFrame(data)
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
print (df)
0 Market Product Code 1;2 4 5;6
1 Total Customers 123 1 500 400
2 Total Customers 123 2 400 320
3 Major Customer 1 123 1 100 220
4 Major Customer 1 123 2 230 230
5 Major Customer 2 123 1 130 30
6 Major Customer 2 123 2 20 10
7 Total Customers 456 1 500 400
8 Total Customers 456 2 400 320
9 Major Customer 1 456 1 100 220
10 Major Customer 1 456 2 230 230
11 Major Customer 2 456 1 130 30
12 Major Customer 2 456 2 20 10
Below is my desired output
0 Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
Ideally I would like to perform such a task at the 'read_csv' level. Any thoughts?
Try reindex with repeat
s=df.columns.str.split(';')
df=df.reindex(columns=df.columns.repeat(s.str.len()))
df.columns=sum(s.tolist(),[])
df
Out[247]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10
You can split the columns with ';' and then reconstruct a df:
pd.DataFrame({c:df[t] for t in df.columns for c in t.split(';')})
Out[157]:
1 2 4 5 6 Market Product Code
1 1 1 500 400 400 Total Customers 123
2 2 2 400 320 320 Total Customers 123
3 1 1 100 220 220 Major Customer 1 123
4 2 2 230 230 230 Major Customer 1 123
5 1 1 130 30 30 Major Customer 2 123
6 2 2 20 10 10 Major Customer 2 123
7 1 1 500 400 400 Total Customers 456
8 2 2 400 320 320 Total Customers 456
9 1 1 100 220 220 Major Customer 1 456
10 2 2 230 230 230 Major Customer 1 456
11 1 1 130 30 30 Major Customer 2 456
12 2 2 20 10 10 Major Customer 2 456
Or if you would like to reserve column order:
pd.concat([df[t].to_frame(c) for t in df.columns for c in t.split(';')],1)
Out[167]:
Market Product Code 1 2 4 5 6
1 Total Customers 123 1 1 500 400 400
2 Total Customers 123 2 2 400 320 320
3 Major Customer 1 123 1 1 100 220 220
4 Major Customer 1 123 2 2 230 230 230
5 Major Customer 2 123 1 1 130 30 30
6 Major Customer 2 123 2 2 20 10 10
7 Total Customers 456 1 1 500 400 400
8 Total Customers 456 2 2 400 320 320
9 Major Customer 1 456 1 1 100 220 220
10 Major Customer 1 456 2 2 230 230 230
11 Major Customer 2 456 1 1 130 30 30
12 Major Customer 2 456 2 2 20 10 10