get the sign change count from Dataframe - pandas

I have df like this
Date amount
0 2021-06-18 14
1 2021-06-19 -8
2 2021-06-20 -8
3 2021-06-21 17
4 2021-07-02 -8
5 2021-07-05 77
6 2021-07-06 -10
7 2021-08-02 -78
8 2021-08-06 77
9 2021-07-08 10
i went the count of sign change in amount month wise of count each month like in
count = [{"June-2021": 2},{"July-2021" : 3},{"Aug-2021" : 1}]
Note: Last Date of each month and first date of next month is different then count as in different count
i want a function for this

You can use (x.mul(x.shift()) < 0).sum() (current entry multiply by last entry being negative indicates a sign change) to get the count of sign changes within a group of month-year, as follows:
count = (df.groupby(df['Date'].dt.strftime('%b-%Y'), sort=False)['amount']
.agg(lambda x: (x.mul(x.shift()) < 0).sum())
.to_dict()
)
Result:
print(count)
{'Jun-2021': 2, 'Jul-2021': 3, 'Aug-2021': 1}
Edit
If you want list of dict, you can use:
count = (df.groupby(df['Date'].dt.strftime('%b-%Y'), sort=False)['amount']
.agg(lambda x: (x.mul(x.shift()) < 0).sum())
.reset_index()
.apply(lambda x: {x['Date']: x['amount']}, axis=1)
.to_list()
)
Result:
print(count)
[{'Jun-2021': 2}, {'Jul-2021': 3}, {'Aug-2021': 1}]

Related

Pandas -- get dates closest to nth day of month

This Python code identifies the rows where the day of the month equals 5. For a month that does not have day 5, because it is a weekend or holiday, I want the mask to be True for the earlier date that is closest to day 5. I could write a loop to identify such dates, but is there an array formula to do this?
import pandas as pd
infile = "dates.csv"
df = pd.read_csv(infile)
dtimes = pd.to_datetime(df.iloc[:,0])
mask = (dtimes.dt.day == 5)
For test purpose I created the following DataFrame (with a single column):
xxx
0 2022-11-03
1 2022-11-04
2 2022-11-07
3 2022-12-02
4 2022-12-05
5 2022-12-06
6 2023-01-04
7 2023-01-05
8 2023-01-06
9 2023-02-02
10 2023-02-03
11 2023-02-06
12 2023-02-07
13 2023-04-02
14 2023-04-05
15 2023-04-06
Because I based my solution on groupby method, I created dtimes
as a Series with the index equal to values:
wrk = pd.to_datetime(df.iloc[:,0])
dtimes = pd.Series(wrk.values, index=wrk)
Then, to find the valid date within the current group of dates
(a single month), I defined the followig function:
def findDate(grp):
if grp.size == 0:
return None
dd = grp.dt.day
if dd.eq(5).any():
dd = dd[dd.eq(5)]
else:
dd = dd[dd.lt(5)]
return dd.index[-1]
To find valid dates, for "existing" months, run:
validDates = dtimes.groupby(pd.Grouper(freq='M')).apply(findDate).dropna()
The result is:
xxx
2022-11-30 2022-11-04
2022-12-31 2022-12-05
2023-01-31 2023-01-05
2023-02-28 2023-02-03
2023-04-30 2023-04-05
dtype: datetime64[ns]
And to create your mask, run:
mask = dtimes.isin(validDates).values
To see the filtered rows, run:
df[mask]
getting:
xxx
1 2022-11-04
4 2022-12-05
7 2023-01-05
10 2023-02-03
14 2023-04-05

Write text in a column based on ascending dates. Pandas Python

There are three dates in a df Date column sorted in ascending order. How to write text 'Short' for nearest date, 'Mid' for next date, 'Long' for the farthest date in a new column adjacent to the Date column ? i.e. 2021-04-23 = Short, 2021-05-11 = Mid and 2021-10-08 = Long.
data = {"product_name":["Keyboard","Mouse", "Monitor", "CPU","CPU", "Speakers"],
"Unit_Price":[500,200, 5000.235, 10000.550, 10000.550, 250.50],
"No_Of_Units":[5,5, 10, 20, 20, 8],
"Available_Quantity":[5,6,10,1,3,2],
"Date":['11-05-2021', '23-04-2021', '08-10-2021','23-04-2021', '08-10-2021','11-05-2021']
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'], format = '%d-%m-%Y')
df = df.sort_values(by='Date')
Convert to_datetime and rank the dates, then map your values in the desired order:
df['New'] = (pd.to_datetime(df['Date']).rank(method='dense')
.map(dict(enumerate(['Short', 'Mid', 'Long'], start=1)))
)
Output:
product_name Unit_Price No_Of_Units Available_Quantity Date New
1 Mouse 200.000 5 6 2021-04-23 Short
3 CPU 10000.550 20 1 2021-04-23 Short
0 Keyboard 500.000 5 5 2021-05-11 Mid
5 Speakers 250.500 8 2 2021-05-11 Mid
2 Monitor 5000.235 10 10 2021-10-08 Long
4 CPU 10000.550 20 3 2021-10-08 Long

Pandas groupby and rolling window

I`m trying to calculate the sum of one field for a specific period of time, after grouping function is applied.
My dataset look like this:
Date Company Country Sold
01.01.2020 A BE 1
02.01.2020 A BE 0
03.01.2020 A BE 1
03.01.2020 A BE 1
04.01.2020 A BE 1
05.01.2020 B DE 1
06.01.2020 B DE 0
I would like to add a new column per each row, that calculates the sum of Sold (per each group "Company, Country" for the last 7 days - not including the current day
Date Company Country Sold LastWeek_Count
01.01.2020 A BE 1 0
02.01.2020 A BE 0 1
03.01.2020 A BE 1 1
03.01.2020 A BE 1 1
04.01.2020 A BE 1 3
05.01.2020 B DE 1 0
06.01.2020 B DE 0 1
I tried the following, but it is also including the current date, and it gives differnt values for the same date, i.e 03.01.2020
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(7, on ='Date')['Sold'].sum().reset_index()
Is there a buildin function in pandas that I can use to perform these calculations?
You can use a .rolling window of 8 and then subtract the sum of the Date (for each grouped row) to effectively get the previous 7 days. For this sample data, we should also pass min_periods=1 (otherwise you will get NaN values, but for your actual dataset, you will need to decide what you want to do with windows that are < 8).
Then from the .rolling window of 8, simply do another .groupby of the relevant columns but also include Date this time, and take the max value of the newly created LastWeek_Count column. You need to take the max, because you have multiple records per day, so by taking the max, you are taking the total aggregated amount per Date.
Then, create a series that takes the grouped by sum per Date. In the final step subtract the sum by date from the rolling 8-day max, which is a workaround to how you can get the sum of the previous 7 days, as there is not a parameter for an offset with .rolling:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(8, min_periods=1, on='Date')['Sold'].sum().reset_index()['Sold']
df['LastWeek_Count'] = df.groupby(['Company', 'Country', 'Date'])['LastWeek_Count'].transform('max')
s = df.groupby(['Company', 'Country', 'Date'])['Sold'].transform('sum')
df['LastWeek_Count'] = (df['LastWeek_Count']-s).astype(int)
Out[17]:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0
1 2020-01-02 A BE 0 1
2 2020-01-03 A BE 1 1
3 2020-01-03 A BE 1 1
4 2020-01-04 A BE 1 3
5 2020-01-05 B DE 1 0
6 2020-01-06 B DE 0 1
One way would be to first consolidate the Sold value of each group (['Date', 'Company', 'Country']) on a single line using a temporary DF.
After that, apply your .groupby with .rolling with an interval of 8 rows.
After calculating the sum, subtract the value of each line with the value in Sold column and add that column in the original DF with .merge
#convert Date column to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
#create a temporary DataFrame
df2 = df.groupby(['Date', 'Company', 'Country'])['Sold'].sum().reset_index()
#calc the lastweek
df2['LastWeek_Count'] = (df2.groupby(['Company', 'Country'])
.rolling(8, min_periods=1, on = 'Date')['Sold']
.sum().reset_index(drop=True)
)
#subtract the value of 'lastweek' from the current 'Sold'
df2['LastWeek_Count'] = df2['LastWeek_Count'] - df2['Sold']
#add th2 new column in the original DF
df.merge(df2.drop(columns=['Sold']), on = ['Date', 'Company', 'Country'])
#output:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0.0
1 2020-01-02 A BE 0 1.0
2 2020-01-03 A BE 1 1.0
3 2020-01-03 A BE 1 1.0
4 2020-01-04 A BE 1 3.0
5 2020-01-05 B DE 1 0.0
6 2020-01-06 B DE 0 1.0

Sorting columns in pandas dataframe while automating reports

I am working on an automation task and my dataframe columns are as shown below
Defined Discharge Bin Apr-20 Jan-20 Mar-20 May-20 Grand Total
2-4 min 1 1
4-6 min 5 1 6
6-8 min 5 7 2 14
I want to sort the columns starting from Jan-20. The problem here is that the columns automatically get sorted according to alphabetical order. Sorting can be done manually but since I'm working on an automation task I need to ensure that each month when we feed the data the columns should automatically get sorted according to the months of the year.
Try this:
import pandas as pd
df = pd.DataFrame(data={'Defined Discharge Bin':['2-4 min', '4-6 min','6-8 min'], 'Apr-20':['', '', ''], 'Jan-20':['', 5, 5], 'Mar-20':['', '', 7], 'May-20':[1, 1, 2], 'Grand Total':[1, 6, 14]})
cols_exclude = ['Defined Discharge Bin', 'Grand Total']
cols_date = [c for c in df.columns.tolist() if c not in cols_exclude]
cols_sorted = sorted(cols_date, key=lambda x: pd.to_datetime(x, format='%b-%y'))
df = df[cols_exclude[0:1] + cols_sorted + cols_exclude[-1:]]
print(df)
Output:
Defined Discharge Bin Jan-20 Mar-20 Apr-20 May-20 Grand Total
0 2-4 min 1 1
1 4-6 min 5 1 6
2 6-8 min 5 7 2 14

week number from given date in pandas

I have a data frame with two columns Date and value.
I want to add new column named week_number that basically is how many weeks back from the given date
import pandas as pd
df = pd.DataFrame(columns=['Date','value'])
df['Date'] = [ '04-02-2019','03-02-2019','28-01-2019','20-01-2019']
df['value'] = [10,20,30,40]
df
Date value
0 04-02-2019 10
1 03-02-2019 20
2 28-01-2019 30
3 20-01-2019 40
suppose given date is 05-02-2019.
Then I need to add a column week_number in a way such that how many weeks back the Date column date is from given date.
The output should be
Date value week_number
0 04-02-2019 10 1
1 03-02-2019 20 1
2 28-01-2019 30 2
3 20-01-2019 40 3
how can I do this in pandas
First convert column to datetimes by to_datetime with dayfirst=True, then subtract from right side by rsub, convert timedeltas to days, get modulo by 7 and add 1:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['week_number'] = df['Date'].rsub(pd.Timestamp('2019-02-05')).dt.days // 7 + 1
#alternative
#df['week_number'] = (pd.Timestamp('2019-02-05') - df['Date']).dt.days // 7 + 1
print (df)
Date value week_number
0 2019-02-04 10 1
1 2019-02-03 20 1
2 2019-01-28 30 2
3 2019-01-20 40 3