How to slice the pandas dataframe which has date as its index - pandas

I have pandas dataframe which reads like below
SKU
1/1/2017 1
2/1/2017 2
3/1/2017 3
4/1/2017 4
5/1/2017 5
So it has date string as index
How can I perform slicing operation for this dataframe
I tried
df.loc['1/1/2017':'3/1/2017']
It threw me error, saying that I have to convert the string indexes into datetime
Kindly help

For me it working nice with your sample data:
print (df.loc['1/1/2017':'3/1/2017'])
SKU
1/1/2017 1
2/1/2017 2
3/1/2017 3
But I suggest create DatetimeIndex:
df.index = pd.to_datetime(df.index, dayfirst=True)
print (df.loc['2017-01-01':'2017-01-03'])
SKU
2017-01-01 1
2017-01-02 2
2017-01-03 3

Related

How to sum up a selected range of rows via a condition?

I hope with these additional information someone could find time to help me with this new issue.
sample date here --> file
'Date as index' (datetime.date)
As I said I'm trying to select a range in a dataframe every time x is in interval [-20 -190] and create a new dataframe with a new column which is the sum of the selected rows and keep the last "encountered" date as index
EDIT : The "loop" start at the first date/beginning of the df and when a value which is less than 0 or -190 is found, then sum it up and continue to find and sum it up and so on
BUT I still got values which are still in the intervall (-190, 0)
example and code below.
Thks
import pandas as pd
df = pd.read_csv('http://www.sharecsv.com/s/0525f76a07fca54717f7962d58cac692/sample_file.csv', sep = ';')
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3
##### output #####
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 11:28:00 -154.35
3 2019-01-02 12:08:00 -4706.87
4 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-29 16:58:00 -0.38
833 2019-09-30 17:08:00 -129365.71
834 2019-09-30 17:13:00 -157.05
835 2019-10-01 08:58:00 -111911.98
########## expected output #############
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 12:08:00 -4706.87
3 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-30 17:08:00 -129365.71
833 2019-10-01 08:58:00 -111911.98
...
...
Use Series.where with Series.between for replace values to NaNs of Date column with back filling missing values and then aggregate sum, next step is filter out rows with match range by boolean indexing and last use DataFrame.resample with cast Series to one column DataFrame by Series.to_frame:
#range -190, 0
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3 = df3[~df3['x'].between(-190, 0)]
df3 = df3.resample('D', on='Date')['x'].sum().to_frame()

Not getting top5 values for each month using grouper and groupby in pandas

I'm trying to get top5 values for amount for each month along with the text column. I've tried resampling and group by statement
Dataset:
text amount date
123… 11.00 11-05-17
123abc… 10.00 11-08-17
Xyzzy… 22.00. 12-07-17
Xyzzy… 221.00. 11-08-17
Xyzzy… 212.00. 10-08-17
Xyzzy… 242.00. 18-08-17
Code:
df1 = df.groupby([’text', pd.Grouper(key=‘date', freq='M')])[‘amount'].apply(lambda x: x.nlargest(5))
I get group of text but not arranged by month or largest values sorted in descending order.
df1 = df.groupby([pd.Grouper(key=‘date', freq='M')])[‘amount'].apply(lambda x: x.nlargest(5))
THis code works fine but does not give text column.
assuming that amount is a numeric column:
In [8]: df.groupby(['text', pd.Grouper(key='date', freq='M')]).apply(lambda x: x.nlargest(2, 'amount'))
Out[8]:
text amount date
text date
123abc… 2017-11-30 1 123abc… 10.0 2017-11-08
123… 2017-11-30 0 123… 11.0 2017-11-05
Xyzzy… 2017-08-31 5 Xyzzy… 242.0 2017-08-18
2017-10-31 4 Xyzzy… 212.0 2017-10-08
2017-11-30 3 Xyzzy… 221.0 2017-11-08
2017-12-31 2 Xyzzy… 22.0 2017-12-07
You can using head with sort_values
df1 = df.sort_values('amount',ascending=False).groupby(['text', pd.Grouper(key='date', freq='M')]).head(2)

How do I sort column by targeting a specific number within that cell?

I would like to use Pandas Python to sort a specific column by date (more specifically the year). However, the year is buried within a bunch of other numbers. How do I just target the 2 digits that I need?
In the example below, I want to sort this column by the numbers [16,14,15...] rather than considering all the numbers in that row.
3/18/16 11:46
6/19/14 14:58
7/27/15 14:22
8/3/15 12:59
2/20/13 12:33
9/27/16 12:08
7/27/15 14:22
Given a dataframe like this,
date
0 3/18/16
1 6/19/14
2 7/27/15
3 8/3/15
4 2/20/13
5 9/27/16
6 7/27/15
You can convert the date column to datetime format and then sort.
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by = 'date')
The resulting dataframe
date
4 2013-02-20
1 2014-06-19
2 2015-07-27
6 2015-07-27
3 2015-08-03
0 2016-03-18
5 2016-09-27

Pandas group by cumsum keep columns

I have spent a few hours now trying to do a "cumulative group by sum" on a pandas dataframe. I have looked at all the stackoverflow answers and surprisingly none of them can solve my (very elementary) problem:
I have a dataframe:
df1
Out[8]:
Name Date Amount
0 Jack 2016-01-31 10
1 Jack 2016-02-29 5
2 Jack 2016-02-29 8
3 Jill 2016-01-31 10
4 Jill 2016-02-29 5
I am trying to
group by ['Name','Date'] and
cumsum 'Amount'.
That is it.
So the desired output is:
df1
Out[10]:
Name Date Cumsum
0 Jack 2016-01-31 10
1 Jack 2016-02-29 23
2 Jill 2016-01-31 10
3 Jill 2016-02-29 15
EDIT: I am simplifying the question. With the current answers I still can't get the correct "running" cumsum. Look closely, I want to see the cumulative sum "10, 23, 10, 15". In words, I want to see, at every consecutive date, the total cumulative sum for a person. NB: If there are two entries on one date for the same person, I want to sum those and then add them to the running cumsum and only then print the sum.
You need assign output to new column and then remove Amount column by drop:
df1['Cumsum'] = df1.groupby(by=['Name','Date'])['Amount'].cumsum()
df1 = df1.drop('Amount', axis=1)
print (df1)
Name Date Cumsum
0 Jack 2016-01-31 10
1 Jack 2016-02-29 5
2 Jack 2016-02-29 13
3 Jill 2016-01-31 10
4 Jill 2016-02-29 5
Another solution with assign:
df1 = df1.assign(Cumsum=df1.groupby(by=['Name','Date'])['Amount'].cumsum())
.drop('Amount', axis=1)
print (df1)
Name Date Cumsum
0 Jack 2016-01-31 10
1 Jack 2016-02-29 5
2 Jack 2016-02-29 13
3 Jill 2016-01-31 10
4 Jill 2016-02-29 5
EDIT by comment:
First groupby columns Name and Date and aggregate sum, then groupby by level Name and aggregate cumsum.
df = df1.groupby(by=['Name','Date'])['Amount'].sum()
.groupby(level='Name').cumsum().reset_index(name='Cumsum')
print (df)
Name Date Cumsum
0 Jack 2016-01-31 10
1 Jack 2016-02-29 23
2 Jill 2016-01-31 10
3 Jill 2016-02-29 15
Set the index first, then groupby.
df.set_index(['Name', 'Date']).groupby(level=[0, 1]).Amount.cumsum().reset_index()
After the OP changed their question, this is now the correct answer.
df1.groupby(
['Name','Date']
)Amount.sum().groupby(
level='Name'
).cumsum()
This is the same answer provided by jezrael

pandas pivot_table with dates as values

let's say I have the following table of customer data
df = pd.DataFrame.from_dict({"Customer":[0,0,1],
"Date":['01.01.2016', '01.02.2016', '01.01.2016'],
"Type":["First Buy", "Second Buy", "First Buy"],
"Value":[10,20,10]})
which looks like this:
Customer | Date | Type | Value
-----------------------------------------
0 |01.01.2016|First Buy | 10
-----------------------------------------
0 |01.02.2016|Second Buy| 20
-----------------------------------------
1 |01.01.2016|First Buy | 10
I want to pivot the table by the Type column.
However, the pivoting only gives the numeric Value columns as a result.
I'd desire a structure like:
Customer | First Buy Date | First Buy Value | Second Buy Date | Second Buy Value
---------------------------------------------------------------------------------
where the missing values are NAN or NAT
Is this possible using pivot_table. If not, I can imagine some workarounds, but they are quite lenghty. Any other suggestions?
Use unstack:
df1 = df.set_index(['Customer', 'Type']).unstack()
df1.columns = ['_'.join(cols) for cols in df1.columns]
print (df1)
Date_First Buy Date_Second Buy Value_First Buy Value_Second Buy
Customer
0 01.01.2016 01.02.2016 10.0 20.0
1 01.01.2016 None 10.0 NaN
If need another order of columns use swaplevel and sort_index:
df1 = df.set_index(['Customer', 'Type']).unstack()
df1.columns = ['_'.join(cols) for cols in df1.columns.swaplevel(0,1)]
df1.sort_index(axis=1, inplace=True)
print (df1)
First Buy_Date First Buy_Value Second Buy_Date Second Buy_Value
Customer
0 01.01.2016 10.0 01.02.2016 20.0
1 01.01.2016 10.0 None NaN