pandas sort by multiple columns - pandas

I want to sort the values in column C in ascending order and values in column B in order "April","August","December" and any remaining values e.g NaN in current example. Can anyone help.
before
A B C
0 354.7 April 4
1 278.8 NaN 4
2 283.5 December 2
3 249.6 NaN 2
4 95.5 April 2
5 85.6 August 2
6 55.4 August 4
7 176.5 December 4
8 104.8 August 8
9 278.8 NaN 10
10 238.7 April 8
11 278.8 April 5
12 152 December 8
After :
A B C
0 95.5 April 2
1 85.6 August 2
2 283.5 December 2
3 249.6 NaN 2
4 354.7 April 4
5 55.4 August 4
6 176.5 December 4
7 278.8 NaN 4
8 278.8 April 5
9 238.7 April 8
10 104.8 August 8
11 152 December 8
12 278.8 NaN 10

Is this what you need ?
df.B=pd.Categorical(df.B,['December','April','August'])
df.sort_values(['C','B'])
Out[284]:
A B C
2 283.5 December 2
4 95.5 April 2
5 85.6 August 2
3 249.6 NaN 2
7 176.5 December 4
0 354.7 April 4
6 55.4 August 4
1 278.8 NaN 4
11 278.8 April 5
12 152.0 December 8
10 238.7 April 8
8 104.8 August 8
9 278.8 NaN 10

Related

creating multi index from data grouped by month in Pandas

Consider this sample data:
Month Location Products Sales Profit
JAN 1 43 32 20
JAN 2 82 54 25
JAN 3 64 43 56
FEB 1 37 28 78
FEB 2 18 15 34
FEB 3 5 2 4
MAR 1 47 40 14
The multi-index transformation I am trying to achieve is this:
JAN FEB MAR
Location Products Sales Profit Products Sales Profit Products Sales Profit
1 43 32 29 37 28 78 47 40 14
2 82 54 25 18 15 34 null null null
3 64 43 56 5 2 4 null null null
I tried versions of this:
df.stack().to_frame().T
It put all the data into one row. So, that's not the goal.
I presume I am close in that it should be a stacking or unstacking, melting or unmelting, but my attempts have all resulted in data oatmeal at this point. Appreciate your time trying to solve this one.
You can use pivot with reorder_levels and sort_index():
df.pivot(index='Location',columns='Month').reorder_levels(order=[1,0],axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
In case you are interested, this answer elaborates between swaplevel and reoder_levels.
Use pivot:
>>> df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
To preserve order, you have to transform your Month column as CategoricalDtype before:
df['Month'] = df['Month'].astype(pd.CategoricalDtype(df['Month'].unique(), ordered=True))
out = df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
print(out)
# Output:
Month JAN FEB MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 43.0 20.0 32.0 37.0 78.0 28.0 47.0 14.0 40.0
2 82.0 25.0 54.0 18.0 34.0 15.0 NaN NaN NaN
3 64.0 56.0 43.0 5.0 4.0 2.0 NaN NaN NaN
Update 2
Try to force the order of level 2 columns:
df1 = df.set_index(['Month', 'Location'])
df1.columns = pd.CategoricalIndex(df1.columns, ordered=True)
df1 = df1.unstack('Month').swaplevel(axis=1).sort_index(axis=1)

Assigning a day, week, and year column in Pandas in one line

I usually have to extract days, weeks and years into separate columns like this:
data['Day'] = data.SALESDATE.dt.isocalendar().day
data['Week'] = data.SALESDATE.dt.isocalendar().week
data['Year'] = data.SALESDATE.dt.isocalendar().year
But is there a way where I can assign all three in one nice line?
data[['Day', 'Week', 'Year']] = ....
``
For one line solution use DataFrame.join with rename columns if necessary:
rng = pd.date_range('2017-04-03', periods=10)
data = pd.DataFrame({'SALESDATE': rng, 'a': range(10)})
data = data.join(data.SALESDATE.dt.isocalendar().rename(columns=lambda x: x.title()))
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
Or change order of list and assign:
data[['Year', 'Week', 'Day']] = data.SALESDATE.dt.isocalendar()
print (data)
SALESDATE a Year Week Day
0 2017-04-03 0 2017 14 1
1 2017-04-04 1 2017 14 2
2 2017-04-05 2 2017 14 3
3 2017-04-06 3 2017 14 4
4 2017-04-07 4 2017 14 5
5 2017-04-08 5 2017 14 6
6 2017-04-09 6 2017 14 7
7 2017-04-10 7 2017 15 1
8 2017-04-11 8 2017 15 2
9 2017-04-12 9 2017 15 3
If need changed order of values in list:
data[['Day', 'Week', 'Year']] = data.SALESDATE.dt.isocalendar()[['day','week','year']]
print (data)
SALESDATE a Day Week Year
0 2017-04-03 0 1 14 2017
1 2017-04-04 1 2 14 2017
2 2017-04-05 2 3 14 2017
3 2017-04-06 3 4 14 2017
4 2017-04-07 4 5 14 2017
5 2017-04-08 5 6 14 2017
6 2017-04-09 6 7 14 2017
7 2017-04-10 7 1 15 2017
8 2017-04-11 8 2 15 2017
9 2017-04-12 9 3 15 2017

Shrinking multiple rows to one row

I want to shrink multiple rows in a data frame to one row.
for example, if I have a dataframe like this,
name year project_name month week worklogs
Ahkam 2019 Proj1 1 1 10
Ahkam 2019 proj2 1 1 14
Ahkam 2019 proj3 1 2 6
Ahkam 2019 proj4 1 2 14
Naser 2019 Proj1 1 1 7
Naser 2019 proj2 1 1 8
Naser 2019 proj3 1 2 5
Naser 2019 proj4 1 2 3
and my output dataframe should be:
name year project_name month week worklogs
Ahkam 2019 NaN 1 1 24
Ahkam 2019 NaN 1 2 20
Naser 2019 NaN 1 1 15
Naser 2019 NaN 1 2 8
The project_name column may be whatever it can be. The worklogs must be added according to grouped columns(name,year,month,week)
Thanks in advance.
Use DataFrameGroupBy.agg:
df = (df.groupby(['name', 'year', 'month', 'week'], as_index=False)
.agg({'project_name':'first', 'worklogs':'sum'}))
print(df)
name year month week project_name worklogs
0 Ahkam 2019 1 1 Proj1 24
1 Ahkam 2019 1 2 proj3 20
2 Naser 2019 1 1 Proj1 15
3 Naser 2019 1 2 proj3 8

how to sep col as given char length in pandas?

How to seperate dataframe as follows:
yr mon day Tmax Tmin pcp
2013 4 22 5.09-10.92 0.0
2013 4 23 2.77 -9.63 0.5
2013 4 24 0.28 -9.90 9.9
2013 4 25 0.76 -6.70 12.2
2013 4 26 -0.35 -9.48 0.0
2013 4 27 7.22-10.47 0.0
2013 4 28 4.19-10.78 0.0
you see: there are no whitespaces between Tmax and Tmin in principle. The max width of Tmax and Tmin are 6 char-spaces. If less than 6 spaces , filled by whitespace. I want to read it to df and seperate each columns.
seperate columns as given char length?
try this:
df = pd.read_fwf(filename)
It seems you need str.extract floats and ints, solution works if all data are in one column which is selected by iloc:
pat="(\d+)\s*(\d+)\s*(\d+)\s*([-+]?\d+\.\d+|\d+)\s*([-+]?\d+\.\d+|\d+)\s*([-+]?\d+\.\d+|\d+)"
df1 = df.iloc[:, 0].str.extract(pat, expand=True)
df1.columns = ['year', 'mont','day','Tmax','Tmin','pcp']
print (df1)
year mont day Tmax Tmin pcp
0 2013 4 22 5.09 -10.92 0.0
1 2013 4 23 2.77 -9.63 0.5
2 2013 4 24 0.28 -9.90 9.9
3 2013 4 25 0.76 -6.70 12.2
4 2013 4 26 -0.35 -9.48 0.0
5 2013 4 27 7.22 -10.47 0.0
6 2013 4 28 4.19 -10.78 0.0
Another solution is use read_fwf and specify colspecs:
import pandas as pd
from pandas.compat import StringIO
temp=u"""yr mon day Tmax Tmin pcp
2013 4 22 5.09-10.92 0.0
2013 4 23 2.77 -9.63 0.5
2013 4 24 0.28 -9.90 9.9
2013 4 25 0.76 -6.70 12.2
2013 4 26 -0.35 -9.48 0.0
2013 4 27 7.22-10.47 0.0
2013 4 28 4.19-10.78 0.0 """
#after testing replace 'StringIO(temp)' to 'filename.csv'
names=['year', 'mont','day','Tmax','Tmin','pcp']
colspecs = [(0, 6), (9, 10), (12, 14), (21, 26),(26,32),(34,38)]
df = pd.read_fwf(StringIO(temp),colspecs=colspecs, names=names, header=0)
print (df)
year mont day Tmax Tmin pcp
0 2013 4 22 5.09 -10.92 0.0
1 2013 4 23 2.77 -9.63 0.5
2 2013 4 24 0.28 -9.90 9.9
3 2013 4 25 0.76 -6.70 12.2
4 2013 4 26 -0.35 -9.48 0.0
5 2013 4 27 7.22 -10.47 0.0
6 2013 4 28 4.19 -10.78 0.0

how can we give index while calculating 3 days moving average

I have a data sets like below and want to calculate the max value 3 days moving average and tried this code
pd.rolling_mean(data['prec'], 3).max()
this code gives the moving average but without date
year month day prec
0 1981 1 1 1.5
1 1981 1 2 0.0
2 1981 1 3 0.0
3 1981 1 4 0.4
4 1981 1 5 0.0
5 1981 1 6 1.0
6 1981 1 7 1.9
7 1981 1 8 0.6
8 1981 1 9 3.7
9 1981 1 10 0.0
10 1981 1 11 0.0
11 1981 1 12 0.0
12 1981 1 13 0.0
13 1981 1 14 12.2
14 1981 1 15 1.7
15 1981 1 16 0.6
16 1981 1 17 0.9
17 1981 1 18 0.6
18 1981 1 19 0.4
19 1981 1 20 0.2
20 1981 1 21 1.4
21 1981 1 22 3.2
22 1981 1 23 0.0
the format which I want is
year month day prec
.... .. .. ...
can anyone help to solve this problem
Assign the result of pd.rolling_mean or pd.rolling_max to a DataFrame column:
import pandas as pd
df = pd.read_table('data', sep='\s+')
df['moving average'] = pd.rolling_mean(df['prec'], 3)
df['max of moving average'] = pd.rolling_max(df['moving average'], 3)
yields
In [32]: df
Out[32]:
year month day prec moving average max of moving average
0 1981 1 1 1.5 NaN NaN
1 1981 1 2 0.0 NaN NaN
2 1981 1 3 0.0 5.000000e-01 NaN
3 1981 1 4 0.4 1.333333e-01 NaN
4 1981 1 5 0.0 1.333333e-01 0.500000
5 1981 1 6 1.0 4.666667e-01 0.466667
6 1981 1 7 1.9 9.666667e-01 0.966667
7 1981 1 8 0.6 1.166667e+00 1.166667
8 1981 1 9 3.7 2.066667e+00 2.066667
9 1981 1 10 0.0 1.433333e+00 2.066667
10 1981 1 11 0.0 1.233333e+00 2.066667
11 1981 1 12 0.0 1.480297e-16 1.433333
12 1981 1 13 0.0 1.480297e-16 1.233333
13 1981 1 14 12.2 4.066667e+00 4.066667
14 1981 1 15 1.7 4.633333e+00 4.633333
15 1981 1 16 0.6 4.833333e+00 4.833333
16 1981 1 17 0.9 1.066667e+00 4.833333
17 1981 1 18 0.6 7.000000e-01 4.833333
18 1981 1 19 0.4 6.333333e-01 1.066667
19 1981 1 20 0.2 4.000000e-01 0.700000
20 1981 1 21 1.4 6.666667e-01 0.666667
21 1981 1 22 3.2 1.600000e+00 1.600000
22 1981 1 23 0.0 1.533333e+00 1.600000