I have a Dataframe:
df =
A B C D
DATA_DATE
20170103 5.0 3.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 1.0 NaN 2.0 3.0
And I have a series
s =
DATA_DATE
20170103 4.0
20170104 0.0
20170105 2.2
I'd like to run an element-wise max() function and align s along the columns of df. In other words, I want to get
result =
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
What is the best way to do this? I've checked single column comparison and series to series comparison but haven't found an efficient way to run dataframe against a series.
Bonus: Not sure if the answer will be self-evident from above, but how to do it if I want to align s along the rows of df (assume dimensions match)?
Data:
In [135]: df
Out[135]:
A B C D
DATA_DATE
20170103 5.0 3.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 1.0 NaN 2.0 3.0
In [136]: s
Out[136]:
20170103 4.0
20170104 0.0
20170105 2.2
Name: DATA_DATE, dtype: float64
Solution:
In [66]: df.clip_lower(s, axis=0)
C:\Users\Max\Anaconda4\lib\site-packages\pandas\core\ops.py:1247: RuntimeWarning: invalid value encountered in greater_equal
result = op(x, y)
Out[66]:
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
we can use the following hack in order to ged rid of the RuntimeWarning:
In [134]: df.fillna(np.inf).clip_lower(s, axis=0).replace(np.inf, np.nan)
Out[134]:
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
This is called broadcasting and can be done as follows:
import numpy as np
np.maximum(df, s[:, None])
Out:
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
Here, s[:, None] will add a new axis to s. The same can be achieved by s[:, np.newaxis]. When you do this, they can be broadcast together because shapes (3, 4) and (3, 1) have a common element.
Note the difference between s and s[:, None]:
s.values
Out: array([ 4. , 0. , 2.2])
s[:, None]
Out:
array([[ 4. ],
[ 0. ],
[ 2.2]])
s.shape
Out: (3,)
s[:, None].shape
Out: (3, 1)
An alternative would be:
df.mask(df.le(s, axis=0), s, axis=0)
Out:
A B C D
DATA_DATE
20170103 5.0 4.0 NaN NaN
20170104 NaN NaN NaN 1.0
20170105 2.2 NaN 2.2 3.0
This reads: Compare df and s. Where df is larger, use df, and otherwise use s.
While there may be better solutions for your problem, I believe this should give you what you need:
for c in df.columns:
df[c] = pd.concat([df[c], s], axis=1).max(axis=1)
Related
I am having trouble appending later values from column C to column A within the same df using pandas. I have tried .append and .concat with ignore_index=True, still not working.
import pandas as pd
d = {'a':[1,2,3,None, None], 'b':[7,8,9, None, None], 'c':[None, None, None, 5, 6]}
df = pd.DataFrame(d)
df['a'] = df['a'].append(df['c'], ignore_index=True)
print(df)
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 NaN NaN 5.0
4 NaN NaN 6.0
Desired:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0
Thank you for updating that, this is what I would do:
df['a'] = df['a'].fillna(df['c'])
print(df)
Output:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0
i can show it by: df.isnull().sum() and get the max value with: df.isnull().sum().max() ,
but someone can tell me how to represent the column name with max Nan's ?
Thank you all!
Use Series.idxmax with DataFrame.loc for filter column with most missing values:
df.loc[:, df.isnull().sum().idxmax()]
If need select multiple columns with more maximes compare Series with max value:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,np.nan,5,np.nan,4],
'C':[7,8,9,np.nan,2,np.nan],
'D':[1,np.nan,5,7,1,0]
})
print (df)
A B C D
0 a 4.0 7.0 1.0
1 b 5.0 8.0 NaN
2 c NaN 9.0 5.0
3 d 5.0 NaN 7.0
4 e NaN 2.0 1.0
5 f 4.0 NaN 0.0
s = df.isnull().sum()
df = df.loc[:, s.eq(s.max())]
print (df)
B C
0 4.0 7.0
1 5.0 8.0
2 NaN 9.0
3 5.0 NaN
4 NaN 2.0
5 4.0 NaN
I am not entirely new to data science, but rather novice with pandas.
My data looks like this:
Date Obser_Type
0 2001-01-05 A
1 2002-02-06 A
2 2002-02-06 B
3 2004-03-07 C
4 2005-04-08 B
5 2006-05-09 A
6 2007-06-10 C
7 2007-07-11 B
I would like to get the following output with the proportions for the different kinds of observations as of total (i.e. accumulated from the beginning up to and including the specified year) and within each year:
Year A_%_total B_%_total C_%_total A_%_Year B_%_Year C_%_Year
0 2001 100 0 0 100 0 0
1 2002 67 33 0 50 50 0
2 2004 50 25 25 0 0 100
3 2005 40 40 20 0 100 0
4 2006 50 33 17 100 0 0
5 2007 37,5 37,5 25 0 50 50
I tried various approaches involving groupby, multiindexing, count etc but to no avail. I got either errors or something unsatisfying.
After extensively digging Stack Overflow and the rest of the internet for days, I am stumped.
The medieval way would be a bucket of loops and ifs, but what is the proper way to do this?
I have used appropriate values for the numbers. I don't know the aggregation logic of each of them, but I decided to create a composition ratio for 'Obser_Type' and a composition ratio for 'year'.
Add a new column for year data
2.Aggregate and create DF
3.Creating the Composition Ratio
4.Aggregate and create DF
5.Creating the Composition Ratio
6.Combining the two DF's
import pandas as pd
import numpy as np
import io
data = '''
Date Obser_Type Value
0 2001-01-05 A 34
1 2002-02-06 A 39
2 2002-02-06 B 67
3 2004-03-07 C 20
4 2005-04-08 B 29
5 2006-05-09 A 10
6 2007-06-10 C 59
7 2007-07-11 B 43
'''
df = pd.read_csv(io.StringIO(data), sep=' ')
df['Date'] = pd.to_datetime(df['Date'])
df['yyyy'] = df['Date'].dt.year
df1 = df.groupby(['yyyy','Obser_Type'])['Value'].agg(sum).unstack().fillna(0)
df1 = df1.apply(lambda x: x/sum(x), axis=0).rename(columns={'A':'A_%_total','B':'B_%_total','C':'C_%_total'})
df2 = df.groupby(['Obser_Type','yyyy'])['Value'].agg(sum).unstack().fillna(0)
df2 = df2.apply(lambda x: x/sum(x), axis=0)
df2 = df2.unstack().unstack().rename(columns={'A':'A_%_Year','B':'B_%_Year','C':'C_%_Year'})
pd.merge(df1, df2, on='yyyy')
Obser_Type A_%_total B_%_total C_%_total A_%_Year B_%_Year C_%_Year
yyyy
2001 0.409639 0.000000 0.000000 1.000000 0.000000 0.000000
2002 0.469880 0.482014 0.000000 0.367925 0.632075 0.000000
2004 0.000000 0.000000 0.253165 0.000000 0.000000 1.000000
2005 0.000000 0.208633 0.000000 0.000000 1.000000 0.000000
2006 0.120482 0.000000 0.000000 1.000000 0.000000 0.000000
2007 0.000000 0.309353 0.746835 0.000000 0.421569 0.578431
Thank you very much for your answer. However, i probably should have made it more clear that the actual dataframe is much bigger and has much more types of observations than A B C, so listing them manually would be inconvenient. My scope here is just the statistics for the different types of observations, not their associated numerical values.
I was able to build something and would like to share:
# convert dates to datetimes
#
df[‚Date'] = pd.to_datetime(df[‚Date'])
# get years from the dates
#
df[‚Year'] = df.Date.dt.year
# get total number of observations per type of observation and year in tabular form
#
grouped = df.groupby(['Year', 'Obser_Type']).count().unstack(1)
Date
Obser_Type A B C
Year
2001 1.0 NaN NaN
2002 1.0 1.0 NaN
2004 NaN NaN 1.0
2005 NaN 1.0 NaN
2006 1.0 NaN NaN
2007 NaN 1.0 1.0
# sum total number of observations per type over all years
#
grouped.loc['Total_Obs_per_Type',:] = grouped.sum(axis=0)
Date
Obser_Type A B C
Year
2001 1.0 NaN NaN
2002 1.0 1.0 NaN
2004 NaN NaN 1.0
2005 NaN 1.0 NaN
2006 1.0 NaN NaN
2007 NaN 1.0 1.0
Total_Obs_per_Type 3.0 3.0 2.0
# at this point the columns have a multiindex
#
grouped.columns
MultiIndex([('Date', 'A'),
('Date', 'B'),
('Date', 'C')],
names=[None, 'Obser_Type'])
# i only needed the second layer which looks like this
#
grouped.columns.get_level_values(1)
Index(['A', 'B', 'C'], dtype='object', name='Obser_Type')
# so i flattened the index
#
grouped.columns = grouped.columns.get_level_values(1)
# now i can easily address the columns
#
grouped.columns
Index(['A', 'B', 'C'], dtype='object', name='Obser_Type')
# create list of columns with observation types
# this refers to columns "A B C"
#
types_list = grouped.columns.values.tolist()
# create list to later access the columns with the cumulative sum of observations per type
# this refers to columns "A_cum B_cum C_cum"
#
types_cum_list = []
# calculate cumulative sum for the different kinds of observations
#
for columnName in types_list:
# create new columns with modified name and calculate for each type of observation the cumulative sum of observations
#
grouped[columnName+'_cum'] = grouped[columnName].cumsum()
# put the new column names in the list of columns with cumulative sum of observations per type
#
types_cum_list.append(columnName+'_cum')
# this gives
Obser_Type A B C A_cum B_cum C_cum
Year
2001 1.0 NaN NaN 1.0 NaN NaN
2002 1.0 1.0 NaN 2.0 1.0 NaN
2004 NaN NaN 1.0 NaN NaN 1.0
2005 NaN 1.0 NaN NaN 2.0 NaN
2006 1.0 NaN NaN 3.0 NaN NaN
2007 NaN 1.0 1.0 NaN 3.0 2.0
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0
# create new column with total number of observations for all types of observation within a single year
#
grouped['All_Obs_Y'] = grouped.loc[:,types_list].sum(axis=1)
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y
Year
2001 1.0 NaN NaN 1.0 NaN NaN 1.0
2002 1.0 1.0 NaN 2.0 1.0 NaN 2.0
2004 NaN NaN 1.0 NaN NaN 1.0 1.0
2005 NaN 1.0 NaN NaN 2.0 NaN 1.0
2006 1.0 NaN NaN 3.0 NaN NaN 1.0
2007 NaN 1.0 1.0 NaN 3.0 2.0 2.0
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0
# create new columns with cumulative sum of all kinds observations up to each year
#
grouped['All_Obs_Cum'] = grouped['All_Obs_Y'].cumsum()
# this gives
# sorry i could not work out the formatting and i am not allowed yet to include screenshots
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum
Year
2001 1.0 NaN NaN 1.0 NaN NaN 1.0 1.0
2002 1.0 1.0 NaN 2.0 1.0 NaN 2.0 3.0
2004 NaN NaN 1.0 NaN NaN 1.0 1.0 4.0
2005 NaN 1.0 NaN NaN 2.0 NaN 1.0 5.0
2006 1.0 NaN NaN 3.0 NaN NaN 1.0 6.0
2007 NaN 1.0 1.0 NaN 3.0 2.0 2.0 8.0
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0
# create list of columns with the percentages each type of observation has within the observations of each year
# this refers to columns "A_%_Y B_%_Y C_Y_%"
#
types_percent_Y_list = []
# calculate the percentages each type of observation has within each year
#
for columnName in types_list:
# calculate percentages
#
grouped[columnName+'_%_Y'] = grouped[columnName] / grouped['All_Obs_Y']
# put the new columns names in list of columns with percentages each type of observation has within a year for later access
#
types_percent_Y_list.append(columnName+'_%_Y')
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y
Year
2001 1.0 NaN NaN 1.0 NaN NaN 1.0 1.0 1.000 NaN NaN
2002 1.0 1.0 NaN 2.0 1.0 NaN 2.0 3.0 0.500 0.500 NaN
2004 NaN NaN 1.0 NaN NaN 1.0 1.0 4.0 NaN NaN 1.00
2005 NaN 1.0 NaN NaN 2.0 NaN 1.0 5.0 NaN 1.000 NaN
2006 1.0 NaN NaN 3.0 NaN NaN 1.0 6.0 1.000 NaN NaN
2007 NaN 1.0 1.0 NaN 3.0 2.0 2.0 8.0 NaN 0.500 0.50
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25
# replace the NaNs in the types_cum columns, otherwise the calculation of the cumulative percentages in the next step would not work
#
# types_cum_list :
# if there is no observation for e.g. type B in the first year (2001) we put a count of 0 for that year,
# that is, in the first row.
# If there is no observation for type B in a later year (e.g. 2004) the cumulative count of Bs
# from the beginning up to that year does not change in that year, so we replace the NaN there with
# the last non-NaN value preceding it
#
# replace NaNs in first row by 0
#
for columnName in types_cum_list:
grouped.update(grouped.iloc[:1][columnName].fillna(value=0))
# replace NaNs in later rows with preceding non-NaN value
#
for columnName in types_cum_list:
grouped[columnName].fillna(method='ffill' , inplace=True)
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y
Year
2001 1.0 NaN NaN 1.0 0.0 0.0 1.0 1.0 1.000 NaN NaN
2002 1.0 1.0 NaN 2.0 1.0 0.0 2.0 3.0 0.500 0.500 NaN
2004 NaN NaN 1.0 2.0 1.0 1.0 1.0 4.0 NaN NaN 1.00
2005 NaN 1.0 NaN 2.0 2.0 1.0 1.0 5.0 NaN 1.000 NaN
2006 1.0 NaN NaN 3.0 2.0 1.0 1.0 6.0 1.000 NaN NaN
2007 NaN 1.0 1.0 3.0 3.0 2.0 2.0 8.0 NaN 0.500 0.50
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25
# create list of the columns with the cumulative percentages of the different observation types from the beginning up to that year
# this refers to columns "A_cum_% B_cum_% C_cum_%"
#
types_cum_percent_list = []
# calculate cumulative proportions of different types of observations from beginning up to each year
#
for columnName in types_cum_list:
# if we had not taken care of the NaNs in the types_cum columns this would produce incorrect numbers
#
grouped[columnName+'_%'] = grouped[columnName] / grouped['All_Obs_Cum']
# put the new columns in their respective list so we can access them conveniently later
#
types_cum_percent_list.append(columnName+'_%')
# this gives
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y A_cum_% B_cum_% C_cum_%
Year
2001 1.0 NaN NaN 1.0 0.0 0.0 1.0 1.0 1.000 NaN NaN 1.000000 0.000000 0.000000
2002 1.0 1.0 NaN 2.0 1.0 0.0 2.0 3.0 0.500 0.500 NaN 0.666667 0.333333 0.000000
2004 NaN NaN 1.0 2.0 1.0 1.0 1.0 4.0 NaN NaN 1.00 0.500000 0.250000 0.250000
2005 NaN 1.0 NaN 2.0 2.0 1.0 1.0 5.0 NaN 1.000 NaN 0.400000 0.400000 0.200000
2006 1.0 NaN NaN 3.0 2.0 1.0 1.0 6.0 1.000 NaN NaN 0.500000 0.333333 0.166667
2007 NaN 1.0 1.0 3.0 3.0 2.0 2.0 8.0 NaN 0.500 0.50 0.375000 0.375000 0.250000
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25 0.375000 0.375000 0.250000
# to conclude i replace the remaining NaNs to make plotting easier
# replace NaNs in columns in types_list
#
# if there is no observation for a type of observation in a year we put a count of 0 for that year
#
for columnName in types_list:
grouped[columnName].fillna(value=0, inplace=True)
# replace NaNs in columns in types_percent_Y_list
#
# if there is no observation for a type of observation in a year we put a percentage of 0 for that year
#
for columnName in types_percent_Y_list:
grouped[columnName].fillna(value=0, inplace=True)
Obser_Type A B C A_cum B_cum C_cum All_Obs_Y All_Obs_Cum A_%_Y B_%_Y C_%_Y A_cum_% B_cum_% C_cum_%
Year
2001 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.000 0.000 0.00 1.000000 0.000000 0.000000
2002 1.0 1.0 0.0 2.0 1.0 0.0 2.0 3.0 0.500 0.500 0.00 0.666667 0.333333 0.000000
2004 0.0 0.0 1.0 2.0 1.0 1.0 1.0 4.0 0.000 0.000 1.00 0.500000 0.250000 0.250000
2005 0.0 1.0 0.0 2.0 2.0 1.0 1.0 5.0 0.000 1.000 0.00 0.400000 0.400000 0.200000
2006 1.0 0.0 0.0 3.0 2.0 1.0 1.0 6.0 1.000 0.000 0.00 0.500000 0.333333 0.166667
2007 0.0 1.0 1.0 3.0 3.0 2.0 2.0 8.0 0.000 0.500 0.50 0.375000 0.375000 0.250000
Total_Obs_per_Type 3.0 3.0 2.0 6.0 6.0 4.0 8.0 16.0 0.375 0.375 0.25 0.375000 0.375000 0.250000
This has the functionylity and flexibility i was looking for. But as i am still learning pandas suggestions for improvement are appreciated.
I'm having difficulty in preventing pd.DataFrame.interpolate(method='index') from extrapolation.
Specifically:
>>> df = pd.DataFrame({1: range(1, 5), 2: range(2, 6), 3 : range(3, 7)}, index = [1, 2, 3, 4])
>>> df = df.reindex(range(6)).reindex(range(5), axis=1)
>>> df.iloc[3, 2] = np.nan
>>> df
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 1.0 2.0 3.0 NaN
2 NaN 2.0 3.0 4.0 NaN
3 NaN 3.0 NaN 5.0 NaN
4 NaN 4.0 5.0 6.0 NaN
5 NaN NaN NaN NaN NaN
So df is just a block of data surrounded by NaN, with an interior missing point at iloc[3, 2]. Now when I apply .interpolate() (along either the horizontal or vertical axis), my goal is to have ONLY that interior point filled, leaving the surrounding NaNs untouched. But somehow I'm not able to get it to work.
I tried:
>>> df.interpolate(method='index', axis=0, limit_area='inside')
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 1.0 2.0 3.0 NaN
2 NaN 2.0 3.0 4.0 NaN
3 NaN 3.0 4.0 5.0 NaN
4 NaN 4.0 5.0 6.0 NaN
5 NaN 4.0 5.0 6.0 NaN
Note the last row got filled, which is undesirable. (btw, I'd think the fill value should be linear extrapolation based on index, but it is just padding the last value, which is highly undesirable.)
I also tried combination of limit and limit_direction to no avail.
What would be the correct argument setting to get the desired result? Hopefully without some contorted masking (but that would work too). Thx.
Ok, turns out I'm running this on Pandas 0.21, hence the limit_area argument is silently failing. Looks like starting from 0.24 this is fixed. Case closed.
I have a pandas DataFrame compiled from some web data (for tennis games) that exhibits strange behaviour when summing across selected rows.
DataFrame:
In [178]: tdf.shape
Out[178]: (47028, 57)
In [201]: cols
Out[201]: ['L1', 'L2', 'L3', 'L4', 'L5', 'W1', 'W2', 'W3', 'W4', 'W5']
In [177]: tdf[cols].head()
Out[177]:
L1 L2 L3 L4 L5 W1 W2 W3 W4 W5
0 4.0 2 NaN NaN NaN 6.0 6 NaN NaN NaN
1 3.0 3 NaN NaN NaN 6.0 6 NaN NaN NaN
2 7.0 5 3 NaN NaN 6.0 7 6 NaN NaN
3 1.0 4 NaN NaN NaN 6.0 6 NaN NaN NaN
4 6.0 7 4 NaN NaN 7.0 5 6 NaN NaN
When then trying to compute the sum over the rows using tdf[cols].sum(axis=1). From the above table, the sum for the 1st row should be 18.0, but it is reported as 10, as below:
In [180]: tdf[cols].sum(axis=1).head()
Out[180]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
The problem seems to be caused by a specific record (row 13771), because when I exclude this row, the sum is calculated correctly:
In [182]: tdf.iloc[:13771][cols].sum(axis=1).head()
Out[182]:
0 18.0
1 18.0
2 34.0
3 17.0
4 35.0
dtype: float64
whereas, including it:
In [183]: tdf.iloc[:13772][cols].sum(axis=1).head()
Out[183]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
Gives the wrong result for the entire column.
The offending record is as follows:
In [196]: tdf[cols].iloc[13771]
Out[196]:
L1 1
L2 1
L3 NaN
L4 NaN
L5 NaN
W1 6
W2 0
W3
W4 NaN
W5 NaN
Name: 13771, dtype: object
In [197]: tdf[cols].iloc[13771].W3
Out[197]: ' '
In [198]: type(tdf[cols].iloc[13771].W3)
Out[198]: str
I'm running the following versions:
In [192]: sys.version
Out[192]: '3.4.3 (default, Nov 17 2016, 01:08:31) \n[GCC 4.8.4]'
In [193]: pd.__version__
Out[193]: '0.19.2'
In [194]: np.__version__
Out[194]: '1.12.0'
Surely a single poorly formatted record should not influence the sum of other records? Is this a bug or am I doing something wrong?
Help much appreciated!
Problem is with empty string - then dtype of column W3 is object (obviously string) and sum omit it.
Solutions:
Try replace problematic empty string value to NaN and then cast to float
tdf.loc[13771, 'W3'] = np.nan
tdf.W3 = tdf.W3.astype(float)
Or try replace all empty strings to NaN in subset cols:
tdf[cols] = tdf[cols].replace({'':np.nan})
#if necessary
tdf[cols] = tdf[cols].astype(float)
Another solution is use to_numeric in problematic column - replace all non numeric to NaN:
tdf.W3 = pd.to_numerice(tdf.W3, erors='coerce')
Or generally apply for columns cols:
tdf[cols] = tdf[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))