I have a dataframe which for the sake of showing a minimal example, I simplified to this:
df = pd.DataFrame({'pnl':[+10,+23,+15,-5,+20],'style':['obs','obs','obs','bf','bf']})
I would like to do the following:
Group the dataframe by style
Count the positive entries of pnl and divide by the total of entries of that same style.
For example, style 'bf' has 2 entries, one positive and one negative, so 1/2 (total) = 0.5.
This should yield the following result:
style win_rate
bf 0.5
osb 1
dtype: float64
I thought of having a list of the groups, iterate over them and build a new df... But it seems to me like an antipattern. I am pretty sure there is an easier / more pythonic solution.
Thanks.
You can do groupby on df['pnl'].gt(0) series with df['style'] and calculate mean
In [14]: df['pnl'].gt(0).groupby(df['style']).mean()
Out[14]:
style
bf 0.5
obs 1.0
Name: pnl, dtype: float64
You can try pd.crosstab, which uses groupby in the background, to get percentage of both positive and non-positive:
pd.crosstab(df['style'], df['pnl'].gt(0), normalize='index')
Output:
pnl False True
style
bf 0.5 0.5
obs 0.0 1.0
Related
I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0
I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.
Currently, I have written the below function for percent change calculation:
function pct_change(input::AbstractVector{<:Number})::AbstractVector{Number}
result = [NaN]
for i in 2:length(input)
push!(result, (input[i] - input[i-1])/abs(input[i-1]))
end
return result
end
This works as expected. But wanted to know whether there is a built-in function for Julia DataFrames similar to pandas pct_change which I can use directly? Or any other better way or improvements that I can make to my function above?
This is a very specific function and is not provided in DataFrames.jl, but rather TimeSeries.jl. Here is an example:
julia> using TimeSeries, Dates
julia> ta = TimeArray(Date(2018, 1, 1):Day(1):Date(2018, 12, 31), 1:365);
julia> percentchange(ta);
(there are some more options to what should be calculated)
The drawback is that it accepts only TimeArray objects and that it drops periods for which percent change cannot be calculated (as they are retained in Python).
If you want your custom definition consider denoting the first value as missing rather than NaN, as missing. Also your function will not produce the most accurate representation of the numbers (e.g. if you wanted to use BigFloat or exact calculations using Rational type they will be converted to Float64). Here are example alternative function implementations that avoid these problems:
function pct_change(input::AbstractVector{<:Number})
res = #view(input[2:end]) ./ #view(input[1:end-1]) .- 1
[missing; res]
end
or
function pct_change(input::AbstractVector{<:Number})
[i == 1 ? missing : (input[i]-input[i-1])/input[i-1] for i in eachindex(input)]
end
And now you have in both cases:
julia> pct_change(1:10)
10-element Array{Union{Missing, Float64},1}:
missing
1.0
0.5
0.33333333333333326
0.25
0.19999999999999996
0.16666666666666674
0.1428571428571428
0.125
0.11111111111111116
julia> pct_change(big(1):10)
10-element Array{Union{Missing, BigFloat},1}:
missing
1.0
0.50
0.3333333333333333333333333333333333333333333333333333333333333333333333333333391
0.25
0.2000000000000000000000000000000000000000000000000000000000000000000000000000069
0.1666666666666666666666666666666666666666666666666666666666666666666666666666609
0.1428571428571428571428571428571428571428571428571428571428571428571428571428547
0.125
0.111111111111111111111111111111111111111111111111111111111111111111111111111113
julia> pct_change(1//1:10)
10-element Array{Union{Missing, Rational{Int64}},1}:
missing
1//1
1//2
1//3
1//4
1//5
1//6
1//7
1//8
1//9
with proper values returned.
for a pandas.DataFrame: df
min max mean
a 0.0 2.300000e+04 6.450098e+02
b 0.0 1.370000e+05 1.651754e+03
c 218.0 1.221550e+10 3.975262e+07
d 1.0 5.060000e+03 2.727708e+02
e 0.0 6.400000e+05 6.560047e+03
I would like to format the display such as numbers show in the
":,.2f" format (that is ##,###.##) and remove the exponents.
I tried: df.style.format("{:,.2f}")which gives: <pandas.io.formats.style.Styler object at 0x108b86f60> that i have no idea what to do with.
Any lead please?
try this young Pandas apprentice
pd.options.display.float_format = '{:,.2f}'.format
There is a method to plot Series histograms, but is there a function to retrieve the histogram counts to do further calculations on top of it?
I keep using numpy's functions to do this and converting the result to a DataFrame or Series when I need this. It would be nice to stay with pandas objects the whole time.
If your Series was discrete you could use value_counts:
In [11]: s = pd.Series([1, 1, 2, 1, 2, 2, 3])
In [12]: s.value_counts()
Out[12]:
2 3
1 3
3 1
dtype: int64
You can see that s.hist() is essentially equivalent to s.value_counts().plot().
If it was of floats an awful hacky solution could be to use groupby:
s.groupby(lambda i: np.floor(2*s[i]) / 2).count()
Since hist and value_counts don't use the Series' index, you may as well treat the Series like an ordinary array and use np.histogram directly. Then build a Series from the result.
In [4]: s = Series(randn(100))
In [5]: counts, bins = np.histogram(s)
In [6]: Series(counts, index=bins[:-1])
Out[6]:
-2.968575 1
-2.355032 4
-1.741488 5
-1.127944 26
-0.514401 23
0.099143 23
0.712686 12
1.326230 5
1.939773 0
2.553317 1
dtype: int32
This is a really convenient way to organize the result of a histogram for subsequent computation.
To index by the center of each bin instead of the left edge, you could use bins[:-1] + np.diff(bins)/2.
If you know the number of bins you want, you can use pandas' cut function, which is now accessible via value_counts. Using the same random example:
s = pd.Series(np.random.randn(100))
s.value_counts(bins=5)
Out[55]:
(-0.512, 0.311] 40
(0.311, 1.133] 25
(-1.335, -0.512] 14
(1.133, 1.956] 13
(-2.161, -1.335] 8
Based on this answer from a related question you can get the bin edges and histogram counts as follows:
s = pd.Series(np.random.randn(100))
ax = s.hist()
for rect in dd.patches:
((x0, y0), (x1, y1)) = rect.get_bbox().get_points()
print(((x0, y0), (x1, y1)))