Pandas: Create a new column with random values based on conditional - pandas

I've tried reading similar questions before asking, but I'm still stumped.
Any help is appreaciated.
Input:
I have a pandas dataframe with a column labeled 'radon' which has values in the range: [0.5, 13.65]
Output:
I'd like to create a new column where all radon values that = 0.5 are changed to a random value between 0.1 and 0.5
I tried this:
df['radon_adj'] = np.where(df['radon']==0.5, random.uniform(0, 0.5), df.radon)
However, i get the same random number for all values of 0.5
I tried this as well. It creates random numbers, but the else statment does not copy the original values
df['radon_adj'] = df['radon'].apply(lambda x: random.uniform(0, 0.5) if x == 0.5 else df.radon)

One way would be to create all the random numbers you might need before you select them using where:
>>> df = pd.DataFrame({"radon": [0.5, 0.6, 0.5, 2, 4, 13]})
>>> df["radon_adj"] = df["radon"].where(df["radon"] != 0.5, np.random.uniform(0.1, 0.5, len(df)))
>>> df
radon radon_adj
0 0.5 0.428039
1 0.6 0.600000
2 0.5 0.385021
3 2.0 2.000000
4 4.0 4.000000
5 13.0 13.000000
You could be a little smarter and only generate as many random numbers as you're actually going to need, but it probably took longer for me to type this sentence than you'd save. (It takes me 9 ms to generate ~1M numbers.)
Your apply approach would work too if you used x instead of df.radon:
>>> df['radon_adj'] = df['radon'].apply(lambda x: random.uniform(0.1, 0.5) if x == 0.5 else x)
>>> df
radon radon_adj
0 0.5 0.242991
1 0.6 0.600000
2 0.5 0.271968
3 2.0 2.000000
4 4.0 4.000000
5 13.0 13.000000

Related

Pandas: take the minimum of two operations on two dataframes, while preserving index

I'm a beginner with Pandas. I've got two dataframes df1 and df2 of three columns each, labelled by some index.
I would like to get a third dataframe whose entries are
min( df1-df2, 1-df1-df2 )
for each column, while preserving the index.
I don't know how to do this on all the three columns at once. If I try e.g. np.min( df1-df2, 1-df1-df2 ) I get TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed, whereas min( df1-df2, 1-df1+df2 ) gives ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I can't use apply because I've got more than one dataframe. Basically, I would like to use something like subtract, but with the ability to define my own function.
Example: consider these two dataframes
df0 = pd.DataFrame( [[0.1,0.2,0.3], [0.3, 0.1, 0.2], [0.1, 0.3, 0.9]], index=[2,1,3], columns=['px', 'py', 'pz'] )
In [4]: df0
Out[4]:
px py pz
2 0.1 0.2 0.3
1 0.3 0.1 0.2
3 0.1 0.3 0.9
and
df1 = pd.DataFrame( [[0.9,0.1,0.9], [0.1,0.2,0.1], [0.3,0.1,0.8]], index=[3,1,2], columns=['px', 'py', 'pz'])
px py pz
3 0.9 0.1 0.9
1 0.1 0.2 0.1
2 0.3 0.1 0.8
my desired output is a new dataframe df, made up of three columns 'px', 'py', 'pz', whose entries are:
for j in range(1,4):
dfx[j-1] = min( df0['px'][j] - df1['px'][j], 1 - df0['px'][j] + df1['px'][j] )
for df['px'], and similarly for 'py' and 'pz'.
px py pz
1 0.2 -0.1 0.1
2 -0.2 0.1 -0.5
3 -0.8 0.2 0.0
I hope it's clear now! Thanks in advance!
pandas is smart enough to match up the columns and index values for you in a vectorized way. If you're looping a dataframe, you're probably doing it wrong.
m1 = df0 - df1
m2 = 1 - (df0 + df1)
# Take the values from m1 where they're less than
# The corresponding value in m2. Otherwise, take m2:
out = m1[m1.lt(m2)].combine_first(m2)
# Another method: Combine our two calculated frames,
# groupby the index, and take the minimum.
out = pd.concat([m1, m2]).groupby(level=0).min()
print(out)
# Output:
px py pz
1 0.2 -0.1 0.1
2 -0.2 0.1 -0.5
3 -0.8 0.2 -0.8

Perform multiple math operations on columns in df

I want to do operation [(b-a)/a] * 100 on a dataframe [i.e., percentage change from a reference value]. where a is my first column and b is all other columns of the dataframe.
I tried below steps and it is working but very messy !!
df = pd.DataFrame({'obj1': [1, 3, 4],
'obj2': [6, 9, 10], 'obj3':[2, 6, 8]},
index=['circle', 'triangle', 'rectangle'])
#first we subtract all columns with first col - as that is the starting point : b-a
df_aftersub = df.sub(pd.Series(df.iloc[:,[0]].squeeze()),axis='index')
#second we divide the result with first column to get change - b-a/a
df_change = df_aftersub.div(pd.Series(df.iloc[:,[0]].squeeze()),axis='index')
#third we multiply with 100 to get percent change - b-a/a*100
df_final = df_change*100
df_final
output needed
obj1 obj2 obj3
circle 0.0 500.0 100.0
triangle 0.0 200.0 100.0
rectangle 0.0 150.0 100.0
how to do it in less lines of code and if possible less temporary dataframes (if possible simple to understand)
First subtract first column by DataFrame.sub and divide by DataFrame.div, last multiple by 100:
s = df.iloc[:, 0]
df_final = df.sub(s, axis=0).div(s, axis=0).mul(100)
print (df_final)
obj1 obj2 obj3
circle 0.0 500.0 100.0
triangle 0.0 200.0 100.0
rectangle 0.0 150.0 100.0

Binarize a continuous feature with NaNs Python

I have a pandas dataframe of 4000 rows and 35 features, in which some of the continuous features contain missing values (NaNs). For example, one of them (with 46 missing values) has a very left-skewed distribution and I would like to binarize it by choosing a threshold of 1.5 below which I would like to set it as the class 0 and above or equal to 1.5 as the class 1.
Like: X_original = [0.01,2.80,-1.74,1.34,1.55], X_bin = [0, 1, 0, 0, 1].
I tried doing: dataframe["bin"] = (dataframe["original"] > 1.5).astype(int).
However, I noticed that the missing values (NaNs) disappeared and they are encoded in the 0 class.
How could I solve this problem?
To the best of my knowledge there is way to keep the missing values after a comparison, but you can do the following:
import pandas as pd
import numpy as np
X_original = pd.Series([0.01,2.80,-1.74, np.nan,1.55])
X_bin = X_original > 1.5
X_bin[X_original.isna()] = np.NaN
print(X_bin)
Output
0 0.0
1 1.0
2 0.0
3 NaN
4 1.0
dtype: float64
To keep the column as Integer (and also nullable), do:
X_bin = X_bin.astype(pd.Int8Dtype())
print(X_bin)
Output
0 0
1 1
2 0
3 <NA>
4 1
dtype: Int8
The best way to handle this issue that I found was to use list comprehension:
dataframe["Bin"] = [0 if el<1.5 else 1 if el >= 1.5 else np.NaN for el in dataframe["Original"]]
Then I convert the float numbers to objects except the np.NaN
dataframe["Bin"] = dataframe["Bin"].replace([0.0,1.0],["0","1"])

pandas using qcut on series with fewer values than quantiles

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):
>>> s = pd.Series([5, np.nan, np.nan])
When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)
>>> s.quantile([0.5, 1])
0.5 5.0
1.0 5.0
dtype: float64
But when I apply .qcut() with an integer value for number of quantiles an error is thrown:
>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5., 5., 5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Even after I set the duplicates argument, it still fails:
>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0
How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)
The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:
0 (4.999, 5.000]
1 NaN
2 NaN
Ok, this is a workaround which might work for you.
pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]:
0 (4.999, 5.0]
1 NaN
2 NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)
#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)
#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
Use python 3.5 instead of python 2.7 .
This worked for me

pandas interpolate barycentric backward

I have series where the first data can be a NaN value.
I tried interpolate( 'barycentric', limit_direction='both') but it does work if the first data is NaN:
pd.Series([ np.NaN, 1.5, 2]).interpolate( 'barycentric', limit_direction='both')
0 NaN
1 1.5
2 2.0
dtype: float64
Is there a simple way to make it guess that the first number should be '1' ? Or is there a reason why it doesn't do it ? Other methods and directions don't seem to work.
Try it with limit parameter in a way that fits your data, e.g.:
(pd
.Series([ np.NaN, 1.5, 2])
.interpolate(method = "barycentric", limit = 3, limit_direction = "both"))
0 1.0
1 1.5
2 2.0
dtype: float64