How to convert a whole no stored as float in to string? - pandas

I have a column which is of float64 datatype:
Numbers = [1, 2.3, 3, 4.5, 5]
Now i will have to convert this column to object data type for a comparison
when i convert the data type of the column using df['Numbers'].astype(str), my column gets decimals for whole nos
ex: 1 becomes 1.0. i couldn't bring the expected output
the expected output is 1,2.3,3,4.5,5
can someone help?

Use custom lambda function for test by float.is_integer:
df = pd.DataFrame({'Numbers':[1, 2.3, 3, 4.5, 5]})
df['new'] = df['Numbers'].apply(lambda x: str(int(x)) if x.is_integer() else str(x))
Another alternative is test if integer is same like float:
df['new'] = df['Numbers'].apply(lambda x: str(int(x)) if x == int(x) else str(x))
print (df)
Numbers new
0 1.0 1
1 2.3 2.3
2 3.0 3
3 4.5 4.5
4 5.0 5

Related

dataframe groupby nth with same behaviour as first and last

In a dataframe, when performing groupby['col'].first() we get the first not nan value in each column (same for last).
I am trying to get the second not nan value and I cannot find how. The only relevant function that I found is groupby['col'].nth(1), but it just gives me the second row with nans if exist. groupby['col'].nth(1, dropna='any') doesn't do the job since it skips rows with nans and doesn't check each column seperately.
example:
df = pd.DataFrame({
'A': [1, 1, 1, 1, 1],
'B': [np.nan, 2, 3, 4, 5],
'C': [np.nan, np.nan, 3, 4, 5]
}, columns=['A', 'B', 'C'])
first() behaviour:
df.groupby('A').first().reset_index()
results with:
A B C
0 1 2.0 3.0
on the other hand:
df.groupby('A').nth(0, dropna='any').reset_index()
gives:
A B C
0 1 3.0 3.0
Is there a way to get the same behaviour of first/last in the nth function so I can apply it also for second or any nth item?
You can use the generic aggregate method to filter each series with notna and then pick the index you want, for example:
df.groupby('A').aggregate(lambda x: x.array[pd.notna(x)][0])
Produces:
B C
A
1 2.0 3.0
Changing the index to 1 to get the second notna value gives:
B C
A
1 3.0 4.0
Of course that lambda is a bit naive because it will raise an IndexError if the array isn't long enough. A function like this should work:
def nth_notna(n):
def inner(series):
a = series.array[pd.notna(series)]
if len(a) - 1 < n:
return np.nan
return a[n]
return inner
Then df.groupby('A').aggregate(nth_notna(3)) will produce:
B C
A
1 5.0 NaN

Pandas interpolation type when method='index'?

The pandas documentation indicates that when method='index', the numerical values of the index are used. However, I haven't found any indication of the underlying interpolation method employed. It looks like it uses linear interpolation. Can anyone confirm this definitively or point me to where this is stated in the documentation?
So turns out the document is bit misleading for those who read it will likely to think:
‘index’, ‘values’: use the actual numerical values of the index.
as fill the NaN values with numerical values of the index which is not correct, we should read it as linear interpolate value use the actual numerical values of the index
The difference between method='linear' and method='index' in source code of pandas.DataFrame.interpolate mainly are in following code:
if method == "linear":
# prior default
index = np.arange(len(obj.index))
index = Index(index)
else:
index = obj.index
So if you using the default RangeIndex as index of the dataframe, then interpolate results of method='linear' and method='index' will be the same, however if you specify the different index then results will not be the same, following example will show you the difference clearly:
import pandas as pd
import numpy as np
d = {'val': [1, np.nan, 3]}
df0 = pd.DataFrame(d)
df1 = pd.DataFrame(d, [0, 1, 6])
print("df0:\nmethod_index:\n{}\nmethod_linear:\n{}\n".format(df0.interpolate(method='index'), df0.interpolate(method='linear')))
print("df1:\nmethod_index:\n{}\nmethod_linear:\n{}\n".format(df1.interpolate(method='index'), df1.interpolate(method='linear')))
Outputs:
df0:
method_index:
val
0 1.0
1 2.0
2 3.0
method_linear:
val
0 1.0
1 2.0
2 3.0
df1:
method_index:
val
1 1.000000
2 1.333333
6 3.000000
method_linear:
val
1 1.0
2 2.0
6 3.0
As you can see, when index=[0, 1, 6] with val=[1.0, 2.0, 3.0], the interpolated value is 1.0 + (3.0-1.0) / (6-0) = 1.333333
Following the runtime of the pandas source code (generic.py -> managers.py -> blocks.py -> missing.py), we can find the implementation of linear interpolate value use the actual numerical values of the index:
NP_METHODS = ["linear", "time", "index", "values"]
if method in NP_METHODS:
# np.interp requires sorted X values, #21037
indexer = np.argsort(inds[valid])
result[invalid] = np.interp(
inds[invalid], inds[valid][indexer], yvalues[valid][indexer]
)

Converting only specific columns in dataframe to numeric

I currently have a dataframe with n number of number-value columns and three columns that are datetime and string values. I want to convert all the columns (but three) to numeric values but am not sure what the best method is. Below is a sample dataframe (simplified):
df2 = pd.DataFrame(np.array([[1, '5-4-2016', 10], [1,'5-5-2016', 5],[2, '5-
4-2016', 10], [2, '5-5-2016', 7], [5, '5-4-2016', 8]]), columns= ['ID',
'Date', 'Number'])
I tried using something like (below) but was unsuccessful.
exclude = ['Date']
df = df.drop(exclude, 1).apply(pd.to_numeric,
errors='coerce').combine_first(df)
The expected output: (essentially, the datatype of fields 'ID' and 'Number' change to floats while 'Date' stays the same)
ID Date Number
0 1.0 5-4-2016 10.0
1 1.0 5-5-2016 5.0
2 2.0 5-4-2016 10.0
3 2.0 5-5-2016 7.0
4 5.0 5-4-2016 8.0
Have you tried Series.astype()?
df['ID'] = df['ID'].astype(float)
df['Number'] = df['Number'].astype(float)
or for all columns besides date:
for col in [x for x in df.columns if x != 'Date']:
df[col] = df[col].astype(float)
or
df[[x for x in df.columns if x != 'Date']].transform(lambda x: x.astype(float), axis=1)
You need to call to_numeric with option downcast='float', if you want it change to float. Otherwise, it will be int. You also need to join back to non-converted columns of the original df2
df2[exclude].join(df2.drop(exclude, 1).apply(pd.to_numeric, downcast='float', errors='coerce'))
Out[1815]:
Date ID Number
0 5-4-2016 1.0 10.0
1 5-5-2016 1.0 5.0
2 5-4-2016 2.0 10.0
3 5-5-2016 2.0 7.0
4 5-4-2016 5.0 8.0

pandas interpolate barycentric backward

I have series where the first data can be a NaN value.
I tried interpolate( 'barycentric', limit_direction='both') but it does work if the first data is NaN:
pd.Series([ np.NaN, 1.5, 2]).interpolate( 'barycentric', limit_direction='both')
0 NaN
1 1.5
2 2.0
dtype: float64
Is there a simple way to make it guess that the first number should be '1' ? Or is there a reason why it doesn't do it ? Other methods and directions don't seem to work.
Try it with limit parameter in a way that fits your data, e.g.:
(pd
.Series([ np.NaN, 1.5, 2])
.interpolate(method = "barycentric", limit = 3, limit_direction = "both"))
0 1.0
1 1.5
2 2.0
dtype: float64

Pandas: Create a new column with random values based on conditional

I've tried reading similar questions before asking, but I'm still stumped.
Any help is appreaciated.
Input:
I have a pandas dataframe with a column labeled 'radon' which has values in the range: [0.5, 13.65]
Output:
I'd like to create a new column where all radon values that = 0.5 are changed to a random value between 0.1 and 0.5
I tried this:
df['radon_adj'] = np.where(df['radon']==0.5, random.uniform(0, 0.5), df.radon)
However, i get the same random number for all values of 0.5
I tried this as well. It creates random numbers, but the else statment does not copy the original values
df['radon_adj'] = df['radon'].apply(lambda x: random.uniform(0, 0.5) if x == 0.5 else df.radon)
One way would be to create all the random numbers you might need before you select them using where:
>>> df = pd.DataFrame({"radon": [0.5, 0.6, 0.5, 2, 4, 13]})
>>> df["radon_adj"] = df["radon"].where(df["radon"] != 0.5, np.random.uniform(0.1, 0.5, len(df)))
>>> df
radon radon_adj
0 0.5 0.428039
1 0.6 0.600000
2 0.5 0.385021
3 2.0 2.000000
4 4.0 4.000000
5 13.0 13.000000
You could be a little smarter and only generate as many random numbers as you're actually going to need, but it probably took longer for me to type this sentence than you'd save. (It takes me 9 ms to generate ~1M numbers.)
Your apply approach would work too if you used x instead of df.radon:
>>> df['radon_adj'] = df['radon'].apply(lambda x: random.uniform(0.1, 0.5) if x == 0.5 else x)
>>> df
radon radon_adj
0 0.5 0.242991
1 0.6 0.600000
2 0.5 0.271968
3 2.0 2.000000
4 4.0 4.000000
5 13.0 13.000000