Converting only specific columns in dataframe to numeric - pandas

I currently have a dataframe with n number of number-value columns and three columns that are datetime and string values. I want to convert all the columns (but three) to numeric values but am not sure what the best method is. Below is a sample dataframe (simplified):
df2 = pd.DataFrame(np.array([[1, '5-4-2016', 10], [1,'5-5-2016', 5],[2, '5-
4-2016', 10], [2, '5-5-2016', 7], [5, '5-4-2016', 8]]), columns= ['ID',
'Date', 'Number'])
I tried using something like (below) but was unsuccessful.
exclude = ['Date']
df = df.drop(exclude, 1).apply(pd.to_numeric,
errors='coerce').combine_first(df)
The expected output: (essentially, the datatype of fields 'ID' and 'Number' change to floats while 'Date' stays the same)
ID Date Number
0 1.0 5-4-2016 10.0
1 1.0 5-5-2016 5.0
2 2.0 5-4-2016 10.0
3 2.0 5-5-2016 7.0
4 5.0 5-4-2016 8.0

Have you tried Series.astype()?
df['ID'] = df['ID'].astype(float)
df['Number'] = df['Number'].astype(float)
or for all columns besides date:
for col in [x for x in df.columns if x != 'Date']:
df[col] = df[col].astype(float)
or
df[[x for x in df.columns if x != 'Date']].transform(lambda x: x.astype(float), axis=1)

You need to call to_numeric with option downcast='float', if you want it change to float. Otherwise, it will be int. You also need to join back to non-converted columns of the original df2
df2[exclude].join(df2.drop(exclude, 1).apply(pd.to_numeric, downcast='float', errors='coerce'))
Out[1815]:
Date ID Number
0 5-4-2016 1.0 10.0
1 5-5-2016 1.0 5.0
2 5-4-2016 2.0 10.0
3 5-5-2016 2.0 7.0
4 5-4-2016 5.0 8.0

Related

How to remove all type of nan from the dataframe.?

I had a data frame, which is shown below. I want to merge column values into one column, excluding nan values.
Image 1:
When I am using the code
df3["Generation"] = df3[df3.columns[5:]].apply(lambda x: ','.join(x.dropna()), axis=1)
I am getting results like this.
Image 2:
I suspect that these columns are of type string; thus, they are not affected by x.dropna().
One example that I made is this, which gives similar results as yours.
df = pd.DataFrame({'a': [np.nan, np.nan, 1, 2], 'b': [1, 1, np.nan, None]}).astype(str)
df.apply(lambda x: ','.join(x.dropna()))
0 nan,1.0
1 nan,1.0
2 1.0,nan
3 2.0,nan
dtype: object
-----------------
# using simple string comparing solves the problem
df.apply(lambda x: ','.join(x[x!='nan']), axis=1)
0 1.0
1 1.0
2 1.0
3 2.0
dtype: object

Replace NaN values of pandas.DataFrame based on values of other columns (according to formula)

Demo dataframe:
import pandas as pd
df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]})
I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them).
Desired result:
1 5
100 NaN
3 15
How is it possible with pandas?
You can get the rows you want to change using df['a'].isnull(). Then you can use that to update the columns with loc.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, None, 3], 'b': [5, 10, 15]})
change = df['a'].isnull()
df.loc[change, ['a', 'b']] = [df.loc[change, 'b']**2, np.NaN]
print(df)
Note that the change variable is only to keep from repeating df['a'].isnull() on both sides of the assignment. You could replace it with that expression to do this in one line, but I think that looks cluttered.
Result:
a b
0 1.0 5.0
1 100.0 NaN
2 3.0 15.0

How to convert a whole no stored as float in to string?

I have a column which is of float64 datatype:
Numbers = [1, 2.3, 3, 4.5, 5]
Now i will have to convert this column to object data type for a comparison
when i convert the data type of the column using df['Numbers'].astype(str), my column gets decimals for whole nos
ex: 1 becomes 1.0. i couldn't bring the expected output
the expected output is 1,2.3,3,4.5,5
can someone help?
Use custom lambda function for test by float.is_integer:
df = pd.DataFrame({'Numbers':[1, 2.3, 3, 4.5, 5]})
df['new'] = df['Numbers'].apply(lambda x: str(int(x)) if x.is_integer() else str(x))
Another alternative is test if integer is same like float:
df['new'] = df['Numbers'].apply(lambda x: str(int(x)) if x == int(x) else str(x))
print (df)
Numbers new
0 1.0 1
1 2.3 2.3
2 3.0 3
3 4.5 4.5
4 5.0 5

pandas using qcut on series with fewer values than quantiles

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):
>>> s = pd.Series([5, np.nan, np.nan])
When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)
>>> s.quantile([0.5, 1])
0.5 5.0
1.0 5.0
dtype: float64
But when I apply .qcut() with an integer value for number of quantiles an error is thrown:
>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5., 5., 5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Even after I set the duplicates argument, it still fails:
>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0
How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)
The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:
0 (4.999, 5.000]
1 NaN
2 NaN
Ok, this is a workaround which might work for you.
pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]:
0 (4.999, 5.0]
1 NaN
2 NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)
#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)
#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
Use python 3.5 instead of python 2.7 .
This worked for me

pandas interpolate barycentric backward

I have series where the first data can be a NaN value.
I tried interpolate( 'barycentric', limit_direction='both') but it does work if the first data is NaN:
pd.Series([ np.NaN, 1.5, 2]).interpolate( 'barycentric', limit_direction='both')
0 NaN
1 1.5
2 2.0
dtype: float64
Is there a simple way to make it guess that the first number should be '1' ? Or is there a reason why it doesn't do it ? Other methods and directions don't seem to work.
Try it with limit parameter in a way that fits your data, e.g.:
(pd
.Series([ np.NaN, 1.5, 2])
.interpolate(method = "barycentric", limit = 3, limit_direction = "both"))
0 1.0
1 1.5
2 2.0
dtype: float64