I have problems to replace inf and -inf win nan in pandas - pandas

I found a code on the web to replace inf and -inf with np.nan, however it did not work on my computer.
df = pd.DataFrame({"A" : [4.6, 5., np.inf]})
new_dataframe = a_dataframe.replace([np.inf, -np.inf], np.nan)
df
My output
A
0 4.6
1 5.0
2 inf
Does somebody know a solution?

import pandas and numpy, then assign to the dataframe df.replace([np.inf, -np.inf], np.nan). When you do df.replace([np.inf, -np.inf], np.nan) this does not update the dataframe it needs to be assigned with = for the change to happen.
Also, for some reason in the code you provided there is new_dataframe and a_dataframe, which have nothing to do with df. Try the code below.
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [4.6, 5., np.inf]})
df = df.replace([np.inf, -np.inf], np.nan)
print(df)

Related

Save a pandas dataframe containing numpy arrays

I have a dataframe with a column full of numpy arrays.
A B C
0 1.0 0.000000 [[0. 1.],[0. 1.]]
1 2.0 0.000000 [[85. 1.],[52. 0.]]
2 3.0 0.000000 [[5. 1.],[0. 0.]]
3 1.0 3.333333 [[0. 1.],[41. 0.]]
4 2.0 3.333333 [[85. 1.],[0. 21.]]
Problem is, when I save it as a CSV file, and when i load it on another python file, the numpy column is read as text.
I tried to transform the column with np.fromstring() or np.loadtxt() but it doesn't work.
Example of and array after pd.read_csv()
"[[ 85. 1.]\n [ 52. 0. ]]"
Thanks
You can try .to_json()
output = pd.DataFrame([
{'a':1,'b':np.arange(4)},
{'a':2,'b':np.arange(5)}
]).to_json()
But you will get only lists back when reloading with
df=pd.read_json()
Turn them to numpy arrays with:
df['b']=[np.array(v) for v in df['b']]
The code below should work. I used another question to solve it, theres a bit more explanation in there: Convert a string with brackets to numpy array
import pandas as pd
import numpy as np
from ast import literal_eval
# Recreating DataFrame
data = np.array([0, 1, 0, 1, 85, 1, 52, 0, 5, 1, 0, 0, 0, 1, 41, 0, 85, 1, 0, 21], dtype='float')
data = data.reshape((5,2,2))
write_df = pd.DataFrame({'A': [1.0,2.0,3.0,1.0,2.0],
'B': [0,0,0,3+1/3,3+1/3],
'C': data.tolist()})
# Saving DataFrame to CSV
fpath = 'D:\\Data\\test.csv'
write_df.to_csv(fpath)
# Reading DataFrame from CSV
read_df = pd.read_csv(fpath)
# literal_eval converts the string to a list of tuples
# np.array can convert this list of tuples directly into an array
def makeArray(rawdata):
string = literal_eval(rawdata)
return np.array(string)
# Applying the function row-wise, there could be a more efficient way
read_df['C'] = read_df['C'].apply(lambda x: makeArray(x))
Here is an ugly solution.
import pandas as pd
import numpy as np
### Create dataframe
a = [1.0, 2.0, 3.0, 1.0, 2.0]
b = [0.000000,0.000000,0.000000,3.333333,3.333333]
c = [np.array([[0. ,1.],[0. ,1.]]),
np.array([[85. ,1.2],[52. ,0.]]),
np.array([[5. ,1.],[0. ,0.]]),
np.array([[0. ,1.],[41. ,0.]]),
np.array([[85. ,1.],[0. ,21.]]),]
df = pd.DataFrame({"a":a,"b":b,"c":c})
#### Save to csv
df.to_csv("to_trash.csv")
df = pd.read_csv("to_trash.csv")
### Bad string manipulation that could be done better with regex
df["c"] = ("np.array("+(df
.c
.str.split()
.str.join(' ')
.str.replace(" ",",")
.str.replace(",,",",")
.str.replace("[,", "[", regex=False)
)+")").apply(lambda x: eval(x))
The best solution I found is using Pickle files.
You can save your dataframe as a pickle file.
import pickle
img = cv2.imread('img1.jpg')
data = pd.DataFrame({'img':img})
data.to_pickle('dataset.pkl')
Then you can read is as pickle file:
with (open(ref_path + 'dataset.pkl', "rb")) as openfile:
df_file = pickle.load(openfile)
Let me know if it worked.

Setting dataframe style per column doesn't work

pandas==1.2.4 and python==3.7
This doesn't change the formatting on column A:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.uniform(0, 1, 9).reshape(3, -1), columns=list('ABC'))
df.style.format({"A": '{:.1f}'})
print(df)
This works, however:
df['A'] = df['A'].map('{:.1f}'.format)
print(df)
So does this:
pd.set_option('display.float_format','{:.1f}'.format)
print(df)
Am I using the feature correctly?

How do to fill NaN's in a pandas dataframe with random 1's and 0's

How to replace the NaN values in a pandas dataframe with random 0's and 1's?
df.fillna(random.randint(0,1))
seems to fill the NaN's in certain columns with all 1's or all 0's
#Creating a dummy Dataframe
import pandas as pd
import numpy as np
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [100,np.nan,27000,np.nan]
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
# Replacing NAN in a particular column
a = df.Price.isnull()
rand_int = np.random.randint(2, size=a.sum())
df.loc[a,'Price' ] = rand_int
print(df)
# For entire dataframe
for i in df:
a = df[i].isnull()
rand_int = np.random.randint(2, size=a.sum())
df.loc[a, i ] = rand_int
print(df)

DataFrame.apply unintuitively changes int to float breaking an index loopup

Problem description
The column 'a' has type integer, not float. The apply function should not change the type just because the dataframe has another, unrelated float column.
I understand, why it happens: it detects the most suitable type for a Series. I still consider it unintuitive that I select a group of columns to apply some function to them that only works on ints, not on floats, and suddenly I remove one unrelated column and get an exception, because now I only have numeric columns, and all ints became floats.
>>> import pandas as pd
# This works.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': ['', '', '']}).apply(lambda row: row['a'], axis=1)
0 1
1 2
2 3
dtype: int64
# Here we also expect 1, 2, 3, as above.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: row['a'], axis=1)
0 1.0
1 2.0
2 3.0
# Why floats?!?!?!?!?!
# It's an integer column:
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})['a'].dtype
dtype('int64')
Expected Output
0 1
1 2
2 3
dtype: int64
Specifically in my problem I am trying to use the value in the apply function to get the value from a list. I am trying to do this in a performant way such that recasting as int inside the apply is too slow.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: myList[row['a']], axis=1)
https://github.com/pandas-dev/pandas/issues/23230
This is from the only source I could find having the same problem.
It seems like your underlying problem is to index a list by the values in one of your DataFrame columns. This can be done by converting your list to an array and then you can normally slice:
Sample Data
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 0, 3], 'b': ['', '', '']})
myList = ['foo', 'bar', 'baz', 'boo']
Code:
np.array(myList)[df.a.to_numpy()]
#array(['bar', 'baz', 'boo'], dtype='<U3')
Or if you want the Series:
pd.Series(np.array(myList)[df.a.to_numpy()], index=df.index)
#0 bar
#1 foo
#2 boo
#dtype: object
Alternatively with a list comprehension this is:
[myList[i] for i in df.a]
#['bar', 'foo', 'boo']
You are getting caught by Pandas upcasting. Certain operations will result in an upcast column dtype. The (0.24 Doc)[https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#gotchas] describes this here.
Examples of this are encountered when certain operations are done.
import pandas as pd
import numpy as np
print(pd.__version__)
# float64 is the default dtype of an empty dataframe.
df = pd.DataFrame({'a': [], 'b': []})['a'].dtype
print(df)
try:
df['a'] = [1,2,3,4]
except TypeError as te:
# good, the default dtype is float64
print(te)
print(df)
# even if 'defaul' is changed, this is a surprise
# because referring to all columns does convert to float
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# creates an index, "a" is float type
df.loc["a", "col1":"col2"] = np.int64(0)
print(df.dtypes)
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# not upcast
df.loc[:"col1"] = np.int64(0)
print(df.dtypes)
Taking a shot at a performant answer that works around such upcasting behavior:
import pandas as pd
import numpy as np
print(pd.__version__)
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})
df['a'] = df['a'].apply(lambda row: row+1)
df['b'] = df['b'].apply(lambda row: row+1)
print(df)
print(df['a'].dtype)
print(df['b'].dtype)
dtypes are preserved.
0.24.2
a b
0 2 1.0
1 3 1.0
2 4 1.0
int64
float64

pandas dropna is not removing nan when using np.where

I have this function
import pandas as pd
import numpy as np
from shapely.geometry import Point, Polygon
def test(e, n):
polygon = Polygon([(340,6638),(340,6614),(375,6620),(374,6649)])
point_instance = Point((e, n))
a = polygon.contains(point_instance)
val = np.where(a, 0, np.nan)
return pd.Series([val])
I want to apply above function in my dataframe and then remove the nan
def testData(filename):
df = pd.read_csv(filename)
df['check'] = df\
.apply(lambda x: test(x['E'], x['N']), axis=1)
# I tried both of these and doesnt delete nan values
df.dropna(axis=0, how = 'any', inplace = True)
df1 = df.dropna(axis=0, how='any', subset=['check'])
However, if i save data in a file and use dropna, then it works.
Sample dataframe
Id,E,N
1,5,8
2,6,9
3,7,10
This is the output I am getting
Id E N check
1 5 8 nan
2 6 9 nan
3 7 10 nan
It seems using np.nan inside np.where creates conflict datatypes.
And for that reason, pandas dropna didnt work.
I fixed using pandas map inside my function
a = pd.Series(polygon.contains(point_instance))
val = a.map({True: 0, False: np.nan})
return val