I found a code on the web to replace inf and -inf with np.nan, however it did not work on my computer.
df = pd.DataFrame({"A" : [4.6, 5., np.inf]})
new_dataframe = a_dataframe.replace([np.inf, -np.inf], np.nan)
df
My output
A
0 4.6
1 5.0
2 inf
Does somebody know a solution?
import pandas and numpy, then assign to the dataframe df.replace([np.inf, -np.inf], np.nan). When you do df.replace([np.inf, -np.inf], np.nan) this does not update the dataframe it needs to be assigned with = for the change to happen.
Also, for some reason in the code you provided there is new_dataframe and a_dataframe, which have nothing to do with df. Try the code below.
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [4.6, 5., np.inf]})
df = df.replace([np.inf, -np.inf], np.nan)
print(df)
Related
I have a dataframe with a column full of numpy arrays.
A B C
0 1.0 0.000000 [[0. 1.],[0. 1.]]
1 2.0 0.000000 [[85. 1.],[52. 0.]]
2 3.0 0.000000 [[5. 1.],[0. 0.]]
3 1.0 3.333333 [[0. 1.],[41. 0.]]
4 2.0 3.333333 [[85. 1.],[0. 21.]]
Problem is, when I save it as a CSV file, and when i load it on another python file, the numpy column is read as text.
I tried to transform the column with np.fromstring() or np.loadtxt() but it doesn't work.
Example of and array after pd.read_csv()
"[[ 85. 1.]\n [ 52. 0. ]]"
Thanks
You can try .to_json()
output = pd.DataFrame([
{'a':1,'b':np.arange(4)},
{'a':2,'b':np.arange(5)}
]).to_json()
But you will get only lists back when reloading with
df=pd.read_json()
Turn them to numpy arrays with:
df['b']=[np.array(v) for v in df['b']]
The code below should work. I used another question to solve it, theres a bit more explanation in there: Convert a string with brackets to numpy array
import pandas as pd
import numpy as np
from ast import literal_eval
# Recreating DataFrame
data = np.array([0, 1, 0, 1, 85, 1, 52, 0, 5, 1, 0, 0, 0, 1, 41, 0, 85, 1, 0, 21], dtype='float')
data = data.reshape((5,2,2))
write_df = pd.DataFrame({'A': [1.0,2.0,3.0,1.0,2.0],
'B': [0,0,0,3+1/3,3+1/3],
'C': data.tolist()})
# Saving DataFrame to CSV
fpath = 'D:\\Data\\test.csv'
write_df.to_csv(fpath)
# Reading DataFrame from CSV
read_df = pd.read_csv(fpath)
# literal_eval converts the string to a list of tuples
# np.array can convert this list of tuples directly into an array
def makeArray(rawdata):
string = literal_eval(rawdata)
return np.array(string)
# Applying the function row-wise, there could be a more efficient way
read_df['C'] = read_df['C'].apply(lambda x: makeArray(x))
Here is an ugly solution.
import pandas as pd
import numpy as np
### Create dataframe
a = [1.0, 2.0, 3.0, 1.0, 2.0]
b = [0.000000,0.000000,0.000000,3.333333,3.333333]
c = [np.array([[0. ,1.],[0. ,1.]]),
np.array([[85. ,1.2],[52. ,0.]]),
np.array([[5. ,1.],[0. ,0.]]),
np.array([[0. ,1.],[41. ,0.]]),
np.array([[85. ,1.],[0. ,21.]]),]
df = pd.DataFrame({"a":a,"b":b,"c":c})
#### Save to csv
df.to_csv("to_trash.csv")
df = pd.read_csv("to_trash.csv")
### Bad string manipulation that could be done better with regex
df["c"] = ("np.array("+(df
.c
.str.split()
.str.join(' ')
.str.replace(" ",",")
.str.replace(",,",",")
.str.replace("[,", "[", regex=False)
)+")").apply(lambda x: eval(x))
The best solution I found is using Pickle files.
You can save your dataframe as a pickle file.
import pickle
img = cv2.imread('img1.jpg')
data = pd.DataFrame({'img':img})
data.to_pickle('dataset.pkl')
Then you can read is as pickle file:
with (open(ref_path + 'dataset.pkl', "rb")) as openfile:
df_file = pickle.load(openfile)
Let me know if it worked.
pandas==1.2.4 and python==3.7
This doesn't change the formatting on column A:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.uniform(0, 1, 9).reshape(3, -1), columns=list('ABC'))
df.style.format({"A": '{:.1f}'})
print(df)
This works, however:
df['A'] = df['A'].map('{:.1f}'.format)
print(df)
So does this:
pd.set_option('display.float_format','{:.1f}'.format)
print(df)
Am I using the feature correctly?
How to replace the NaN values in a pandas dataframe with random 0's and 1's?
df.fillna(random.randint(0,1))
seems to fill the NaN's in certain columns with all 1's or all 0's
#Creating a dummy Dataframe
import pandas as pd
import numpy as np
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [100,np.nan,27000,np.nan]
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
# Replacing NAN in a particular column
a = df.Price.isnull()
rand_int = np.random.randint(2, size=a.sum())
df.loc[a,'Price' ] = rand_int
print(df)
# For entire dataframe
for i in df:
a = df[i].isnull()
rand_int = np.random.randint(2, size=a.sum())
df.loc[a, i ] = rand_int
print(df)
Problem description
The column 'a' has type integer, not float. The apply function should not change the type just because the dataframe has another, unrelated float column.
I understand, why it happens: it detects the most suitable type for a Series. I still consider it unintuitive that I select a group of columns to apply some function to them that only works on ints, not on floats, and suddenly I remove one unrelated column and get an exception, because now I only have numeric columns, and all ints became floats.
>>> import pandas as pd
# This works.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': ['', '', '']}).apply(lambda row: row['a'], axis=1)
0 1
1 2
2 3
dtype: int64
# Here we also expect 1, 2, 3, as above.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: row['a'], axis=1)
0 1.0
1 2.0
2 3.0
# Why floats?!?!?!?!?!
# It's an integer column:
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})['a'].dtype
dtype('int64')
Expected Output
0 1
1 2
2 3
dtype: int64
Specifically in my problem I am trying to use the value in the apply function to get the value from a list. I am trying to do this in a performant way such that recasting as int inside the apply is too slow.
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]}).apply(lambda row: myList[row['a']], axis=1)
https://github.com/pandas-dev/pandas/issues/23230
This is from the only source I could find having the same problem.
It seems like your underlying problem is to index a list by the values in one of your DataFrame columns. This can be done by converting your list to an array and then you can normally slice:
Sample Data
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 0, 3], 'b': ['', '', '']})
myList = ['foo', 'bar', 'baz', 'boo']
Code:
np.array(myList)[df.a.to_numpy()]
#array(['bar', 'baz', 'boo'], dtype='<U3')
Or if you want the Series:
pd.Series(np.array(myList)[df.a.to_numpy()], index=df.index)
#0 bar
#1 foo
#2 boo
#dtype: object
Alternatively with a list comprehension this is:
[myList[i] for i in df.a]
#['bar', 'foo', 'boo']
You are getting caught by Pandas upcasting. Certain operations will result in an upcast column dtype. The (0.24 Doc)[https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#gotchas] describes this here.
Examples of this are encountered when certain operations are done.
import pandas as pd
import numpy as np
print(pd.__version__)
# float64 is the default dtype of an empty dataframe.
df = pd.DataFrame({'a': [], 'b': []})['a'].dtype
print(df)
try:
df['a'] = [1,2,3,4]
except TypeError as te:
# good, the default dtype is float64
print(te)
print(df)
# even if 'defaul' is changed, this is a surprise
# because referring to all columns does convert to float
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# creates an index, "a" is float type
df.loc["a", "col1":"col2"] = np.int64(0)
print(df.dtypes)
df = pd.DataFrame(columns=["col1", "col2"], dtype=np.int64)
# not upcast
df.loc[:"col1"] = np.int64(0)
print(df.dtypes)
Taking a shot at a performant answer that works around such upcasting behavior:
import pandas as pd
import numpy as np
print(pd.__version__)
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0., 0., 0.]})
df['a'] = df['a'].apply(lambda row: row+1)
df['b'] = df['b'].apply(lambda row: row+1)
print(df)
print(df['a'].dtype)
print(df['b'].dtype)
dtypes are preserved.
0.24.2
a b
0 2 1.0
1 3 1.0
2 4 1.0
int64
float64
I have this function
import pandas as pd
import numpy as np
from shapely.geometry import Point, Polygon
def test(e, n):
polygon = Polygon([(340,6638),(340,6614),(375,6620),(374,6649)])
point_instance = Point((e, n))
a = polygon.contains(point_instance)
val = np.where(a, 0, np.nan)
return pd.Series([val])
I want to apply above function in my dataframe and then remove the nan
def testData(filename):
df = pd.read_csv(filename)
df['check'] = df\
.apply(lambda x: test(x['E'], x['N']), axis=1)
# I tried both of these and doesnt delete nan values
df.dropna(axis=0, how = 'any', inplace = True)
df1 = df.dropna(axis=0, how='any', subset=['check'])
However, if i save data in a file and use dropna, then it works.
Sample dataframe
Id,E,N
1,5,8
2,6,9
3,7,10
This is the output I am getting
Id E N check
1 5 8 nan
2 6 9 nan
3 7 10 nan
It seems using np.nan inside np.where creates conflict datatypes.
And for that reason, pandas dropna didnt work.
I fixed using pandas map inside my function
a = pd.Series(polygon.contains(point_instance))
val = a.map({True: 0, False: np.nan})
return val