Parquet issue with infering schema on int column containing Null - pandas

I am reading the s3 key and converting it into parquet using pandas. And before converting into parquet I am type casting it so that pyarrow can infer the schema correctly.
The snippet looks something like below:
df = pd.read_csv(io.BytesIO(s3.get_object(Bucket=s3_bucket, Key=s3_key)['Body'].read()), sep='\t', error_bad_lines=False, warn_bad_lines=True)
df['col_name'] = df['col_name'].astype('int')
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf, compression='snappy')
So far so good.
The problem is, when int column has null value, pandas will take it as an object offcourse. Is there any way to typecast it into 'int'. One way could be to do fillna(0) or with 99999 first and then do the typecasting. It worked but then Null and 0 or 99999 has different meaning in that column.
So any idea how to typecast it into int? or anything I can do to modify the code above to handle this situation?

From the pandas documentation:
Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype
Since version 0.24 there are extended integer types which are capable of holding missing values. Typecast to dtype="Int64"
You can find more information under
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
EDIT: The proposed workaround in Arrow is
import pandas as pd
import pyarrow as pa
def from_pandas(df):
"""Cast Int64 to object before 'serializing'"""
for col in df:
if isinstance(df[col].dtype, pd.Int64Dtype):
df[col] = df[col].astype('object')
return pa.Table.from_pandas(df)
def to_pandas(tbl):
"""After 'deserializing', recover the correct int type"""
df = tbl.to_pandas(integer_object_nulls=True)
for col in df:
if (pa.types.is_integer(tbl.schema.field_by_name(col).type) and
pd.api.types.is_object_dtype(df[col].dtype)):
df[col] = df[col].astype('Int64')
return df
df = pd.Series([0, 1, None, 2, 822215679726100500], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 822215679726100500], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 15], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 15], dtype='int16', name='x').to_frame()
df2 = to_pandas(from_pandas(df))
df2.dtypes
All credits to Thomas Buhrmann

Related

Why does pandas make a float64 out of an integer?

I set an integer as a cell value but Pandas does make a float64 out of it. The question is why and how I can prevent that?
>>> df = pandas.DataFrame()
>>> df.loc['X', 'Y'] = 1
>>> df
Y
X 1.0
I think you shouldn't start with an empty DataFrame and then use df.loc[] to create columns.
E.g. start with a df with dtypes set
df = pandas.DataFrame({"x":[1,2,3]}, dtype=np.int64)
or if you need to adjust the dtype of a column, then call the astype() method, e.g.
df["x"] = df["x"].astype("float")

i need to return a value from a dataframe cell as a variable not a series

i have the following issue:
when i use .loc funtion it returns a series not a single value with no index.
As i need to do some math operation with the selected cells. the function that i am using is:
import pandas as pd
data = [[82,1], [30, 2], [3.7, 3]]
df = pd.DataFrame(data, columns = ['Ah-Step', 'State'])
df['Ah-Step'].loc[df['State']==2]+ df['Ah-Step'].loc[df['State']==3]
.values[0] will do what OP wants.
Assuming one wants to obtain the value 30, the following will do the work
df.loc[df['State'] == 2, 'Ah-Step'].values[0]
print(df)
[Out]: 30.0
So, in OP's specific case, the operation 30+3.7 could be done as follows
df.loc[df['State'] == 2, 'Ah-Step'].values[0] + df['Ah-Step'].loc[df['State']==3].values[0]
[Out]: 33.7

Assert an integer is in list on pandas series

I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))

to make pydata handle string columns

I have a dataframe that has a few columns with floats and a few columns that are string. All columns have nan. The string columns have either strings or nan which appear to have a type float. When I try to 'df.to_hdf' to store the dataframe, I get the following error:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['operation', 'snl_datasource_period', 'ticker', 'cusip', 'end_fisca_perio_date', 'fiscal_period', 'finan_repor_curre_code', 'earni_relea_date', 'finan_perio_begin_on']]
How can I work around it?
You can fill each column with the appropriate missing value. E.g.
import pandas as pd
import numpy as np
col1 = [1.0, np.nan, 3.0]
col2 = ['one', np.nan, 'three']
df = pd.DataFrame(dict(col1=col1, col2=col2))
df['col1'] = df['col1'].fillna(0.0)
df['col2'] = df['col2'].fillna('')
df.to_hdf('eg.hdf', 'eg')

Replace None with NaN in pandas dataframe

I have table x:
website
0 http://www.google.com/
1 http://www.yahoo.com
2 None
I want to replace python None with pandas NaN. I tried:
x.replace(to_replace=None, value=np.nan)
But I got:
TypeError: 'regex' must be a string or a compiled regular expression or a list or dict of strings or regular expressions, you passed a 'bool'
How should I go about it?
You can use DataFrame.fillna or Series.fillna which will replace the Python object None, not the string 'None'.
import pandas as pd
import numpy as np
For dataframe:
df = df.fillna(value=np.nan)
For column or series:
df.mycol.fillna(value=np.nan, inplace=True)
Here's another option:
df.replace(to_replace=[None], value=np.nan, inplace=True)
The following line replaces None with NaN:
df['column'].replace('None', np.nan, inplace=True)
If you use df.replace([None], np.nan, inplace=True), this changed all datetime objects with missing data to object dtypes. So now you may have broken queries unless you change them back to datetime which can be taxing depending on the size of your data.
If you want to use this method, you can first identify the object dtype fields in your df and then replace the None:
obj_columns = list(df.select_dtypes(include=['object']).columns.values)
df[obj_columns] = df[obj_columns].replace([None], np.nan)
This solution is straightforward because can replace the value in all the columns easily.
You can use a dict:
import pandas as pd
import numpy as np
df = pd.DataFrame([[None, None], [None, None]])
print(df)
0 1
0 None None
1 None None
# replacing
df = df.replace({None: np.nan})
print(df)
0 1
0 NaN NaN
1 NaN NaN
Its an old question but here is a solution for multiple columns:
values = {'col_A': 0, 'col_B': 0, 'col_C': 0, 'col_D': 0}
df.fillna(value=values, inplace=True)
For more options, check the docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
DataFrame['Col_name'].replace("None", np.nan, inplace=True)