to make pydata handle string columns - pandas

I have a dataframe that has a few columns with floats and a few columns that are string. All columns have nan. The string columns have either strings or nan which appear to have a type float. When I try to 'df.to_hdf' to store the dataframe, I get the following error:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['operation', 'snl_datasource_period', 'ticker', 'cusip', 'end_fisca_perio_date', 'fiscal_period', 'finan_repor_curre_code', 'earni_relea_date', 'finan_perio_begin_on']]
How can I work around it?

You can fill each column with the appropriate missing value. E.g.
import pandas as pd
import numpy as np
col1 = [1.0, np.nan, 3.0]
col2 = ['one', np.nan, 'three']
df = pd.DataFrame(dict(col1=col1, col2=col2))
df['col1'] = df['col1'].fillna(0.0)
df['col2'] = df['col2'].fillna('')
df.to_hdf('eg.hdf', 'eg')

Related

dtype definition for pandas dataframe with columns of VARCHAR or String

I want to get some data in a dictionary that need to go into a pandas dataframe.
The dataframe is later written in a PostgreSQL table using sqlalchemy, and I would like to get the right column types.
Hence, I specify the dtypes for the dataframe
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(length=40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(length=100),
"id_lokalId": sqlalchemy.types.String(length=36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
Later I use df = pd.DataFrame(item_lst, dtype=dtypes), where item_lst is a list of dictionaries.
Independent from me using either String(8), String(length=8) or VARCHAR(8) in the dtype definition, the result of pd.DataFrame(item_lst, dtype=dtypes) is always object of type '(String or VARCHAR)' has no len().
How do I have to define the dtype to overcome this error?
Instead of forcing data types when the DataFrame is created, let pandas infer the data types (just df = pd.DataFrame(item_lst)) and then use your dtypes dict with to_sql() when you push your DataFrame to the database, like this:
from pprint import pprint
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite://")
item_lst = [{"forretningshændelse": "foo"}]
df = pd.DataFrame(item_lst)
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 forretningshændelse 1 non-null object
dtypes: object(1)
memory usage: 136.0+ bytes
None
"""
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8)}
df.to_sql("tbl", engine, index=False, dtype=dtypes)
insp = sqlalchemy.inspect(engine)
pprint(insp.get_columns("tbl"))
"""
[{'autoincrement': 'auto',
'default': None,
'name': 'forretningshændelse',
'nullable': True,
'primary_key': 0,
'type': VARCHAR(length=8)}]
"""
I believe you are confusing the dtypes within the DataFrame with the dtypes on the SQL table itself.
You probably don't need to manually specify the datatypes in pandas itself but if you do, here's how.
Spoiler alert: it is written in the pandas.Dataframe documentation that only a single dtype must be specified so you will need some loops or manual column work to get different types.
To solve your problem:
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("connection_string")
df = pd.DataFrame(item_list)
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(100),
"id_lokalId": sqlalchemy.types.String(36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
with engine.connect() as engine:
df.to_sql("table_name",if_exists="replace", con=engine, dtype=dtypes)
Tip: Avoid using special characters while coding in general, it only makes maintaining code harder at some point :). I assumed you're creating a new sql table and not appending, otherwise types for the table would already be defined.
Happy Coding!

How can I get an interpolated value from a Pandas data frame?

I have a simple Pandas data frame with two columns, 'Angle' and 'rff'. I want to get an interpolated 'rff' value based on entering an Angle that falls between two Angle values (i.e. between two index values) in the data frame. For example, I'd like to enter 3.4 for the Angle and then get an interpolated 'rff'. What would be the best way to accomplish that?
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
s = s.set_index('Angle') #Set 'Angle' as index
print(s)
result = s.at[3.0, "rff"]
print(result)
You may use numpy:
import numpy as np
np.interp(3.4, s.index, s.rff)
#59.6
You could use numpy for this:
import numpy as np
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
print(np.interp(3.4, s.Angle, s.rff))
>>> 59.6

Parquet issue with infering schema on int column containing Null

I am reading the s3 key and converting it into parquet using pandas. And before converting into parquet I am type casting it so that pyarrow can infer the schema correctly.
The snippet looks something like below:
df = pd.read_csv(io.BytesIO(s3.get_object(Bucket=s3_bucket, Key=s3_key)['Body'].read()), sep='\t', error_bad_lines=False, warn_bad_lines=True)
df['col_name'] = df['col_name'].astype('int')
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf, compression='snappy')
So far so good.
The problem is, when int column has null value, pandas will take it as an object offcourse. Is there any way to typecast it into 'int'. One way could be to do fillna(0) or with 99999 first and then do the typecasting. It worked but then Null and 0 or 99999 has different meaning in that column.
So any idea how to typecast it into int? or anything I can do to modify the code above to handle this situation?
From the pandas documentation:
Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype
Since version 0.24 there are extended integer types which are capable of holding missing values. Typecast to dtype="Int64"
You can find more information under
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
EDIT: The proposed workaround in Arrow is
import pandas as pd
import pyarrow as pa
def from_pandas(df):
"""Cast Int64 to object before 'serializing'"""
for col in df:
if isinstance(df[col].dtype, pd.Int64Dtype):
df[col] = df[col].astype('object')
return pa.Table.from_pandas(df)
def to_pandas(tbl):
"""After 'deserializing', recover the correct int type"""
df = tbl.to_pandas(integer_object_nulls=True)
for col in df:
if (pa.types.is_integer(tbl.schema.field_by_name(col).type) and
pd.api.types.is_object_dtype(df[col].dtype)):
df[col] = df[col].astype('Int64')
return df
df = pd.Series([0, 1, None, 2, 822215679726100500], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 822215679726100500], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 15], dtype='Int64', name='x').to_frame()
# df = pd.Series([0, 1, 3, 2, 15], dtype='int16', name='x').to_frame()
df2 = to_pandas(from_pandas(df))
df2.dtypes
All credits to Thomas Buhrmann

applying random distribution in each row of data frame

I have the following dataframe
import numpy as np
import pandas as pd
import scipy as sc
import scipy.stats as sct
d= {'col1': [1, 2,5,0.6], 'col2': [3, 4,1,0.8]}
df = pd. DataFrame(data=d)
I want to add two new column in that dataframe but the element of two new columns are the random poisson distribution of col1 and col2
I used the following coding to generate the new columns (col3 and col4).
df ['col3'] = int(sct.poisson.rvs(df.col1,size=1))
df ['col4'] = int(sct.poisson.rvs(df.col2,size=1))
This is the closet example of my dataframe which is quite huge and it contains 3,800,000 rows.
I can generate it using for loop. it took me too long time.
How can generate random poisson distribution based on dataframe without using loop?
Thanks
Zep
Try just using:
df['col3'] = sct.poisson.rvs(df.col1)
df['col4'] = sct.poisson.rvs(df.col2)

Replace None with NaN in pandas dataframe

I have table x:
website
0 http://www.google.com/
1 http://www.yahoo.com
2 None
I want to replace python None with pandas NaN. I tried:
x.replace(to_replace=None, value=np.nan)
But I got:
TypeError: 'regex' must be a string or a compiled regular expression or a list or dict of strings or regular expressions, you passed a 'bool'
How should I go about it?
You can use DataFrame.fillna or Series.fillna which will replace the Python object None, not the string 'None'.
import pandas as pd
import numpy as np
For dataframe:
df = df.fillna(value=np.nan)
For column or series:
df.mycol.fillna(value=np.nan, inplace=True)
Here's another option:
df.replace(to_replace=[None], value=np.nan, inplace=True)
The following line replaces None with NaN:
df['column'].replace('None', np.nan, inplace=True)
If you use df.replace([None], np.nan, inplace=True), this changed all datetime objects with missing data to object dtypes. So now you may have broken queries unless you change them back to datetime which can be taxing depending on the size of your data.
If you want to use this method, you can first identify the object dtype fields in your df and then replace the None:
obj_columns = list(df.select_dtypes(include=['object']).columns.values)
df[obj_columns] = df[obj_columns].replace([None], np.nan)
This solution is straightforward because can replace the value in all the columns easily.
You can use a dict:
import pandas as pd
import numpy as np
df = pd.DataFrame([[None, None], [None, None]])
print(df)
0 1
0 None None
1 None None
# replacing
df = df.replace({None: np.nan})
print(df)
0 1
0 NaN NaN
1 NaN NaN
Its an old question but here is a solution for multiple columns:
values = {'col_A': 0, 'col_B': 0, 'col_C': 0, 'col_D': 0}
df.fillna(value=values, inplace=True)
For more options, check the docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
DataFrame['Col_name'].replace("None", np.nan, inplace=True)