The type of <field> is not a SQLAlchemy type with Pandas to_sql to an Oracle database - pandas

I have a pandas dataframe that has several categorical fields.
SQLAlchemy throws a exception "The type of is not a SQLAlchemy type".
I've tried converting the object fields back to string, but get the same error.
dfx = pd.DataFrame()
for col_name in df.columns:
if(df[col_name].dtype == 'object'):
dfx[col_name] = df[col_name].astype('str').copy()
else:
dfx[col_name] = df[col_name].copy()
print(col_name, dfx[col_name].dtype)
.
dfx.to_sql('results', con=engine, dtype=my_dtypes, if_exists='append', method='multi', index=False)
the new dfx seems to have the same categoricals despite creating a new table with .copy()
Also, as a side note, why does to_sql() generate a CREATE TABLE with CLOBs?

No need to use the copy() function here, and you should not have to convert from 'object' to 'str' either.
Are you writing to an Oracle database? The default output type for text data (including 'object') is CLOB. You can get around it by specifying the dtype to use. For example:
import pandas as pd
from sqlalchemy import types, create_engine
from sqlalchemy.exc import InvalidRequestError
conn = create_engine(...)
testdf = pd.DataFrame({'pet': ['dog','cat','mouse','dog','fish','pony','cat']
, 'count': [2,6,12,1,45,1,3]
, 'x': [105.3, 98.7, 112.4, 3.6, 48.9, 208.9, -1.7]})
test_types = dict(zip(
testdf.columns.tolist(),
(types.VARCHAR(length=20), types.Integer(), types.Float()) ))
try:
testdf.to_sql( name="test", schema="myschema"
, con=conn
, if_exists='replace' #'append'
, index=False
, dtype = test_types)
print (f"Wrote final input dataset to table {schema}.{table2}")
except (ValueError, InvalidRequestError):
print ("Could not write to table 'test'.")
If you are not writing to Oracle, please specify your target database - perhaps someone else with experience in that DBMS can advise you.

What #eknumbat is absolutely correct. For AWS Redshift, you can do the following. Note you can find all of the sqlalchemy datatypes here (https://docs.sqlalchemy.org/en/14/core/types.html)
import pandas as pd
from sqlalchemy.types import VARCHAR, INTEGER, FLOAT
from sqlalchemy import create_engine
conn = create_engine(...)
testdf = pd.DataFrame({'pet': ['dog','cat','mouse','dog','fish','pony','cat'],
'count': [2,6,12,1,45,1,3],
'x': [105.3, 98.7, 112.4, 3.6, 48.9, 208.9, -1.7]})
test_types = {'pet': VARCHAR, 'count': Integer, 'x': Float}
testdf.to_sql(name="test",
schema="myschema".
con=conn,
if_exists='replace',
index=False,
dtype = test_types)

Related

dtype definition for pandas dataframe with columns of VARCHAR or String

I want to get some data in a dictionary that need to go into a pandas dataframe.
The dataframe is later written in a PostgreSQL table using sqlalchemy, and I would like to get the right column types.
Hence, I specify the dtypes for the dataframe
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(length=40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(length=100),
"id_lokalId": sqlalchemy.types.String(length=36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
Later I use df = pd.DataFrame(item_lst, dtype=dtypes), where item_lst is a list of dictionaries.
Independent from me using either String(8), String(length=8) or VARCHAR(8) in the dtype definition, the result of pd.DataFrame(item_lst, dtype=dtypes) is always object of type '(String or VARCHAR)' has no len().
How do I have to define the dtype to overcome this error?
Instead of forcing data types when the DataFrame is created, let pandas infer the data types (just df = pd.DataFrame(item_lst)) and then use your dtypes dict with to_sql() when you push your DataFrame to the database, like this:
from pprint import pprint
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("sqlite://")
item_lst = [{"forretningshændelse": "foo"}]
df = pd.DataFrame(item_lst)
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 forretningshændelse 1 non-null object
dtypes: object(1)
memory usage: 136.0+ bytes
None
"""
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8)}
df.to_sql("tbl", engine, index=False, dtype=dtypes)
insp = sqlalchemy.inspect(engine)
pprint(insp.get_columns("tbl"))
"""
[{'autoincrement': 'auto',
'default': None,
'name': 'forretningshændelse',
'nullable': True,
'primary_key': 0,
'type': VARCHAR(length=8)}]
"""
I believe you are confusing the dtypes within the DataFrame with the dtypes on the SQL table itself.
You probably don't need to manually specify the datatypes in pandas itself but if you do, here's how.
Spoiler alert: it is written in the pandas.Dataframe documentation that only a single dtype must be specified so you will need some loops or manual column work to get different types.
To solve your problem:
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("connection_string")
df = pd.DataFrame(item_list)
dtypes = {"forretningshændelse": sqlalchemy.types.String(length=8),
"forretningsområde": sqlalchemy.types.String(40),
"forretningsproces": sqlalchemy.types.INTEGER(),
"id_namespace": sqlalchemy.types.String(100),
"id_lokalId": sqlalchemy.types.String(36),
"kommunekode": sqlalchemy.types.INTEGER(),
"registreringFra": sqlalchemy.types.DateTime()}
with engine.connect() as engine:
df.to_sql("table_name",if_exists="replace", con=engine, dtype=dtypes)
Tip: Avoid using special characters while coding in general, it only makes maintaining code harder at some point :). I assumed you're creating a new sql table and not appending, otherwise types for the table would already be defined.
Happy Coding!

snowflake.connector SQL compilation error invalid identifier from pandas dataframe

I'm trying to ingest a df I created from a json response into an existing table (the table is currently empty because I can't seem to get this to work)
The df looks something like the below table:
index
clicks_affiliated
0
3214
1
2221
but I'm seeing the following error:
snowflake.connector.errors.ProgrammingError: 000904 (42000): SQL
compilation error: error line 1 at position 94
invalid identifier '"clicks_affiliated"'
and the column names in snowflake match to the columns in my dataframe.
This is my code:
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas, pd_writer
from pandas import json_normalize
import requests
df_norm = json_normalize(json_response, 'reports')
#I've tried also adding the below line (and removing it) but I see the same error
df = df_norm.reset_index(drop=True)
def create_db_engine(db_name, schema_name):
engine = URL(
account="ab12345.us-west-2",
user="my_user",
password="my_pw",
database="DB",
schema="PUBLIC",
warehouse="WH1",
role="DEV"
)
return engine
def create_table(out_df, table_name, idx=False):
url = create_db_engine(db_name="DB", schema_name="PUBLIC")
engine = create_engine(url)
connection = engine.connect()
try:
out_df.to_sql(
table_name, connection, if_exists="append", index=idx, method=pd_writer
)
except ConnectionError:
print("Unable to connect to database!")
finally:
connection.close()
engine.dispose()
return True
print(df.head)
create_table(df, "reporting")
So... it turns out I needed to change my columns in my dataframe to uppercase
I've added this after the dataframe creation to do so and it worked:
df.columns = map(lambda x: str(x).upper(), df.columns)

Parse CSV with far future dates to Parquet

I’m trying to read a CSV into Pandas, and then write it to Parquet. The challenge is that the CSV has a date column with a value of 3000-12-31, and apparently Pandas has no way to store that value as an actual date. Because of that, PyArrow fails to read the date value.
An example file and code to reproduce is
test.csv
t
3000-12-31
import pandas as pd
import pyarrow as pa
df = pd.read_csv("test.csv", parse_dates=["t"])
schema = pa.schema([pa.field("t", pa.date64())])
table = pa.Table.from_pandas(df, schema=schema)
This gives (a somewhat unhelpful error)
TypeError: an integer is required (got type str)
What's the right way to do this?
Pandas datetime columns (which use the datetime64[ns] data type) indeed cannot store such dates.
One possible workaround to convert the strings to datetime.datetime objects in an object dtype column. And then pyarrow should be able to accept them to create a date column.
This conversion could eg be done with dateutil:
>>> import dateutil
>>> df['t'] = df['t'].apply(dateutil.parser.parse)
>>> df
t
0 3000-12-31 00:00:00
>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]
or if you use a fixed format, using datetime.date.strptime is probably more reliable:
>>> import datetime
>>> df['t'] = df['t'].apply(lambda s: datetime.datetime.strptime(s, "%Y-%m-%d"))
>>> table = pa.Table.from_pandas(df, schema=schema)
>>> table
pyarrow.Table
t: date64[ms]

Converting DataFrame into sql

I am using the following code to convert my pandas into sql, but I get the following error although my dtype is float64 for this particular column.
I have tried to convert my dtype to str, but this did not work.
import sqlite3
import pandas as pd
#create db file
db = conn = sqlite3.connect(‘example.db’)
#convert my df data to sql
df = df(‘users’ , con=db, if_exists='replace')
InterfaceError: Error binding parameter 1214 - probably unsupported type.
However when I check the parameter 1214 i.e. column 1214 in my df. This col has a float64 dtype. I don't understand then how to solve this problem.
Double check your data types, as SQLite supports a limited number of data types --> https://www.sqlite.org/datatype3.html. My guess would be to use a float dtype (so try dtype='float')

psycopg2: can't adapt type 'numpy.int64'

I have a dataframe with the dtypes shown below and I want to insert the dataframe into a postgres DB but it fails due to error can't adapt type 'numpy.int64'
id_code int64
sector object
created_date float64
updated_date float64
How can I convert these types to native python types such as from int64 (which is essentially 'numpy.int64') to a classic int that would then be acceptable to postgres via the psycopg2 client.
data['id_code'].astype(np.int) defaults to int64
It is nonetheless possible to convert from one numpy type to another (e.g from int to float)
data['id_code'].astype(float)
changes to
dtype: float64
The bottomline is that psycopg2 doesn't seem to understand numpy datatypes if any one has ideas how to convert them to classic types that would be helpful.
Updated: Insertion to DB
def insert_many():
"""Add data to the table."""
sql_query = """INSERT INTO classification(
id_code, sector, created_date, updated_date)
VALUES (%s, %s, %s, %s);"""
data = pd.read_excel(fh, sheet_name=sheetname)
data_list = list(data.to_records())
conn = None
try:
conn = psycopg2.connect(db)
cur = conn.cursor()
cur.executemany(sql_query, data_list)
conn.commit()
cur.close()
except(Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
Add below somewhere in your code:
import numpy
from psycopg2.extensions import register_adapter, AsIs
def addapt_numpy_float64(numpy_float64):
return AsIs(numpy_float64)
def addapt_numpy_int64(numpy_int64):
return AsIs(numpy_int64)
register_adapter(numpy.float64, addapt_numpy_float64)
register_adapter(numpy.int64, addapt_numpy_int64)
same problem here, successfully solve this problem after I transform series to nd.array and int.
you can try as following:
data['id_code'].values.astype(int)
--
update:
if the value including NaN, it still wrong.
It seems that psycopg2 can't explain the np.int64 format, therefore the following methods works for me.
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
psycopg2.extensions.register_adapter(np.int64, psycopg2._psycopg.AsIs)
I'm not sure why your data_list contains NumPy data types, but the same thing happens to me when I run your code. Here is an alternative way to construct data_list that so that integers and floats end up as their native python types:
data_list = [list(row) for row in data.itertuples(index=False)]
Alternate approach
I think you could accomplish the same thing in fewer lines of code by using pandas to_sql:
import sqlalchemy
import pandas as pd
data = pd.read_excel(fh, sheet_name=sheetname)
engine = sqlalchemy.create_engine("postgresql://username#hostname/dbname")
data.to_sql(engine, 'classification', if_exists='append', index=False)
I had the same issue and fixed it using:
df = df.convert_dtypes()