I have developed python and used pandas module to write excel.
While executing command print(df1.columns), I get dtype as 'Object'.
and using same excel to load in Teradata table using TPT script and getting below error
FILE_READER[1]: TPT19108 Data Format 'DELIMITED' requires all 'VARCHAR/JSON/JSON BY NAME/CLOB BY NAME/BLOB BY NAME/XML BY NAME/XML/CLOB' schema.
Using Description in TPT:-
DEFINE SCHEMA Teradata__DATA
DESCRIPTION 'SCHEMA OF Teradata data'
(
Issue_Key VARCHAR(255),
Log_Date VARDATE(10) FORMATIN ('YYYY-MM-DD') FORMATOUT ('YYYY-MM-DD'),
User_Name VARCHAR(255),
Time_Spent NUMBER(10,2)
Please help in resolving the failure message. Error might be due different Datatype or due to defined delimeter as "TAB". Please suggest if any other reason is causing this failure.
CODE
df = pd.read_excel('Time_Log_Source_2019-05-30.xlsx', sheet_name='Sheet1', dtype=str)
print("Column headings:")
print(df.columns)
df = pd.DataFrame(df,columns=['Issue Key', 'Log Date', 'User', 'Time Spent(Sec)'])
df['Log Date'] = df['Log Date'].str[:10]
df['Time Spent(Sec)'] = df['Time Spent(Sec)'].astype(int)/3600
print(df)
df.to_excel("Time_Log_Source_2019-05-30_output.xlsx")
df1 = pd.read_excel('Time_Log_Source_2019-05-30_output.xlsx', sheet_name='Sheet1',dtype=str)
df1['Issue Key'] = df1['Issue Key'].astype('str')
df1['Log Date'] = df1['Log Date'].astype('str')
df1['User'] = df1['User'].astype('str')
df1['Time Spent(Sec)'] = df1['Time Spent(Sec)'].astype('str')
df1.to_excel("Time_Log_Source_2019-05-30_output.xlsx",startrow=0, startcol=0, index=False)
print(type(df1['Time Spent(Sec)']))
print(df.columns)
print(df1.columns)
Result
Index([u'Issue Key', u'Log Date', u'User', u'Time Spent(Sec)'], dtype='object')
Index([u'Issue Key', u'Log Date', u'User', u'Time Spent(Sec)'], dtype='object')
A TPT Schema describes fields in client-side records, not columns in the database table. You would need to change the schema to say the (input) Time_Spent is VARCHAR.
But TPT does not natively read .xlsx files. Consider using to_csv instead of to_excel.
Related
I tried to copy a parquet file to a table in Azure Synapse by using Polybase T-SQL. Here is an example:
data = [["Jean", 15, "Tennis"], ["Jane", 20, "Yoga"], ["Linda", 35, "Yoga"], ["Linda", 35, "Tennis"]]
columns = ["Name", "Age", "Sport"]
df = spark.createDataFrame(data, columns)
Then I save the dataframe as a parquet file by partitioning by the column "Sport":
df\
.write\
.option("header", True)\
.partitionBy("Sport")\
.format("parquet")\
.mode("overwrite")\
.save("/mnt/test_partition_polybase.parquet")
Then I use Polybase T-SQL for copying from a parquet file to a table in Synapse:
IF OBJECT_ID(N'EVENTSTORE.TEST') IS NOT NULL BEGIN DROP EXTERNAL TABLE EVENTSTORE.TEST END
CREATE EXTERNAL TABLE EVENTSTORE.TEST(
[Name] NVARCHAR(250),
[Age] BIGINT,
[SPORT] NVARCHAR(250)
)
WITH (DATA_SOURCE = [my_data_source],LOCATION = N'test_partition_polybase.parquet',FILE_FORMAT = [SynapseParquetFormat],REJECT_TYPE = VALUE,REJECT_VALUE = 0)
I get the error:
External file access failed due to internal error: 'File
test_partition_polybase.parquet/Sport=Tennis/part-00000-tid-5109481329839379912-631db0ad-cd52-4f9e-acf6-76828a8aa4eb-67-1.c000.snappy.parquet:
HdfsBridge::CreateRecordReader - Unexpected error encountered creating
the record reader: HadoopExecutionException: Column count mismatch.
Source file has 2 columns, external table definition has 3 columns
It's because the partitioned column "Sport" is not in the partitioned file. How can I solve this issue?
I finally solved my problem by creating a duplicated column:
df\
.withColumn("part_Sport", F.col("Sport"))\
.write\
.option("header", True)\
.partitionBy("part_Sport")\
.format("parquet")\
.mode("overwrite")\
.save("/mnt/test_partition_polybase.parquet")
I don't know if it is the best solution but that's the only one I have found.
Goal
I'm trying to use pandas DataFrame.to_sql() to send a large DataFrame (>1M rows) to an MS SQL server database.
Problem
The command is significantly slower on one particular DataFrame, taking about 130 sec to send 10,000 rows. In contrast, a similar DataFrame takes just 7 sec to send the same number of rows. The latter DataFrame actually has more columns, and more data as measured by df.memory_usage(deep=True).
Details
The SQLAlchemy engine is created via
engine = create_engine('mssql+pyodbc://#<server>/<db>?driver=ODBC+Driver+17+for+SQL+Server', fast_executemany=True)
The to_sql() call is as follows:
df[i:i+chunksize].to_sql(table, conn, index=False, if_exists='replace')
where chunksize = 10000.
I've attempted to locate the bottleneck via cProfile, but this only revealed that nearly all of the time is spent in pyodbc.Cursor.executemany.
Any tips for debugging would be appreciated!
Cause
The performance difference is due to an issue in pyodbc where passing None values to SQL Server INSERT statements when using the fast_executemany=True option results in slow downs.
Solution
We can pack the values as JSON and use OPENJSON (supported on SQL Server 2016+) instead of fast_executemany. This solution resulted in a 30x performance improvement in my application! Here's a self-contained example, based on the documentation here, but adapted for pandas users.
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame({'First Name': ['Homer', 'Ned'], 'Last Name': ['Simpson', 'Flanders']})
rows_as_json = df.to_json(orient='records')
server = '<server name>'
db = '<database name>'
table = '<table name>'
engine = create_engine(f'mssql+pyodbc://<user>:<password>#{server}/{db}')
sql = f'''
INSERT INTO {table} ([First Name], [Last Name])
SELECT [First Name], [Last Name] FROM
OPENJSON(?)
WITH (
[First Name] nvarchar(50) '$."First Name"',
[Last Name] nvarchar(50) '$."Last Name"'
)
'''
cursor = engine.raw_connection().cursor()
cursor.execute(sql, rows_as_json)
cursor.commit()
Alternative Workarounds
Export data to CSV and use an external tool to complete the transfer (for example, the bcp utility).
Artificially replace values that are converted to None to a non-empty filler value, add a helper column to indicate which rows were changed, complete the operation as normal via to_sql(), then reset the filler values to Null via a separate query based on the helper column.
Acknowledgement
Many thanks to #GordThompson for pointing me to the solution!
I am trying to export a pandas df to SQL server using the following code:
dtypedict={"column1": sqlalchemy.types.VARCHAR(length=50),
"column2": sqlalchemy.types.VARCHAR(length=50),
'column3': sqlalchemy.types.VARCHAR(length=50),
'column4': sqlalchemy.types.INTEGER(),
'column5': sqlalchemy.types.Date(),
'column6': sqlalchemy.types.VARCHAR(length=50),
'column7': sqlalchemy.types.VARCHAR(length=50),
'column8': sqlalchemy.types.VARCHAR(length=50),
'column9': sqlalchemy.types.VARCHAR(length=50)}
staging.columns = staging.columns.str.replace(' ','_')
staging.fillna('', inplace=True)
server = 'SV'
database = 'DB'
username = 'U'
password = 'PW'
cnxn = 'DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(cnxn))
staging.to_sql('Staging_table', schema='dbo', con=engine, chunksize=50, index=False, if_exists='replace', dtype=dtypedict, method='multi')
For some reason, I keep getting the following error:
('22018', "[22018] [Microsoft][ODBC SQL Server Driver][SQL Server]Conversion failed when converting the nvarchar value 'No EQ' to data type int. (245) (SQLExecDirectW)")
This 'No EQ' value appears in column2. I do not understand why the function attempts to convert a nvarchar to int as the dtype specified is varchar. Examing the column type in Azure Data Studio, varchar is also correctly parsed as sql type for column2. The df dtype is object.
Unfortunately I have to use the old SQL server driver and cannot download a recent ODBC one due to permission restrictions.
Anyone has suggestions how to fix this?
I double checked whether the column names in the dict are named correctly and they are as everything resolves to true:
col_list = staging.columns.to_list()
for col in col_list:
for key in dtypedict.items():
if col in key:
print('True for ' + col)
Looks like the object dtype of the df or in the to_sql process didn't work out for columns that seemingly contained integer values AND string values and were misinterpreted as int columns even though dtypes were specified with the dict.
I converted all combined dtype columns with the .astype(str) function and now to_sql works for all columns!
I'd like to parse a dataframe to two pre-define columns in an sql table. The schema in sql is:
abc(varchar(255))
def(varchar(255))
With a dataframe like so:
df = pd.DataFrame(
[
[False, False],
[True, True],
],
columns=["ABC", "DEF"],
)
And the sql query is like so:
with conn.cursor() as cursor:
string = "INSERT INTO {0}.{1}(abc, def) VALUES (?,?)".format(db, table)
cursor.execute(string, (df["ABC"]), (df["DEF"]))
cursor.commit()
So that the query (string) looks like so:
'INSERT INTO my_table(abc, def) VALUES (?,?)'
This creates the following error message:
pyodbc.Error: ('HY004', '[HY004] [Cloudera][ODBC] (11320) SQL type not supported. (11320) (SQLBindParameter)')
So I try using a direct query (not via Python) in the Impala editor, on the following:
'INSERT INTO my_table(abc, def) VALUES ('Hey','Hi');'
And produces this error message:
AnalysisException: Possible loss of precision for target table 'my_table'. Expression ''hey'' (type: `STRING) would need to be cast to VARCHAR(255) for column 'abc'`
How come I cannot even insert into my table simple strings, like "Hi"? Is my schema set up correctly or perhaps something else?
STRING type in Impala has a size limit of 2GB.
VARCHAR's length is whatever you define it to be, but not more than 64KB.
Thus there is a potential of data loss if you implicitly convert one into another.
By default, literals are treated as type STRING. So, in order to insert a literal into VARCHAR field you need to CAST it appropriately.
INSERT INTO my_table(abc, def) VALUES (CAST('Hey' AS VARCHAR(255)),CAST('Hi' AS VARCHAR(255)));
I'm trying to load data from a Pandas DataFrames into a BigQuery table. The DataFrame has a column of dtype datetime64[ns], and when I try to store the df using load_table_from_dataframe(), I get
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table [table name]. Field computation_triggered_time has changed type from DATETIME to TIMESTAMP.
The table has a schema which reads
CREATE TABLE `[table name]` (
...
computation_triggered_time DATETIME NOT NULL,
...
)
In the DataFrame, computation_triggered_time is a datetime64[ns] column. When I read the original DataFrame from CSV, I convert it from text to datetime like so:
df['computation_triggered_time'] = \
df.to_datetime(df['computation_triggered_time']).values.astype('datetime64[ms]')
Note:
The .values.astype('datetime64[ms]') part is necessary because load_table_from_dataframe() uses PyArrow to serialize the df and that fails if the data has nanosecond-precision. The error is something like
[...] Casting from timestamp[ns] to timestamp[ms] would lose data
This looks like a problem with Google's google-cloud-python package, can you report the bug there? https://github.com/googleapis/google-cloud-python