Goal
I'm trying to use pandas DataFrame.to_sql() to send a large DataFrame (>1M rows) to an MS SQL server database.
Problem
The command is significantly slower on one particular DataFrame, taking about 130 sec to send 10,000 rows. In contrast, a similar DataFrame takes just 7 sec to send the same number of rows. The latter DataFrame actually has more columns, and more data as measured by df.memory_usage(deep=True).
Details
The SQLAlchemy engine is created via
engine = create_engine('mssql+pyodbc://#<server>/<db>?driver=ODBC+Driver+17+for+SQL+Server', fast_executemany=True)
The to_sql() call is as follows:
df[i:i+chunksize].to_sql(table, conn, index=False, if_exists='replace')
where chunksize = 10000.
I've attempted to locate the bottleneck via cProfile, but this only revealed that nearly all of the time is spent in pyodbc.Cursor.executemany.
Any tips for debugging would be appreciated!
Cause
The performance difference is due to an issue in pyodbc where passing None values to SQL Server INSERT statements when using the fast_executemany=True option results in slow downs.
Solution
We can pack the values as JSON and use OPENJSON (supported on SQL Server 2016+) instead of fast_executemany. This solution resulted in a 30x performance improvement in my application! Here's a self-contained example, based on the documentation here, but adapted for pandas users.
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame({'First Name': ['Homer', 'Ned'], 'Last Name': ['Simpson', 'Flanders']})
rows_as_json = df.to_json(orient='records')
server = '<server name>'
db = '<database name>'
table = '<table name>'
engine = create_engine(f'mssql+pyodbc://<user>:<password>#{server}/{db}')
sql = f'''
INSERT INTO {table} ([First Name], [Last Name])
SELECT [First Name], [Last Name] FROM
OPENJSON(?)
WITH (
[First Name] nvarchar(50) '$."First Name"',
[Last Name] nvarchar(50) '$."Last Name"'
)
'''
cursor = engine.raw_connection().cursor()
cursor.execute(sql, rows_as_json)
cursor.commit()
Alternative Workarounds
Export data to CSV and use an external tool to complete the transfer (for example, the bcp utility).
Artificially replace values that are converted to None to a non-empty filler value, add a helper column to indicate which rows were changed, complete the operation as normal via to_sql(), then reset the filler values to Null via a separate query based on the helper column.
Acknowledgement
Many thanks to #GordThompson for pointing me to the solution!
Related
I have a dataframe that contains 391 columns and a number of rows. I am trying to push this to a database via pyodbc and using the following command:
cursor = conn.cursor()
cursor.fast_executemany = True
cursor.executemany(
f"INSERT INTO db.tble({', '.join(df.columns.tolist())}) VALUES ({('?,' * len(df.columns))[:-1]})",
list(df.itertuples(index=False, name=None))
)
cursor.commit()
I would have thought this method would be dynamic for a dataframe of any size yet I get the following error:
ProgrammingError: ('Expected 0 parameters, supplied 391', 'HY000')
I am struggling to understand this as the syntax looks correct, ? has been used instead of %s like other answers. Can someone please help.
Thanks
I once wrote a piece of code, where I wanted to create the insert statement dynamically based on number of columns in the data frame:
here is how the insert query would be passed to the database:
INSERT INTO dbo.Table (column1,columns2,column3) VALUES (?,?,?)
and again, the number of columns and values '?' would be required to be created dynamically at runtime based upon the number of columns the data frame had
I wrote the below piece to just write a string (of ?,?,?) and concatenate it with the insert query,
here
df is the dataframe,
symbol_counter would hold the number of columns in the dataframe,
sym_string would be the final string i.e. (?,?,?,?...n) based on the number of columns
symbol = ['?']
sym_string = ''
symbol_counter = int(df.shape[1])-1
word = 0
for word in range(symbol_counter):
# sym_string += str(symbol)
symbol.insert(word, "?")
word+=1
sym_string = (','.join(symbol))
#and then use this variable and concatenate it with the rest of the query as shown below
query = Variable_holding_first_partofthequery + " VALUES (" +sym_string+")"
I know, it's the big way, but that's how I got it to work. Good Luck!
i am trying to extract and read the data from a SQL query.
Below is the sample data from SQL developer:
target_name expected_instances environment system_name hostname
--------------------------------------------------------------------------------------
ORAUAT_host1 1 UAT ORAUAT_host1_sys host1.sample.net
ORAUAT_host2 1 UAT ORAUAT_host1_sys host2.sample.net
Normally i pass the system_name to the query (which has a bind variable for system_name) and get the data as a list,but not the column names.
Is there a way in Python to retrieve the data along with the column names and reference values with column name like target_name[0] giving the value ORAUAT_host1?Please suggest.Thanks.
If what you want is to get the column names from the table you are querying, you can do something like this:
My example is printing a csv file
import os
import sys
import cx_Oracle
db = cx_Oracle.connect('user/pass#host:1521/service_name')
SQL = "select * from dual"
print(SQL)
cursor = db.cursor()
f = open("C:\dual.csv", "w")
writer = csv.writer(f, lineterminator="\n", quoting=csv.QUOTE_NONNUMERIC)
r = cursor.execute(SQL)
#this takes the column names
col_names = [row[0] for row in cursor.description]
writer.writerow(col_names)
for row in cursor:
writer.writerow(row)
f.close()
The way to print the columns is using the method description of the cursor object
Cursor.description
This read-only attribute is a sequence of 7-item sequences. Each of
these sequences contains information describing one result column:
(name, type, display_size, internal_size, precision, scale, null_ok).
This attribute will be None for operations that do not return rows or
if the cursor has not had an operation invoked via the execute()
method yet.
The type will be one of the database type constants defined at the
module level.
https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#
I am using R to handle large datasets (largest dataframe 30.000.000 x 120). These are stored in Azure Datalake Storage as parquet files, and we would need to query these daily and restore these in a local SQL database. Parquet files can be read without loading the data into memory, which is handy. However, creating SQL tables from parquuet files is more challenging as I'd prefer not to load the data into memory.
Here is the code I used. Unfortunately, this is not a perfect reprex as the SQL database need to exist for this to work.
# load packages
library(tidyverse)
library(arrow)
library(sparklyr)
library(DBI)
# Create test data
test <- data.frame(matrix(rnorm(20), nrow=10))
# Save as parquet file
write_parquet(test2, tempfile(fileext = ".parquet"))
# Load main table
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
test <- spark_read_parquet(sc, name = "test_main", path = "/tmp/RtmpeJBgyB/file2b5f4764e153.parquet", memory = FALSE, overwrite = TRUE)
# Save into SQL table
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "schema", table = "table"),
value = test)
Is it possible to write a SQL table without loading parquet files into memory?
I lack the experience with T-sql bulk import and export but this is likely where you'll find your answer.
library(arrow)
library(DBI)
test <- data.frame(matrix(rnorm(20), nrow=10))
f <- tempfile(fileext = '.parquet')
write_parquet(test2, f)
#Upload table using bulk insert
dbExecute(connection,
paste("
BULK INSERT [database].[schema].[table]
FROM '", gsub('\\\\', '/', f), "' FORMAT = 'PARQUET';
")
)
here I use T-sql's own bulk insert command.
Disclaimer I have not yet used this command in T-sql, so it may riddled with error. For example I can't see a place to specify snappy compression within the documentation, although it can be specified if one instead defined a custom file format with CREATE EXTERNAL FILE FORMAT.
Now the above only inserts into an existing table. For your specific case, where you'd like to create a new table from the file, you would likely be looking more for OPENROWSET using CREATE TABLE AS [select statement].
column_definition <- paste(names(column_defs), column_defs, collapse = ',')
dbExecute(connection,
paste0("CREATE TABLE MySqlTable
AS
SELECT *
FROM
OPENROWSET(
BULK '", f, "' FORMAT = 'PARQUET'
) WITH (
", paste0([Column definitions], ..., collapse = ', '), "
);
")
where column_defs would be a named list or vector describing giving the SQL data-type definition for each column. A (more or less) complete translation from R data types to is available on the T-sql documentation page (Note two very necessary translations: Date and POSIXlt are not present). Once again disclaimer: My time in T-sql did not get to BULK INSERT or similar.
I'm trying to write a SCollection to a partition in Big Query using:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val date = LocateDate.parse("2017-06-21")
val col = sCollection.typedBigQuery[Blah](query)
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.ISO_LOCAL_DATE),
writeDisposition = WriteDisposition.WRITE_EMPTY,
createDisposition = CreateDisposition.CREATE_IF_NEEDED)
The error I get is
Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used."
How can I write to a partition? I don't see any options to specify partitions via either saveAsTypedBigQuery method so I was trying the Legacy SQL table decorators.
See: BigqueryIO Unable to Write to Date-Partitioned Table. You need to manually create the table. BQ IO cannot create a table and partition it.
Additionally, the no table decorators was a complete ruse. It's the alphanumeric part I was missing.
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.BASIC_ISO_DATE),
writeDisposition = WriteDisposition.WRITE_APPEND,
createDisposition = CreateDisposition.CREATE_NEVER)
I am attempting to extract data from Cassandra, into a specific partitioned Hive table using Spark 2.1.1 on Hadoop 2.7. To do this, I have all the data from Cassandra into an rdd which I transform into a dataframe via rdd.toDF(), and passed into the following function:
public def writeToHive(ss: SparkSession, df: DataFrame) {
df.createOrReplaceTempView(tablename)
val cols = df.columns
val schema = df.schema
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
val outdf = ss.sql(s"""INSERT INTO TABLE ${db}.${t} PARTITION (date="${destPartition}") SELECT * FROM ${tablename}""")
// Have also tried the following lines below, but yielded the same results
// var dfInput_1 = dfInput.withColumn("region", lit(s"${destPartition}"))
// dfInput_1.write.mode("append").insertInto(s"${db}.${t}")
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
// logs 423
LOG.info(s"""SELECT COUNT(*) FROM ${db}.${t} where date='${destPartition}'""")
}
When looking in Cassandra, there are indeed 358 rows in the table. I saw this post on Hortonworks https://community.hortonworks.com/questions/51322/count-msmatch-while-using-the-parquet-file-in-spar.html but there doesn't seem to be a solution. I have tried setting spark.sql.hive.metastorePartitionPruning to true, but no changes were seen in the row counts.
Would love any feedback as to why there is a discrepancy between the row counts. Thanks!
EDIT: bad data coming in.... should've seen that coming
Sometimes data contains non-utf8 characters like Japanese or Chinese. Check if data contains any such non-utf8 characters.
If this is a case insert it in ORC format. By default it is text, and text doesn't support non-utf8 characters.