Fastest way of fetching data with parameters from sql database with pandas - sql

I'm trying to fetch data from financial database. Therefore I need this data for several stocks.
Now I'm curious: what's the fastest way of doing this?
For example:
dsn_dwh = 'IvyDB_EU'
import pandas as pd
import pypyodbc as odbc
class dbEU(object):
# Database Object to access ODBC database
def __init__(self):
self._db_connection = odbc.connect(r'DSN='+dsn_dwh+';')
def query(self, query, params):
return pd.read_sql(query, self._db_connection, params = params)
def __del__(self):
self._db_connection.close()
startdat=pd.to_datetime('12/01/2018')
enddat=pd.to_datetime('today')
securities = ['504880', '504881', '504882']
sql = """SELECT * FROM
[IvyDB_EU].[dbo].[SECURITY_PRICE]
WHERE [SecurityID] = ? AND Date >= ? AND Date <=?
"""
df_all = pd.DataFrame()
for s in securities:
params = [int(s)]
params.append(startdat.strftime('%m/%d/%Y'))
params.append(enddat.strftime('%m/%d/%Y'))
db = dbEU()
df = db.query(sql, params)
df_all = df_all.append(df,ignore_index=True)
print(df_all)
Sure one way would be to do this in the WHERE statement with [SecurityID] IN ['504880', '504881', '504882'] but this only works in this simple query. With larger nested querys I need to be able to find another way.
Thanks,
Marvin

Related

Writing a scalable INSERT statement using cx_Oracle

I am attempting to write a script that will allow me to insert values from an uploaded dataframe into a table inside of an Oracle DB; but my issue lies with
too many columns to hard-code
columns aren't one-to-one
What I'm hoping for is a way to write out the columns, check to see if they sync with the columns of my dataframe and from there use an INSERT VALUES sql statement to input the values from the dataframe to the ODS table.
so far these are the important parts of my script:
import pandas as pd
import cx_Oracle
import config
df = pd.read_excel("Employee_data.xlsx")
conn = None
try:
conn = cx_Oracle.connect(config.username, config.password, config.dsn, encoding=config.encoding)
except cx_Oracle.Error as error:
print(error)
finally:
cursor = conn.cursor
sql = "SELECT * FROM ODSMGR.EMPLOYEE_TABLE"
cursor.execute(sql)
data = cursor.fetchall()
col_names = []
for i in range(0, len(cursor.description)):
col_names.append(cursor.description[i][0])
#instead of using df.columns I use:
rows = [tuple(x) for x in df.values]
which prints my ODS column names, and allows me to conveniently store my rows from the df in an array but I'm at a loss for how to import these to the ODS. I found something like:
cursor.execute("insert into ODSMGR.EMPLOYEE_TABLE(col1,col2) values (:col1, :col2)", {":col1df":df, "col2df:df"})
but that'll mean I'll have to hard-code everything which wouldn't be scalable. I'm hoping I can get some sort of insight to help. It's just difficult since the columns aren't 1-to-1 and that there is some compression/collapsing of columns from the DF to the ODS but any help is appreciated.
NOTE: I've also attempted to use SQLalchemy but I am always given an error "ORA-12505: TNS:listener does not currently know of SID given in connect descriptor" which is really strange given that I am able to connect with cx_Oracle
EDIT 1:
I was able to get a list of columns that share the same name; so after running:
import numpy as np
a = np.intersect1d(df.columns, col_names)
print("common columns:", a)
I was able to get a list of columns that the two datasets share.
I also tried to use this as my engine:
engine = create_engine("oracle+cx_oracle://username:password#ODS-test.domain.com:1521/?ODS-Test")
dtyp = {c:types.VARCHAR(df[c].str.len().max())
for c in df.columns[df.dtypes=='object'].tolist()}
df.to_sql('ODS.EMPLOYEE_TABLE', con = engine, dtype=dtyp, if_exists='append')
which has given me nothing but errors.

python How to select the latest sample per user as testing data?

My data is as below. I want to sort by the timestamp and use the latest sample of each userid as the testing data. How should I do the train and test split? What I have tried is using pandas to sort_values timestamp and then groupby 'userid'. But I only get a groupby object. What is the correct way to do that? Is pyspark a better tool?
After I get the dataframe of the testing data, how should split data? Obviously I cannot use sklearn's train_test_split.
You could do the following:
# Sort the data by time stamp
df = df.sort_values('timestamp')
# Group by userid and get the last entry from each group
test_df = df.groupby(by='userid', as_index=False).nth(-1)
# The rest of the values
train_df = df.drop(test_df.index)
You can do the following:
import pyspark.sql.functions as F
max_df = df.groupby("userid").agg(F.max("timestamp"))
# join it back to the original DF
df = df.join(max_df, on="userid")
train_df = df.filter(df["timestamp"] != df["max(timestamp)"])
test_df = df.filter(df["timestamp"] == df["max(timestamp)"])

How to speed up exporting dataframes to MS SQL Server? R is twice as fast

I am very well aware that there are many similar questions, and most answers suggest things like bulk inserts of CSVs (which would be more headache than exporting from R).
However, I am not aware of any comparison with R which addresses if pandas.DataFrame.to_sql() can ever reach the same speed as R's DBI::dbWriteTable.
I run Anaconda on a Windows 10 machine. As a test, I am exporting 15 tables, each with 10,000 rows, 30 columns of random floats and 3 string columns.
The times are:
pyodbc without fast_executemany : 19 seconds
pyodbc with fast_executemany : 7 seconds
turbodbc: 13 seconds
R and DBI::dbWriteTable : 4 seconds
I tried pymssql but I couln't get it to work (see comments in the code).
fast_executemany is a great improvement, but it still takes almost twice as long as R. The difference vs R is striking! And it also suggests there is nothing inherently wrong with the connection to the SQL server.
If the answer is: "no, R is much faster, period", could someone elaborate why? It would be very interesting to understand what causes the difference.
For gigantic tables I might think of exporting them to a format like parquet or feather, get R to read them and then export to SQL for R. However, for mid-sized tables (anything between 30 and 200 MBs of data) I would much much rather find a better solution in Python.
PS It's a SQL Server Express running on the very same PC from which I am running Python and R.
Edit
In answer to a comment, I have compared execution times with different sizes, and the ratio of R's time to pyodbc's seems fairly constant. Take it with a grain of salt - I am not a database expert, and there may well be database configurations or other factors I am missing which are playing a role. The bottom line remains that, in my cases, R is almost twice as fast.
My code:
import numpy as np
import pandas as pd
import timeit
from sqlalchemy import create_engine #, MetaData, Table, select
import pymssql
ServerName = myserver
Database = mydb
params = '?driver=SQL+Server+Native+Client+11.0'
# we define the pyodbc connection
engine_pyo = create_engine('mssql+pyodbc://' + ServerName + '/' + Database + params,
encoding='latin1', fast_executemany=True)
conn_pyo = engine_pyo.connect()
# now the turbodbc
engine_turbo = create_engine('mssql+turbodbc://' + ServerName + '/' + Database + params, encoding='latin1')
conn_turbo = engine_turbo.connect()
# pymssql is installed but doesn't work.
# I get:
# connect() got an unexpected keyword argument 'driver'
#engine_pyms = create_engine('mssql+pymssql://' + ServerName + '/' + Database + params, encoding='latin1')
#conn_pyms = engine_pyms.connect()
sheets = 15
rows= int(10e3)
def create_data(sheets, rows):
df = {} # dictionary of dataframes
for i in range(sheets):
df[i] = pd.DataFrame(data= np.random.rand(rows,30) )
df[i]['a'] = 'some long random text'
df[i]['b'] = 'some more random text'
df[i]['c'] = 'yet more text'
return df
def data_to_sql(df, conn):
for d in df:
df[d].to_sql('Test ' + str(d) , conn, if_exists='replace' )
#NB: df is a dictionary containing dataframes - it is NOT a dataframe
# df[key] is a dataframe
df = create_data(sheets, rows)
rep = 1
n = 1
t_pyodbc = timeit.Timer( "data_to_sql(df, conn_pyo)" , globals=globals() ).repeat(repeat = rep, number = n)
t_turbo = timeit.Timer( "data_to_sql(df, conn_turbo)" , globals=globals() ).repeat(repeat = rep, number = n)
My R code:
library(tictoc)
library(readxl)
library(tidyverse)
library(dplyr)
library(dbplyr)
library(DBI)
tic("data creation")
rows = 10000
cols = 30
tables = 15
dataframes <- list()
for (i in 1:tables){
newdf <- as_tibble( matrix(rnorm(rows * cols),rows , cols) )
newdf <- mutate(newdf, a = 'some long random text')
newdf <- mutate(newdf, b = 'some more random text')
newdf <- mutate(newdf, c = 'yet more text')
dataframes[[i]] <- newdf
}
toc()
tic("set up odbc")
con <- DBI::dbConnect(odbc::odbc(),
driver = "SQL Server",
server = "LTS-LNWS010\\SQLEXPRESS",
database = "CDL",
trusted_connection = TRUE)
toc()
tic("SQL export")
for(i in seq_along(dataframes)){
DBI::dbWriteTable(con, paste("Test_", i), dataframes[[i]], overwrite = TRUE)
}
toc()

pandas gets stuck when trying to read from bigquery

I have a fairly large table in big query ( app. 9M rows) and I would like to read it via pandas.
I've tried reading and using the [pd.read_gbq()][1] function, which works fine on small tables.
On the large table it gets stuck after 50 secs or so (logs show elapsed .. 50s) - without giving an error or anything.
My question is how can I read that table using pd (chunks?). Any conventions on scaling up these bigquery reads will be helpful.
EDIT / resolution
adding to Khan's answer, I ended up implementing chunks, writing 500,000 each time to a file, then reading these files to dataframe like so:
def download_gbq_table(self):
if not os.path.exists(self.tmp_dir):
os.makedirs(self.tmp_dir)
increment = 100000
intervals = list(range(0, self.table_size, 100000))
intervals.append(self.table_size - intervals[len(intervals)-1])
df = pd.DataFrame()
for offset in intervals:
query = f"select * from `<table_name>` limit {increment} offset {offset};"
logger.info(f"running query: {query}")
start_time = time.time()
tmp_df = pd.read_gbq(query,
project_id=self.connection_parameters['project_id'],
private_key=self.connection_parameters['service_account'],
dialect='standard'
)
df = pd.concat([df, tmp_df])
logger.info(f'time took: {str(round(time.time() - start_time, 2))}')
if len(df) % 500000 == 0:
df.to_csv(os.path.join(self.tmp_dir, f'df_{str(offset + increment)}.csv'))
df = pd.DataFrame()
def read_df_from_multi_csv(self):
all_files = glob.glob(os.path.join(self.tmp_dir, "df_*"))
df_list = []
for f in all_files:
start_time = time.time()
df_list.append(pd.read_csv(f))
logger.info(f'time took for reading {f}: {str(round(time.time() - start_time, 2))}')
return pd.concat((pd.read_csv(f) for f in all_files))
Pandas' read_gbq function currently does not provide a chunksize parameter (even though its opposite to_gbq function does provide a chunksize parameter).
Anyways, you can solve your problem with adding LIMIT and OFFSET to your SQL query read stuff iteratively from BigQuery. Something on the lines of:
project_id = "xxxxxxxx"
increment=100000
chunks=range(0, 9000000, 100000)
chunks[-1]+=increment
intervals=[[chunks[i-1], chunks[i]+1] for i, e in enumerate(chunks) if i > 0]
query_str="select * from `mydataset.mytable` limit {end} offset {start};"
for start, end in intervals:
query = query_str.format(start=start, end=end)
df = pd.read_gbq(query, project_id)
#-- do stuff with your df here..
Not sure if this existed back when the question was originally asked, but now you can use python-bigquery-sqlalchemy (link) to read data from BigQuery, which allows you to use the built-in chunking ability of pandas.read_sql(). You just create a SQLAlchemy connection engine using "bigquery://{project-name}" and pass that to con in pandas.read_sql().
For example:
from sqlalchemy.engine import create_engine
import pandas as pd
read_project = "my-cool-project"
query = f"""
select * from `{read_project}.cool-dataset.cooltable`
"""
bq_engine = create_engine(f"bigquery://{read_project}")
for df in pd.read_sql(query, con=bq_engine, chunksize=100_000):
# do stuff with df...

SQL Server : parse a column's value into 5 columns

I have the following column in a table.
daily;1;21/03/2015;times;10
daily;1;01/02/2016;times;8
monthly;1;01/01/2016;times;2
weekly;1;21/01/2016;times;4
How can I parse this by the ; delimiter into different columns?
one way to do it would be to pull it into pandas, delimit by semicolon, and put it back into SQL Server. See below for an example which I tested.
TEST DATA SETUP
CODE
import sqlalchemy as sa
import urllib
import pandas as pd
server = 'yourserver'
read_database = 'db_to_read_data_from'
write_database = 'db_to_write_data_to'
read_tablename = 'table_to_read_from'
write_tablename = 'table_to_write_to'
read_params = urllib.quote_plus("DRIVER={SQL Server};SERVER="+server+";DATABASE="+read_database+";TRUSTED_CONNECTION=Yes")
read_engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % read_params)
write_params = urllib.quote_plus("DRIVER={SQL Server};SERVER="+server+";DATABASE="+write_database+";TRUSTED_CONNECTION=Yes")
write_engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % write_params)
#Read from SQL into DF
Table_DF = pd.read_sql(read_tablename, con=read_engine)
#Delimit by semicolon
parsed_DF = Table_DF['string_column'].apply(lambda x: pd.Series(x.split(';')))
#write DF back to SQL
parsed_DF.to_sql(write_tablename,write_engine,if_exists='append')
RESULT