Pandas to_sql() performance related to number of columns - pandas

I noticed some odd behaviour of a script of mine which uses pandas' to_sql function to insert large numbers of rows into one of my mssql server.
The performance dramatically decreases when the number of columns exceeds 10
For example:
34484 rows x 10 columns => ~10k records per second
34484 rows x 12 columns => ~500 records per second
I use the fast_executemany Flag when establishing the conneciton, anyone got any idea!?
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s?charset=utf8" % params, fast_executemany=True)
sqlalchemy_connection = engine.connect()
....
df.to_sql(name='TEST', con=sqlalchemy_connection , if_exists='append', index=False)

Related

Pandas efficiently concat DataFrames returned from apply function

I have a pandas.Series of business dates called s_dates. I want to pass each of these dates (together with some other hyper-parameters) to a function called func_sql_to_df which formats an SQL-query and then returns a pandas.DataFrame. Finally, all of the DataFrames should be concatenated (appended) into a single pandas.DataFrame called df_summary where the business date is the identifier.
From here I need to do two things:
export df_summary to an Excel sheet or csv-file.
group df_summary by the dates and then apply another function called func_analysis to each column.
My attempt is something like this
df_summary = pd.concat(list(
s_dates.apply(func_sql_to_df, args=hyper_param)
))
df_summary.groupby('dates').apply(func_analysis)
# Export data
...
However, the first statement where df_summary is defined takes quite long. There are a total of 250 dates where the first couple of iterations takes approximately 3 seconds but it increases to over 3 minutes after about 100 iterations (and continues to do so). All of the SQL-queries take more or less the same time to execute individually and the resulting dataframes all have the same number of observations.
I want to increase the performance of this setup, but I am already not using any loops (only apply-functions) and the SQL-query has already been optimized a lot. Any suggestions?
Update: If I am not mistaken then my attempt is actually the suggested solution as stated in the accepted answer to this post.
Update2: My SQL-query looks something like this. I do not know if all the dates can be passed at ones as the conditions specified in the WHERE-statement must hold for each passed value in dates.
select /*+ parallel(auto) */
MY_DATE as EOD_DATE -- These are all the elements in 'DATES' passed
, Var2
, Var3
, ColA
, ColB
, ...
, ColN
from Database1
where
Var2 in (select Var2 from Datebase2 where update_time < MY_DATE) -- Cond1
and Var3 in (select Var3 from DataBase3 where EOD_DATE = MY_DATE) -- Cond2
and cond3
and cond4
...
Running the query for any date in dates on its own seems to take around 2-8 seconds. However, as mentioned some of the iterations in the apply-function takes more than 3 minutes.
Turns out the trying to use pandas.concat(...) with a pandas.DataFrame.apply(...) as the argument as in my setting above is really slow. I just tried to compare the results by using a for-loop which gives ~x10 times faster performance.
# ~x10 times faster
dfs = []
for d in dates:
dfs.append(func_sql_to_df(d, hyper_param))
df_summary = pd.concat(dfs) # It is very important that the concat is outside the for-loop
This can even be run in parallel to get much better results
# ~x10 * (n_jobs) times faster
from joblib import Parallel, delayed
df_summary = pd.concat(
Parallel(n_jobs=-1)(delayed(func_sql_to_df)(d, hyper_param) for d in dates)
)

Serialized Results too large PySpark Left Join

I have a problem with a last "left" join of a transformation.
The result continually returns me the following error: There is too much data being sent to the driver. 4.0 GiB of serialized data from 10700 tasks exceeds the limit of 4.0 GiB. and not so how to fix it.
The final desired dataset contains 4.5 million rows and shouldn't be complicated to obtain. I have already disabled the join broadcast but to no avail.
Edit for deepening:
The join is located at the last line of the code (which involves more or less heavy operations) but this is precisely where it stops. Without this join, in fact, I can build both datasets (df and df2). Then when I execute the join it returns me the error.
df --> ~2.500.000 rows, 3 columns, 24.5 MB size. (Result of F.explode of DATE for each ID
df2 --> ~700.000 rows, 10 columns, 29.5 MB size. (Result of union of some datasets)
df_final = df.join(df2, ['ID', 'DATE'], 'left')
Please help me! Thank u!

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

Apache Hbase - Fetching large row is extremely slow

I'm running an Apache Hbase Cluster on AWS EMR. I have a table that is a single column family, 75,000 columns and 50,000 rows. I'm trying to get all the column values for a single row, and when the row is not sparse, and has 75,000 values, the return time is extremely slow - it takes almost 2.5 seconds for me to fetch the data from the DB. I'm querying the table from a Lambda function running Happybase.
import happybase
start = time.time()
col = 'mycol'
table = connection.table('mytable')
row = table.row(col)
end = time.time() - start
print("Time taken to fetch column from database:")
print(end)
What can I do to make this faster? This seems incredibly slow - the return payload is 75,000 value pairs, and is only ~2MB. It should be much faster than 2 seconds. I'm looking for millisecond return time.
I have a BLOCKCACHE size of 8194kb, a BLOOMFILTER of type ROW, and SNAPPY compression enabled on this table.

store matrix data in SQLite for fast retrieval in R

I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.