JDBC going extrem slow after a few big sql selects in scala - sql

atm im trying to get a lot datas(26 million entrys) out of a sql database to put them into a LevelDB.
Im using scala and a connectionPool from JDBC. This is my Query:
"SELECT ts.block_id, ts.sender, ts.fee, ts.recipient AS transaction_recipient, ts.amount AS transaction_amount, ts.type, tf.recipient AS transfer_recipient, tf.amount AS transfer_amount FROM transactions ts LEFT OUTER JOIN transfers tf ON (tf.transaction_id = ts.transaction_id) ORDER BY ts.block_id ASC LIMIT " + (x*sqlBatchSize) + ","+sqlBatchSize)
Where sqlbatchSize is the parameter for how much entrys I want to get in one SQL Statement and x is the iteration amount. So simply im asking select **** Limit 0,10.000 and then 10.000,10.000 and then 20.000,10.000 and so on.
I tried it with different sizes from 5k to 10k, 100k, 500k, 5kkk and so on. Tried it with some start value, so I only get the entrys from the last half of the database and stuff like that.
I have always the same problem. The first query is fast, the second one is slower and then the next ones are slower and slower. Doenst matter if I request 10x 10k oder 10x 500k entrys. Always the same.
Here is my code a little bit shorted:
import java.nio.ByteBuffer
import java.nio.charset.StandardCharsets
import java.sql.{Connection, DriverManager}
import org.apache.commons.dbcp2._
import com.google.common.primitives.Longs
import com.typesafe.config._
import scala.math.BigInt
import java.io.File
import java.util.Calendar
import org.iq80.leveldb
import org.iq80.leveldb.impl.Iq80DBFactory._
import org.iq80.leveldb.Options
object Datasource {
// Load config
val conf = ConfigFactory.load()
val dbUrl = "jdbc:mysql://" + conf.getString("database.db_host") + ":" + conf.getString("database.db_port") + "/" + conf.getString("database.db_name")
val connectionPool = new BasicDataSource()
connectionPool.setUsername(conf.getString("database.db_user"))
connectionPool.setPassword(conf.getString("database.db_password"))
connectionPool.setDriverClassName("com.mysql.jdbc.Driver")
connectionPool.setUrl(dbUrl)
connectionPool.setInitialSize(3)
}
object main extends App {
var durchlaufen = 1
var x = 0
var sqlBatchSize = 50000
if(durchlaufen == 1){
try{
var grenze = 10
while(x < grenze) {
val connection = Datasource.connectionPool.getConnection
val rs = connection.createStatement.executeQuery("SELECT ts.block_id, ts.sender, ts.fee, ts.recipient AS transaction_recipient, ts.amount AS transaction_amount, ts.type, tf.recipient AS transfer_recipient, tf.amount AS transfer_amount FROM transactions ts LEFT OUTER JOIN transfers tf ON (tf.transaction_id = ts.transaction_id) ORDER BY ts.block_id ASC LIMIT " + (x*sqlBatchSize+10000000) + ","+sqlBatchSize)
x += 1
while (rs.next()) {
// Do stuff
}
connection.close()
rs.close()
}
}
}
}
What I tried too is the remove the connection close and working on the same connection in all iterations of the while. Same result
I checked with visualVm the memory and there is one byte array that is very large and max out the ram. But I have no clue what byte array that is
Anyone has an idea why the querys start to go extrem slow?
If its like, im requestting 10x 500k entrys than the ram is full and its going to slow down, okay. I can understand that. But if I request 10x 5k it goes slow the same rate as 10x 500k just a single request takes longer. At the end, the third query takes about double the time and the 7th query takes about 6-9 times longer than the first

Related

Speeding up Sympy's solveset calculation for a large array of variables

I'm trying to create a parametrization of points in space from a specific point according to a specific inequality.
I'm doing it using Sympy.solevset method while the calculation will return an interval of the parameter t that represents all points between those in my dataframe.
Sadly, performing a Sympy.solveset over 13 sets of values (i.e 13 iterations) leads to execution times of over 20 seconds overall, and over 1 sec calculation time per set.
The code:
from sympy import *
from sympy import S
from sympy.solvers.solveset import solveset, solveset_real
import pandas as pd
import time
t=symbols('t',positive=True)
p1x,p1y,p2x,p2y=symbols('p1x p1y p2x p2y')
centerp=[10,10]
radius=5
data={'P1X':[0,1,2,3,1,2,3,1,2,3,1,2,3],'P1Y':[3,2,1,0,1,2,3,1,2,3,1,2,3],'P2X':[3,8,2,4,1,2,3,1,2,3,1,2,3],'P2Y':[3,9,10,7,1,2,3,1,2,3,1,2,3],'result':[0,0,0,0,0,0,0,0,0,0,0,0,0]}
df=pd.DataFrame(data)
parameterized_x=p1x+t*(p2x-p1x)
parameterized_y=p1y+t*(p2y-p1y)
start_whole_process=time.time()
overall_time=0
for index,row in df.iterrows():
parameterized_x.subs([[p1x,row['P1X']],[p2x,row['P2X']]])
parameterized_y.subs([[p1y,row['P1Y']],[p2y,row['P2Y']]])
expr=sqrt((parameterized_x-centerp[0])**2+(parameterized_y-centerp[1])**2)-radius
start=time.time()
df.at[index,'result']=solveset(expr>=0,t,domain=S.Reals)
end=time.time()
overall_time=overall_time+end-start
end_whole_process=time.time()
I need to know if there's a way to enhance calculation time or maybe there is another package that can preform a specific inequality over large quantities of data without having to wait minutes upon minutes.
There is one big mistake in your current approach than needs to be fixed first. Inside your for loop you did:
parameterized_x.subs([[p1x,row['P1X']],[p2x,row['P2X']]])
parameterized_y.subs([[p1y,row['P1Y']],[p2y,row['P2Y']]])
expr=sqrt((parameterized_x-centerp[0])**2+(parameterized_y-centerp[1])**2)-radius
This is wrong: SymPy expressions cannot be modified in place. This leads your expr to be exactly the same for each row, namely:
# sqrt((p1x + t*(-p1x + p2x) - 10)**2 + (p1y + t*(-p1y + p2y) - 10)**2) - 5
Then, solveset tries to solve the same expression on each row. Because this expression contains 3 symbols, solveset takes a long time trying to compute the solution, eventually producing the same answer for each row:
# ConditionSet(t, sqrt((p1x + t*(-p1x + p2x) - 10)**2 + (p1y + t*(-p1y + p2y) - 10)**2) - 5 >= 0, Complexes)
Remember: every operation you apply to a SymPy expression creates a new SymPy expression. So, the above code has to be modified to:
px_expr = parameterized_x.subs([[p1x,row['P1X']],[p2x,row['P2X']]])
py_expr = parameterized_y.subs([[p1y,row['P1Y']],[p2y,row['P2Y']]])
expr=sqrt((px_expr-centerp[0])**2+(py_expr-centerp[1])**2)-radius
In doing so, expr is different for each row, as it is expected. Then, solveset computes different solutions, and it is much much faster.
Here is your full example:
from sympy import *
from sympy.solvers.solveset import solveset, solveset_real
import pandas as pd
import time
t=symbols('t',positive=True)
p1x,p1y,p2x,p2y=symbols('p1x p1y p2x p2y')
centerp=[10,10]
radius=5
data={'P1X':[0,1,2,3,1,2,3,1,2,3,1,2,3],'P1Y':[3,2,1,0,1,2,3,1,2,3,1,2,3],'P2X':[3,8,2,4,1,2,3,1,2,3,1,2,3],'P2Y':[3,9,10,7,1,2,3,1,2,3,1,2,3],'result':[0,0,0,0,0,0,0,0,0,0,0,0,0]}
df=pd.DataFrame(data)
parameterized_x=p1x+t*(p2x-p1x)
parameterized_y=p1y+t*(p2y-p1y)
start_whole_process=time.time()
overall_time=0
for index,row in df.iterrows():
px_expr = parameterized_x.subs([[p1x,row['P1X']],[p2x,row['P2X']]])
py_expr = parameterized_y.subs([[p1y,row['P1Y']],[p2y,row['P2Y']]])
expr=sqrt((px_expr-centerp[0])**2+(py_expr-centerp[1])**2)-radius
df.at[index,'result']=solveset(expr>=0,t,domain=S.Reals)
end_whole_process=time.time()
print("end_whole_process - start_whole_process", end_whole_process - start_whole_process)

psycopg2 copy_from Problems in Python 3

I'm new to Python (and coding) and bit off more than I can chew trying to use copy_from.
I am reading rows from a CSV, manipulating them a bit, then writing them into SQL. Using the normal INSERT commands takes a very long time with hundreds of thousands of rows, so I want to use copy_from. It does work with INSERT though.
https://www.psycopg.org/docs/cursor.html#cursor.copy_from this example uses tabs as column separators and newline at the end of each row, so I made each IO line accordingly:
43620929 2018-04-11 11:38:14 30263506 30263503 30262500 0 0 0 0 0 1000 1000 0
That's what the below outputs with the first print statement:
def copyFromIO(thisOutput):
print(thisOutput.getvalue())
cursor.copy_from(thisOutput, 'hands_new')
thisCommand = 'SELECT * FROM hands_new'
cursor.execute(thisCommand)
print(cursor.fetchall())
hands_new is an existing, empty SQL table. The second print statement is just [], so it isn't writing to the db. What am I getting wrong?
Obviously if it worked, I could make thisOutput much longer, with lots of rows instead of just the one.
I think I figured it out, so if anyone comes across this in the future for some reason:
'thisOutput' format was wrong, I built it from smaller pieces including adding '\t' etc. It works if instead I do:
copyFromIO(io.StringIO('43620929\t2018-04-11 11:38:14\t30263506\t30263503\t30262500\t0\t0\t0\t0\t0\t1000\t1000\t0\n'))
& I needed the right columns in the copy_from command:
def copyFromIO(thisOutput):
print(thisOutput.getvalue())
thisCol = ('pkey', 'created', 'gameid', 'tableid', 'playerid', 'bet', 'pot',
'isout', 'outround', 'rake', 'endstack', 'startstack', 'stppaid')
cursor.copy_from(thisOutput, 'hands_new', columns=(thisCol))
thisCommand = 'SELECT * FROM hands_new'
cursor.execute(thisCommand)
print(cursor.fetchall())

Key-value pairs in SQL table

I am building a Django app and I would like to show some statistics on the main page, like total number of transactions, percentage of successful transactions, daily number of active users etc.
I don't want to calculate these values in the view every time a user requests the main page for performance reasons. I thought of 2 possible solutions.
(1) Create a number of one-record tables
Create a table for each of the statistics, e.g.:
from django.db import models
class LastSuccessfulTransactionDate(models.Model):
date = models.DateTimeField()
class TotalTransactionAmount(models.Model):
total_amount = models.DecimalField(max_digits=8, decimal_places=2)
# ...
and make sure that only one record exists in each table.
(2) Create a table with key-value data
class Statistics(models.Model):
key = models.CharField(max_length=100)
value = models.TextField()
and save the data by doing:
from datetime import datetime
from decimal import Decimal
import pickle
statistics = {
'last_successful_transaction_date': datetime(2010, 2, 3),
'total_transaction_amount': Decimal('1234.56'),
}
for k, v in statistics.items():
try:
s = Statistics.objects.get(key=k)
except Statistics.DoesNotExist:
s = Statistics(key=k)
s.value = base64.b64encode(pickle.dumps(v, pickle.HIGHEST_PROTOCOL)).decode()
s.save()
and retrieve by:
for s in Statistics.objects.all():
k = s.key
v = pickle.loads(base64.b64decode(s.value.encode()))
print(k, v)
In both cases the data would be updated every now and then by a cron job (they don't have to be very acurate).
To me solution (2) looks better, because to display the main page I would need to get data from the Statistics table only, not from a number of one-record tables. Is there any recommended solution to this problem? Thanks
There is even better solution if you are using Postgresql, which is JSONField. You can directly store the key value pair in the JSONField. Try like this:
# model
class Statistics(models.Model):
records = JSONField()
# usage
statistics = {
'last_successful_transaction_date': datetime(2010, 2, 3),
'total_transaction_amount': Decimal('1234.56'),
}
Statistics.objects.create(records=statistics)
Statistics.objects.filter(records__last_successful_transaction_date= datetime(2010, 2, 3)) # Running query
s = Statistics.objects.filter(records__last_successful_transaction_date= datetime(2010, 2, 3)).first()
new_statistics = {
'last_successful_transaction_date': datetime(2012, 2, 3),
'total_transaction_amount': Decimal('1234.50'),
}
records = s.records
records.update(new_statistics)
s.records = records
s.save()

Putting dbSendQuery into a function in R

I'm using RJDBC in RStudio to pull a set of data from an Oracle database into R.
After loading the RJDBC package I have the following lines:
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
Run through the usual script, they always execute without fail; it can sometimes take a few minutes dependent on variable x, e.g. may result in 100K rows or 1M rows being pulled. masterdata will return everything in a dataframe.
I'm now trying to place all of the above into a function, with one required argument, variable x which is a TEXT argument (a city name); this input however is also part of the LONG SQL QUERY.
The function I wrote called Data_Grab is as follows:
Data_Grab = function(x) {
drv = JDBC("oracle.jdbc.OracleDriver", classPath="C:/R/ojdbc7.jar", identifier.quote = " ")
conn = dbConnect(drv,"jdbc:oracle:thin:#private_server_info", "804301", "password")
rs = dbSendQuery(conn, statement= paste("LONG SQL QUERY TO SELECT REQUIRED DATA,
INCLUDING REQUEST FOR VARIABLE x"))
masterdata = fetch(rs, n = -1) # extract all rows
return (masterdata)
}
My function appears to execute in seconds (no error is produced) however I get just the 21 column headings for the dataframe and the line
<0 rows> (or 0-length row.names)
Not sure what is wrong here; obviously expecting function to still take minutes to execute as data being pulled is large, but not being returned any actual data frame.
Help is appreciated!
if you want to parameterize your query to a JDBC database, try also using the gsubfn package. code might look like this:
library(gsubfn)
library(RJDBC)
Data_Grab = function(x) {
rd1 = x
df <- fn$dbGetQuery(conn,"SELECT BLAH1, BLAH2
FROM TABLENAME
WHERE BLAH1 = '$rd1')
return(df)
basically, you need to put a $ before the variable name that stores the parameter you wish to pass.

Print data through Spark SQL taking long time

I have 3 text files in hdfs which I am reading using spark sql and registering them as table. After that I am doing almost 5-6 operations - including joins , group by etc.. And this whole process is taking hardly 6-7 secs. ( Source File size - 3 GB with almost 20 million rows ).
As a final step of my computation, I am expecting only 1 record in my final rdd - named as acctNPIScr in below code snippet.
My question here is that when I am trying to print this rdd either by registering as table and printing records from table or by this method - acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println). It is taking very long time - almost 1.5 minute to print 1 record.
Can someone pls help me if I am doing something wrong in printing. What is the best way to print final result from schemardd.
.....
val acctNPIScr = sqlContext.sql(""SELECT party_id, sum(npi_int)/sum(device_priority_new) as npi_score FROM AcctNPIScoreTemp group by party_id ")
acctNPIScr.registerTempTable("AcctNPIScore")
val endtime = System.currentTimeMillis()
logger.info("Total sql Time :" + (endtime - st)) // this time is hardly 5 secs
println("start printing")
val result = sqlContext.sql("SELECT * FROM AcctNPIScore").collect().foreach(println)
//acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println)
logger.info("Total printing Time :" + (System.currentTimeMillis() - endtime)) // print one record is taking almost 1.5 minute
They key here is that all Spark transformations are lazy. Only actions cause the transformations to be evaluated.
// this time is hardly 5 seconds
This is because you haven't forced the transformations to be evaluated yet. That doesn't happen until here map(t => "Score: " + t(1)).collect(). collect is an action, so your entire dataset is processed at this point.