Print data through Spark SQL taking long time - sql

I have 3 text files in hdfs which I am reading using spark sql and registering them as table. After that I am doing almost 5-6 operations - including joins , group by etc.. And this whole process is taking hardly 6-7 secs. ( Source File size - 3 GB with almost 20 million rows ).
As a final step of my computation, I am expecting only 1 record in my final rdd - named as acctNPIScr in below code snippet.
My question here is that when I am trying to print this rdd either by registering as table and printing records from table or by this method - acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println). It is taking very long time - almost 1.5 minute to print 1 record.
Can someone pls help me if I am doing something wrong in printing. What is the best way to print final result from schemardd.
.....
val acctNPIScr = sqlContext.sql(""SELECT party_id, sum(npi_int)/sum(device_priority_new) as npi_score FROM AcctNPIScoreTemp group by party_id ")
acctNPIScr.registerTempTable("AcctNPIScore")
val endtime = System.currentTimeMillis()
logger.info("Total sql Time :" + (endtime - st)) // this time is hardly 5 secs
println("start printing")
val result = sqlContext.sql("SELECT * FROM AcctNPIScore").collect().foreach(println)
//acctNPIScr.map(t => "Score: " + t(1)).collect().foreach(println)
logger.info("Total printing Time :" + (System.currentTimeMillis() - endtime)) // print one record is taking almost 1.5 minute

They key here is that all Spark transformations are lazy. Only actions cause the transformations to be evaluated.
// this time is hardly 5 seconds
This is because you haven't forced the transformations to be evaluated yet. That doesn't happen until here map(t => "Score: " + t(1)).collect(). collect is an action, so your entire dataset is processed at this point.

Related

awswrangler.s3.read_parquet very slow for multi-layer partitioning

I have transformed a csv file in partitioned parquets on 2 layers (day, ID) with the following function:
awswrangler.s3.to_parquet(
df=df,
path=s3_path,
dataset=True,
use_threads=True,
partition_cols=partition_key,
boto3_session=session_s3
)
When I read the parquets I use the following:
my_filter = lambda x: True if x['day'] == mydate and x['ID'] == myID else False
df = awswrangler.s3.read_parquet(path=s3_path, dataset=True, partition_filter=my_filter)
The reading procedure takes 30 seconds and it's ok.
However, when I partition the files into 3 layers (day, hour, ID) the same function (i.e. filtering only by date and ID) takes over 3 minutes.
It seems that adding a layer slowed significantly down the reading procedure. I guess it's something related to parallel vs sequential reading.
Has somebody had the same problem and can suggest a workaround?

Apache Hbase - Fetching large row is extremely slow

I'm running an Apache Hbase Cluster on AWS EMR. I have a table that is a single column family, 75,000 columns and 50,000 rows. I'm trying to get all the column values for a single row, and when the row is not sparse, and has 75,000 values, the return time is extremely slow - it takes almost 2.5 seconds for me to fetch the data from the DB. I'm querying the table from a Lambda function running Happybase.
import happybase
start = time.time()
col = 'mycol'
table = connection.table('mytable')
row = table.row(col)
end = time.time() - start
print("Time taken to fetch column from database:")
print(end)
What can I do to make this faster? This seems incredibly slow - the return payload is 75,000 value pairs, and is only ~2MB. It should be much faster than 2 seconds. I'm looking for millisecond return time.
I have a BLOCKCACHE size of 8194kb, a BLOOMFILTER of type ROW, and SNAPPY compression enabled on this table.

JDBC going extrem slow after a few big sql selects in scala

atm im trying to get a lot datas(26 million entrys) out of a sql database to put them into a LevelDB.
Im using scala and a connectionPool from JDBC. This is my Query:
"SELECT ts.block_id, ts.sender, ts.fee, ts.recipient AS transaction_recipient, ts.amount AS transaction_amount, ts.type, tf.recipient AS transfer_recipient, tf.amount AS transfer_amount FROM transactions ts LEFT OUTER JOIN transfers tf ON (tf.transaction_id = ts.transaction_id) ORDER BY ts.block_id ASC LIMIT " + (x*sqlBatchSize) + ","+sqlBatchSize)
Where sqlbatchSize is the parameter for how much entrys I want to get in one SQL Statement and x is the iteration amount. So simply im asking select **** Limit 0,10.000 and then 10.000,10.000 and then 20.000,10.000 and so on.
I tried it with different sizes from 5k to 10k, 100k, 500k, 5kkk and so on. Tried it with some start value, so I only get the entrys from the last half of the database and stuff like that.
I have always the same problem. The first query is fast, the second one is slower and then the next ones are slower and slower. Doenst matter if I request 10x 10k oder 10x 500k entrys. Always the same.
Here is my code a little bit shorted:
import java.nio.ByteBuffer
import java.nio.charset.StandardCharsets
import java.sql.{Connection, DriverManager}
import org.apache.commons.dbcp2._
import com.google.common.primitives.Longs
import com.typesafe.config._
import scala.math.BigInt
import java.io.File
import java.util.Calendar
import org.iq80.leveldb
import org.iq80.leveldb.impl.Iq80DBFactory._
import org.iq80.leveldb.Options
object Datasource {
// Load config
val conf = ConfigFactory.load()
val dbUrl = "jdbc:mysql://" + conf.getString("database.db_host") + ":" + conf.getString("database.db_port") + "/" + conf.getString("database.db_name")
val connectionPool = new BasicDataSource()
connectionPool.setUsername(conf.getString("database.db_user"))
connectionPool.setPassword(conf.getString("database.db_password"))
connectionPool.setDriverClassName("com.mysql.jdbc.Driver")
connectionPool.setUrl(dbUrl)
connectionPool.setInitialSize(3)
}
object main extends App {
var durchlaufen = 1
var x = 0
var sqlBatchSize = 50000
if(durchlaufen == 1){
try{
var grenze = 10
while(x < grenze) {
val connection = Datasource.connectionPool.getConnection
val rs = connection.createStatement.executeQuery("SELECT ts.block_id, ts.sender, ts.fee, ts.recipient AS transaction_recipient, ts.amount AS transaction_amount, ts.type, tf.recipient AS transfer_recipient, tf.amount AS transfer_amount FROM transactions ts LEFT OUTER JOIN transfers tf ON (tf.transaction_id = ts.transaction_id) ORDER BY ts.block_id ASC LIMIT " + (x*sqlBatchSize+10000000) + ","+sqlBatchSize)
x += 1
while (rs.next()) {
// Do stuff
}
connection.close()
rs.close()
}
}
}
}
What I tried too is the remove the connection close and working on the same connection in all iterations of the while. Same result
I checked with visualVm the memory and there is one byte array that is very large and max out the ram. But I have no clue what byte array that is
Anyone has an idea why the querys start to go extrem slow?
If its like, im requestting 10x 500k entrys than the ram is full and its going to slow down, okay. I can understand that. But if I request 10x 5k it goes slow the same rate as 10x 500k just a single request takes longer. At the end, the third query takes about double the time and the 7th query takes about 6-9 times longer than the first

What algorithm can I use to compute number of say positive or negative postings seen until a certain timepoint?

I wish to check if my understanding and proposed algorithm below would be correct.
to calculate the number of positive postings I have seen until time point ti, I am proposing a loop as below:
sumofPi = 0
for x = 0 until x = ti
sumofPi = sumofPi + Pi-1
I am not sure if this will work but the idea is to be able to sum up the positive postings that comes in within a certain timepoint in a data stream.
Thanks
The sequence seems fine as long as the events are indexed in order and you are comfortable loosing events that happened at the same time but indexed differently as a result of that limitation. You may also want to address posting type filtering.
Your algorithm in Python:
# Sample data
postingevents=[1,0,1,1,0,1]
# Algorithm:
sumofPi = 0
ti=4
for i in range(0,ti):
sumofPi += postingevents[i]
print(sumofPi)
3
Looks like you are dealing with time series.
For time series, I would suggest rolling sum or rolling weighted averages, there's an example here
Below are some Python code samples using loops and recursion with a data sample (Event indicator & epoch time stamp)
# Data sample:
postingevents=[1,0,1,1,0,1]
postingti=[1497634668,1497634669,1497634697,1497634697,1497634714,1497634718]
postings=([postingevents,postingti])
# All events preceeding time stamp T. Events do not need to be ordered by time.
def sumpi_notordered(X,t):
return sum([xv if yv<=t else 0 for (xv,yv) in zip(X[0],X[1])])
# Sum ordered events indexed by T, using recursion.
def sumpi_ordered(X,t):
if t>=1:
return X[t]+sumpi_ordered(X,t-1)
else:
return(X[t])
print(sumpi_notordered(postings,1497634697))
3
print(sumpi_ordered(postingevents,3))
3

How to set a minimum random number in REBOL?

I'm executing some code and then waiting somewhere between 1 second and 1 minute. I'm currently using random 0:01:00 /seed but what I really need is to be able to set a floor so that its waiting between 30 seconds and 1 minute.
If you want 0:0:30 to be the minimum and 0:1:0 to be the maximum, try the formula:
0:0:29 + random 0:0:31
This formula yields a "discretely distributed (pseudo)random value". If you want a "continuously distributed (pseudo) random value", you can use (just in R3) the formula:
0:0:30 + random 30.0
R2 does not have a native support for "continuously distributed (pseudo)random values".
Not my area of expertise, but:
00:00:30 + to time! (random 100% * (to integer! 00:00:30))
...appears to work, I think.
>>random/seed now/precise
>> t1: now wait 30 + random 30 difference now t1
== 0:00:39
How about the following:
0:00:30 + random 0:00:30
You could generate a whole number from 1 to 30 and subtract that number in seconds from 1 minute and 1 second.
(and about seeding, use that, but not constantly)