Extensions for Computationally-Intensive Cypher queries - optimization

As a follow up to a previous question of mine, I want to find all 30 pathways that exist between two given nodes within a depth of 4. Something to the effect of this:
start startnode = node(1), endnode(1000)
match startnode-[r:rel_Type*1..4]->endnode
return r
limit 30;
My database contains ~50k nodes and 2M relationships.
Expectedly, the computation time for this query is very, very large; I even ended up with the following GC message in the message.log file: GC Monitor: Application threads blocked for an additional 14813ms [total block time: 182.589s]. This error keeps occuring, and blocks all threads for an indefinite period of time. Therefore, I am looking for a way to lower the computational strain of this query on the server by optimizing the query.
Is there any extension I could use to help optimize this query?

Give this one a try:
https://github.com/wfreeman/findpaths
You can query the extension like so:
.../findpathslen/1/1000/4/30
And it will give you a json response with the paths found. Hopefully that helps you.
The meat of it is here, using the built-in graph algorithm to find paths of a certain length:
#GET
#Path("/findpathslen/{id1}/{id2}/{len}/{count}")
#Produces(Array("application/json"))
def fof(#PathParam("id1") id1:Long, #PathParam("id2") id2:Long, #PathParam("len") len:Int, #PathParam("count") count:Int, #Context db:GraphDatabaseService) = {
val node1 = db.getNodeById(id1)
val node2 = db.getNodeById(id2)
val pathFinder = GraphAlgoFactory.pathsWithLength(Traversal.pathExpanderForAllTypes(Direction.OUTGOING), len)
val pathIterator = pathFinder.findAllPaths(node1,node2).asScala
val jsonMap = pathIterator.take(count).map(p => obj(p))
Response.ok(compact(render(decompose(jsonMap))), MediaType.APPLICATION_JSON).build()
}

Related

How to parallelize a REST API crawler in http4s & fs2?

I wrote a sequential REST API crawler in http4s & fs2 here:
https://gist.github.com/NicolasRouquette/656ed7a2d6984ce0995fd78a3aec2566
This is to query a REST API service to get a starting set of IDs, fetch elements for a batch of IDs and continue based on the cross-reference IDs found in these elements until there are no new IDs to fetch and return a map of all elements fetched.
This works; however, the performance is inadequate -- too slow!
Since I don't have access to the server, I tried experimenting with varying batch sizes, from 10, 50, 100, 200, 500 and even batching all IDs in a single query. Query time increases significantly with batch size.
At large sizes (500 and all), I even got HTTP 500 responses from the server.
I would like to experiment with batching parallel queries in a load-balancing fashion using a pool of threads; however, it is unclear to me how to do this based on the fs2 docs.
Can someone provide suggestions how to achieve this?
Regarding using http4s & fs2: Well, I found this library fairly easy to use for simple client-side programming. Given the emphasis on supporting tasks, streams, etc..., I presume that batching parallel queries should be doable somehow.
fs2.concurrent.join will allow you to run multiple streams concurrently. The specific section in the guide is available at https://github.com/functional-streams-for-scala/fs2/blob/v0.9.7/docs/guide.md#concurrency
For your use case you could take your queue of ids, chunk them, create a http task and then wrap it in a stream. You would then run this stream of streams concurrently with join and combine the results.
def createHttpRequest(ids: Seq[ID]): Task[(ElementMap, Set[ID])] = ???
def fetch(queue: Set[ID]): Task[(ElementMap, Set[ID])] = {
val resultStreams = Stream.emits(queue.toSeq)
.vectorChunkN(batchSize)
.map(createHttpRequest)
.map(Stream.eval)
val resultStream = fs2.concurrent.join(maxOpen)(resultStreams)
resultStream.runFold((Map.empty[ID, Element], Set.empty[ID])) {
case ((a, b), (_a, _b)) => (a ++ _a, b ++ _b)
}
}

Using multiple threads for DB updates results in higher write time per update

So I have a script that is supposed to update a giant table (Postgres). Since the table has about 150m rows and I want to complete this as fast as possible, using multiple threads seemed like a perfect answer. However, I'm seeing something very weird.
When I use a single thread, the write time to an update is much much lower than when I use multiple threads.
require 'sequel'
.....
DB = Sequel.connect(DB_CREDS)
queue = Queue.new
read_query = query = DB["
SELECT id, extra_fields
FROM objects
WHERE XYZ IS FALSE
"]
read_query.use_cursor(:rows_per_fetch => 1000).each do |row|
queue.push(row)
end
Up until this point, IMO it shouldn't matter because we're just reading stuff from the DB and it has nothing to do with writing. From here, I've tried two approaches. Single-threaded and Multi-threaded.
NOTE - This is not the actual UPDATE query that I want to execute, it's just a pseudo one for demonstration purposes. The actual query is a lot longer and plays with JSON and stuff so I can't really update the entire table using a single query.
Single-threaded
until queue.empty?
photo = queue.shift
id = photo[:id]
update_query = DB["
UPDATE objects
SET XYZ = TRUE
WHERE id = #{id}
"]
result = update_query.update
end
If I execute this, I see in my DB logs that each update query takes time less than 0.01 seconds
I, [2016-08-15T10:45:48.095324 #54495] INFO -- : (0.001441s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395179
I, [2016-08-15T10:45:48.103818 #54495] INFO -- : (0.008331s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395181
I, [2016-08-15T10:45:48.106741 #54495] INFO -- : (0.002743s) UPDATE
objects SET XYZ = TRUE WHERE id = 84395182
Multi-threaded
MAX_THREADS = 5
num_threads = 0
all_threads = []
until queue.empty?
if num_threads < MAX_THREADS
photo = queue.shift
num_threads += 1
all_threads << Thread.new {
id = photo[:id]
update_query = DB["
UPDATE photos
SET cv_tagged = TRUE
WHERE id = #{id}
"]
result = update_query.update
num_threads -= 1
Thread.exit
}
end
end
all_threads.each do |thread|
thread.join
end
Now, in theory it should be faster right? But each update takes about 0.5 seconds. I'm so surprised what that is the case.
I, [2016-08-15T11:02:10.992156 #54583] INFO -- : (0.414288s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498834
I, [2016-08-15T11:02:11.097004 #54583] INFO -- : (0.622775s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498641
I, [2016-08-15T11:02:11.097074 #54583] INFO -- : (0.415521s)
UPDATE objects
SET XYZ = TRUE
WHERE id = 119498826
Any ideas on -
Why this is happening?
How can I increase the update speed for multiple threads approach.
Have you configured Sequel so that it has a connection pool of 5 connections?
Have you considered doing multiple updates per call via an IN clause?
If you haven't done 1, you have N threads fighting over N-n connections, which equates to resource starvation, which is a classic concurrency issue.
Your example can be reduced to: DB[:objects].where(:XYZ=>false).update(:XYZ=>true)
I'm guessing your actual need is not that simple. But the same approach may still work. Instead of issuing a query per row, use a single query to update all related rows.
I went through something similar on a project ("import all history from a legacy database into a new one with completely different structure and organization"). Unless you managed to shoot yourself in the foot somewhere else, you have 2 basic bottlenecks to look for:
the database's disk IO
the ruby process' CPU
Some suggestions,
database IO: use DB transactions, update 1000 records per transaction (you can tweak the exact number but 1000 is usually good) - huge DB table usually means a lot of indexes too, every couple of update actions will trigger a REINDEX and AUTOVACUUM actions within the DB which will result in a significant drop of update speed, a transaction basically allows you to push a 1000 updated records without REINDEX and AUTOVACUUM and then perform both actions, the result is MUCH faster (something like an order of magnitude)
database IO: change indexes, drop every index you can live without during the update process, ideally you will have only 1 very streamlined index which allows unique row lookups for update purposes
ruby CPU: unless you are using JRuby or Rubinius, or REALLY paying the price of network latency to your DB, threads will do you no big benefit, use fork/processes (see GIL). You did a great job choosing Sequel over AR for this
ruby CPU: if you decide to go threads + JRuby with this don't forget to try and plug in jProfiler, it's amazing at tracing bottlenecks in Java and author of SideKiq swears it is amazing for JRuby too - unfortunately, afaik, there is no equivalent of jProfiler for C Ruby (there are profiling tools, but nowhere as useful)
After you implement these suggestions you know you did all you could when:
all of the CPUs on the Ruby box are on 100% load
the hard disk IO of the DB is on 100% throughput
Find this sweet spot and don't add additional ruby update threads/processes after that (or add more hardware) and that's that
PS check out https://github.com/ruby-concurrency/concurrent-ruby - it's a great parallelization lib

Spark : Data processing using Spark for large number of files says SocketException : Read timed out

I am running Spark in standalone mode on 2 machines which have these configs
500gb memory, 4 cores, 7.5 RAM
250gb memory, 8 cores, 15 RAM
I have created a master and a slave on 8core machine, giving 7 cores to worker. I have created another slave on 4core machine with 3 worker cores. The UI shows 13.7 and 6.5 G usable RAM for 8core and 4core respectively.
Now on this I have to process an aggregate of user ratings over a period of 15 days. I am trying to do this using Pyspark
This data is stored in hourwise files in day-wise directories in an s3 bucket, every file must be around 100MB eg
s3://some_bucket/2015-04/2015-04-09/data_files_hour1
I am reading the files like this
a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
where files is a string of this form 's3://some_bucket/2015-04/2015-04-09/*,s3://some_bucket/2015-04/2015-04-09/*'
Then I do a series of maps and filters and persist the result
a.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to do a reduceByKey to get an aggregate score over the span of days.
b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to make a redis call for the actual terms for the items the user has rated, so I call mapPartitions like this
final_scores = b.mapPartitions(get_tags)
get_tags function creates a redis connection each time of invocation and calls redis and yield a (user, item, rate) tuple
(The redis hash is stored in the 4core)
I have tweaked the settings for SparkConf to be at
conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
.set("spark.executor.memory", "5g")
.set("spark.akka.timeout", "10000")
.set("spark.akka.frameSize", "1000")
.set("spark.task.cpus", "5")
.set("spark.cores.max", "10")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max.mb", "10")
.set("spark.shuffle.consolidateFiles", "True")
.set("spark.files.fetchTimeout", "500")
.set("spark.task.maxFailures", "5"))
I run the job with driver-memory of 2g in client mode, since cluster mode doesn't seem to be supported here.
The above process takes a long time for 2 days' of data (around 2.5hours) and completely gives up on 14 days'.
What needs to improve here?
Is this infrastructure insufficient in terms of RAM and cores (This is offline and can take hours, but it has got to finish in 5 hours or so)
Should I increase/decrease the number of partitions?
Redis could be slowing the system, but the number of keys is just too huge to make a one time call.
I am not sure where the task is failing, in reading the files or in reducing.
Should I not use Python given better Spark APIs in Scala, will that help with efficiency as well?
This is the exception trace
Lost task 4.1 in stage 0.0 (TID 11, <node>): java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:227)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:88)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:126)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
I could really use some help, thanks in advance
Here is what my main code looks like
def main(sc):
f=get_files()
a=sc.textFile(f, 15)
.coalesce(7*sc.defaultParallelism)
.map(lambda line: line.split(","))
.filter(len(line)>0)
.map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring)
.map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER)
b=a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
c=taggings.mapPartitions(get_tags)
c.saveAsTextFile("f")
a.unpersist()
b.unpersist()
The get_tags function is
def get_tags(partition):
rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
for element in partition:
user = element[0]
song = element[1]
rating = element[2]
tags = rh.hget(settings['REDIS_HASH'], song)
if tags:
tags = json.loads(tags)
else:
tags = scrape(song, rh)
if tags:
for tag in tags:
yield (user, tag, rating)
The get_files function is as:
def get_files():
paths = get_path_from_dates(DAYS)
base_path = 's3n://acc_key:sec_key#bucket/'
files = list()
for path in paths:
fle = base_path+path+'/file_format.*'
files.append(fle)
return ','.join(files)
The get_path_from_dates(DAYS) is
def get_path_from_dates(last):
days = list()
t = 0
while t <= last:
d = today - timedelta(days=t)
path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
days.append(path)
t += 1
return days
As a small optimization, I have created two separate tasks, one to read from s3 and get additive sum, second to read transformations from redis. The first tasks has high number of partitions since there are around 2300 files to read. The second one has much lesser number of partitions to prevent redis connection latency, and there is only one file to read which is on the EC2 cluster itself. This is only partial, still looking for suggestions to improve ...
I was in a similar usecase: doing coalesce on a RDD with 300,000+ partitions. The difference is that I was using s3a(SocketTimeoutException from S3AFileSystem.waitAysncCopy). Finally the issue was resolved by setting a larger fs.s3a.connection.timeout(Hadoop's core-site.xml). Hopefully you can get a clue.

Dynamically change the periodic interval of celery task at runtime

I have a periodic celery task running once per minute, like so:
#tasks.py
#periodic_task(run_every=(crontab(hour="*", minute="*", day_of_week="*")))
def scraping_task():
result = pollAPI()
Where the function pollAPI(), as you might have guessed from the name, polls an API. The catch is that the API has a rate limit that is undisclosed, and sometimes gives an error response, if that limit is hit. I'd like to be able to take that response, and if the limit is hit, decrease the periodic task interval dynamically (or even put the task on pause for a while). Is this possible?
I read in the docs about overwriting the is_due method of schedules, but I am lost on exactly what to do to give the behaviour I'm looking for here. Could anyone help?
You could try using celery.conf.update to update your CELERYBEAT_SCHEDULE.
You can add a model in the database that will store the information if the rate limit is reached. Before doing an API poll, you can check the information in the database. If there is no limit, then just send an API request.
The other approach is to use PeriodicTask from django-celery-beat. You can update the interval dynamically. I created an example project and wrote an article showing how to use dynamic periodic tasks in Celery and Django.
The example code that updates the task when the limit reached:
def scraping_task(special_object_id, larger_interval=1000):
try:
result = pollAPI()
except Exception as e:
# limit reached
special_object = ModelWithTask.objects.get(pk=special_object_id)
task = PeriodicTask.objects.get(pk=special_object.task.id)
new_schedule, created = IntervalSchedule.objects.get_or_create(
every=larger_inerval,
period=IntervalSchedule.SECONDS,
)
task.interval = new_schedule
task.save()
You can pass the parameters to the scraping_task when creating a PeriodicTask object. You will need to have an additional model in the database to have access to the task:
from django.db import models
from django_celery_beat.models import PeriodicTask
class ModelWithTask(models.Model):
task = models.OneToOneField(
PeriodicTask, null=True, blank=True, on_delete=models.SET_NULL
)
# create periodic task
special_object = ModelWithTask.objects.create_or_get()
schedule, created = IntervalSchedule.objects.get_or_create(
every=10,
period=IntervalSchedule.SECONDS,
)
task = PeriodicTask.objects.create(
interval=schedule,
name="Task 1",
task="scraping_task",
kwargs=json.dumps(
{
"special_obejct_id": special_object.id,
}
),
)
special_object.task = task
special_object.save()

How to avoid Hitting the 10 sec limit per user

We run multiple short queries in parallel, and hit the 10 sec limit.
According to the docs, throttling might occur if we hit a limit of 10 API requests per user per project.
We send a "start query job", and then we call the "getGueryResutls()" with timeoutMs of 60,000, however, we get a response after ~ 1 sec, we look for JOB Complete in the JSON response, and since it is not there, we need to send the GetQueryResults() again many times and hit the threshold, that is causing an error, not a slowdown. the sample code is below.
our questions are as such:
1. What is a "user" is it an appengine user, is it a user-id that we can put in the connection string or in the query itslef?
2. Is it really per API project of BigQuery?
3. What is the behavior?we got an error: "Exceeded rate limits: too many user/method api request limit for this user_method", and not a throttling behavior as the doc say and all of our process fails.
4. As seen below in the code, why we get the response after 1 sec & not according to our timeout? are we doing something wrong?
Thanks a lot
Here is the a sample code:
while (res is None or 'jobComplete' not in res or not res['jobComplete']) :
try:
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
except HTTPException:
if independent:
raise
Are you saying that even though you specify timeoutMs=60000, it is returning within 1 second but the job is not yet complete? If so, this is a bug.
The quota limits for getQueryResults are actually currently much higher than 10 requests per second. The reason the docs say only 10 is because we want to have the ability to throttle it down to that amount if someone is hitting us too hard. If you're currently seeing an error on this API, it is likely that you're calling it at a very high rate.
I'll try to reproduce the problem where we don't wait for the timeout ... if that is really what is happening it may be the root of your problems.
def query_results_long(self, jobId, maxResults, res=None):
start_time = query_time = None
while res is None or 'jobComplete' not in res or not res['jobComplete']:
if start_time:
logging.info('requested for query results ended after %s', query_time)
time.sleep(2)
start_time = datetime.now()
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
query_time = datetime.now() - start_time
return res
then in appengine log I had this:
requested for query results ended after 0:00:04.959110