What is the best way to communicate among multiple processes in ubuntu - process

I've three different machine learning models in python. To improve performance, I run them on different terminals in parallel. They are communicating and sharing data with one another through files. These models are creating batches of files to make available for other. All the processes are running in parallel but dependent on data prepared by other process. Once a process A prepares a batch of data, it creates a file to give signal to other process that data is ready, then process B starts processing it, while looking for other batch too simultaneously. How can this huge data be shared with next process without creating files? Is there any better way to communicate among these processes without creating/deleting temporary files in python?
Thanks

You could consider running up a small Redis instance... a very fast, in-memory data structure server.
It allows you to share strings, lists, queues, hashes, atomic integers, sets, ordered sets between processes very simply.
As it is networked, you can share all these data structures not only within a single machine, but across multiple machines.
As it has bindings for C/C++, Python, bash, Ruby, Perl and so on, it also means you can use the shell, for example, to quickly inject commands/data into your app to change its behaviour, or get debugging insight by looking at how variables are set.

Here's an example of how to do multiprocessing in Python3. Instead of storing results in a file the results are stored in a dictionary (see output)
from multiprocessing import Pool, cpu_count
def multi_processor(function_name):
file_list = []
# Test, put 6 strings in the list so your_function should run six times
# with 6 processors in parallel, (assuming your CPU has enough cores)
file_list.append("test1")
file_list.append("test2")
file_list.append("test3")
file_list.append("test4")
file_list.append("test5")
file_list.append("test6")
# Use max number of system processors - 1
pool = Pool(processes=cpu_count()-1)
pool.daemon = True
results = {}
# for every item in the file_list, start a new process
for aud_file in file_list:
results[aud_file] = pool.apply_async(your_function, args=("arg1", "arg2"))
# Wait for all processes to finish before proceeding
pool.close()
pool.join()
# Results and any errors are returned
return {your_function: result.get() for your_function, result in results.items()}
def your_function(arg1, arg2):
try:
print("put your stuff in this function")
your_results = ""
return your_results
except Exception as e:
return str(e)
if __name__ == "__main__":
some_results = multi_processor("your_function")
print(some_results)
The output is
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
{'test1': '', 'test2': '', 'test3': '', 'test4': '', 'test5': '', 'test6': ''}

Try using a sqlite database to share files.
I made this for this exact purpose:
https://pypi.org/project/keyvalue-sqlite/
You can use it like this:
from keyvalue_sqlite import KeyValueSqlite
DB_PATH = '/path/to/db.sqlite'
db = KeyValueSqlite(DB_PATH, 'table-name')
# Now use standard dictionary operators
db.set_default('0', '1')
actual_value = db.get('0')
assert '1' == actual_value
db.set_default('0', '2')
assert '1' == db.get('0')

Related

Lego-EV3: How to fix EOFError when catching user-input via multiprocessing?

Currently, I am working with a EV3 lego robot that is controlled by several neurons. Now I want to modify the code (running on
python3) in such a way that one can change certain parameter values on the run via the shell (Ubuntu) in order to manipulate the robot's dynamics at any time (and for multiple times). Here is a schema of what I have achieved so far based on a short example code:
from multiprocessing import Process
from multiprocessing import SimpleQueue
import ev3dev.ev3 as ev3
class Neuron:
(definitions of class variables and update functions)
def check_input(queue):
while (True):
try:
new_para = str(input("Type 'parameter=value': "))
float(new_para[2:0]) # checking for float in input
var = new_para[0:2]
if (var == "k="): # change parameter k
queue.put(new_para)
elif (var == "g="): # change parameter g
queue.put(new_para)
else:
print("Error". Type 'k=...' or 'g=...')
queue.put(0) # put anything in queue
except (ValueError, EOFError):
print("New value is not a number. Try again!")
(some neuron-specific initializations)
queue = SimpleQueue()
check = Process(target=check_input, args=(queue,))
check.start()
while (True):
if (not queue.empty()):
cmd = queue.get()
var = cmd[0]
val = float(cmd[2:])
if (var == "k"):
Neuron.K = val
elif (var == "g"):
Neuron.g = val
(updating procedure for neurons, writing data to file)
Since I am new to multiprocessing there are certainly some mistakes concerning taking care of locking, efficiency and so on but the robot moves and input fields occur in the shell. However, the current problem is that it's actually impossible to make an input:
> python3 controller_multiprocess.py
> Type 'parameter=value': New value is not a number. Try again!
> Type 'parameter=value': New value is not a number. Try again!
> Type 'parameter=value': New value is not a number. Try again!
> ... (and so on)
I know that this behaviour is caused by putting the exception of EOFError due to the fact that this error occurs when the exception is removed (and the process crashes). Hence, the program just rushes through the try-loop here and assumes that no input (-> empty string) was made over and over again. Why does this happen? - when not called as a threaded procedure the program patiently waits for an input as expected. And how can one fix or bypass this issue so that changing parameters gets possible as wanted?
Thanks in advance!

How to parallelize a REST API crawler in http4s & fs2?

I wrote a sequential REST API crawler in http4s & fs2 here:
https://gist.github.com/NicolasRouquette/656ed7a2d6984ce0995fd78a3aec2566
This is to query a REST API service to get a starting set of IDs, fetch elements for a batch of IDs and continue based on the cross-reference IDs found in these elements until there are no new IDs to fetch and return a map of all elements fetched.
This works; however, the performance is inadequate -- too slow!
Since I don't have access to the server, I tried experimenting with varying batch sizes, from 10, 50, 100, 200, 500 and even batching all IDs in a single query. Query time increases significantly with batch size.
At large sizes (500 and all), I even got HTTP 500 responses from the server.
I would like to experiment with batching parallel queries in a load-balancing fashion using a pool of threads; however, it is unclear to me how to do this based on the fs2 docs.
Can someone provide suggestions how to achieve this?
Regarding using http4s & fs2: Well, I found this library fairly easy to use for simple client-side programming. Given the emphasis on supporting tasks, streams, etc..., I presume that batching parallel queries should be doable somehow.
fs2.concurrent.join will allow you to run multiple streams concurrently. The specific section in the guide is available at https://github.com/functional-streams-for-scala/fs2/blob/v0.9.7/docs/guide.md#concurrency
For your use case you could take your queue of ids, chunk them, create a http task and then wrap it in a stream. You would then run this stream of streams concurrently with join and combine the results.
def createHttpRequest(ids: Seq[ID]): Task[(ElementMap, Set[ID])] = ???
def fetch(queue: Set[ID]): Task[(ElementMap, Set[ID])] = {
val resultStreams = Stream.emits(queue.toSeq)
.vectorChunkN(batchSize)
.map(createHttpRequest)
.map(Stream.eval)
val resultStream = fs2.concurrent.join(maxOpen)(resultStreams)
resultStream.runFold((Map.empty[ID, Element], Set.empty[ID])) {
case ((a, b), (_a, _b)) => (a ++ _a, b ++ _b)
}
}

Spark : Data processing using Spark for large number of files says SocketException : Read timed out

I am running Spark in standalone mode on 2 machines which have these configs
500gb memory, 4 cores, 7.5 RAM
250gb memory, 8 cores, 15 RAM
I have created a master and a slave on 8core machine, giving 7 cores to worker. I have created another slave on 4core machine with 3 worker cores. The UI shows 13.7 and 6.5 G usable RAM for 8core and 4core respectively.
Now on this I have to process an aggregate of user ratings over a period of 15 days. I am trying to do this using Pyspark
This data is stored in hourwise files in day-wise directories in an s3 bucket, every file must be around 100MB eg
s3://some_bucket/2015-04/2015-04-09/data_files_hour1
I am reading the files like this
a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
where files is a string of this form 's3://some_bucket/2015-04/2015-04-09/*,s3://some_bucket/2015-04/2015-04-09/*'
Then I do a series of maps and filters and persist the result
a.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to do a reduceByKey to get an aggregate score over the span of days.
b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to make a redis call for the actual terms for the items the user has rated, so I call mapPartitions like this
final_scores = b.mapPartitions(get_tags)
get_tags function creates a redis connection each time of invocation and calls redis and yield a (user, item, rate) tuple
(The redis hash is stored in the 4core)
I have tweaked the settings for SparkConf to be at
conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
.set("spark.executor.memory", "5g")
.set("spark.akka.timeout", "10000")
.set("spark.akka.frameSize", "1000")
.set("spark.task.cpus", "5")
.set("spark.cores.max", "10")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max.mb", "10")
.set("spark.shuffle.consolidateFiles", "True")
.set("spark.files.fetchTimeout", "500")
.set("spark.task.maxFailures", "5"))
I run the job with driver-memory of 2g in client mode, since cluster mode doesn't seem to be supported here.
The above process takes a long time for 2 days' of data (around 2.5hours) and completely gives up on 14 days'.
What needs to improve here?
Is this infrastructure insufficient in terms of RAM and cores (This is offline and can take hours, but it has got to finish in 5 hours or so)
Should I increase/decrease the number of partitions?
Redis could be slowing the system, but the number of keys is just too huge to make a one time call.
I am not sure where the task is failing, in reading the files or in reducing.
Should I not use Python given better Spark APIs in Scala, will that help with efficiency as well?
This is the exception trace
Lost task 4.1 in stage 0.0 (TID 11, <node>): java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:227)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:88)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:126)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
I could really use some help, thanks in advance
Here is what my main code looks like
def main(sc):
f=get_files()
a=sc.textFile(f, 15)
.coalesce(7*sc.defaultParallelism)
.map(lambda line: line.split(","))
.filter(len(line)>0)
.map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring)
.map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER)
b=a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
c=taggings.mapPartitions(get_tags)
c.saveAsTextFile("f")
a.unpersist()
b.unpersist()
The get_tags function is
def get_tags(partition):
rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
for element in partition:
user = element[0]
song = element[1]
rating = element[2]
tags = rh.hget(settings['REDIS_HASH'], song)
if tags:
tags = json.loads(tags)
else:
tags = scrape(song, rh)
if tags:
for tag in tags:
yield (user, tag, rating)
The get_files function is as:
def get_files():
paths = get_path_from_dates(DAYS)
base_path = 's3n://acc_key:sec_key#bucket/'
files = list()
for path in paths:
fle = base_path+path+'/file_format.*'
files.append(fle)
return ','.join(files)
The get_path_from_dates(DAYS) is
def get_path_from_dates(last):
days = list()
t = 0
while t <= last:
d = today - timedelta(days=t)
path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
days.append(path)
t += 1
return days
As a small optimization, I have created two separate tasks, one to read from s3 and get additive sum, second to read transformations from redis. The first tasks has high number of partitions since there are around 2300 files to read. The second one has much lesser number of partitions to prevent redis connection latency, and there is only one file to read which is on the EC2 cluster itself. This is only partial, still looking for suggestions to improve ...
I was in a similar usecase: doing coalesce on a RDD with 300,000+ partitions. The difference is that I was using s3a(SocketTimeoutException from S3AFileSystem.waitAysncCopy). Finally the issue was resolved by setting a larger fs.s3a.connection.timeout(Hadoop's core-site.xml). Hopefully you can get a clue.

Use Multiple DBs With One Redis Lua Script?

Is it possible to have one Redis Lua script hit more than one database? I currently have information of one type in DB 0 and information of another type in DB 1. My normal workflow is doing updates on DB 1 based on an API call along with meta information from DB 0. I'd love to do everything in one Lua script, but can't figure out how to hit multiple dbs. I'm doing this in Python using redis-py:
lua_script(keys=some_keys,
args=some_args,
client=some_client)
Since the client implies a specific db, I'm stuck. Ideas?
It is usually a wrong idea to put related data in different Redis databases. There is almost no benefit compared to defining namespaces by key naming conventions (no extra granularity regarding security, persistence, expiration management, etc ...). And a major drawback is the clients have to manually handle the selection of the correct database, which is error prone for clients targeting multiple databases at the same time.
Now, if you still want to use multiple databases, there is a way to make it work with redis-py and Lua scripting.
redis-py does not define a wrapper for the SELECT command (normally used to switch the current database), because of the underlying thread-safe connection pool implementation. But nothing prevents you to call SELECT from a Lua script.
Consider the following example:
$ redis-cli
SELECT 0
SET mykey db0
SELECT 1
SET mykey db1
The following script displays the value of mykey in the 2 databases from the same client connection.
import redis
pool = redis.ConnectionPool(host='localhost', port=6379, db=0)
r = redis.Redis(connection_pool=pool)
lua1 = """
redis.call("select", ARGV[1])
return redis.call("get",KEYS[1])
"""
script1 = r.register_script(lua1)
lua2 = """
redis.call("select", ARGV[1])
local ret = redis.call("get",KEYS[1])
redis.call("select", ARGV[2])
return ret
"""
script2 = r.register_script(lua2)
print r.get("mykey")
print script2( keys=["mykey"], args = [1,0] )
print r.get("mykey"), "ok"
print
print r.get("mykey")
print script1( keys=["mykey"], args = [1] )
print r.get("mykey"), "misleading !!!"
Script lua1 is naive: it just selects a given database before returning the value. Its usage is misleading, because after its execution, the current database associated to the connection has changed. Don't do this.
Script lua2 is much better. It takes the target database and the current database as parameters. It makes sure that the current database is reactivated before the end of the script, so that next command applied on the connection still run in the correct database.
Unfortunately, there is no command to guess the current database in the Lua script, so the client has to provide it systematically. Please note the Lua script must reset the current database at the end whatever happens (even in case of previous error), so it makes complex scripts cumbersome and awkward.

How to process a long web request in Heroku?

https://devcenter.heroku.com/articles/request-timeout
30 Seconds and a timeout error fires according to their documentation.
I'm uploading and parsing a CSV file to save to my database. One of those files is 1.7MB in size and has 37000 rows.
This process takes a bit long to process, certainly more than 30 seconds.
What can I do in these cases? What options do I have?
require 'csv'
class DatabaseImporterController < ApplicationController
def index
end
def import
# Receive the uploaded CSV file and import to the database.
csv_file = params[:csv_file].tempfile
i = 0
CSV.foreach(csv_file) do |row|
# Structure for CSV file: Year, Make, Model, Trim
if i > 0 then
make = Make.find_or_create_by_name(row[1])
model = make.model.create(:year => row[0], :name => row[2], :trim => row[3])
end
i += 1
end
redirect_to :action => 'list'
end
def list
#models = Model.all
end
end
Instead of processing your CSV file in the controller have it push a notification to a queue with the location of the uploaded file. Then have a worker dyno handle that processing.
You'll pay a little bit more, especially if you're trying to stick with the free single dyno tier, but this is a scalable design (which is why I'd imagine there's a 30 second timeout on HTTP processing).
An alternative is to push the data directly into a table and execute a stored procedure asynchronously. This pushes the work off to Postgres to handle off the HTTP thread and may place your request under the 30 second time limit, though with larger files you may breech this cap anyway.
Before you bother restructuring your entire application you'll want to run a test to ensure that Heroku has not disabled libpq-asynch.
The big cost here in your code above is Make.find_or_create_by_name which is invoking 37,000 separate - as per your example input - SELECT and possibly an INSERT for each row in your CSV. If libpq-asynch is not an option you'll have to create a stored procedure that will perform this functionality in batches of 100 or 1000 rows at a time - that way your controller code isn't making so many round trips to the database. Postgres supports arrays in the classical ordinal index style as well as arrays of row types so this is actually much less painful than it sounds.