Redis database gives "Connection refused" error after finishing almost all of a task - redis

I'm trying to parse some data that I have stored in a redis database (on my local machine, accessing via the default port 6739). Essentially, the task is to iterate over about 10K hash structures in the database, calculate a new field from the fields currently in the hash, then write that new field back to the database so that I don't need to do the calculation again.
My script starts up fine, connects to the database, and makes it through about 9500 of the hashes before crashing with a "redis.exceptions.ConnectionError: Error 111 connecting localhost:6379. Connection refused." error. I've rebooted the EC2 instance I'm running it on several times and every time it crashes in the same place.
Any idea what might be going on? Why would redis work for some of the data set but then crash?
EDIT: Here's the output of the execution. It runs for about 3 and a half minutes before dying.
$ sudo python parser.py
Added 0 out of 10378 to dictionary: 22:48:53
Added 100 out of 10378 to dictionary: 22:48:54
Added 200 out of 10378 to dictionary: 22:48:55
Added 300 out of 10378 to dictionary: 22:48:57
Added 400 out of 10378 to dictionary: 22:48:58
Added 500 out of 10378 to dictionary: 22:49:00
...
Added 9000 out of 10378 to dictionary: 22:51:16
Added 9100 out of 10378 to dictionary: 22:51:30
Added 9200 out of 10378 to dictionary: 22:51:44
Added 9300 out of 10378 to dictionary: 22:52:00
Added 9400 out of 10378 to dictionary: 22:52:15
Added 9500 out of 10378 to dictionary: 22:52:17
Traceback (most recent call last):
File "parser.py", line 180, in <module>
buildDictionary(force=True)
File "parser.py", line 123, in buildDictionary
addPostToDict(postid)
File "parser.py", line 92, in addPostToDict
comments = [contentFromId(commentid) for commentid in commentids]
File "parser.py", line 72, in contentFromId
content = db.hget(contentid, keyword)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1539, in hget
return self.execute_command('HGET', name, key)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 464, in execute_command
connection.send_command(*args)
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 334, in send_command
self.send_packed_command(self.pack_command(*args))
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 316, in send_packed_command
self.connect()
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 253, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting localhost:6379. Connection refused.

You can set timeout and memory limit in Redis in order to ensure it handles long conenctions and time outs

So the answer turned out to be that my Redis database got too big for my tiny little EC2 instance's memory. As Itamar points out in the comments, Redis stores everything in memory. Once my job added too many items to the database and filled up all available memory, Redis refused all further requests to add to the database.
Things I observed that might help you diagnose this:
Later jobs started taking longer and longer (you can see in the EDIT to the question that it takes 1-2 seconds per 100 jobs at the beginning, but ~15 seconds per 100 jobs at the end)
I lowered the maximum amount of memory that my Redis store could use and it started failing earlier

Related

What is the recommended architecture for using amazon neptune in a scalable way?

I am building an application backed by a Neptune database. Because I want the application to be scalable, I am using AWS Lambda + API gateway to build a REST API to interact with the database. This seems to be a reasonable idea based on the fact that this use case is documented in the Neptune docs.
The Neptune docs recommend reusing the websocket connection to the database across the entire execution context of the function, which is what I am doing at the moment. The docs also recommend resetting the connection and retrying upon errors (see here), which I am also using. However, I am seeing exceptions every now and then (perhaps every 20 requests on average). One of the exceptions I get is
ConnectionResetError: Cannot write to closing transport
which seems to be the same as this issue.
The other one is:
Traceback (most recent call last):
File "/var/task/chalice/app.py", line 1685, in _get_view_function_response
response = view_function(**function_args)
File "/var/task/app.py", line 57, in resource
return Resource(app.current_request, g).process()
File "/var/task/backoff/_sync.py", line 94, in retry
ret = target(*args, **kwargs)
File "/var/task/chalicelib/handlers/resource.py", line 106, in get
values = resources.valueMap().with_(WithOptions.tokens).toList()
File "/var/task/gremlin_python/process/traversal.py", line 57, in toList
return list(iter(self))
File "/var/task/gremlin_python/process/traversal.py", line 47, in __next__
self.traversal_strategies.apply_strategies(self)
File "/var/task/gremlin_python/process/traversal.py", line 548, in apply_strategies
traversal_strategy.apply(traversal)
File "/var/task/gremlin_python/driver/remote_connection.py", line 63, in apply
remote_traversal = self.remote_connection.submit(traversal.bytecode)
File "/var/task/gremlin_python/driver/driver_remote_connection.py", line 60, in submit
results = result_set.all().result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/var/task/gremlin_python/driver/resultset.py", line 90, in cb
f.result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/var/lang/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/var/task/gremlin_python/driver/connection.py", line 82, in _receive
data = self._transport.read()
File "/var/task/gremlin_python/driver/aiohttp/transport.py", line 104, in read
raise RuntimeError("Connection was already closed.")
RuntimeError: Connection was already closed.
In case it is relevant, I am using gremlingpython==3.5.1
It seems to me that these issues are all ultimately a consequence of using AWS Lambda, namely due to the mismatch between the longevity of websocket connections and the ephemeral nature of lambda execution contexts. The question then is: Am I doing the wrong thing by trying to use AWS lambda for my API? Would it be more appropriate to setup an EC2 instance and deal with the scalability in some other way?
P.S. Previously I did create and close a connection in every function execution (as previously recommended in the Neptune docs), which did work fine but was naturally slow.
The latest version of Neptune only supports Gremlin 3.4.11 (https://docs.aws.amazon.com/neptune/latest/userguide/engine-releases-1.0.5.1.html). I would start by using gremlin-python 3.4.11 and see if that resolves your issue. Gremlin-python 3.5 replaced Tornado with AIO HTTP (ref) for websocket connections and I suspect that change may be causing a slight change in behavior that a future release supporting Gremlin 3.5 will address.
I wonder whether the 'Connection was already closed' error message is not being treated as a retriable error by the retry logic?
What happens if you add this error message to the list of retriable_error_msgs in the Python example in the docs?

Streaming job failure-State Schema not Compatible issue

My streaming job is now failing with the below error, streaming job worked fine for almost 2 months, and it is completely stateless transformation and just needs to append the new rows to the destination delta table. Before streaming, I'm manually providing the schema to a csv files, even verified the streaming job schema and downstream table schema both matches perfectly along with the datatype.
Not sure, why even in the stateless transformation, I'm getting the below error. Any help would be appreciated.
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2442, in _call_proxy
return_value = getattr(self.pool[obj_id], method)(*params)
File "/databricks/spark/python/pyspark/sql/utils.py", line 195, in call
raise e
File "/databricks/spark/python/pyspark/sql/utils.py", line 192, in call
self.func(DataFrame(jdf, self.sql_ctx), batch_id)
File "<command-422857213447422>", line 2, in write_to_managed_table
print(f"inside foreachBatch for batch_id:{batchId}, rows in passed dataframe: {micro_batch_df.count()}")
File "/databricks/spark/python/pyspark/sql/dataframe.py", line 670, in count
return int(self._jdf.count())
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/sql/utils.py", line 110, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o433.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 28 in stage 13792.0
failed 4 times, most recent failure: Lost task 28.3 in stage 13792.0 (TID 752198)
(10.139.64.13 executor 45):
org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema
doesn't match to the schema for existing state! Please note that Spark allow difference of
field name: check count of fields and data type of each field.
There might a problem with the CSV file, it could be corrupted.
You can ignore this csv file by setting the "mode" option to "PERMISSIVE" or "DROPMALFORMED".
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamReader.html#csv(path:String):org.apache.spark.sql.DataFrame
spark.read.format("csv")
.option("header,"true")
.option("path","your.csv")
.option("mode","DROPMALFORMED")
.schema(csvSchema)
.load()

a worker gets killed that has delayed data in dask distributed

I have problem debugging the error I got from dask distributed.
I am using loc to take out small dataframe from large ~100M rows and converting to daskArray.
only when I want to grab the data from the cluster I ran into this issue.
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
*
Finally when my data is ready to be taken out with
ddf2local=client.gather(chunkDelayedDB.result())
*
The worker contains data that will die and I get following message. In one machine with 32 gb it works fine
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/client.py", line 244, in _result
result = await self.client._gather([self])
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/client.py", line 1761, in _gather
response = await future
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/client.py", line 1813, in _gather_remote
response = await self.scheduler.gather(keys=keys)
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/core.py", line 748, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/core.py", line 547, in send_recv
raise exc.with_traceback(tb)
File "/home/admin/mypy36/lib/python3.6/site-packages/distributed/core.py", line 403, in handle_comm
File "/home/admin/mypy36/lib/python3.6/site-packages/distributed/scheduler.py", line 2561, in gather
KeyError: 'delayedDB-42a099cc4ddd139877f60c67295ab09a'
I trackdown the memory of the worker being killed at is fine untill getting the data (also it had alot of data.)
I would acknowledge any support that might help so greatly.

redis.exceptions.DataError: Invalid input of type: 'NoneType'. Convert to a byte, string or number first

I've recently started to use Redis and RQ to run background processes. I built a Dash app which works fine on Heroku and used to work locally as well. Recently, I tried to test the same app locally again and I keep getting the following error - although I'm using exactly the same code hosted on Heroku:
redis.exceptions.DataError: Invalid input of type: 'NoneType'. Convert to a byte, string or number first.
In my requirements.txt and virtual env on Ubuntu 18.04 I have redis v.3.0.1, rq 0.13.0
When I run redis-server on my terminal I see that Redis 4.0.9 is used (that's also confusing to me).
I tried to google for two days looking for a solution with no avail.
Has anyone an idea of what might have happened and how to solve this error?
Here is the full relevant traceback:
File "/home/tom/dashenv/pb101_models/pages/cumulative_culture.py", line 1026, in stop_or_start_update
job = q.fetch_job(job_id)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/rq/queue.py", line 142, in fetch_job
self.remove(job_id)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/rq/queue.py", line 186, in remove
return self.connection.lrem(self.key, 1, job_id)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/client.py", line 1580, in lrem
return self.execute_command('LREM', name, count, value)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/client.py", line 754, in execute_command
connection.send_command(*args)
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/connection.py", line 619, in send_command
self.send_packed_command(self.pack_command(*args))
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/connection.py", line 659, in pack_command
for arg in imap(self.encoder.encode, args):
File "/home/tom/dashenv/dash/lib/python3.6/site-packages/redis/connection.py", line 124, in encode
"byte, string or number first." % typename)
redis.exceptions.DataError: Invalid input of type: 'NoneType'. Convert to a byte, string or number first.
Thanks in advance for any suggestion/hint.
All best,
Tom
Check this link: redis 3.0
It says that the redis-py no longer accepts NoneType
Try json.dumps(None), that worked for me

Connection reset by peer error in MongoDb on bulk insert

I am trying to insert 500 documents by doing a bulk insert in pymongo and i get this error
File "/usr/lib64/python2.6/site-packages/pymongo/collection.py", line 306, in insert
continue_on_error, self.__uuid_subtype), safe)
File "/usr/lib64/python2.6/site-packages/pymongo/connection.py", line 748, in _send_message
raise AutoReconnect(str(e))
pymongo.errors.AutoReconnect: [Errno 104] Connection reset by peer
i looked around and found here that this happens because the size of inserted documents exceeds 16 MB so according to that the size of 500 documents should be over 16 MB. So i checked the size of the size of the 500 documents(python dictionaries) like this
size=0
for dict in dicts:
size+=dict.__sizeof__()
print size
this gives me 502920. This is like 500 KB. way less than 16 MB. Then why do i get this error.
I know i am calculating the size of python dictionaries not BSON documents and MongoDB takes in BSON documents but that cant turn 500KB into 16+ MB. Moreover i dont know how to convert a python dict into A BSON document.
My MongoDB version is 2.0.6 and pymongo version is 2.2.1
EDIT
I can do a bulk insert with 150 documents and thats fine but over 150 documents this error appears
This Bulk Inserts bug has been resolved, but you may need to update your pymongo version:
pip install --upgrade pymongo
The error occurs due to the fact that the bulk inserted documents have
an overall size of greater than 16 MB
My method of calculating the size of dictionaries was wrong.
When i manually inspected each key of the dictionary and found that 1 key was having a value of size 300 KB. So that did make the overall size of documents in the bulk insert more than 16 MB. (500*(300+)KB) > 16 MB. But i still dont know how to calculate size of a dictionary without manually inspecting it. Can someone please suggest?
Just had the same error and got around it by creating my own small bulks like this:
region_list = []
region_counter = 0
write_buffer = 1000
# loop through regions
for region in source_db.region.find({}, region_column):
region_counter += 1 # up _counter
region_list.append(region)
# save bulk if we're at the write buffer
if region_counter == write_buffer:
result = user_db.region.insert(region_list)
region_list = []
region_counter = 0
# if there is a rest, also save that
if region_counter > 0:
result = user_db.region.insert(region_list)
Hope this helps
NB: small update, from pymongo 2.6 on, PyMongo will auto-split lists based on the max transfer size: "The insert() method automatically splits large batches of documents into multiple insert messages based on max_message_size"