a worker gets killed that has delayed data in dask distributed

a worker gets killed that has delayed data in dask distributed - pandas

I have problem debugging the error I got from dask distributed.
I am using loc to take out small dataframe from large ~100M rows and converting to daskArray.
only when I want to grab the data from the cluster I ran into this issue.
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
*
Finally when my data is ready to be taken out with
ddf2local=client.gather(chunkDelayedDB.result())
*
The worker contains data that will die and I get following message. In one machine with 32 gb it works fine
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/client.py", line 244, in _result
result = await self.client._gather([self])
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/client.py", line 1761, in _gather
response = await future
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/client.py", line 1813, in _gather_remote
response = await self.scheduler.gather(keys=keys)
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/core.py", line 748, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/home/yousef/pipy/lib/python3.6/site-packages/distributed/core.py", line 547, in send_recv
raise exc.with_traceback(tb)
File "/home/admin/mypy36/lib/python3.6/site-packages/distributed/core.py", line 403, in handle_comm
File "/home/admin/mypy36/lib/python3.6/site-packages/distributed/scheduler.py", line 2561, in gather
KeyError: 'delayedDB-42a099cc4ddd139877f60c67295ab09a'
I trackdown the memory of the worker being killed at is fine untill getting the data (also it had alot of data.)
I would acknowledge any support that might help so greatly.

Related

What is the recommended architecture for using amazon neptune in a scalable way?

I am building an application backed by a Neptune database. Because I want the application to be scalable, I am using AWS Lambda + API gateway to build a REST API to interact with the database. This seems to be a reasonable idea based on the fact that this use case is documented in the Neptune docs.
The Neptune docs recommend reusing the websocket connection to the database across the entire execution context of the function, which is what I am doing at the moment. The docs also recommend resetting the connection and retrying upon errors (see here), which I am also using. However, I am seeing exceptions every now and then (perhaps every 20 requests on average). One of the exceptions I get is
ConnectionResetError: Cannot write to closing transport
which seems to be the same as this issue.
The other one is:
Traceback (most recent call last):
File "/var/task/chalice/app.py", line 1685, in _get_view_function_response
response = view_function(**function_args)
File "/var/task/app.py", line 57, in resource
return Resource(app.current_request, g).process()
File "/var/task/backoff/_sync.py", line 94, in retry
ret = target(*args, **kwargs)
File "/var/task/chalicelib/handlers/resource.py", line 106, in get
values = resources.valueMap().with_(WithOptions.tokens).toList()
File "/var/task/gremlin_python/process/traversal.py", line 57, in toList
return list(iter(self))
File "/var/task/gremlin_python/process/traversal.py", line 47, in __next__
self.traversal_strategies.apply_strategies(self)
File "/var/task/gremlin_python/process/traversal.py", line 548, in apply_strategies
traversal_strategy.apply(traversal)
File "/var/task/gremlin_python/driver/remote_connection.py", line 63, in apply
remote_traversal = self.remote_connection.submit(traversal.bytecode)
File "/var/task/gremlin_python/driver/driver_remote_connection.py", line 60, in submit
results = result_set.all().result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/var/task/gremlin_python/driver/resultset.py", line 90, in cb
f.result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/var/lang/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/var/task/gremlin_python/driver/connection.py", line 82, in _receive
data = self._transport.read()
File "/var/task/gremlin_python/driver/aiohttp/transport.py", line 104, in read
raise RuntimeError("Connection was already closed.")
RuntimeError: Connection was already closed.
In case it is relevant, I am using gremlingpython==3.5.1
It seems to me that these issues are all ultimately a consequence of using AWS Lambda, namely due to the mismatch between the longevity of websocket connections and the ephemeral nature of lambda execution contexts. The question then is: Am I doing the wrong thing by trying to use AWS lambda for my API? Would it be more appropriate to setup an EC2 instance and deal with the scalability in some other way?
P.S. Previously I did create and close a connection in every function execution (as previously recommended in the Neptune docs), which did work fine but was naturally slow.

The latest version of Neptune only supports Gremlin 3.4.11 (https://docs.aws.amazon.com/neptune/latest/userguide/engine-releases-1.0.5.1.html). I would start by using gremlin-python 3.4.11 and see if that resolves your issue. Gremlin-python 3.5 replaced Tornado with AIO HTTP (ref) for websocket connections and I suspect that change may be causing a slight change in behavior that a future release supporting Gremlin 3.5 will address.

I wonder whether the 'Connection was already closed' error message is not being treated as a retriable error by the retry logic?
What happens if you add this error message to the list of retriable_error_msgs in the Python example in the docs?

Streaming job failure-State Schema not Compatible issue

My streaming job is now failing with the below error, streaming job worked fine for almost 2 months, and it is completely stateless transformation and just needs to append the new rows to the destination delta table. Before streaming, I'm manually providing the schema to a csv files, even verified the streaming job schema and downstream table schema both matches perfectly along with the datatype.
Not sure, why even in the stateless transformation, I'm getting the below error. Any help would be appreciated.
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2442, in _call_proxy
return_value = getattr(self.pool[obj_id], method)(*params)
File "/databricks/spark/python/pyspark/sql/utils.py", line 195, in call
raise e
File "/databricks/spark/python/pyspark/sql/utils.py", line 192, in call
self.func(DataFrame(jdf, self.sql_ctx), batch_id)
File "<command-422857213447422>", line 2, in write_to_managed_table
print(f"inside foreachBatch for batch_id:{batchId}, rows in passed dataframe: {micro_batch_df.count()}")
File "/databricks/spark/python/pyspark/sql/dataframe.py", line 670, in count
return int(self._jdf.count())
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/sql/utils.py", line 110, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o433.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 28 in stage 13792.0
failed 4 times, most recent failure: Lost task 28.3 in stage 13792.0 (TID 752198)
(10.139.64.13 executor 45):
org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema
doesn't match to the schema for existing state! Please note that Spark allow difference of
field name: check count of fields and data type of each field.

There might a problem with the CSV file, it could be corrupted.
You can ignore this csv file by setting the "mode" option to "PERMISSIVE" or "DROPMALFORMED".
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
PERMISSIVE : sets other fields to null when it meets a corrupted record. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/DataStreamReader.html#csv(path:String):org.apache.spark.sql.DataFrame
spark.read.format("csv")
.option("header,"true")
.option("path","your.csv")
.option("mode","DROPMALFORMED")
.schema(csvSchema)
.load()

How does "on-negotiation-needed" work when trying to stream using gstreamer webrtc?

How does the webrtc pipeline get any information about its peers?
This is what I assume what the on_negotiation_needed callback does?
def start_pipeline(self):
self.pipe = Gst.parse_launch(PIPELINE_DESC)
self.webrtc = self.pipe.get_by_name('sendrecv')
**self.webrtc.connect('on-negotiation-needed', self.on_negotiation_needed)**
self.webrtc.connect('on-ice-candidate', self.send_ice_candidate_message)
self.webrtc.connect('pad-added', self.on_incoming_stream)
self.pipe.set_state(Gst.State.PLAYING)
I see that it has the on_negotiation_needed callback but its unclear where the element variable comes from? I looked here: http://blog.nirbheek.in/2018/02/gstreamer-webrtc.html and here: https://github.com/centricular/gstwebrtc-demos and I am still confused as to how this negotiation works? From what I understand there are 2 (or more) peers and both of them must connect to the signaling server, then one of them has to create an offer.
I await for the message from (I assume) the gstreamer webrtcbin on the signaling server:
print (websocket.remote_address)
#get message from client
message = await asyncio.wait_for(websocket.recv(), 3000)
and I get this error when the pipline starts:
('192.168.11.138', 44120)
Error in connection handler
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 674, in transfer_data
message = yield from self.read_message()
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 742, in read_message
frame = yield from self.read_data_frame(max_size=self.max_size)
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 815, in read_data_frame
frame = yield from self.read_frame(max_size)
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 884, in read_frame
extensions=self.extensions,
File "/usr/local/lib/python3.6/dist-packages/websockets/framing.py", line 99, in read
data = yield from reader(2)
File "/usr/lib/python3.6/asyncio/streams.py", line 672, in readexactly
raise IncompleteReadError(incomplete, n)
asyncio.streams.IncompleteReadError: 0 bytes read on a total of 2 expected bytes
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/websockets/server.py", line 169, in handler
yield from self.ws_handler(self, path)
File "signaling_server.py", line 34, in signaling
message = await asyncio.wait_for(websocket.recv(), 3000)
File "/usr/lib/python3.6/asyncio/tasks.py", line 358, in wait_for
return fut.result()
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 434, in recv
yield from self.ensure_open()
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 646, in ensure_open
) from self.transfer_data_exc
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 1006 (connection closed abnormally [internal]), no reason

I cannot say about Python (unfortunately, cannot make Python bindings for GStreamer work on Windows), however, demo works from C# (I just checked).
First you should connect with your browser to https://webrtc.nirbheek.in/, and get the 'Our id' value.
Your Python Gstreamer should connect to wss://webrtc.nirbheek.in:8443, and use the Id value from the browser.
The browser will get the test image stream from the GStreamer, and the GStreamer application will get the Webcam image from the browser.
HTH, Tom
Here's a screenshot:

POSKeyError during migration

Migration from plone 3.3.2 to plone 4.2.1 fails with PosKeyError. I've tried recipes from this article http://plonechix.blogspot.com/2009/12/definitive-guide-to-poskeyerror.html.
I've run error_finder snippet, but it didn't give me any execeptions. I've also tried to take object in debugger using app.mysite._p_jar[p64(oid)] - also no success, it fails with the same error.
How can I delete the broken object or at least get more info about object (e.g. its class name or location)?
Full traceback:
POSKeyError('\x00\x00\x00\x00\x00\x0ey=',)
(Also, the following error occurred while attempting to render the standard error message, please see the event log for full details:
An operation previously failed, with traceback:
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/ZPublisher/Publish.py", line 443, in publish_module
environ, debug, request, response)
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/ZPublisher/Publish.py", line 237, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/ZPublisher/Publish.py", line 134, in publish
transactions_manager.commit()
File "/Users/makmak/Plone/buildout-cache/eggs/Zope2-2.13.16-py2.7.egg/Zope2/App/startup.py", line 301, in commit
transaction.commit()
File "/Users/makmak/Plone/buildout-cache/eggs/transaction-1.1.1-py2.7.egg/transaction/_manager.py", line 89, in commit
return self.get().commit()
File "/Users/makmak/Plone/buildout-cache/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 336, in commit
t, v, tb = self._saveAndGetCommitishError()
File "/Users/makmak/Plone/buildout-cache/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 329, in commit
self._commitResources()
File "/Users/makmak/Plone/buildout-cache/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 443, in _commitResources
rm.commit(self)
File "/Users/makmak/Plone/buildout-cache/eggs/ZODB3-3.10.5-py2.7-macosx-10.4-x86_64.egg/ZODB/Connection.py", line 572, in commit
oid, serial, transaction)
File "/Users/makmak/Plone/buildout-cache/eggs/ZODB3-3.10.5-py2.7-macosx-10.4-x86_64.egg/ZODB/BaseStorage.py", line 416, in checkCurrentSerialInTransaction
committed_tid = self.getTid(oid)
File "/Users/makmak/Plone/buildout-cache/eggs/ZODB3-3.10.5-py2.7-macosx-10.4-x86_64.egg/ZODB/FileStorage/FileStorage.py", line 770, in getTid
with self._lock:
File "/Users/makmak/Plone/buildout-cache/eggs/ZODB3-3.10.5-py2.7-macosx-10.4-x86_64.egg/ZODB/FileStorage/FileStorage.py", line 403, in _lookup_pos
raise POSKeyError(oid)
POSKeyError: 0x0e793d

You can use the fsrefs.py to find the bad object.
A very short article on using it is: http://nathanvangheem.com/news/fixing-broken-zodb-object-references

I believe this is the same issue I just ran into which happens if a savepoint is rolled back that included adding an object to the catalog. I think this is a bug in the ZODB but you can workaround it by addressing whatever is rolling back a savepoint, in this case that's the migration of files and images to blobs. So if you fix what's keeping those files or images from successfully migrating to BLOBs (or just delete them) then it should succeed.

Error while loading data in BigQuery via command line : Updated

Had ran a overnight job using a script.
It worked for lot of tables and then some 4 hrs back 7am IST approx started behaving weird
Now even single commands give the same error
bq load --max_bad_records=10 tbl163.a_V3_14Jun2012 a_V3_14Jun2012.log.gz ../schema/analyze.schema
Error:
BigQuery error in load operation: Could not connect with BigQuery server, http
response status: 502
Update: I have received the following error just now
You have encountered a bug in the BigQuery CLI. Please send an email to bigquery- team#google.com to report this, with the following information:
========================================
== Platform ==
CPython:2.7.3:Linux-3.2.0-25-virtual-x86_64-with-Ubuntu-12.04-precise
== bq version ==
v2.0.6
== Command line ==
['/usr/local/bin/bq', 'load', '--max_bad_records=10', 'vizvrm299.analyze_VIZVRM299_26Jun2012', 'analyze_VIZVRM299_26Jun2012.log.gz', '../schema/analyze.schema']
== Error trace ==
File "build/bdist.linux-x86_64/egg/bq.py", line 614, in RunSafely
self.RunWithArgs(*args, **kwds)
File "build/bdist.linux-x86_64/egg/bq.py", line 791, in RunWithArgs
job = client.Load(table_reference, source, schema=schema, **opts)
File "build/bdist.linux-x86_64/egg/bigquery_client.py", line 1473, in Load
upload_file=upload_file, **kwds)
File "build/bdist.linux-x86_64/egg/bigquery_client.py", line 1228, in ExecuteJob
job_id=job_id)
File "build/bdist.linux-x86_64/egg/bigquery_client.py", line 1214, in RunJobSynchronously
upload_file=upload_file, job_id=job_id)
File "build/bdist.linux-x86_64/egg/bigquery_client.py", line 1208, in StartJob
projectId=project_id).execute()
File "build/bdist.linux-x86_64/egg/bigquery_client.py", line 184, in execute
return super(BigqueryHttp, self).execute(**kwds)
File "build/bdist.linux-x86_64/egg/apiclient/http.py", line 644, in execute
_, body = self.next_chunk(http)
File "build/bdist.linux-x86_64/egg/apiclient/http.py", line 708, in next_chunk
raise ResumableUploadError("Failed to retrieve starting URI.")
========================================
Unexpected exception in load operation: Failed to retrieve starting URI.

Just to close out this question, are you still observing these errors? This looks like a network configuration issue, not directly an issue with the bq command line tool. However, AFAIK, bq doesn't provide functions for resuming job insertion if there is a problem with the network.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

a worker gets killed that has delayed data in dask distributed - pandas

Related

What is the recommended architecture for using amazon neptune in a scalable way?

Streaming job failure-State Schema not Compatible issue

How does "on-negotiation-needed" work when trying to stream using gstreamer webrtc?

POSKeyError during migration

Error while loading data in BigQuery via command line : Updated

Categories

Resources