Reading shared global dataframe in dask - pandas

I'm parallelizing a function over 32 cores and am having some trouble accessing a shared dataframe dask_paths. All the code works correctly when I get rid of the line (and lines that depend on it) dask_paths[dask_paths['od'] == (o,d)].compute(). Incredibly, if I compute this for some fixed o, d outside of the distributed code, then use that result, I get what I want (for that o, d). This means it really is just the actual accessing of dask_paths that is failing in the parallel computation. I am using the logic given here for "embarassingly paralellizable for loops" in the dask docs. Moreover, I used to use get_group on a global pd.DataFrame grouped for this logic, and that suffered from the same problem of glogbal (even though this is serially done in a couple seconds, the computation stalls before giving a cryptic error message, given at the bottom). I don't know why this is.
Note that dask_paths is a Dask.dataframe. This is the most basic of logic in parallellizing with dask, so not sure why it's failing. I am working on a Vertex AI jupyter notebook on Google Cloud. There is no error trace, because the program simply stalls. All the data/dataframes have been defined in the global environment of the jupyter notebook in the cells above, and are working fine. The vertex AI notebook has 16 vCPUs and 100GB RAM and is run on Google Cloud's VM. There is no reading or writing to any files, so that's not the issue.
# dask_paths['od'] takes on values like (100, 150)
# popular takes the form of a [[20, 25], [100, 150], [67, 83],...]
# and is of length 2000 elements, every element is a list of len 2
def main():
def pop2unique(pop):
df = dask_paths[dask_paths['od'] == (pop[0], pop[1])].compute()
return df['last'].sum()
lzs = []
ncores = 32
dask_client.cluster.scale(10)
futures = dask_client.map(pop2unique, popular[:10]) # this stalls
results = dask_client.gather(futures)
And dask_paths takes the form
index (o, d) last
605096 (22, 25) 10
999336 (103, 88) 31
1304512 (101, 33) 9
1432383 (768, 21) 16
The client being used everywhere is given by
from dask.distributed import Client, progress
dask_client = Client(threads_per_worker=4, n_workers=8)
The error message I get is long and cryptic:
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-9y17gy_r', purging
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 33
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 35
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 33', 'time': 1658860560.8544912}
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 36
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 31
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 35', 'time': 1658860560.8563268}
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 36', 'time': 1658860560.8576422}
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 34
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 31', 'time': 1658860560.8595085}
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 32
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 34', 'time': 1658860560.8609138}
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860560.862359}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 36', 'time': 1658860560.8576422}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 31', 'time': 1658860560.8595085}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 35', 'time': 1658860560.8563268}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 33', 'time': 1658860560.8544912}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 34', 'time': 1658860560.8609138}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860560.862359}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860560.862359}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860560.862359}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 36', 'time': 1658860560.8576422}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 36', 'time': 1658860560.8576422}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 35', 'time': 1658860560.8563268}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 35', 'time': 1658860560.8563268}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 34', 'time': 1658860560.8609138}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 34', 'time': 1658860560.8609138}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 33', 'time': 1658860560.8544912}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 33', 'time': 1658860560.8544912}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 31', 'time': 1658860560.8595085}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 31', 'time': 1658860560.8595085}
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-xd5jxrin', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-w_fmefrs', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-djg8ki4m', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-ho1hw10b', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-mbdw10vg', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-whk890cp', purging
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 32
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860564.692771}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860564.692771}
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7f82cf7eba10>>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py:310> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860564.692771}")>)
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 348, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860564.692771}

The errors you are seeing might not be related to your workflow - maybe a version conflict or similar.
However, you are mixing dask API paradigms. You have created a dask-dataframe - which understand how to partitionify operations for dask to compute - but then chosen to create tasks manually yourself. This is a bad idea. Dask tasks should generally be operating on one partition of a normal data structure (in this case, dataframe) not on a dask collection (in this case, dask dataframe). It may be (I am not sure) that the attempt to serialise the dask-dataframe and deserialise it on the workers is what is causing them to fail to start properly.
Your workflow at first glance looks like a full shuffle, but indeed i does parallelise OK, because you can groupby in each partition, and sum the results.
def per_partition_op(df):
return df.groupby("od")["last"].sum()
df2 = df.map_partitions(per_partition_op)
At this point, you can just compute and work with the partials series, since this should already be of a manageable size
partials = df2.compute()
results = partials.groupby(level=0).sum()

Related

(Scrapy-Redis) Error caught on signal handler: <bound method RedisMixin.spider_idle of spider

I have three machines on Azure. One is for redis server. The others are crawlers. But, they showed this message after few hours. Did anyone encounter this situation before? Thanks.
2023-02-04 10:58:14 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method RedisMixin.spider_idle of <Spider at 0x29ca19e9580>>
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 812, in read_response
response = self._parser.read_response(disable_decoding=disable_decoding)
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 318, in read_response
raw = self._buffer.readline()
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 249, in readline
self._read_from_socket()
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 192, in _read_from_socket
data = self._sock.recv(socket_read_size)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\utils\signal.py", line 30, in send_catch_log
response = robustApply(receiver, signal=signal, sender=sender, *arguments, **named)
File "C:\ProgramData\Anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy_redis\spiders.py", line 205, in spider_idle
if self.server is not None and self.count_size(self.redis_key) > 0:
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\commands\core.py", line 2604, in llen
return self.execute_command("LLEN", name)
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1258, in execute_command
return conn.retry.call_with_retry(
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\retry.py", line 49, in call_with_retry
fail(error)
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1262, in <lambda>
lambda error: self._disconnect_raise(conn, error),
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1248, in _disconnect_raise
raise error
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\retry.py", line 46, in call_with_retry
return do()
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1259, in <lambda>
lambda: self._send_command_parse_response(
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1235, in _send_command_parse_response
return self.parse_response(conn, command_name, **options)
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1275, in parse_response
response = connection.read_response()
File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 818, in read_response
raise ConnectionError(f"Error while reading from {hosterr}" f" : {e.args}")
redis.exceptions.ConnectionError: Error while reading from X.X.X.X:X : (10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)
Expect to get the clue/suggestion/solution.

The security token included in the request is expired, when I try to update credentials

I am using Celery with SQS as a broker and I am trying to renew my credentials "AWS_ACCESS_KEY_ID" and "AWS_SECRET_ACCESS_KEY", before they expire, the first time I run the task and the result is success, but after 15 minutes it expires although credentials have been renewed, the function to update credentials is as follows:
import os
import boto3
from celery import Celery
from kombu.utils.url import safequote
def update_aws_credentials():
role_info = {
'RoleArn': f"arn:aws:iam::{os.environ['AWS_ACCOUNT_NUMER']}:role/my_role_execution",
'RoleSessionName': 'roleExecution',
'DurationSeconds': 900
}
sts_client = boto3.client('sts', region_name='eu-central-1')
credentials = sts_client.assume_role(**role_info)
aws_access_key_id = credentials["Credentials"]['AccessKeyId']
aws_secret_access_key = credentials["Credentials"]['SecretAccessKey']
aws_session_token = credentials["Credentials"]["SessionToken"]
os.environ["AWS_ACCESS_KEY_ID"] = aws_access_key_id
os.environ["AWS_SECRET_ACCESS_KEY"] = aws_secret_access_key
os.environ["AWS_DEFAULT_REGION"] = 'eu-central-1'
os.environ["AWS_SESSION_TOKEN"] = aws_session_token
return aws_access_key_id, aws_secret_access_key
def get_celery(aws_access_key_id, aws_secret_access_key):
broker = f"sqs://{safequote(aws_access_key_id)}:{safequote(aws_secret_access_key)}#"
backend = 'redis://redis-service:6379/0'
celery = Celery(f"my_task", broker=broker, backend=backend)
celery.conf["broker_transport_options"] = {
'polling_interval': 30,
'region': 'eu-central-1',
'predefined_queues': {
f"my_queue": {
'url': f"https://sqs.eu-central-1.amazonaws.com/{os.environ['AWS_ACCOUNT_NUMER']}/my_queue"
}
}
}
celery.conf["task_default_queue"] = f"my_queue"
return celery
def refresh_sqs_credentials():
access, secret = update_aws_credentials()
return get_celery(access, secret)
Running refresh_sqs_credentials, new credentials are created:
celery = worker.refresh_sqs_credentials()
And then I run my task with celery:
task = celery.send_task('my_task.code_of_my_task', args=[content], task_id=task_id)
All tasks that I run before 15 minutes finish successfully, but after 15 minutes the error is the following:
[2021-12-14 14:08:15,637] ERROR in app: Exception on /tasks/run [POST]
Traceback (most recent call last):
File "/api/app.py", line 87, in post
task = celery.send_task('glgt_ap35080_dev_sqs_runalgo.allocation_alg_task', args=[content], task_id=task_id)
File "/usr/local/lib/python3.6/site-packages/celery/app/base.py", line 717, in send_task
amqp.send_task_message(P, name, message, **options)
File "/usr/local/lib/python3.6/site-packages/celery/app/amqp.py", line 547, in send_task_message
**properties
File "/usr/local/lib/python3.6/site-packages/kombu/messaging.py", line 178, in publish
exchange_name, declare,
File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 525, in _ensured
return fun(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/kombu/messaging.py", line 200, in _publish
mandatory=mandatory, immediate=immediate,
File "/usr/local/lib/python3.6/site-packages/kombu/transport/virtual/base.py", line 605, in basic_publish
return self._put(routing_key, message, **kwargs)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/SQS.py", line 294, in _put
c.send_message(**kwargs)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 337, in _api_call
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 656, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ExpiredToken) when calling the SendMessage operation: The security token included in the request is expired
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/local/lib/python3.6/site-packages/flask_restplus/api.py", line 325, in wrapper
resp = resource(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 88, in view
return self.dispatch_request(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/flask_restplus/resource.py", line 44, in dispatch_request
resp = meth(*args, **kwargs)
File "/api/app.py", line 90, in post
abort(500)
File "/usr/local/lib/python3.6/site-packages/werkzeug/exceptions.py", line 774, in abort
return _aborter(status, *args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/werkzeug/exceptions.py", line 755, in __call__
raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.InternalServerError: 500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
10.142.95.217 - - [14/Dec/2021 14:08:15] "POST /tasks/run HTTP/1.1" 500 -
I'm storing the credentials in environment variables, I don't understand why it expires after 15 minutes, can someone help me please?
The versions of the packages used are:
boto3==1.14.54
celery==5.0.0
kombu==5.0.2
pycurl==7.43.0.6
Thank you

use AWS_PROFILE in pandas.read_parquet

I'm testing this locally where I have a ~/.aws/config file.
~/.aws/config looks some thing like:
[profile a]
...
[profile b]
...
I also have a AWS_PROFILE environmental variable set as "a".
I would like to read a file in which is accessible with profile b using pandas.
I am able to access it through s3fs by doing:
import s3fs
fs = s3fs.S3FileSystem(profile="b")
fs.get("BUCKET/FILE.parquet", "FILE.parquet")
pd.read_parquet("FILE.parquet")
However, if I try to pass this to pd.read_parquet using storage_options I get a PermissionError: Forbidden.
pd.read_parquet(
"s3://BUCKET/FILE.parquet",
storage_options={"profile": "b"},
)
full Traceback below
Traceback (most recent call last):
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 233, in _call_s3
out = await method(**additional_kwargs)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/aiobotocore/client.py", line 154, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
return impl.read(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read
return self.api.parquet.read_table(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1672, in read_table
dataset = _ParquetDatasetV2(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1504, in __init__
if filesystem.get_file_info(path_or_paths).is_file:
File "pyarrow/_fs.pyx", line 438, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/_fs.pyx", line 1004, in pyarrow._fs._cb_get_file_info
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/fs.py", line 226, in get_file_info
info = self.fs.info(path)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 72, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in sync
raise result[0]
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 20, in _runner
result[0] = await coro
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 911, in _info
out = await self._call_s3(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 252, in _call_s3
raise translate_boto_error(err)
PermissionError: Forbidden
Note: there is an old question somewhat related to this but it didn't help: How to read parquet file from s3 using dask with specific AWS profile
You just need to add the following argument to the function:
storage_options=dict(profile='your_profile_name')
Hence the read statement is:
pd.read_parquet("s3://your_bucket",storage_options=dict(profile='your_profile_name'))

tensorflow-data-validation doesn't work on large datasets with apache-beam direct runner because of grpc timeout

I’m running into a problem of tensorflow-data-validation with direct runner to generate statistics from some large datasets over 400GB.
It seems that all workers stopped working after an error message of “Keepalive watchdog fired. Closing transport.” It seems to be a grpc keepalive timeout.
E0804 17:49:07.419950276 44806 chttp2_transport.cc:2881] ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport.
2020-08-04 17:49:07 local_job_service.py : INFO Worker: severity: ERROR timestamp { seconds: 1596563347 nanos: 420487403 } message: "Python sdk harness failed: \nTraceback (most recent call last):\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 158, in main\n sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 213, in run\n for work_request in self._control_stub.Control(get_responses()):\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 416, in __next__\n return self._next()\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog timeout\"\n\tdebug_error_string = \"{\"created\":\"#1596563347.420024732\",\"description\":\"Error received from peer ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent call last):\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 158, in main\n sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 213, in run\n for work_request in self._control_stub.Control(get_responses()):\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 416, in __next__\n return self._next()\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog timeout\"\n\tdebug_error_string = \"{\"created\":\"#1596563347.420024732\",\"description\":\"Error received from peer ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location: "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161" thread: "MainThread"
Traceback (most recent call last):
File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globalse
File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 248, in <module>
main(sys.argv)
File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 158, in main
sdk_pipeline_options.view_as(ProfilingOptions))).run()
File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 213, in run
for work_request in self._control_stub.Control(get_responses()):
File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
return self._next()
File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", line 706, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "keepalive watchdog timeout"
debug_error_string = "{"created":"#1596563347.420024732","description":"Error received from peer ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive watchdog timeout","grpc_status":14}"

Auth Issue with AirFlow and BigQuery?

I'm trying to get a simple dummy job going in Airflow for BigQuery but running into what i think might be auth issues but am not quite sure.
My DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.contrib.hooks.bigquery_hook import BigQueryHook
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 1, 1),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
with DAG('my_bq_dag', schedule_interval=timedelta(days=1),
default_args=default_args) as dag:
bq_extract_one_day = BigQueryOperator(
task_id='my_bq_task1',
bql='SELECT 666 as msg',
destination_dataset_table='airflow.msg',
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id='bigquery_default'
)
Then when i try test with:
#airflow-server:~/$ airflow test my_bq_dag my_bq_task1 2017-01-01
I get:
[2017-03-09 17:06:05,629] {__init__.py:36} INFO - Using executor LocalExecutor
[2017-03-09 17:06:05,735] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2017-03-09 17:06:05,764] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
[2017-03-09 17:06:06,091] {models.py:154} INFO - Filling up the DagBag from /home/user/myproject/airflow/dags
[2017-03-09 17:06:06,385] {models.py:1196} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 2
--------------------------------------------------------------------------------
[2017-03-09 17:06:06,386] {models.py:1219} INFO - Executing <Task(BigQueryOperator): my_bq_task1> on 2017-01-01 00:00:00
[2017-03-09 17:06:06,396] {bigquery_operator.py:55} INFO - Executing: SELECT 666 as msg
[2017-03-09 17:06:06,425] {discovery.py:810} INFO - URL being requested: POST https://www.googleapis.com/bigquery/v2/projects/myproject/jobs?alt=json
[2017-03-09 17:06:06,425] {client.py:570} INFO - Attempting refresh to obtain initial access_token
[2017-03-09 17:06:06,426] {models.py:1286} ERROR - []
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1245, in run
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/bigquery_operator.py", line 59, in execute
cursor.run_query(self.bql, self.destination_dataset_table, self.write_disposition, self.allow_large_results, self.udf_config)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 207, in run_query
return self.run_with_configuration(configuration)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 437, in run_with_configuration
.insert(projectId=self.project_id, body=job_data) \
File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 140, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 722, in execute
body=self.body, headers=self.headers)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 572, in new_request
self._refresh(request_orig)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 842, in _refresh
self._do_refresh_request(http_request)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 869, in _do_refresh_request
body = self._generate_refresh_request_body()
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 1549, in _generate_refresh_request_body
assertion = self._generate_assertion()
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 1677, in _generate_assertion
private_key, self.private_key_password), payload)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/_openssl_crypt.py", line 117, in from_string
pkey = crypto.load_privatekey(crypto.FILETYPE_PEM, parsed_pem_key)
File "/usr/local/lib/python2.7/dist-packages/OpenSSL/crypto.py", line 2583, in load_privatekey
_raise_current_error()
File "/usr/local/lib/python2.7/dist-packages/OpenSSL/_util.py", line 48, in exception_from_error_queue
raise exception_type(errors)
Error: []
[2017-03-09 17:06:06,428] {models.py:1298} INFO - Marking task as UP_FOR_RETRY
[2017-03-09 17:06:06,428] {models.py:1327} ERROR - []
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 15, in <module>
args.func(args)
File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 352, in test
ti.run(force=True, ignore_dependencies=True, test_mode=True)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1245, in run
result = task_copy.execute(context=context)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/bigquery_operator.py", line 59, in execute
cursor.run_query(self.bql, self.destination_dataset_table, self.write_disposition, self.allow_large_results, self.udf_config)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 207, in run_query
return self.run_with_configuration(configuration)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 437, in run_with_configuration
.insert(projectId=self.project_id, body=job_data) \
File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 140, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 722, in execute
body=self.body, headers=self.headers)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 572, in new_request
self._refresh(request_orig)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 842, in _refresh
self._do_refresh_request(http_request)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 869, in _do_refresh_request
body = self._generate_refresh_request_body()
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 1549, in _generate_refresh_request_body
assertion = self._generate_assertion()
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 1677, in _generate_assertion
private_key, self.private_key_password), payload)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/_openssl_crypt.py", line 117, in from_string
pkey = crypto.load_privatekey(crypto.FILETYPE_PEM, parsed_pem_key)
File "/usr/local/lib/python2.7/dist-packages/OpenSSL/crypto.py", line 2583, in load_privatekey
_raise_current_error()
File "/usr/local/lib/python2.7/dist-packages/OpenSSL/_util.py", line 48, in exception_from_error_queue
raise exception_type(errors)
OpenSSL.crypto.Error: []
I've been trying to just get a simple job to write to a table in my bq project. Partly using this post as a guide https://medium.com/google-cloud/airflow-for-google-cloud-part-1-d7da9a048aa4#.5qclla82t