Reading shared global dataframe in dask - pandas
I'm parallelizing a function over 32 cores and am having some trouble accessing a shared dataframe dask_paths. All the code works correctly when I get rid of the line (and lines that depend on it) dask_paths[dask_paths['od'] == (o,d)].compute(). Incredibly, if I compute this for some fixed o, d outside of the distributed code, then use that result, I get what I want (for that o, d). This means it really is just the actual accessing of dask_paths that is failing in the parallel computation. I am using the logic given here for "embarassingly paralellizable for loops" in the dask docs. Moreover, I used to use get_group on a global pd.DataFrame grouped for this logic, and that suffered from the same problem of glogbal (even though this is serially done in a couple seconds, the computation stalls before giving a cryptic error message, given at the bottom). I don't know why this is.
Note that dask_paths is a Dask.dataframe. This is the most basic of logic in parallellizing with dask, so not sure why it's failing. I am working on a Vertex AI jupyter notebook on Google Cloud. There is no error trace, because the program simply stalls. All the data/dataframes have been defined in the global environment of the jupyter notebook in the cells above, and are working fine. The vertex AI notebook has 16 vCPUs and 100GB RAM and is run on Google Cloud's VM. There is no reading or writing to any files, so that's not the issue.
# dask_paths['od'] takes on values like (100, 150)
# popular takes the form of a [[20, 25], [100, 150], [67, 83],...]
# and is of length 2000 elements, every element is a list of len 2
def main():
def pop2unique(pop):
df = dask_paths[dask_paths['od'] == (pop[0], pop[1])].compute()
return df['last'].sum()
lzs = []
ncores = 32
dask_client.cluster.scale(10)
futures = dask_client.map(pop2unique, popular[:10]) # this stalls
results = dask_client.gather(futures)
And dask_paths takes the form
index (o, d) last
605096 (22, 25) 10
999336 (103, 88) 31
1304512 (101, 33) 9
1432383 (768, 21) 16
The client being used everywhere is given by
from dask.distributed import Client, progress
dask_client = Client(threads_per_worker=4, n_workers=8)
The error message I get is long and cryptic:
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-9y17gy_r', purging
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 33
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 35
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 33', 'time': 1658860560.8544912}
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 36
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 31
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 35', 'time': 1658860560.8563268}
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 36', 'time': 1658860560.8576422}
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 34
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 31', 'time': 1658860560.8595085}
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 32
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 34', 'time': 1658860560.8609138}
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860560.862359}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 36', 'time': 1658860560.8576422}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 31', 'time': 1658860560.8595085}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 35', 'time': 1658860560.8563268}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 33', 'time': 1658860560.8544912}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 34', 'time': 1658860560.8609138}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860560.862359}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860560.862359}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860560.862359}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 36', 'time': 1658860560.8576422}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 36', 'time': 1658860560.8576422}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 35', 'time': 1658860560.8563268}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 35', 'time': 1658860560.8563268}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 34', 'time': 1658860560.8609138}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 34', 'time': 1658860560.8609138}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 33', 'time': 1658860560.8544912}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 33', 'time': 1658860560.8544912}
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.7/asyncio/tasks.py:623> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 31', 'time': 1658860560.8595085}")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 31', 'time': 1658860560.8595085}
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-xd5jxrin', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-w_fmefrs', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-djg8ki4m', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-ho1hw10b', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-mbdw10vg', purging
distributed.diskutils - INFO - Found stale lock file and directory '/home/jupyter/dask-worker-space/worker-whk890cp', purging
distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 32
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860564.692771}
distributed.nanny - ERROR - Failed while trying to start worker process: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860564.692771}
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7f82cf7eba10>>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py:310> exception=ValueError("Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860564.692771}")>)
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 348, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 334, in start
response = await self.instantiate()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 417, in instantiate
result = await self.process.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 694, in start
msg = await self._wait_until_connected(uid)
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 812, in _wait_until_connected
raise msg["exception"]
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 884, in run
await worker
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 279, in _
await self.start()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1584, in start
await self._register_with_scheduler()
File "/opt/conda/lib/python3.7/site-packages/distributed/worker.py", line 1299, in _register_with_scheduler
raise ValueError(f"Unexpected response from register: {response!r}")
ValueError: Unexpected response from register: {'status': 'error', 'message': 'name taken, 32', 'time': 1658860564.692771}
The errors you are seeing might not be related to your workflow - maybe a version conflict or similar.
However, you are mixing dask API paradigms. You have created a dask-dataframe - which understand how to partitionify operations for dask to compute - but then chosen to create tasks manually yourself. This is a bad idea. Dask tasks should generally be operating on one partition of a normal data structure (in this case, dataframe) not on a dask collection (in this case, dask dataframe). It may be (I am not sure) that the attempt to serialise the dask-dataframe and deserialise it on the workers is what is causing them to fail to start properly.
Your workflow at first glance looks like a full shuffle, but indeed i does parallelise OK, because you can groupby in each partition, and sum the results.
def per_partition_op(df):
return df.groupby("od")["last"].sum()
df2 = df.map_partitions(per_partition_op)
At this point, you can just compute and work with the partials series, since this should already be of a manageable size
partials = df2.compute()
results = partials.groupby(level=0).sum()
Related
(Scrapy-Redis) Error caught on signal handler: <bound method RedisMixin.spider_idle of spider
I have three machines on Azure. One is for redis server. The others are crawlers. But, they showed this message after few hours. Did anyone encounter this situation before? Thanks. 2023-02-04 10:58:14 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method RedisMixin.spider_idle of <Spider at 0x29ca19e9580>> Traceback (most recent call last): File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 812, in read_response response = self._parser.read_response(disable_decoding=disable_decoding) File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 318, in read_response raw = self._buffer.readline() File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 249, in readline self._read_from_socket() File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 192, in _read_from_socket data = self._sock.recv(socket_read_size) ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\utils\signal.py", line 30, in send_catch_log response = robustApply(receiver, signal=signal, sender=sender, *arguments, **named) File "C:\ProgramData\Anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy_redis\spiders.py", line 205, in spider_idle if self.server is not None and self.count_size(self.redis_key) > 0: File "C:\ProgramData\Anaconda3\lib\site-packages\redis\commands\core.py", line 2604, in llen return self.execute_command("LLEN", name) File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1258, in execute_command return conn.retry.call_with_retry( File "C:\ProgramData\Anaconda3\lib\site-packages\redis\retry.py", line 49, in call_with_retry fail(error) File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1262, in <lambda> lambda error: self._disconnect_raise(conn, error), File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1248, in _disconnect_raise raise error File "C:\ProgramData\Anaconda3\lib\site-packages\redis\retry.py", line 46, in call_with_retry return do() File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1259, in <lambda> lambda: self._send_command_parse_response( File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1235, in _send_command_parse_response return self.parse_response(conn, command_name, **options) File "C:\ProgramData\Anaconda3\lib\site-packages\redis\client.py", line 1275, in parse_response response = connection.read_response() File "C:\ProgramData\Anaconda3\lib\site-packages\redis\connection.py", line 818, in read_response raise ConnectionError(f"Error while reading from {hosterr}" f" : {e.args}") redis.exceptions.ConnectionError: Error while reading from X.X.X.X:X : (10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None) Expect to get the clue/suggestion/solution.
The security token included in the request is expired, when I try to update credentials
I am using Celery with SQS as a broker and I am trying to renew my credentials "AWS_ACCESS_KEY_ID" and "AWS_SECRET_ACCESS_KEY", before they expire, the first time I run the task and the result is success, but after 15 minutes it expires although credentials have been renewed, the function to update credentials is as follows: import os import boto3 from celery import Celery from kombu.utils.url import safequote def update_aws_credentials(): role_info = { 'RoleArn': f"arn:aws:iam::{os.environ['AWS_ACCOUNT_NUMER']}:role/my_role_execution", 'RoleSessionName': 'roleExecution', 'DurationSeconds': 900 } sts_client = boto3.client('sts', region_name='eu-central-1') credentials = sts_client.assume_role(**role_info) aws_access_key_id = credentials["Credentials"]['AccessKeyId'] aws_secret_access_key = credentials["Credentials"]['SecretAccessKey'] aws_session_token = credentials["Credentials"]["SessionToken"] os.environ["AWS_ACCESS_KEY_ID"] = aws_access_key_id os.environ["AWS_SECRET_ACCESS_KEY"] = aws_secret_access_key os.environ["AWS_DEFAULT_REGION"] = 'eu-central-1' os.environ["AWS_SESSION_TOKEN"] = aws_session_token return aws_access_key_id, aws_secret_access_key def get_celery(aws_access_key_id, aws_secret_access_key): broker = f"sqs://{safequote(aws_access_key_id)}:{safequote(aws_secret_access_key)}#" backend = 'redis://redis-service:6379/0' celery = Celery(f"my_task", broker=broker, backend=backend) celery.conf["broker_transport_options"] = { 'polling_interval': 30, 'region': 'eu-central-1', 'predefined_queues': { f"my_queue": { 'url': f"https://sqs.eu-central-1.amazonaws.com/{os.environ['AWS_ACCOUNT_NUMER']}/my_queue" } } } celery.conf["task_default_queue"] = f"my_queue" return celery def refresh_sqs_credentials(): access, secret = update_aws_credentials() return get_celery(access, secret) Running refresh_sqs_credentials, new credentials are created: celery = worker.refresh_sqs_credentials() And then I run my task with celery: task = celery.send_task('my_task.code_of_my_task', args=[content], task_id=task_id) All tasks that I run before 15 minutes finish successfully, but after 15 minutes the error is the following: [2021-12-14 14:08:15,637] ERROR in app: Exception on /tasks/run [POST] Traceback (most recent call last): File "/api/app.py", line 87, in post task = celery.send_task('glgt_ap35080_dev_sqs_runalgo.allocation_alg_task', args=[content], task_id=task_id) File "/usr/local/lib/python3.6/site-packages/celery/app/base.py", line 717, in send_task amqp.send_task_message(P, name, message, **options) File "/usr/local/lib/python3.6/site-packages/celery/app/amqp.py", line 547, in send_task_message **properties File "/usr/local/lib/python3.6/site-packages/kombu/messaging.py", line 178, in publish exchange_name, declare, File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 525, in _ensured return fun(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/kombu/messaging.py", line 200, in _publish mandatory=mandatory, immediate=immediate, File "/usr/local/lib/python3.6/site-packages/kombu/transport/virtual/base.py", line 605, in basic_publish return self._put(routing_key, message, **kwargs) File "/usr/local/lib/python3.6/site-packages/kombu/transport/SQS.py", line 294, in _put c.send_message(**kwargs) File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 337, in _api_call File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 656, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (ExpiredToken) when calling the SendMessage operation: The security token included in the request is expired During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/usr/local/lib/python3.6/site-packages/flask_restplus/api.py", line 325, in wrapper resp = resource(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 88, in view return self.dispatch_request(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/flask_restplus/resource.py", line 44, in dispatch_request resp = meth(*args, **kwargs) File "/api/app.py", line 90, in post abort(500) File "/usr/local/lib/python3.6/site-packages/werkzeug/exceptions.py", line 774, in abort return _aborter(status, *args, **kwargs) File "/usr/local/lib/python3.6/site-packages/werkzeug/exceptions.py", line 755, in __call__ raise self.mapping[code](*args, **kwargs) werkzeug.exceptions.InternalServerError: 500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application. 10.142.95.217 - - [14/Dec/2021 14:08:15] "POST /tasks/run HTTP/1.1" 500 - I'm storing the credentials in environment variables, I don't understand why it expires after 15 minutes, can someone help me please? The versions of the packages used are: boto3==1.14.54 celery==5.0.0 kombu==5.0.2 pycurl==7.43.0.6 Thank you
use AWS_PROFILE in pandas.read_parquet
I'm testing this locally where I have a ~/.aws/config file. ~/.aws/config looks some thing like: [profile a] ... [profile b] ... I also have a AWS_PROFILE environmental variable set as "a". I would like to read a file in which is accessible with profile b using pandas. I am able to access it through s3fs by doing: import s3fs fs = s3fs.S3FileSystem(profile="b") fs.get("BUCKET/FILE.parquet", "FILE.parquet") pd.read_parquet("FILE.parquet") However, if I try to pass this to pd.read_parquet using storage_options I get a PermissionError: Forbidden. pd.read_parquet( "s3://BUCKET/FILE.parquet", storage_options={"profile": "b"}, ) full Traceback below Traceback (most recent call last): File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 233, in _call_s3 out = await method(**additional_kwargs) File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/aiobotocore/client.py", line 154, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1672, in read_table dataset = _ParquetDatasetV2( File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1504, in __init__ if filesystem.get_file_info(path_or_paths).is_file: File "pyarrow/_fs.pyx", line 438, in pyarrow._fs.FileSystem.get_file_info File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/_fs.pyx", line 1004, in pyarrow._fs._cb_get_file_info File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/fs.py", line 226, in get_file_info info = self.fs.info(path) File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 72, in wrapper return sync(self.loop, func, *args, **kwargs) File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in sync raise result[0] File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 20, in _runner result[0] = await coro File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 911, in _info out = await self._call_s3( File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 252, in _call_s3 raise translate_boto_error(err) PermissionError: Forbidden Note: there is an old question somewhat related to this but it didn't help: How to read parquet file from s3 using dask with specific AWS profile
You just need to add the following argument to the function: storage_options=dict(profile='your_profile_name') Hence the read statement is: pd.read_parquet("s3://your_bucket",storage_options=dict(profile='your_profile_name'))
tensorflow-data-validation doesn't work on large datasets with apache-beam direct runner because of grpc timeout
I’m running into a problem of tensorflow-data-validation with direct runner to generate statistics from some large datasets over 400GB. It seems that all workers stopped working after an error message of “Keepalive watchdog fired. Closing transport.” It seems to be a grpc keepalive timeout. E0804 17:49:07.419950276 44806 chttp2_transport.cc:2881] ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport. 2020-08-04 17:49:07 local_job_service.py : INFO Worker: severity: ERROR timestamp { seconds: 1596563347 nanos: 420487403 } message: "Python sdk harness failed: \nTraceback (most recent call last):\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 158, in main\n sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 213, in run\n for work_request in self._control_stub.Control(get_responses()):\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 416, in __next__\n return self._next()\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog timeout\"\n\tdebug_error_string = \"{\"created\":\"#1596563347.420024732\",\"description\":\"Error received from peer ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent call last):\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 158, in main\n sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 213, in run\n for work_request in self._control_stub.Control(get_responses()):\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 416, in __next__\n return self._next()\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog timeout\"\n\tdebug_error_string = \"{\"created\":\"#1596563347.420024732\",\"description\":\"Error received from peer ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location: "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161" thread: "MainThread" Traceback (most recent call last): File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code exec(code, run_globalse File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 248, in <module> main(sys.argv) File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 158, in main sdk_pipeline_options.view_as(ProfilingOptions))).run() File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 213, in run for work_request in self._control_stub.Control(get_responses()): File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", line 416, in __next__ return self._next() File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", line 706, in _next raise self grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "keepalive watchdog timeout" debug_error_string = "{"created":"#1596563347.420024732","description":"Error received from peer ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive watchdog timeout","grpc_status":14}"
Auth Issue with AirFlow and BigQuery?
I'm trying to get a simple dummy job going in Airflow for BigQuery but running into what i think might be auth issues but am not quite sure. My DAG: from airflow import DAG from airflow.operators.bash_operator import BashOperator from datetime import datetime, timedelta from airflow.contrib.hooks.bigquery_hook import BigQueryHook from airflow.contrib.operators.bigquery_operator import BigQueryOperator default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime(2017, 1, 1), 'email': ['airflow#airflow.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5), # 'queue': 'bash_queue', # 'pool': 'backfill', # 'priority_weight': 10, # 'end_date': datetime(2016, 1, 1), } with DAG('my_bq_dag', schedule_interval=timedelta(days=1), default_args=default_args) as dag: bq_extract_one_day = BigQueryOperator( task_id='my_bq_task1', bql='SELECT 666 as msg', destination_dataset_table='airflow.msg', write_disposition='WRITE_TRUNCATE', bigquery_conn_id='bigquery_default' ) Then when i try test with: #airflow-server:~/$ airflow test my_bq_dag my_bq_task1 2017-01-01 I get: [2017-03-09 17:06:05,629] {__init__.py:36} INFO - Using executor LocalExecutor [2017-03-09 17:06:05,735] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt [2017-03-09 17:06:05,764] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt [2017-03-09 17:06:06,091] {models.py:154} INFO - Filling up the DagBag from /home/user/myproject/airflow/dags [2017-03-09 17:06:06,385] {models.py:1196} INFO - -------------------------------------------------------------------------------- Starting attempt 1 of 2 -------------------------------------------------------------------------------- [2017-03-09 17:06:06,386] {models.py:1219} INFO - Executing <Task(BigQueryOperator): my_bq_task1> on 2017-01-01 00:00:00 [2017-03-09 17:06:06,396] {bigquery_operator.py:55} INFO - Executing: SELECT 666 as msg [2017-03-09 17:06:06,425] {discovery.py:810} INFO - URL being requested: POST https://www.googleapis.com/bigquery/v2/projects/myproject/jobs?alt=json [2017-03-09 17:06:06,425] {client.py:570} INFO - Attempting refresh to obtain initial access_token [2017-03-09 17:06:06,426] {models.py:1286} ERROR - [] Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1245, in run result = task_copy.execute(context=context) File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/bigquery_operator.py", line 59, in execute cursor.run_query(self.bql, self.destination_dataset_table, self.write_disposition, self.allow_large_results, self.udf_config) File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 207, in run_query return self.run_with_configuration(configuration) File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 437, in run_with_configuration .insert(projectId=self.project_id, body=job_data) \ File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 140, in positional_wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 722, in execute body=self.body, headers=self.headers) File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 572, in new_request self._refresh(request_orig) File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 842, in _refresh self._do_refresh_request(http_request) File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 869, in _do_refresh_request body = self._generate_refresh_request_body() File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 1549, in _generate_refresh_request_body assertion = self._generate_assertion() File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 1677, in _generate_assertion private_key, self.private_key_password), payload) File "/usr/local/lib/python2.7/dist-packages/oauth2client/_openssl_crypt.py", line 117, in from_string pkey = crypto.load_privatekey(crypto.FILETYPE_PEM, parsed_pem_key) File "/usr/local/lib/python2.7/dist-packages/OpenSSL/crypto.py", line 2583, in load_privatekey _raise_current_error() File "/usr/local/lib/python2.7/dist-packages/OpenSSL/_util.py", line 48, in exception_from_error_queue raise exception_type(errors) Error: [] [2017-03-09 17:06:06,428] {models.py:1298} INFO - Marking task as UP_FOR_RETRY [2017-03-09 17:06:06,428] {models.py:1327} ERROR - [] Traceback (most recent call last): File "/usr/local/bin/airflow", line 15, in <module> args.func(args) File "/usr/local/lib/python2.7/dist-packages/airflow/bin/cli.py", line 352, in test ti.run(force=True, ignore_dependencies=True, test_mode=True) File "/usr/local/lib/python2.7/dist-packages/airflow/utils/db.py", line 53, in wrapper result = func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1245, in run result = task_copy.execute(context=context) File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/bigquery_operator.py", line 59, in execute cursor.run_query(self.bql, self.destination_dataset_table, self.write_disposition, self.allow_large_results, self.udf_config) File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 207, in run_query return self.run_with_configuration(configuration) File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/bigquery_hook.py", line 437, in run_with_configuration .insert(projectId=self.project_id, body=job_data) \ File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 140, in positional_wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 722, in execute body=self.body, headers=self.headers) File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 572, in new_request self._refresh(request_orig) File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 842, in _refresh self._do_refresh_request(http_request) File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 869, in _do_refresh_request body = self._generate_refresh_request_body() File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 1549, in _generate_refresh_request_body assertion = self._generate_assertion() File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 1677, in _generate_assertion private_key, self.private_key_password), payload) File "/usr/local/lib/python2.7/dist-packages/oauth2client/_openssl_crypt.py", line 117, in from_string pkey = crypto.load_privatekey(crypto.FILETYPE_PEM, parsed_pem_key) File "/usr/local/lib/python2.7/dist-packages/OpenSSL/crypto.py", line 2583, in load_privatekey _raise_current_error() File "/usr/local/lib/python2.7/dist-packages/OpenSSL/_util.py", line 48, in exception_from_error_queue raise exception_type(errors) OpenSSL.crypto.Error: [] I've been trying to just get a simple job to write to a table in my bq project. Partly using this post as a guide https://medium.com/google-cloud/airflow-for-google-cloud-part-1-d7da9a048aa4#.5qclla82t