s3fs seems to fail from time to time when reading from an S3 bucket using an AWS Lambda function within a VPN. I am using s3fs==0.4.0 and pandas==1.0.1.
import s3fs
import pandas as pd
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
s3_file = event['Records'][0]['s3']['object']['key']
s3fs.S3FileSystem.connect_timeout = 1800
s3fs.S3FileSystem.read_timeout = 1800
with s3fs.S3FileSystem(anon=False).open(f"s3://{bucket}/{s3_file}", 'rb') as f:
self.data = pd.read_json(f, **kwargs)
The stacktrace is the following:
Traceback (most recent call last):
File "/var/task/urllib3/connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/var/task/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/var/task/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/task/botocore/httpsession.py", line 263, in send
chunked=self._chunked(request.headers),
File "/var/task/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/var/task/urllib3/util/retry.py", line 376, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/var/task/urllib3/packages/six.py", line 735, in reraise
raise value
File "/var/task/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/var/task/urllib3/connectionpool.py", line 376, in _make_request
self._validate_conn(conn)
File "/var/task/urllib3/connectionpool.py", line 994, in _validate_conn
conn.connect()
File "/var/task/urllib3/connection.py", line 300, in connect
conn = self._new_conn()
File "/var/task/urllib3/connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPSConnection object at 0x7f4d578e3ed0>: Failed to establish a new connection: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/task/botocore/endpoint.py", line 200, in _do_get_response
http_response = self._send(request)
File "/var/task/botocore/endpoint.py", line 244, in _send
return self.http_session.send(request)
File "/var/task/botocore/httpsession.py", line 283, in send
raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://my_bucket.s3.eu-west-1.amazonaws.com/?list-type=2&prefix=my_folder%2Fsomething%2F&delimiter=%2F&encoding-type=url"
Has someone faced this same issue? Why would it fail only sometimes? Is there a s3fs configuration that could help for this specific issue?
Actually there was no problem at all with s3fs. Seems like we were using a Lambda function with two Subnets within the VPC and one was working normally but the other one wasn't allowed to access S3 resources, therefore when a Lambda was spawned using the second network it wouldn't be able to connect at all.
Fixing this issue was as easy as removing the second subnet.
You could also use boto3 which is supported by AWS, in order to get json from S3.
import json
import boto3
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
s3 = boto3.resource('s3')
file_object = s3_resource.Object(bucket, key)
json_content = json.loads(file_object.get()['Body'].read())
Related
I am using Celery with SQS as a broker and I am trying to renew my credentials "AWS_ACCESS_KEY_ID" and "AWS_SECRET_ACCESS_KEY", before they expire, the first time I run the task and the result is success, but after 15 minutes it expires although credentials have been renewed, the function to update credentials is as follows:
import os
import boto3
from celery import Celery
from kombu.utils.url import safequote
def update_aws_credentials():
role_info = {
'RoleArn': f"arn:aws:iam::{os.environ['AWS_ACCOUNT_NUMER']}:role/my_role_execution",
'RoleSessionName': 'roleExecution',
'DurationSeconds': 900
}
sts_client = boto3.client('sts', region_name='eu-central-1')
credentials = sts_client.assume_role(**role_info)
aws_access_key_id = credentials["Credentials"]['AccessKeyId']
aws_secret_access_key = credentials["Credentials"]['SecretAccessKey']
aws_session_token = credentials["Credentials"]["SessionToken"]
os.environ["AWS_ACCESS_KEY_ID"] = aws_access_key_id
os.environ["AWS_SECRET_ACCESS_KEY"] = aws_secret_access_key
os.environ["AWS_DEFAULT_REGION"] = 'eu-central-1'
os.environ["AWS_SESSION_TOKEN"] = aws_session_token
return aws_access_key_id, aws_secret_access_key
def get_celery(aws_access_key_id, aws_secret_access_key):
broker = f"sqs://{safequote(aws_access_key_id)}:{safequote(aws_secret_access_key)}#"
backend = 'redis://redis-service:6379/0'
celery = Celery(f"my_task", broker=broker, backend=backend)
celery.conf["broker_transport_options"] = {
'polling_interval': 30,
'region': 'eu-central-1',
'predefined_queues': {
f"my_queue": {
'url': f"https://sqs.eu-central-1.amazonaws.com/{os.environ['AWS_ACCOUNT_NUMER']}/my_queue"
}
}
}
celery.conf["task_default_queue"] = f"my_queue"
return celery
def refresh_sqs_credentials():
access, secret = update_aws_credentials()
return get_celery(access, secret)
Running refresh_sqs_credentials, new credentials are created:
celery = worker.refresh_sqs_credentials()
And then I run my task with celery:
task = celery.send_task('my_task.code_of_my_task', args=[content], task_id=task_id)
All tasks that I run before 15 minutes finish successfully, but after 15 minutes the error is the following:
[2021-12-14 14:08:15,637] ERROR in app: Exception on /tasks/run [POST]
Traceback (most recent call last):
File "/api/app.py", line 87, in post
task = celery.send_task('glgt_ap35080_dev_sqs_runalgo.allocation_alg_task', args=[content], task_id=task_id)
File "/usr/local/lib/python3.6/site-packages/celery/app/base.py", line 717, in send_task
amqp.send_task_message(P, name, message, **options)
File "/usr/local/lib/python3.6/site-packages/celery/app/amqp.py", line 547, in send_task_message
**properties
File "/usr/local/lib/python3.6/site-packages/kombu/messaging.py", line 178, in publish
exchange_name, declare,
File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 525, in _ensured
return fun(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/kombu/messaging.py", line 200, in _publish
mandatory=mandatory, immediate=immediate,
File "/usr/local/lib/python3.6/site-packages/kombu/transport/virtual/base.py", line 605, in basic_publish
return self._put(routing_key, message, **kwargs)
File "/usr/local/lib/python3.6/site-packages/kombu/transport/SQS.py", line 294, in _put
c.send_message(**kwargs)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 337, in _api_call
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 656, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ExpiredToken) when calling the SendMessage operation: The security token included in the request is expired
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/local/lib/python3.6/site-packages/flask_restplus/api.py", line 325, in wrapper
resp = resource(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 88, in view
return self.dispatch_request(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/flask_restplus/resource.py", line 44, in dispatch_request
resp = meth(*args, **kwargs)
File "/api/app.py", line 90, in post
abort(500)
File "/usr/local/lib/python3.6/site-packages/werkzeug/exceptions.py", line 774, in abort
return _aborter(status, *args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/werkzeug/exceptions.py", line 755, in __call__
raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.InternalServerError: 500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
10.142.95.217 - - [14/Dec/2021 14:08:15] "POST /tasks/run HTTP/1.1" 500 -
I'm storing the credentials in environment variables, I don't understand why it expires after 15 minutes, can someone help me please?
The versions of the packages used are:
boto3==1.14.54
celery==5.0.0
kombu==5.0.2
pycurl==7.43.0.6
Thank you
I'm testing this locally where I have a ~/.aws/config file.
~/.aws/config looks some thing like:
[profile a]
...
[profile b]
...
I also have a AWS_PROFILE environmental variable set as "a".
I would like to read a file in which is accessible with profile b using pandas.
I am able to access it through s3fs by doing:
import s3fs
fs = s3fs.S3FileSystem(profile="b")
fs.get("BUCKET/FILE.parquet", "FILE.parquet")
pd.read_parquet("FILE.parquet")
However, if I try to pass this to pd.read_parquet using storage_options I get a PermissionError: Forbidden.
pd.read_parquet(
"s3://BUCKET/FILE.parquet",
storage_options={"profile": "b"},
)
full Traceback below
Traceback (most recent call last):
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 233, in _call_s3
out = await method(**additional_kwargs)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/aiobotocore/client.py", line 154, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
return impl.read(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read
return self.api.parquet.read_table(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1672, in read_table
dataset = _ParquetDatasetV2(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1504, in __init__
if filesystem.get_file_info(path_or_paths).is_file:
File "pyarrow/_fs.pyx", line 438, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/_fs.pyx", line 1004, in pyarrow._fs._cb_get_file_info
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/fs.py", line 226, in get_file_info
info = self.fs.info(path)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 72, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in sync
raise result[0]
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 20, in _runner
result[0] = await coro
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 911, in _info
out = await self._call_s3(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 252, in _call_s3
raise translate_boto_error(err)
PermissionError: Forbidden
Note: there is an old question somewhat related to this but it didn't help: How to read parquet file from s3 using dask with specific AWS profile
You just need to add the following argument to the function:
storage_options=dict(profile='your_profile_name')
Hence the read statement is:
pd.read_parquet("s3://your_bucket",storage_options=dict(profile='your_profile_name'))
I have an anaconda environment with selenium installed. When I try to run I get this error:
Traceback (most recent call last):
File "c:\Users\Nick\Desktop\Code\product-scraper\sephora-scraper\scraper.py", line 31, in <module>
ChromeDriverManager().install(), options=options)
File "C:\Users\Nick\anaconda3\envs\web-scraper\lib\site-packages\webdriver_manager\chrome.py", line 34, in install
driver_path = self._get_driver_path(self.driver)
File "C:\Users\Nick\anaconda3\envs\web-scraper\lib\site-packages\webdriver_manager\manager.py", line 21, in _get_driver_path
driver_version = driver.get_version()
File "C:\Users\Nick\anaconda3\envs\web-scraper\lib\site-packages\webdriver_manager\driver.py", line 40, in get_version
return self.get_latest_release_version()
File "C:\Users\Nick\anaconda3\envs\web-scraper\lib\site-packages\webdriver_manager\driver.py", line 63, in get_latest_release_version
resp = requests.get(f"{self._latest_release_url}_{self.browser_version}")
File "C:\Users\Nick\anaconda3\envs\web-scraper\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Nick\anaconda3\envs\web-scraper\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Nick\anaconda3\envs\web-scraper\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Nick\anaconda3\envs\web-scraper\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Nick\anaconda3\envs\web-scraper\lib\site-packages\requests\adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='chromedriver.storage.googleapis.com', port=443): Max retries exceeded with url: /LATEST_RELEASE_88.0.4324 (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available."))
I'm new to anaconda so I don't know what else to provide. Please leave a comment if I need to anything and I will add it right away. Thanks.
Try to add this path to your environment variable:
..\Anaconda3
..\Anaconda3\scripts
..\Anaconda3\Library\bin
You might need to restart windows after set up environment path
I have the following (simplified) code:
import os
import scrapy
class TestSpider(scrapy.Spider):
name = 'test_spider'
start_urls = ['http://www.pdf995.com/samples/pdf.pdf', ]
def parse(self, response):
save_path = 'test'
file_name = 'test.pdf'
self.save_page(response, save_path, file_name)
def save_page(self, response, save_dir, file_name):
os.makedirs(save_dir, exist_ok=True)
with open(os.path.join(save_dir, file_name), 'wb') as afile:
afile.write(response.body)
When i run it, I get this error:
[scrapy.core.scraper] ERROR: Error downloading <GET http://www.pdf995.com/samples/pdf.pdf>
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\twisted\internet\defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "C:\Python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "C:\Python36\lib\site-packages\twisted\internet\defer.py", line 1278, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://www.pdf995.com/samples/pdf.pdf>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\twisted\internet\defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "C:\Python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 53, in process_response
spider=spider)
File "C:\Python36\lib\site-packages\scrapy_beautifulsoup\middleware.py", line 16, in process_response
return response.replace(body=str(BeautifulSoup(response.body, self.parser)))
File "C:\Python36\lib\site-packages\scrapy\http\response\__init__.py", line 79, in replace
return cls(*args, **kwargs)
File "C:\Python36\lib\site-packages\scrapy\http\response\__init__.py", line 20, in __init__
self._set_body(body)
File "C:\Python36\lib\site-packages\scrapy\http\response\__init__.py", line 55, in _set_body
"Response body must be bytes. "
TypeError: Response body must be bytes. If you want to pass unicode body use TextResponse or HtmlResponse.
Do I need to introduce a middleware or something to handle this? This looks like it should be valid, at least by other examples.
Note: at the moment I'm not using a pipeline because there in my real spider I have a lot of checks on whether the related item has been scraped, validating if this pdf belongs to the item, and checking a custom name of a pdf to see if it was downloaded. And as mentioned, many samples did what I'm doing here so I thought it would be easier and work.
The issue because of your own scrapy_beautifulsoup\middleware.py which is trying to replace the return response.replace(body=str(BeautifulSoup(response.body, self.parser))).
You need to correct that and that should fix the issue
I am new to python and want to know alternate way for doing the following.
I am having issue with the exec_command of paramiko...
Following is the code:
sshdell = paramiko.SSHClient()
sshdell.set_missing_host_key_policy(paramiko.AutoAddPolicy())
sshdell.connect('ip', port=22, username='user', password='pwd')
stdin,stdout,stderr = sshdell.exec_command("ping 4.2.2.2 interface X1")
ping_check = stdout.readlines()
for line in ping_check:
print(line)
the given error is thrown.
Traceback (most recent call last):
File "delltest.py", line 36, in <module>
stdin,stdout,stderr = sshdell.exec_command("ping 4.2.2.2 interface X1")
File "C:\python35\lib\site-packages\paramiko\client.py", line 441, in exec_command
chan.exec_command(command)
File "C:\python35\lib\site-packages\paramiko\channel.py", line 60, in _check
return func(self, *args, **kwds)
File "C:\python35\lib\site-packages\paramiko\channel.py", line 234, in exec_command
self._wait_for_event()
File "C:\python35\lib\site-packages\paramiko\channel.py", line 1161, in _wait_for_event
raise e
paramiko.ssh_exception.SSHException: Channel closed.
Please suggest as my device may not support the exec_command() function.