Error downloading PDF files - scrapy

I have the following (simplified) code:
import os
import scrapy
class TestSpider(scrapy.Spider):
name = 'test_spider'
start_urls = ['http://www.pdf995.com/samples/pdf.pdf', ]
def parse(self, response):
save_path = 'test'
file_name = 'test.pdf'
self.save_page(response, save_path, file_name)
def save_page(self, response, save_dir, file_name):
os.makedirs(save_dir, exist_ok=True)
with open(os.path.join(save_dir, file_name), 'wb') as afile:
afile.write(response.body)
When i run it, I get this error:
[scrapy.core.scraper] ERROR: Error downloading <GET http://www.pdf995.com/samples/pdf.pdf>
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\twisted\internet\defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "C:\Python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "C:\Python36\lib\site-packages\twisted\internet\defer.py", line 1278, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://www.pdf995.com/samples/pdf.pdf>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\twisted\internet\defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "C:\Python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 53, in process_response
spider=spider)
File "C:\Python36\lib\site-packages\scrapy_beautifulsoup\middleware.py", line 16, in process_response
return response.replace(body=str(BeautifulSoup(response.body, self.parser)))
File "C:\Python36\lib\site-packages\scrapy\http\response\__init__.py", line 79, in replace
return cls(*args, **kwargs)
File "C:\Python36\lib\site-packages\scrapy\http\response\__init__.py", line 20, in __init__
self._set_body(body)
File "C:\Python36\lib\site-packages\scrapy\http\response\__init__.py", line 55, in _set_body
"Response body must be bytes. "
TypeError: Response body must be bytes. If you want to pass unicode body use TextResponse or HtmlResponse.
Do I need to introduce a middleware or something to handle this? This looks like it should be valid, at least by other examples.
Note: at the moment I'm not using a pipeline because there in my real spider I have a lot of checks on whether the related item has been scraped, validating if this pdf belongs to the item, and checking a custom name of a pdf to see if it was downloaded. And as mentioned, many samples did what I'm doing here so I thought it would be easier and work.

The issue because of your own scrapy_beautifulsoup\middleware.py which is trying to replace the return response.replace(body=str(BeautifulSoup(response.body, self.parser))).
You need to correct that and that should fix the issue

Related

External ID not found

Odoo Server Error
Traceback (most recent call last):
File "/home/odoo/src/odoo/14.0/odoo/tools/cache.py", line 85, in lookup
r = d[key]
File "/home/odoo/src/odoo/14.0/odoo/tools/func.py", line 71, in wrapper
return func(self, *args, **kwargs)
File "/home/odoo/src/odoo/14.0/odoo/tools/lru.py", line 34, in getitem
a = self.d[obj]
KeyError: ('ir.model.data', <function IrModelData.xmlid_lookup at 0x7f5794273a60>, 'account.account_invoices_without_payment')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/odoo/src/odoo/14.0/addons/web/controllers/main.py", line 2121, in report_download
response = self.report_routes(reportname, docids=docids, converter=converter, context=context)
File "/home/odoo/src/odoo/14.0/odoo/http.py", line 532, in response_wrap
response = f(*args, **kw)
File "/home/odoo/src/odoo/14.0/addons/web/controllers/main.py", line 2056, in report_routes
pdf = report.with_context(context)._render_qweb_pdf(docids, data=data)[0]
File "/home/odoo/src/odoo/14.0/addons/account/models/ir_actions_report.py", line 41, in _render_qweb_pdf
invoice_reports = (self.env.ref('account.account_invoices_without_payment'), self.env.ref('account.account_invoices'))
File "/home/odoo/src/odoo/14.0/odoo/api.py", line 511, in ref
return self['ir.model.data'].xmlid_to_object(xml_id, raise_if_not_found=raise_if_not_found)
File "/home/odoo/src/odoo/14.0/odoo/addons/base/models/ir_model.py", line 1944, in xmlid_to_object
t = self.xmlid_to_res_model_res_id(xmlid, raise_if_not_found)
File "/home/odoo/src/odoo/14.0/odoo/addons/base/models/ir_model.py", line 1928, in xmlid_to_res_model_res_id
return self.xmlid_lookup(xmlid)[1:3]
File "", line 2, in xmlid_lookup
File "/home/odoo/src/odoo/14.0/odoo/tools/cache.py", line 90, in lookup
value = d[key] = self.method(*args, **kwargs)
File "/home/odoo/src/odoo/14.0/odoo/addons/base/models/ir_model.py", line 1921, in xmlid_lookup
raise ValueError('External ID not found in the system: %s' % xmlid)
ValueError: External ID not found in the system:
account.account_invoices_without_payment
The error occurs when I tried to print an invoice. It happens even if I choose an empty print template. Any help?Thanks.
In my opinion, you should be check table ir_model_data with name=account.account_invoices_without_payment. If you can find, you must update module account. If you can't find, you can be insert new record table ir_model_data with name and res_id = id view account_invoices_without_payment in ir_ui_view.
May be help you.
Please upgrade the module account and make sure that the db using correctly. You can give db-filter to choose the db correctly
After checking in the settings-> external id. I find that somehow this external id got deleted for unknown reason. I opened an new database and compared and find that this is the case then I create a new eternal id according to the new db's value.

use AWS_PROFILE in pandas.read_parquet

I'm testing this locally where I have a ~/.aws/config file.
~/.aws/config looks some thing like:
[profile a]
...
[profile b]
...
I also have a AWS_PROFILE environmental variable set as "a".
I would like to read a file in which is accessible with profile b using pandas.
I am able to access it through s3fs by doing:
import s3fs
fs = s3fs.S3FileSystem(profile="b")
fs.get("BUCKET/FILE.parquet", "FILE.parquet")
pd.read_parquet("FILE.parquet")
However, if I try to pass this to pd.read_parquet using storage_options I get a PermissionError: Forbidden.
pd.read_parquet(
"s3://BUCKET/FILE.parquet",
storage_options={"profile": "b"},
)
full Traceback below
Traceback (most recent call last):
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 233, in _call_s3
out = await method(**additional_kwargs)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/aiobotocore/client.py", line 154, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
return impl.read(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read
return self.api.parquet.read_table(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1672, in read_table
dataset = _ParquetDatasetV2(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/parquet.py", line 1504, in __init__
if filesystem.get_file_info(path_or_paths).is_file:
File "pyarrow/_fs.pyx", line 438, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/_fs.pyx", line 1004, in pyarrow._fs._cb_get_file_info
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/pyarrow/fs.py", line 226, in get_file_info
info = self.fs.info(path)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 72, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in sync
raise result[0]
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/fsspec/asyn.py", line 20, in _runner
result[0] = await coro
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 911, in _info
out = await self._call_s3(
File "/home/ray/local/bin/anaconda3/envs/main/lib/python3.8/site-packages/s3fs/core.py", line 252, in _call_s3
raise translate_boto_error(err)
PermissionError: Forbidden
Note: there is an old question somewhat related to this but it didn't help: How to read parquet file from s3 using dask with specific AWS profile
You just need to add the following argument to the function:
storage_options=dict(profile='your_profile_name')
Hence the read statement is:
pd.read_parquet("s3://your_bucket",storage_options=dict(profile='your_profile_name'))

s3fs timeout issue on an AWS Lambda function within a VPN

s3fs seems to fail from time to time when reading from an S3 bucket using an AWS Lambda function within a VPN. I am using s3fs==0.4.0 and pandas==1.0.1.
import s3fs
import pandas as pd
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
s3_file = event['Records'][0]['s3']['object']['key']
s3fs.S3FileSystem.connect_timeout = 1800
s3fs.S3FileSystem.read_timeout = 1800
with s3fs.S3FileSystem(anon=False).open(f"s3://{bucket}/{s3_file}", 'rb') as f:
self.data = pd.read_json(f, **kwargs)
The stacktrace is the following:
Traceback (most recent call last):
File "/var/task/urllib3/connection.py", line 157, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/var/task/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/var/task/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/task/botocore/httpsession.py", line 263, in send
chunked=self._chunked(request.headers),
File "/var/task/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/var/task/urllib3/util/retry.py", line 376, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/var/task/urllib3/packages/six.py", line 735, in reraise
raise value
File "/var/task/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/var/task/urllib3/connectionpool.py", line 376, in _make_request
self._validate_conn(conn)
File "/var/task/urllib3/connectionpool.py", line 994, in _validate_conn
conn.connect()
File "/var/task/urllib3/connection.py", line 300, in connect
conn = self._new_conn()
File "/var/task/urllib3/connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPSConnection object at 0x7f4d578e3ed0>: Failed to establish a new connection: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/task/botocore/endpoint.py", line 200, in _do_get_response
http_response = self._send(request)
File "/var/task/botocore/endpoint.py", line 244, in _send
return self.http_session.send(request)
File "/var/task/botocore/httpsession.py", line 283, in send
raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://my_bucket.s3.eu-west-1.amazonaws.com/?list-type=2&prefix=my_folder%2Fsomething%2F&delimiter=%2F&encoding-type=url"
Has someone faced this same issue? Why would it fail only sometimes? Is there a s3fs configuration that could help for this specific issue?
Actually there was no problem at all with s3fs. Seems like we were using a Lambda function with two Subnets within the VPC and one was working normally but the other one wasn't allowed to access S3 resources, therefore when a Lambda was spawned using the second network it wouldn't be able to connect at all.
Fixing this issue was as easy as removing the second subnet.
You could also use boto3 which is supported by AWS, in order to get json from S3.
import json
import boto3
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
s3 = boto3.resource('s3')
file_object = s3_resource.Object(bucket, key)
json_content = json.loads(file_object.get()['Body'].read())

sqlalchemy.orm.exc.DetachedInstanceError depending on Pyramid webapp serving method?

This error in thrown only on production Apache2 server, and not at all when using pserve method to test the project locally :
sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <My_Table at 0x7fc82076d2d0> is not bound to a Session; lazy load operation of attribute 'my_table_relationship' cannot proceed
Crashing code in template is:
<tal:x repeat="oMot_cn oMot_mcnii_cni.r_mcnii_cp.r_newphonets">
This kind of chained relationships pattern had been working well so far, I am just trying to load the whole query object at launching time of the Pyramid app, instead of loading a new one each time a request is handled by the view/controller...
It keeps crashing on production while it does not crash on pserve dev testing, so it's hard to debug.
Whole traceback on production server:
mod_wsgi (pid=15170): Exception occurred processing WSGI script '/var/local/env_py3/env_py3_2an/MyProject/myproject.wsgi'.
Traceback (most recent call last):
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/chameleon/template.py", line 170, in render
self._render(stream, econtext, rcontext)
File "tp_lesson_T_ex_mots_from_audio_765c6cd23ec517b3893410ce969cf2d9.py", line 584, in render
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/chameleon/py26.py", line 5, in lookup_attr
return getattr(obj, key)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/sqlalchemy/orm/attributes.py", line 233, in __get__
return self.impl.get(instance_state(instance), dict_)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/sqlalchemy/orm/attributes.py", line 579, in get
value = self.callable_(state, passive)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/sqlalchemy/orm/strategies.py", line 479, in _load_for_state
(orm_util.state_str(state), self.key)
sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <Caract_phonet at 0x7fc82076d2d0> is not bound to a Session; lazy load operation of attribute 'r_newphonets' cannot proceed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/fanstatic/publisher.py", line 219, in __call__
return self.app(environ, start_response)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/fanstatic/injector.py", line 64, in __call__
response = request.get_response(self.app)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/webob/request.py", line 1296, in send
application, catch_exc_info=False)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/webob/request.py", line 1260, in call_application
app_iter = application(self.environ, start_response)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/pyramid/router.py", line 251, in __call__
response = self.invoke_subrequest(request, use_tweens=True)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/pyramid/router.py", line 227, in invoke_subrequest
response = handle_request(request)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/pyramid/tweens.py", line 21, in excview_tween
response = handler(request)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/pyramid_tm-0.7-py3.2.egg/pyramid_tm/__init__.py", line 82, in tm_tween
reraise(*exc_info)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/pyramid_tm-0.7-py3.2.egg/pyramid_tm/compat.py", line 13, in reraise
raise value
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/pyramid_tm-0.7-py3.2.egg/pyramid_tm/__init__.py", line 63, in tm_tween
response = handler(request)
File "/var/local/env_py3/env_py3_2an/lib/python3.2/site-packages/pyramid/router.py", line 161, in handle_request
response = view_callable(context, request)
Is there any missing PasteDeploy configuration to get it crashing the same way it does once it's pushed to production server ?

Error saving crawled page using file_urls and ITEM_PIPELINES: Missing scheme in request url: h

I'm trying to have scrapy download a copy of each page it crawls but when I run my spider the log contains entries like
2016-06-20 15:39:12 [scrapy] ERROR: Error processing {'file_urls': 'http://example.com/page',
'title': u'PageTitle'}
Traceback (most recent call last):
File "c:\anaconda3\envs\scrapy\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\anaconda3\envs\scrapy\lib\site-packages\scrapy\pipelines\media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "c:\anaconda3\envs\scrapy\lib\site-packages\scrapy\pipelines\files.py", line 365, in get_media_requests
return [Request(x) for x in item.get(self.files_urls_field, [])]
File "c:\anaconda3\envs\scrapy\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
self._set_url(url)
File "c:\anaconda3\envs\scrapy\lib\site-packages\scrapy\http\request\__init__.py", line 57, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
Other questions in SO about this error seem to relate to problems with the start_urls but my start url is fine as the spider crawls across the site, it just doesn't save the page to my specified files_store.
I populate the file_urls using item['file_urls'] = response.url
Do I need to specify the url a different way?