Scrapyd with Polipo and Tor - scrapy

UPDATE: I am now running this command:
scrapyd-deploy <project_name>
And getting this error:
504 Connect to localhost:8123 failed: General SOCKS server failure
I am trying to deploy my scrapy spider through scrapyd-deploy, the following is the command I use:
scrapyd-deploy -L <project_name>
I get the following error message:
Traceback (most recent call last):
File "/usr/local/bin/scrapyd-deploy", line 269, in <module>
main()
File "/usr/local/bin/scrapyd-deploy", line 74, in main
f = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not found
The following is my scrapy.cfg file:
[settings]
default = <project_name>.settings
[deploy:<project_name>]
url = http://localhost:8123
project = <project_name>
eggs_dir = eggs
logs_dir = logs
items_dir = items
jobs_to_keep = 5
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5
http_port = 8123
debug = on
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
I am running tor and polipo, with the polipo proxy on port 'http://localhost:8123'. I can perform a wget and download that page without any problems. The proxy is correctly working, I can connect to the internet and so on. Please ask if you need more clarification.
Thanks!

urllib2.HTTPError: HTTP Error 404: Not found
The url is not reached.

Anything interesting in /var/log/polipo/polipo.log? What comes from tail -100 /var/log/polipo/polipo.log?

Apparently this is because I forgot to run the main command. It is easy to miss because it is mentioned in the Overview page of the documentation, and not the Deployment page. The following is the command:
scrapyd

504 Connect to localhost:8123 failed: General SOCKS server failure
You're asking Polipo to connect to localhost:8123; Polipo passes the request to tor, which returns a failure result which is dutifully returned by Polipo ("General SOCKS server failure").
url = http://localhost:8123
This is certainly not what you meant.
http_port = 8123
I'm also pretty sure you didn't want to run scrapyd on the same port as Polipo.

Related

Why am I getting a 503 from the Amedeus airport API?

I am using the Amadeus Python 3 client, and I get a 503 when calling the method `
response = amadeus.reference_data.locations.airports.get(
longitude=long,
latitude=lat)
`
Stack trace: `
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/amadeus/reference_data/locations/_airports.py", line 25, in get
return self.client.get(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/amadeus/mixins/http.py", line 40, in get
return self.request('GET', path, params)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/amadeus/mixins/http.py", line 110, in request
return self._unauthenticated_request(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/amadeus/mixins/http.py", line 126, in _unauthenticated_request
return self.__execute(request)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/amadeus/mixins/http.py", line 152, in __execute
response._detect_error(self)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/amadeus/mixins/parser.py", line 16, in _detect_error
self.__raise_error(error, client)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/amadeus/mixins/parser.py", line 67, in __raise_error
raise error
amadeus.client.errors.ServerError: [503]
`
Is the API down at the moment?
a 503 is a server error. If the api has a status page, check that. If not, reach out to whoever runs it if that is known. Other than that, only those with access to the servers can tell you why.

Zappa packaged lambda error ..botocore.exceptions.SSLError: SSL validation failed for <s3 file> [Errno 2] No such file or directory

Running AWS lambda service packaged using Zappa.io
The service is running however, its not able to reach the S3 file due to ssl error
Getting the below error while trying to access remote_env from an s3 bucket
[1592935276008] [DEBUG] 2020-06-23T18:01:16.8Z b8374974-f820-484a-bcc3-64a530712769 Exception received when sending HTTP request.
Traceback (most recent call last):
File "/var/task/urllib3/util/ssl_.py", line 336, in ssl_wrap_socket
context.load_verify_locations(ca_certs, ca_cert_dir)
FileNotFoundError: [Errno 2] No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/runtime/botocore/httpsession.py", line 254, in send
urllib_response = conn.urlopen(
File "/var/task/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/var/task/urllib3/util/retry.py", line 376, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/var/task/six.py", line 703, in reraise
raise value
File "/var/task/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/var/task/urllib3/connectionpool.py", line 376, in _make_request
self._validate_conn(conn)
File "/var/task/urllib3/connectionpool.py", line 996, in _validate_conn
conn.connect()
File "/var/task/urllib3/connection.py", line 352, in connect
self.sock = ssl_wrap_socket(
File "/var/task/urllib3/util/ssl_.py", line 338, in ssl_wrap_socket
raise SSLError(e)
urllib3.exceptions.SSLError: [Errno 2] No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/runtime/botocore/endpoint.py", line 200, in _do_get_response
http_response = self._send(request)
File "/var/runtime/botocore/endpoint.py", line 244, in _send
return self.http_session.send(request)
File "/var/runtime/botocore/httpsession.py", line 281, in send
raise SSLError(endpoint_url=request.url, error=e)
botocore.exceptions.SSLError: SSL validation failed for ....... [Errno 2] No such file or directory
My Environment
Zappa version used: 0.51.0
Operating System and Python version: Ubuntu , Python 3.8
Output of pip freeze
appdirs==1.4.3
argcomplete==1.11.1
boto3==1.14.8
botocore==1.17.8
CacheControl==0.12.6
certifi==2019.11.28
cffi==1.14.0
cfn-flip==1.2.3
chardet==3.0.4
click==7.1.2
colorama==0.4.3
contextlib2==0.6.0
cryptography==2.9.2
distlib==0.3.0
distro==1.4.0
docutils==0.15.2
durationpy==0.5
Flask==1.1.2
Flask-Cors==3.0.8
future==0.18.2
h11==0.9.0
hjson==3.0.1
html5lib==1.0.1
httptools==0.1.1
idna==2.8
ipaddr==2.2.0
itsdangerous==1.1.0
Jinja2==2.11.2
jmespath==0.10.0
kappa==0.6.0
lockfile==0.12.2
mangum==0.9.2
MarkupSafe==1.1.1
msgpack==0.6.2
packaging==20.3
pep517==0.8.2
pip-tools==5.2.1
placebo==0.9.0
progress==1.5
pycparser==2.20
pydantic==1.5.1
PyMySQL==0.9.3
pyOpenSSL==19.1.0
pyparsing==2.4.6
python-dateutil==2.6.1
python-slugify==4.0.0
pytoml==0.1.21
PyYAML==5.3.1
requests==2.22.0
retrying==1.3.3
s3transfer==0.3.3
six==1.14.0
starlette==0.13.4
text-unidecode==1.3
toml==0.10.1
tqdm==4.46.1
troposphere==2.6.1
typing-extensions==3.7.4.2
urllib3==1.25.8
uvloop==0.14.0
webencodings==0.5.1
websockets==8.1
Werkzeug==0.16.1
wsgi-request-logger==0.4.6
zappa==0.51.0
My zappa_settings.json:
{
"dev": {
"app_function": "main.app",
"aws_region": "us-west-2",
"profile_name": "default",
"project_name": "d3c",
"runtime": "python3.8",
"keep_warm":false,
"cors": true,
"s3_bucket": "my-lambda-deployables",
"remote_env":"<my remote s3 file>"
}
}
I have confirmed that my S3 file is accessible from my local ubuntu machine however does not work on aws
This seems to be related to an open issue open issue on Zappa
I had the same issue my Zappa deployment,
I tried all possible options but nothing was working, But after trying different suggestions the following steps worked for me
I copied python3.8/site-packages/botocore/cacert.pem to my lambda folder
I Set the "REQUESTS_CA_BUNDLE" environment variable to /var/task/cacert.pem
/var/task is where AWS Lambda extracts your zipped up code to.
How to set environment variables in Zappa
I updated my Zappa function and everything worked fine
fixed this by adding the cert path to environment (python)
os.environ['REQUESTS_CA_BUNDLE'] = os.path.join('/etc/ssl/certs/','ca-certificates.crt')
Edit: Sorry the issue was not really fixed with the above code, but found a hack work around by adding verify=False for all ssl requests
boto3.client('s3', verify=False)

JSONDecodeError with Scrapy: Expecting value: line 1 column 1 (char 0)

I am using requests in order to fetch and parse some data scraped using Scrapy with Scrapyrt (real time scraping).
This is how I do it:
#pass spider to requests parameters #
params = {
'spider_name': spider,
'start_requests':True
}
# scrape items
response = requests.get('http://scrapyrt:9080/crawl.json', params)
print ('RESPONSE JSON',response.json())
data = response.json()
As per Scrapy documentation, with 'start_requests' parameter set as True, the spider automatically requests urls and passes the response to the parse method which is the default method used for parsing requests.
start_requests
type: boolean
optional
Whether spider should execute Scrapy.Spider.start_requests method. start_requests are executed by default when you run Scrapy Spider normally without ScrapyRT, but this method is NOT executed in API by default. By default we assume that spider is expected to crawl ONLY url provided in parameters without making any requests to start_urls defined in Spider class. start_requests argument overrides this behavior. If this argument is present API will execute start_requests Spider method.
But the setup is not working. Log:
[2019-05-19 06:11:14,835: DEBUG/ForkPoolWorker-4] Starting new HTTP connection (1): scrapyrt:9080
[2019-05-19 06:11:15,414: DEBUG/ForkPoolWorker-4] http://scrapyrt:9080 "GET /crawl.json?spider_name=precious_tracks&start_requests=True HTTP/1.1" 500 7784
[2019-05-19 06:11:15,472: ERROR/ForkPoolWorker-4] Task project.api.routes.background.scrape_allmusic[87dbd825-dc1c-4789-8ee0-4151e5821798] raised unexpected: JSONDecodeError('Expecting value: line 1 column 1 (char 0)',)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/src/app/project/api/routes/background.py", line 908, in scrape_allmusic
print ('RESPONSE JSON',response.json())
File "/usr/lib/python3.6/site-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
The error was due to a bug with Twisted 19.2.0, a scrapyrt dependency, which assumed response to be of wrong type.
Once I installed Twisted==18.9.0, it worked.

bq cli tool unable to connect to server

I'm trying to use the bq tool in my docker container, but I am having difficulty connecting. I have ran gcloud init, also, I can query on webui and outside of docker container w/ bq.
bq query --format=prettyjson --nouse_legacy_sql 'select count(*) from `bigquery-public-data.samples.shakespeare`'
BigQuery error in query operation: Cannot contact server. Please try
again. Traceback: Traceback (most recent call last): File
"/usr/lib/google-cloud-sdk/platform/bq/bigquery_client.py", line 886,
in BuildApiClient response_metadata, discovery_document =
http.request(discovery_url) File
"/usr/lib/google-cloud-sdk/platform/bq/third_party/oauth2client_4_0/transport.py",
line 176, in new_request redirections, connection_type) File
"/usr/lib/google-cloud-sdk/platform/bq/third_party/oauth2client_4_0/transport.py",
line 283, in request connection_type=connection_type) File
"/usr/lib/google-cloud-sdk/platform/bq/third_party/httplib2/init.py",
line 1626, in request (response, content) = self._request(conn,
authority, uri, request_uri, method, body, headers, redirections,
cachekey) File
"/usr/lib/google-cloud-sdk/platform/bq/third_party/httplib2/init.py",
line 1368, in _request (response, content) = self._conn_request(conn,
request_uri, method, body, headers) File
"/usr/lib/google-cloud-sdk/platform/bq/third_party/httplib2/init.py",
line 1288, in _conn_request conn.connect() File
"/usr/lib/google-cloud-sdk/platform/bq/third_party/httplib2/init.py",
line 1082, in connect raise SSLHandshakeError(e) SSLHandshakeError:
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
(_ssl.c:726)
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
The pip freeze is
airflow==1.8.0
alembic==0.8.10
asn1crypto==0.24.0
BigQuery-Python==1.14.0
boto==2.49.0
cachetools==2.1.0
certifi==2018.10.15
cffi==1.11.5
chardet==3.0.4
Click==7.0
croniter==0.3.25
cryptography==2.3.1
dill==0.2.8.2
docutils==0.14
enum34==1.1.6
filechunkio==1.8
Flask==0.11.1
Flask-Admin==1.4.1
Flask-Cache==0.13.1
Flask-Login==0.2.11
flask-swagger==0.2.13
Flask-WTF==0.12
funcsigs==1.0.0
future==0.15.2
futures==3.2.0
gitdb2==2.0.5
GitPython==2.1.11
google-api-core==1.5.1
google-api-python-client==1.5.5
google-auth==1.5.1
google-auth-httplib2==0.0.3
google-auth-oauthlib==0.2.0
google-cloud-bigquery==1.6.0
google-cloud-core==0.28.1
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
gunicorn==19.3.0
httplib2==0.11.3
idna==2.7
ipaddress==1.0.22
itsdangerous==1.1.0
JayDeBeApi==1.1.1
Jinja2==2.8.1
JPype1==0.6.3
lockfile==0.12.2
lxml==3.8.0
Mako==1.0.7
Markdown==2.6.11
MarkupSafe==1.0
MySQL-python==1.2.5
numpy==1.15.3
oauth2client==2.0.2
oauthlib==2.1.0
ordereddict==1.1
pandas==0.18.1
pandas-gbq==0.7.0
protobuf==3.6.1
psutil==4.4.2
psycopg2==2.7.5
pyasn1==0.4.4
pyasn1-modules==0.2.2
pycparser==2.19
Pygments==2.2.0
pyOpenSSL==18.0.0
python-daemon==2.1.2
python-dateutil==2.7.5
python-editor==1.0.3
python-nvd3==0.14.2
python-slugify==1.1.4
pytz==2018.7
PyYAML==3.13
requests==2.20.0
requests-oauthlib==1.0.0
rsa==4.0
setproctitle==1.1.10
six==1.11.0
smmap2==2.0.5
SQLAlchemy==1.2.12
tabulate==0.7.7
thrift==0.9.3
Unidecode==1.0.22
uritemplate==3.0.0
urllib3==1.24
Werkzeug==0.14.1
WTForms==2.2.1
zope.deprecation==4.3.0

How to disable or change the path of ghostdriver.log?

Question is straightfoward, but some context may help.
I'm trying to deploy scrapy while using selenium and phantomjs as downloader. But the problem is that it keeps on saying permission denied when trying to deploy. So I want to change the path of ghostdriver.log or just disable it. Looking at phantomjs -h and ghostdriver github page I couldn't find the answer, my friend google let me down also.
$ scrapy deploy
Building egg of crawler-1370960743
'build/scripts-2.7' does not exist -- can't clean it
zip_safe flag not set; analyzing archive contents...
tests.fake_responses.__init__: module references __file__
Deploying crawler-1370960743 to http://localhost:6800/addversion.json
Server response (200):
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/scrapyd/webservice.py", line 18, in render
return JsonResource.render(self, txrequest)
File "/usr/lib/pymodules/python2.7/scrapy/utils/txweb.py", line 10, in render
r = resource.Resource.render(self, txrequest)
File "/usr/lib/python2.7/dist-packages/twisted/web/resource.py", line 216, in render
return m(request)
File "/usr/lib/pymodules/python2.7/scrapyd/webservice.py", line 66, in render_POST
spiders = get_spider_list(project)
File "/usr/lib/pymodules/python2.7/scrapyd/utils.py", line 65, in get_spider_list
raise RuntimeError(msg.splitlines()[-1])
RuntimeError: IOError: [Errno 13] Permission denied: 'ghostdriver.log
When using the PhantomJS driver add the following parameter:
driver = webdriver.PhantomJS(service_log_path='/var/log/phantomjs/ghostdriver.log')
Related code, would be nice to have an option to turn off logging though, seems thats not supported:
selenium/webdriver/phantomjs/service.py
class Service(object):
"""
Object that manages the starting and stopping of PhantomJS / Ghostdriver
"""
def __init__(self, executable_path, port=0, service_args=None, log_path=None):
"""
Creates a new instance of the Service
:Args:
- executable_path : Path to PhantomJS binary
- port : Port the service is running on
- service_args : A List of other command line options to pass to PhantomJS
- log_path: Path for PhantomJS service to log to
"""
self.port = port
self.path = executable_path
self.service_args= service_args
if self.port == 0:
self.port = utils.free_port()
if self.service_args is None:
self.service_args = []
self.service_args.insert(0, self.path)
self.service_args.append("--webdriver=%d" % self.port)
if not log_path:
log_path = "ghostdriver.log"
self._log = open(log_path, 'w')
#Reduce logging level
driver = webdriver.PhantomJS(service_args=["--webdriver-loglevel=SEVERE"])
#Remove logging
import os
driver = webdriver.PhantomJS(service_log_path=os.path.devnull)