How to set up a Scrapy proxy with authorization? - scrapy

My middlewares settings:
from w3lib.http import basic_auth_header
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "111.11.11.111:1111"
request.headers['Proxy - Authorization'] = basic_auth_header('login', 'password')
My settings:
DOWNLOADER_MIDDLEWARES = {
'my_project.middlewares.CustomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
After launching, I get an error:
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy 217.29.53.106:51725 [{'status': 407, 'reason': b'Proxy Authentication Required'}]
What is the reason, how to fix it? (I use valid https proxies)

Try changing the header name to be Proxy-Authorization
request.headers['Proxy-Authorization'] = basic_auth_header('login', 'password')

proxy = [
'http': 'http://{user}:{password}#{host}:{port}',
'https': 'https://{user}:{password}#{host}:{port}',
]
yield scrapy.request(url=url, proxy=proxy}
doesn't this work?

Related

How to set CORS header in cloudflare workers?

I'm using cloudflare workers to create a reverse proxy but I can't use it to embed on main domain cause it gives CORS error:
Access to image at 'https://example.workers.dev/96384873_p0.png' from origin 'https://example.com' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.
here's the code in workers
addEventListener("fetch", event => {
let url = new URL(event.request.url);
url.hostname = "i.pximg.net";
let request = new Request(url, event.request);
event.respondWith(
fetch(request, {
headers: {
'Referer': 'https://www.pixiv.net/',
'User-Agent': 'Cloudflare Workers'
}
})
);
});
How can I fix the CORS error?
You should read up on CORS to understand why you are receiving this error, then the fix should be straightforward (setting additional headers) - https://web.dev/cross-origin-resource-sharing/

Cannot POST request using service account key file in Python, getting 'Invalid IAP credentials: Unable to parse JWT', '401 Status Code'

I am trying to send a POST request to a Google App Engine service with a JSON body accompanied by an authorization token. I am generating the access token from a local service account key JSON file. The code below is generating a credential but finally the authorization is being rejected. I also tried different ways already. Even tried writing the request in Postman with a Bearer token in the Header, or even as a plain cURL command. But whatever I try, getting a 401 authentication error. I need to make sure whether the problem is in my side or on the other side with the service. Explored every documentation avaliable but no luck.
from google.auth.transport import requests
from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession
CREDENTIAL_SCOPES = ["https://www.googleapis.com/auth/cloud-platform"]
CREDENTIALS_KEY_PATH = 'my-local-service-account-key-file.json'
#the example service url I am trying to hit with requests
url = 'https://test.appspot.com/submit'
headers = {"Content-Type": "application/json"}
#example data I am sending with the request body
payload = {
"key1": "value 1",
"key2": "value 2"
}
credentials = service_account.Credentials.from_service_account_file(
CREDENTIALS_KEY_PATH,
scopes=CREDENTIAL_SCOPES
)
credentials.refresh(requests.Request())
authed_session = AuthorizedSession(credentials)
response = authed_session.request('POST',
url,
headers=headers,
data=payload
)
#adding some debug lines for your help
print(response.text)
print(response.status_code)
print(response.headers)
Getting the Output:
Invalid IAP credentials: Unable to parse JWT
401
{'X-Goog-IAP-Generated-Response': 'true', 'Date': 'Mon, 03 May 2021 06:52:11 GMT', 'Content-Type': 'text/html', 'Server': 'Google Frontend', 'Content-Length': '44', 'Alt-Svc': 'h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"'}
IAP expects a JWT(OpenID Connect (OIDC)) token in the Authorization header while your method will attach an access token the the Authorization header instead. Take a look at the below code snippet to make a request to an IAP secured resource.
Your code needs to be something like the following:
from google.auth.transport.requests import Request
from google.oauth2 import id_token
import requests
def make_iap_request(url, client_id, method='GET', **kwargs):
"""Makes a request to an application protected by Identity-Aware Proxy.
Args:
url: The Identity-Aware Proxy-protected URL to fetch.
client_id: The client ID used by Identity-Aware Proxy.
method: The request method to use
('GET', 'OPTIONS', 'HEAD', 'POST', 'PUT', 'PATCH', 'DELETE')
**kwargs: Any of the parameters defined for the request function:
https://github.com/requests/requests/blob/master/requests/api.py
If no timeout is provided, it is set to 90 by default.
Returns:
The page body, or raises an exception if the page couldn't be retrieved.
"""
# Set the default timeout, if missing
if 'timeout' not in kwargs:
kwargs['timeout'] = 90
# Obtain an OpenID Connect (OIDC) token from metadata server or using service
# account.
open_id_connect_token = id_token.fetch_id_token(Request(), client_id)
# Fetch the Identity-Aware Proxy-protected URL, including an
# Authorization header containing "Bearer " followed by a
# Google-issued OpenID Connect token for the service account.
resp = requests.request(
method, url,
headers={'Authorization': 'Bearer {}'.format(
open_id_connect_token)}, **kwargs)
if resp.status_code == 403:
raise Exception('Service account does not have permission to '
'access the IAP-protected application.')
elif resp.status_code != 200:
raise Exception(
'Bad response from application: {!r} / {!r} / {!r}'.format(
resp.status_code, resp.headers, resp.text))
else:
return resp.text
Note: The above method works with implicit credentials that can be set by running command: export GOOGLE_APPLICATION_CREDENTIALS=my-local-service-account-key-file.json to set the path to your service account in the environment and then run the python code from the same terminal.
Take a look at this link for more info.

why I can't use scrapy with the proxy of https

I write a simple test to validate the https proxy in scrapy.but it didn't work
class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['baidu.com']
start_urls = ['http://www.baidu.com/']
def parse(self, response):
if response.status == 200:
print(response.text)
and the file of middlewares like this:
class DynamicProxyDownloaderMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'https://183.159.88.182:8010'
also the file of settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'requestTest.middlewares.DynamicProxyDownloaderMiddleware': 100
}
when using the lib of requests.the https proxy works.but changed to scrapy.it confused me.So anybody know this?
the log file:
[the log file][1]
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying http://www.baidu.com/> (failed 1 times): TCP connection timed out: 10060:
the proxy address is https://183.159.88.182:8010

How to HTTP 'Keep-Alive' in Python3.2 with urllib

I try to keep a HTTP Connection with urllib.request in Python 3.2.3 alive with this code:
handler = urllib.request.HTTPHandler()
opener = urllib.request.build_opener(handler)
opener.addheaders = [("connection", "keep-alive"), ("Cookie", cookie_value)]
r = opener.open(url)
But if I listen to the connection with Wireshark I get an Header with "Connection: closed" but set Cookie.
Host: url
Cookie: cookie-value
Connection: close
What do I have to do to set Headerinfo to Connection: keep-alive?
If you need something more automatic than plain http.client, this might help, though it's not threadsafe.
from http.client import HTTPConnection, HTTPSConnection
import select
connections = {}
def request(method, url, body=None, headers={}, **kwargs):
scheme, _, host, path = url.split('/', 3)
h = connections.get((scheme, host))
if h and select.select([h.sock], [], [], 0)[0]:
h.close()
h = None
if not h:
Connection = HTTPConnection if scheme == 'http:' else HTTPSConnection
h = connections[(scheme, host)] = Connection(host, **kwargs)
h.request(method, '/' + path, body, headers)
return h.getresponse()
def urlopen(url, data=None, *args, **kwargs):
resp = request('POST' if data else 'GET', url, data, *args, **kwargs)
assert resp.status < 400, (resp.status, resp.reason, resp.read())
return resp
I keep connection alive by use http-client
import http.client
conn = http.client.HTTPConnection(host, port)
conn.request(method, url, body, headers)
the headers just give dict and body still can use urllib.parse.urlencode.
so, you can make Cookie header by http client.
reference:
official reference

How to set different scrapy-settings for different spiders?

I want to enable some http-proxy for some spiders, and disable them for other spiders.
Can I do something like this?
# settings.py
proxy_spiders = ['a1' , b2']
if spider in proxy_spider: #how to get spider name ???
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
else:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
If the code above doesn't work, is there any other suggestion?
a bit late, but since release 1.0.0 there is a new feature in scrapy where you can override settings per spider like this:
class MySpider(scrapy.Spider):
name = "my_spider"
custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
class MySpider2(scrapy.Spider):
name = "my_spider2"
custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
There is a new and easier way to do this.
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'SOME_SETTING': 'some value',
}
I use Scrapy 1.3.1
You can add setting.overrides within the spider.py file
Example that works:
from scrapy.conf import settings
settings.overrides['DOWNLOAD_TIMEOUT'] = 300
For you, something like this should also work
from scrapy.conf import settings
settings.overrides['DOWNLOADER_MIDDLEWARES'] = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
You can define your own proxy middleware, something straightforward like this:
from scrapy.contrib.downloadermiddleware import HttpProxyMiddleware
class ConditionalProxyMiddleware(HttpProxyMiddleware):
def process_request(self, request, spider):
if getattr(spider, 'use_proxy', None):
return super(ConditionalProxyMiddleware, self).process_request(request, spider)
Then define the attribute use_proxy = True in the spiders that you want to have the proxy enabled. Don't forget to disable the default proxy middleware and enable your modified one.
Why not use two projects rather than only one?
Let's name these two projects with proj1 and proj2. In proj1's settings.py, put these settings:
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
In proj2's settings.py, put these settings:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}