Scrapy How to scrape HTTPS site through SSL proxy - ssl

I've SSL proxy server and I want to scrape https site. I mean the connection between scrapy and the proxy is encrypted then the proxy will open a connection to the website.
after some debugging I found the following:-
currently scrapy handle the situation as follows:-
if the site is http it use ScrapyProxyAgent which send client hello then send a connect request for the website to the proxy
but if the site is https
it use a TunnelingAgent which does not send client hello to the proxy and hence the connection is terminated.
What I need is to tell scrapy to first establish a connection via ScrapyProxyAgent then use a TunnelingAgent not sure how to do that.
I tried to create a https DOWNLOAD_HANDLERS but I'm not that expert
class MyHTTPDownloader(HTTP11DownloadHandler):
def download_request(self, request, spider):
"""Return a deferred for the HTTP download"""
timeout = request.meta.get('download_timeout') or self._connectTimeout
bindaddress = request.meta.get('bindaddress')
proxy = request.meta.get('proxy')
agent = ScrapyProxyAgent(reactor,proxyURI=to_bytes(proxy, encoding='ascii'),
connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
proxyHost = to_unicode(proxyHost)
url = urldefrag(request.url)[0]
method = to_bytes(request.method)
headers = TxHeaders(request.headers)
omitConnectTunnel = b'noconnect' in proxyParams
proxyConf = (proxyHost, proxyPort,
request.headers.get(b'Proxy-Authorization', None))
if request.body:
bodyproducer = _RequestBodyProducer(request.body)
if request.body:
bodyproducer = _RequestBodyProducer(request.body)
elif method == b'POST':
bodyproducer = _RequestBodyProducer(b'')
else:
bodyproducer = None
start_time = time()
tunnelingAgent = TunnelingAgent(reactor, proxyConf,
contextFactory=self._contextFactory, connectTimeout=timeout,
bindAddress=bindaddress, pool=self._pool)
agent.request(method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
I need to establish a tunnel after the proxy agent is connected.
is that even possible?
thanks in advance

Related

S3 presigned URL: Generate on server, use from client?

Is it possible to generate an S3 presigned URL in a Lambda function and return that URL to a client, so the client can use it to do an unauthenticated HTTP PUT?
I'm finding that S3 is unexpectedly closing my HTTPS connection when I try to PUT to the URL I get from the lambda function, and I don't know if it's because the server and client are different entities.
Can I do what I want to do? Is there a secret step I'm missing here?
EDIT: per Anon Coward's request, the server-side code is:
presigned_upload_parts = []
for part in range(num_parts):
resp = s3.generate_presigned_url(
ClientMethod = 'upload_part',
Params = {
'Bucket': os.environ['USER_UPLOADS_BUCKET'],
'Key': asset_id,
'UploadId': s3_upload_id,
'PartNumber': part
}
)
presigned_upload_parts.append({"part": part, "url": resp})
return custom_http_response_wrapper(presigned_upload_parts)
The client-side code is:
for idx, part in enumerate(urls):
startByte = idx * bytes_per_part
endByte = min(filesize, ((idx + 1) * bytes_per_part))
f.seek(startByte, 0)
bytesBuf = f.read(endByte - startByte)
print(f"Buffer is type {type(bytesBuf)} with length {len(bytesBuf):,}")
print(f"Part {str(idx)}: bytes {startByte:,} to {endByte:,} as {part['url']}")
#resp = requests.post(part['url'], data = bytesBuf, headers = self.get_standard_headers())
resp = requests.put(
url = part['url'],
data = bytesBuf
)
The error I'm getting is:
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
The presigned URL looks like:
https://my-bucket-name.s3.amazonaws.com/my/item/key?uploadId=yT2W....iuiggs-&partNumber=0&AWSAccessKeyId=ASIAR...MY&Signature=i6duc...Mmpc%3D&x-amz-security-token=IQoJ...%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F...SWHC&Expires=1657135314
There was a bug in my code somewhere. I ran the code under WSL as a test, and in the Linux environment got a more friendly error that helped me find and fix a minor bug, and now it's running as expected in the Windows environment. Whether that's because of the bugfix or some other environmental change I'll never know.

socket http lua to set timeout

I am trying to create a function that can call REST with the http socket lua.
And I tried to set the timeout this way. But, when I run this function, the timeout is not running. How should I set the timeout?
local http = require "socket.http"
local socket = require "socket"
local respbody = {}
http.request {
method = req_method,
url = req_url,
source = ltn12.source.string(req_body),
headers =
{
["Content-Type"] = req_content_type,
["content-length"] = string.len(req_body),
["Host"] = host,
},
sink = ltn12.sink.table(respbody),
create = function()
local req_sock = socket.tcp()
req_sock:settimeout(3, 't')
return req_sock
end,
}
You may want to check lua-http. I use it to call REST and works like a charm. I am not an expert but, as far as I can tell, it is a good LUA http implementation.
You can set a two seconds timeout as simple as:
local http_client = require "http.client"
local myconnection = http_client.connect {
host = "myrestserver.domain.com";
timeout = 2;
}
Full documentation in here.
if I implement the example with my requirements, will it be like this? cmiiw
local http_client = require "http.client"
local req_body = "key1=value1&key2=value2"
local myconnection = http_client.connect {
method = "POST";
url = "myrestserver.domain.com/api/example";
host = "myrestserver.domain.com";
source = req_body
headers = {
["Content-Type"] = "application/x-www-form-urlencoded",
["content-length"] = string.len(req_body),
},
timeout = 2;
}
LuaSocket implicitly set http.TIMEOUT to the socket object.
Also you have to remember that socket timeout is not the same as request timeout.
Socket timeout means timeout for each operation independently. For simple case you can wait connection up to timeout seconds and then each read operation can take up to timeout seconds. And because of HTTP client read response line by line you get timeout seconds for each header plus for each body chunk. Also, there may be redirecions where each redirection is a separate HTTP request/response. If you use TLS there also will be hendshake after connection which also took several send/receive operation.
I did not use lua-http module and do not know how timeout implemented there.
But I prefer use modules like cURL if I really need to restrict request timeout.

Using Domain mfp 8 server return "request time out" using real mobile device?

Image of the console error
mobile apps are successfully connected using mfp server IP Address with port 9080 but using instead of IP with Domain the mfp8 server response error msg "The Request time out" and response text "undefined"
Using IP Address: mfpclient properties file:
wlServerProtocol = http
wlServerHost = **.**.**.78
wlServerPort = 9080
wlServerContext = /mfp/
testWebResourcesChecksum = false
ignoredFileExtensions = png, jpg, jpeg, gif, mp4, mp3
wlPlatformVersion = 8.0.0.00-20190910-142437
wlSecureDirectUpdatePublicKey =
languagePreferences = en
wlBuildId = 8.0.0.00-20190910-142437
Using Domain: mfpclient properties file:
wlServerProtocol = https
wlServerHost = www.domainname.com
wlServerPort = 443
wlServerContext = /mfp/
testWebResourcesChecksum = false
ignoredFileExtensions = png, jpg, jpeg, gif, mp4, mp3
wlPlatformVersion = 8.0.0.00-20190910-142437
wlSecureDirectUpdatePublicKey =
languagePreferences = en
wlBuildId = 8.0.0.00-20190910-142437
the output whenever im using domain is request timeout error
Please update with which MobileFirst API request is timing out.
REQUEST_TIMEOUT error will come if
i.If the server is not accessible on the IP address/port specified in mfpclient.properties file.
ii.When timeout is set in WLResourceRequest and there is a delay in adapter response that is more than timeout value.
iii.Having said that, the request timeout error occurs when a request made by the device does not get a response from the MobileFirst server within the stipulated timeout period. For OAuth calls, this timeout is 10 seconds.
iv. Server is taking more time to respond. Check your backend logic.
v. Possibilties that DNS resolution is not happening within 10 sec.

Python-twisted client connection failover

I am writing a tcp proxy with Twisted framework and need a simple client failover. If proxy can not connect to one backend, then connect to next one in the list. I used
reactor.connectTCP(host, port, factory) for proxy till I came to this task, but it does not spit out error if it can not connect. How can I catch, that it can not connect and try other host or should I use some other connection method?
You can use a deferred to do that
class MyClientFactory(ClientFactory):
protocol = ClientProtocol
def __init__(self, request):
self.request = request
self.deferred = defer.Deferred()
def handleReply(self, command, reply):
# Handle the reply
self.deferred.callback(0)
def clientConnectionFailed(self, connector, reason):
self.deferred.errback(reason)
def send(_, host, port, msg):
factory = MyClientFactory(msg)
reactor.connectTCP(host, port, factory)
return factory.deferred
d = Deferred()
d.addErrback(send, host1, port1, msg1)
d.addErrback(send, host2, port2, msg2)
# ...
d.addBoth(lambda _: print "finished")
This will trigger the next errback if the first one fails, otherwise goto the print function.

net/http.rb:560:in `initialize': getaddrinfo: Name or service not known (SocketError)

##timestamp = nil
def generate_oauth_url
##timestamp = timestamp
url = CONNECT_URL + REQUEST_TOKEN_PATH + "&oauth_callback=#{OAUTH_CALLBACK}&oauth_consumer_key=#{OAUTH_CONSUMER_KEY}&oauth_nonce=#{NONCE} &oauth_signature_method=#{OAUTH_SIGNATURE_METHOD}&oauth_timestamp=#{##timestamp}&oauth_version=#{OAUTH_VERSION}"
puts url
url
end
def sign(url)
Base64.encode64(HMAC::SHA1.digest((NONCE + url), OAUTH_CONSUMER_SECRET)).strip
end
def get_request_token
url = generate_oauth_url
signed_url = sign(url)
request = Net::HTTP.new((CONNECT_URL + REQUEST_TOKEN_PATH),80)
puts request.inspect
headers = { "Authorization" => "Authorization: OAuth oauth_nonce = #{NONCE}, oauth_callback = #{OAUTH_CALLBACK}, oauth_signature_meth od = #{OAUTH_SIGNATURE_METHOD}, oauth_timestamp=#{##timestamp}, oauth_consumer_key = #{OAUTH_CONSUMER_KEY}, oauth_signature = #{signed_url}, oauth_versio n = #{OAUTH_VERSION}" }
request.post(url, nil,headers)
end
def timestamp
Time.now.to_i
end
I am trying to do what oauth does in an attempt to understand how to use the Authorization headers. I am also getting the following error. I am trying to connect to the linkedin API.
/usr/lib/ruby/1.8/net/http.rb:560:in 'initialize': getaddrinfo: Name or service not known (SocketError)
I would really appreciate it if someone could nudge me in the right direction.
"Name or service not known" is a socket-level error which usually points to either an invalid IP address/DNS hostname, or an unregistered port name (e.g. telnet the.host.name service where service is not a registered service name.)
Check that CONNECT_URL holds a valid URL.
EDIT: I'm not a Ruby programmer, but I wouldn't mind betting that Net::HTTP.new requires a hostname (e.g. www.facebook.com) as the first argument, not a complete URL (e.g. www.facebook.com/login.php?method=oauth).
You also get this error when you have no internet connection since a DNS lookup is often the first thing that happens when establishing a TCP connection using a hostname.
Unplug your network cable and try:
Socket.getaddrinfo("www.example.com", "http")
# => SocketError: getaddrinfo: nodename nor servname provided, or not known