Why does scrapy not use the random proxy downloader midleware?

Why does scrapy not use the random proxy downloader midleware? - scrapy

I am using the following library with scrapy in order to do requests from IPs through rotating proxies.
This persumably stoped working and my IP is used instead. So I am wondering if there is a fallback or if I accidently changed the config.
My settings look like this:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = '/Users/user/test_crawl/proxy_list.txt'
PROXY_MODE = 0
The proxy list:
http://147.30.82.195:8080
http://168.183.187.238:8080
The traceback:
[scrapy.proxies] DEBUG: Proxy user pass not found
2018-12-27 14:23:20 [scrapy.proxies] DEBUG: Using proxy
<http://168.183.187.238:8080>, 2 proxies left
2018-12-27 14:23:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/file.htm> (referer: https://www.example.com)
The DEBUG output of user pass not found, should be OK as it indicates that I am not using user/pass to authenticate.
The logfile on the example.com server shows my IP directly instead of the proxy IP.
This used to work, so I am wondering how to get it back working.

scrapy-proxies expects proxies to have a password. If the password is empty, it ignores the proxy.
It should probably fail as it does when there are not proxies, but instead it does nothing, which results in no proxy being configured, and your IP being used instead.
I would say you should report the issue upstream, but the project seems dead. So, unless you are willing to fork the project and fix the issue yourself, you are out of luck.

Related

Jibri recording issues behind reverse proxy

I'm trying to run Jibri as part of a Jitsi-Meet installation (all on one server) behind a reverse SSL proxyJitsi works out of the box, but as soon as Jibri tries to log in to the session to record it, the corresponding Chrome session times out. Here's an excerpt from the jibri log:
2021-04-04 09:09:42.546 FINE: [890] org.jitsi.jibri.selenium.pageobjects.CallPage.visit() Visiting url https://example.com/room#config.iAmRecorder=true&config.externalConnectUrl=null&config.startWithAudioMuted=true&config.startWithVideoMuted=true&interfaceConfig.APP_NAME="Jibri"&config.analytics.disabled=true&config.p2p.enabled=false&config.prejoinPageEnabled=false&config.requireDisplayName=false
2021-04-04 09:09:42.633 FINE: [890] org.jitsi.jibri.selenium.pageobjects.CallPage.apply() Not joined yet: APP is not defined
...
2021-04-04 09:10:12.945 INFO: [890] org.jitsi.jibri.selenium.JibriSelenium.onSeleniumStateChange() Transitioning from state Starting up to Error: FailedToJoinCall SESSION Failed to join the call
2021-04-04 09:10:12.947 INFO: [890] org.jitsi.jibri.service.impl.FileRecordingJibriService.onServiceStateChange() File recording service transitioning from state Starting up to Error: FailedToJoinCall SESSION Failed to join the call
The reverse proxy is configured to watch out for this login string on port 443 (normal SSL traffic per the URL above) and forward this to the Jitsi instance. Prosody accepts the request on its http-bind interface but then the invocation times out.
As the web server logs are inconclusive: Where / what logs can I check to see what happens afterwards? I can see jicofo picking up the invocation but don't know what happens afterwards (Jicofo 2021-04-04 09:09:42.130 INFO: [461] org.jitsi.jicofo.recording.jibri.JibriSession.log() Updating status from JIBRI: <iq to='focus#auth.example.com/focus647288887711795' from='jibribrewery#internal.auth.example.com/jibri-nickname' id='5iurC-49012' type='result'><jibri xmlns='http://jitsi.org/protocol/jibri' status='pending'/></iq> for room#conference.example.com)?
More than happy to provide more info as required.

How to resolve 502 response code in Scrapy request?

I created a spider that scrapes data from Yelp by using Scrapy. All requests go through Crawlera proxy. Spider gets the URL to scrape from, sends a request, and scrapes the data. This worked fine up until the other day, when I started getting 502 None response. The 502 None response appears
after execution of this line:
r = self.req_session.get(url, proxies=self.proxies, verify='../secret/crawlera-ca.crt').text
The traceback:
2020-11-04 14:27:55 [urllib3.connectionpool] DEBUG: https://www.yelp.com:443 "GET /biz/a-dog-in-motion-arcadia HTTP/1.1" 502 None
So, it seems that spider cannot reach the URL because the connection is closed.
I have checked 502 meaning in Scrapy and Crawlera documentation, and it refers to connection being refused, closed, domain unavailable and similar things.
I have debugged the code related to where the issue is happening, and everything is up to date.
If anyone has ideas or knowledge about this, would love to hear, as I am stuck. What could actually be the issue here?
NOTE: Yelp URLs work normally when I open them in browser.

The website sees that you are a "scraper" and not a human user, from the headers of your request.
You should send a different header with the request, so that the scraped website thinks you are browsing with a regular browser.
For more info, refer to the scrapy documentation.

Some pages is not available for some countries, for this reason is recommended to use proxies. I tried to enter the url and the connection was successful.
2020-11-05 02:50:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2020-11-05 02:50:40 [scrapy.core.engine] INFO: Spider opened
2020-11-05 02:50:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/biz/a-dog-in-motion-arcadia> (referer: None)```

Setting up Sahi, Behat & PhantomJS on Vagrant

I'm trying to set up automated testing with PhantomJS, Behat and Sahi on my vagrant machine.
I'm getting the following output, when trying to run a test with behat:
[Behat\SahiClient\Exception\ConnectionException]
Exception has been thrown in "afterStep" hook, defined in FeatureContext::afterStep()
Connection time limit reached
Here is my userdata.properties:
# dirs. Relative paths are relative to userdata dir. Separate directories with semi-colon
scripts.dir=scripts;
# default log directory.
logs.dir=logs
# Directory where auto generated ssl cerificates are stored
certs.dir=certs
# Use external proxy server for http
ext.http.proxy.enable=false
ext.http.proxy.host=
ext.http.proxy.port=
ext.http.proxy.auth.enable=false
ext.http.proxy.auth.name=kamlesh
ext.http.proxy.auth.password=password
# Use external proxy server for https
ext.https.proxy.enable=false
ext.https.proxy.host=
ext.https.proxy.port=
ext.https.proxy.auth.enable=false
ext.https.proxy.auth.name=kamlesh
ext.https.proxy.auth.password=password
# There is only one bypass list for both secure and insecure.
ext.http.both.proxy.bypass_hosts=localhost|127.0.0.1|*.internaldomain.com
# Mark this property true to disable the proxy alert
proxy_alert.disabled=false
And my browswer_types.xml:
<browserTypes>
<browserType>
<name>phantomjs</name>
<displayName>PhantomJS</displayName>
<icon>safari.png</icon>
<path>/usr/bin/phantomjs</path>
<options>--ignore-ssl-errors=yes --proxy=localhost:9999 --ssl-protocol=any /usr/local/sahi/phantomjs-sahi.js</options>
<processName>phantomjs</processName>
<capacity>100</capacity>
<force>true</force>
</browserType>
</browserTypes>
behat.yml:
default:
extensions:
Behat\MinkExtension\Extension:
javascript_session: sahi
browser_name: phantomjs
goutte: ~
sahi:
host: localhost
port: 9999
Sahi run output:
--------
SAHI_HOME: ..
SAHI_USERDATA_DIR: ../userdata
SAHI_EXT_CLASS_PATH:
--------
Sahi properties file = /usr/local/sahi/config/sahi.properties
Sahi user properties file = /usr/local/sahi/userdata/config/userdata.properties
Added shutdown hook.
>>>> Sahi OS v5.0 started. Listening on port: 9999
>>>> Configure your browser to use this server and port as its proxy
>>>> Browse any page and CTRL-ALT-DblClick on the page to bring up the Sahi Controller
-----
Reading browser types from: /usr/local/sahi/userdata/config/browser_types.xml
-----
I've tried reinstalling a bunch of stuff, tried playing around with the ports, processes, proxy settings, nothing.

your vagrant comes with an empty or no db. so when you try to connect to your app, e.g log in with some known user it will crash cause it won't find it!
all the best ;)

Since version 4.3.2 of BrowserType change settings. Since there is no tag force. please check.
https://sahipro.com/docs/using-sahi/sahi-headless-execution-with-phantomjs.html#Documentation since Sahi Pro V4.3.2

How to handle https traffic using libmproxy?

I want to implement a proxy server that intercepts both http and https requests. I came across libmproxy (http://mitmproxy.org/doc/scripting/libmproxy.html) that it is SSL-capable. I start with this simplest proxy that just prints the headers of all requests and responses, and forwards them to clients and servers normally.
#!/usr/bin/env python
from libmproxy import controller, proxy
import os
class Master(controller.Master):
def __init__(self, server):
controller.Master.__init__(self, server)
self.stickyhosts = {}
def run(self):
try:
return controller.Master.run(self)
except KeyboardInterrupt:
self.shutdown()
def handle_request(self, msg):
print "handle request.................................................."
print msg.headers
msg.reply()
def handle_response(self, msg):
print "handle response................................................."
print msg.headers
msg.reply()
config = proxy.ProxyConfig(
cacert = os.path.expanduser("~/.mitmproxy/mitmproxy-ca.pem")
)
server = proxy.ProxyServer(config, 1234)
m = Master(server)
m.run()
Then I configure http and ssl proxy in firefox to 127.0.0.1 port 1234. http seems to work fine as I can see all the headers are printed out. However, when the browser sends https requests, the proxy server does not print anything at all, and the browser displays "the connect was interrupted" error.
Further investigation reveals that the https requests go though the proxy server but not controller.Master. I see that proxy.ProxyHandler.establish_ssl() is being called when there is an https request, but the request does not go though controller.Master.handle_request(). Despite that establish_ssl() is called, the browser does not seem to get any response back. I test this with https://www.google.com.
First, how can I make proxy.ProxyHandler works properly with https requests/responses? Second, how can I modify controller.Master so that it can intercept https requests? I'm also open to other tools that I can build a custom http/https proxy server on top of.

You need to install the mitmproxy CA in the browser you are testing with.
Please see details here ("Installing the mitmproxy CA" section):
http://mitmproxy.org/doc/ssl.html
This solved the problem for me.

Google CCS (GCM) - project not whitelisted

I'm trying to get the Python code working that I found on:
http://developer.android.com/google/gcm/ccs.html
I've change the first 2 rows with (I think) the correct data.
The projectnr and api key is fake, it's just to show you how it almost looks.
import sys, json, xmpp
SERVER = ('gcm.googleapis.com', 5235)
USERNAME = '489713985816'
PASSWORD = 'AIzd237jjN_iT7yRxLWiHRreqax45XaMJQ6VJ98'
I've created a google api project (tried it with 2 different projects).
Activated GCM.
Copied the following:
Project Number: 489713985816
API key : AIzd237jjN_iT7yRxLWiHRreqax45XaMJQ6VJ98
Tried the code with a Key for server, and a key for browser apps, both with and without a specific IP address.
When I execute the code with #python ccs.py I get the following result:
If this is my problem, how do I get my project whitelisted?
Invalid debugflag given: socket
DEBUG:
DEBUG: Debug created for /usr/lib/python2.7/dist-packages/xmpp/client.py
DEBUG: flags defined: socket
DEBUG: socket start Plugging <xmpp.transports.TCPsocket instance at 0x1ea2950>
into <xmpp.client.Client instance at 0x1ea27a0>
DEBUG: socket start Successfully connected to remote
host ('gcm.googleapis.com', 5235)
DEBUG: socket sent <?xml version='1.0'?>
<stream:stream xmlns="jabber:client" to="gcm.googleapis.com" version="1.0"
xmlns:stream="http://etherx.jabber.org/streams" >
DEBUG: socket got
<stream:stream from="gcm.googleapis.com" id="FD82304ADA8C8019" version="1.0"
xmlns:stream="http://etherx.jabber.org/streams" xmlns="jabber:client">
<stream:features>
<mechanisms xmlns="urn:ietf:params:xml:ns:xmpp-sasl">
<mechanism>X-OAUTH2</mechanism>
<mechanism>X-GOOGLE-TOKEN</mechanism>
<mechanism>PLAIN</mechanism>
</mechanisms>
</stream:features>
DEBUG: socket sent <auth xmlns="urn:ietf:params:xml:ns:xmpp-sasl"
mechanism="PLAIN">MjgzMVqTl9p\nVDdUTZWSjk4\n</auth>
DEBUG: socket got <failure xmlns="urn:ietf:params:xml:ns:xmpp-sasl">
<temporary-auth-failure/>
<text xmlns="urn:ietf:params:xml:ns:xmpp-stanzas">
Project 489713985816 not whitelisted.</text>
</failure>
</stream:stream>
Authentication failed!

you might wan't to try the following guide http://www.androidhive.info/2012/10/android-push-notifications-using-google-cloud-messaging-gcm-php-and-mysql/
I was having the same problem that you are but following this guide has helped me get my push notifications through without having to sign up to be whitelisted.

After 3 months of waiting, I've just received an email from a Google employee.
My GCM whitelist request has been approved.
Thank you Ashish.
Now, let the fun begin!

In the documentation, it is mentioned several times that to use Up-stream messaging, you need to ask authorization (be whitelisted).
You can do that here: https://services.google.com/fb/forms/gcm/
You can still use the old "Cloud to device" messaging. You can read more about this, including links to a sample project here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Why does scrapy not use the random proxy downloader midleware? - scrapy

Related

Jibri recording issues behind reverse proxy

How to resolve 502 response code in Scrapy request?

Setting up Sahi, Behat & PhantomJS on Vagrant

How to handle https traffic using libmproxy?

Google CCS (GCM) - project not whitelisted

Categories

Resources