Build protocol mechanism in twisted Site - twisted

I am trying to understand how and when the protocols are created for http requests in Site factory. Though I have a basic understanding of working of factory and protocols,My confusion arises because I see multiple protocols are created for single requests from browser. Following are sample code and the output .
import sys
from twisted.web.server import Site
from twisted.web.static import File
from twisted.internet import reactor
from twisted.python import log
log.startLogging(sys.stdout)
class GFactory(Site):
protmade = 0
def __init__(self,resource):
Site.__init__(self,resource)
def buildProtocol(self, addr):
GFactory.protmade +=1
print "Building Protocol in factory" + str(GFactory.protmade)
print addr
p = Site.buildProtocol(self, addr)
return p
resource = File('./temp')
factory = GFactory(resource)
reactor.listenTCP(8888, factory)
reactor.run()
Log from a single request of http://localhost:8888/ through chrome
2016-05-02 01:38:07+0530 [-] Log opened.
2016-05-02 01:38:07+0530 [-] GFactory starting on 8888
2016-05-02 01:38:07+0530 [-] Starting factory <__main__.GFactory instance at 0x11bac75f0>
2016-05-02 01:38:19+0530 [-] Building Protocol in factory1
2016-05-02 01:38:19+0530 [-] IPv4Address(TCP, '127.0.0.1', 64913)
2016-05-02 01:38:19+0530 [-] Building Protocol in factory2
2016-05-02 01:38:19+0530 [-] IPv4Address(TCP, '127.0.0.1', 64914)
2016-05-02 01:38:19+0530 [-] Building Protocol in factory3
2016-05-02 01:38:19+0530 [-] IPv4Address(TCP, '127.0.0.1', 64915)
2016-05-02 01:38:19+0530 [-] "127.0.0.1" - - [01/May/2016:20:08:19 +0000] "GET / HTTP/1.1" 200 800 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
Following are my queries
1) As shown above there are 3 different protocols created for requests from 3 different ports. Is this an expected behaviour? Shouldn't there be one protocol for each request?
2) If I have to take some actions when the request expire or protocol is closed, which protocol should I be looking into?

Related

Geckodriver with webdriverIo giving session already started error

I am trying to run automation on firefox browser. I am passing the following set of options:
const firefoxOptions = {
capabilities: {
browserName: "firefox"
},
services: [
"geckodriver"
]
}
I also have geckdriver running a separate terminal tab, which is giving out the following logs on execution:
mahimakh#88665a46834a Downloads % ./geckodriver --port=4444
1667519366314 geckodriver INFO Listening on 127.0.0.1:4444
1667519377223 mozrunner::runner INFO Running command: "/Applications/Firefox.app/Contents/MacOS/firefox-bin" "--marionette" "-foreground" "-no-remote" "-profile" "/var/folders/gv/yctv5ytx7yd7xbhcw5qz4v_40000gs/T/rust_mozprofilecZIetT"
2022-11-03 16:49:37.780 plugin-container[26937:1559502] nil host used in call to allowsSpecificHTTPSCertificateForHost
2022-11-03 16:49:37.781 plugin-container[26937:1559502] nil host used in call to allowsAnyHTTPSCertificateForHost:
2022-11-03 16:49:37.785 plugin-container[26937:1559502] nil host used in call to allowsSpecificHTTPSCertificateForHost
2022-11-03 16:49:37.785 plugin-container[26937:1559502] nil host used in call to allowsAnyHTTPSCertificateForHost:
2022-11-03 16:49:37.785 plugin-container[26937:1559507] nil host used in call to allowsSpecificHTTPSCertificateForHost
2022-11-03 16:49:37.785 plugin-container[26937:1559507] nil host used in call to allowsAnyHTTPSCertificateForHost:
1667519377786 Marionette INFO Marionette enabled
1667519377819 Marionette INFO Listening on port 51739
Read port: 51739
console.warn: SearchSettings: "get: No settings file exists, new profile?" (new NotFoundError("Could not open the file at /var/folders/gv/yctv5ytx7yd7xbhcw5qz4v_40000gs/T/rust_mozprofilecZIetT/search.json.mozlz4", (void 0)))
2022-11-03 23:49:44.448 plugin-container[26991:1560037] nil host used in call to allowsSpecificHTTPSCertificateForHost
2022-11-03 23:49:44.448 plugin-container[26991:1560037] nil host used in call to allowsAnyHTTPSCertificateForHost:
2022-11-03 23:49:44.472 plugin-container[26991:1560037] nil host used in call to allowsSpecificHTTPSCertificateForHost
2022-11-03 23:49:44.472 plugin-container[26991:1560037] nil host used in call to allowsAnyHTTPSCertificateForHost:
2022-11-03 23:49:44.473 plugin-container[26991:1560041] nil host used in call to allowsSpecificHTTPSCertificateForHost
2022-11-03 23:49:44.473 plugin-container[26991:1560041] nil host used in call to allowsAnyHTTPSCertificateForHost:
I have a selenium standalone server running on my local too:
16:55:35.665 INFO - Selenium build info: version: '3.5.3', revision: 'a88d25fe6b'
16:55:35.666 INFO - Launching a standalone Selenium Server
2022-11-03 16:55:35.695:INFO::main: Logging initialized #351ms to org.seleniumhq.jetty9.util.log.StdErrLog
16:55:35.756 INFO - Driver class not found: com.opera.core.systems.OperaDriver
16:55:35.797 INFO - Driver provider class org.openqa.selenium.ie.InternetExplorerDriver registration is skipped:
registration capabilities Capabilities [{ensureCleanSession=true, browserName=internet explorer, version=, platform=WINDOWS}] does not match the current platform MAC
16:55:35.797 INFO - Driver provider class org.openqa.selenium.edge.EdgeDriver registration is skipped:
registration capabilities Capabilities [{browserName=MicrosoftEdge, version=, platform=WINDOWS}] does not match the current platform MAC
16:55:35.829 INFO - Using the passthrough mode handler
2022-11-03 16:55:35.855:INFO:osjs.Server:main: jetty-9.4.5.v20170502
2022-11-03 16:55:35.878:WARN:osjs.SecurityHandler:main: ServletContext#o.s.j.s.ServletContextHandler#6f1fba17{/,null,STARTING} has uncovered http methods for path: /
2022-11-03 16:55:35.882:INFO:osjsh.ContextHandler:main: Started o.s.j.s.ServletContextHandler#6f1fba17{/,null,AVAILABLE}
2022-11-03 16:55:35.908:INFO:osjs.AbstractConnector:main: Started ServerConnector#5e853265{HTTP/1.1,[http/1.1]}{0.0.0.0:4444}
2022-11-03 16:55:35.909:INFO:osjs.Server:main: Started #564ms
16:55:35.909 INFO - Selenium Server is up and running
On starting the execution, even though firefox opens and performs a set of tasks, but initially it gives the following error:
2022-11-03T23:49:41.942Z INFO webdriver: COMMAND status()
2022-11-03T23:49:41.943Z INFO webdriver: [GET] http://localhost:4444/status
2022-11-03T23:49:41.945Z INFO webdriver: RESULT { message: 'Session already started', ready: false }
ERROR: checkStatus failed, Session Status Missing "build" Object!
1) Validate Retail Website Title
And later it fails with webdriver being null exception! Can anyone help me understand what i am doing wrong here?

API Umbrella is not accessible in docker giving 502 error

I have set up API Umbrella in my Ubuntu 20 cloud vm.
Try to access but got 502 Bad gateway like in here:
Obviously, the routing is a failure for some reason.
The output of /var/log/api-umbreall/nginx/current is the following:
2022-09-01T06:08:19.57992 starting nginx...
2022-09-01T06:08:27.48168 2022/09/01 06:08:27 [error] 319#0: *13 [lua] elasticsearch_setup.lua:106: create_aliases(): failed to create elasticsearch index: Unsuccessful response: {"error":{"root_cause":[{"type":"index_already_exists_exception","reason":"already exists","index":"api-umbrella-logs-v1-2022-09"}],"type":"index_already_exists_exception","reason":"already exists","index":"api-umbrella-logs-v1-2022-09"},"status":400}, context: ngx.timer
2022-09-01T06:21:45.17055 2022/09/01 06:21:45 [warn] 318#0: *39756 using uninitialized "x_api_umbrella_request_id" variable while logging request, client: 192.241.213.X, server: mydomain.city, request: "GET / HTTP/1.1", host: "150.230.240.y:443"
2022-09-01T06:32:42.70713 2022/09/01 06:32:42 [error] 318#0: *72162 connect() failed (111: Connection refused) while connecting to upstream, client: 185.14.196.Z, server: mydomain.city, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:14009/", host: "mydomain.city"

Splash freezes with "Timing out client: IPv4Address"

I am running scrapy-splash for scraping data from one website.
Regularly ( randomly) splash freezes with next logs:
[36msplash-service_1 |[0m 2020-07-16 08:49:35.119333 [-] "172.31.0.4" - - [16/Jul/2020:08:49:34 +0000] "POST /execute HTTP/1.1" 200 266018 "-" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
[36msplash-service_1 |[0m 2020-07-16 08:50:10.012973 [-] Timing out client: IPv4Address(type='TCP', host='172.31.0.4', port=51970)
[36msplash-service_1 |[0m 2020-07-16 08:50:10.858080 [-] Timing out client: IPv4Address(type='TCP', host='172.31.0.4', port=51978)
[36msplash-service_1 |[0m 2020-07-16 08:50:16.873014 [-] Timing out client: IPv4Address(type='TCP', host='172.31.0.4', port=51974)
[36msplash-service_1 |[0m 2020-07-16 08:50:17.547947 [-] Timing out client: IPv4Address(type='TCP', host='172.31.0.4', port=51966)
[36msplash-service_1 |[0m 2020-07-16 08:50:18.037436 [-] Timing out client: IPv4Address(type='TCP', host='172.31.0.4', port=51976)
[36msplash-service_1 |[0m 2020-07-16 08:50:29.064655 [-] Timing out client: IPv4Address(type='TCP', host='172.31.0.4', port=51932)
[36msplash-service_1 |[0m 2020-07-16 08:50:35.119997 [-] Timing out client: IPv4Address(type='TCP', host='172.31.0.4', port=51968)
How can I get the reason of that? Why it might stuck?
P.S I run it with args={"lua_source": self.lua_script_navigate, "timeout":60000}
Refer to Splash's HTTP API documention of the argument timeout:
timeout : float : optional
A timeout (in seconds) for the render (defaults to 30).
By default, maximum allowed value for the timeout is 90 seconds. To override it start Splash with --max-timeout command line option.
For example, here Splash is configured to allow timeouts up to 5
minutes:
$ docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300
If you has not started splash with --max-timeout, your lua_script is aborted after 30s even when setting a higher timeout in args.

How to pass SSO using selenium2+phantomjs webdriver?

I am using selenium2 (selenium-java:3.0.1) and phantomjs-2.1.1-linux-x86_64. What I am trying to do is getting to a page that needs SSO. When using browser to access the website, it will popup a login dialog to input username and password.
When using wget to get the URL. it stopped at the auth part.
test#ubu-test:wget https://www.example.com/details
--2016-11-15 05:18:02-- https://www.example.com/details
Resolving www.example.com (www.example.com)... 10.20.30.40
Connecting to www.example.com (www.example.com)|10.20.30.40|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /detaisl [following]
--2016-11-15 05:18:02-- https://www.example.com/login
Reusing existing connection to www.example.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://sso.example.com/idp/SSO.saml2?SAMLRequest=...something not related... [following]
--2016-11-15 05:18:02-- https://sso.example.com/idp/SSO.saml2?SAMLRequest=...something not related...
Resolving sso.example.com (sso.example.com)... 11.22.33.44
Connecting to sso.example.com (sso.example.com)|11.22.33.44|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://sso.example.com/idp/95wM4/resumeSAML20/idp/SSO.ping [following]
--2016-11-15 05:18:02-- https://sso.example.com/idp/95wM4/resumeSAML20/idp/SSO.ping
Reusing existing connection to sso.example.com:443.
HTTP request sent, awaiting response... 401 Unauthorized
Username/Password Authentication Failed.
When using selenium2 (selenium-java:3.0.1) and phantomjs-2.1.1-linux-x86_64 with the following code
WebDriver webdriver = new PhantomJSDriver();
webdriver.get("https://www.example.com/details");
System.out.println(webdriver.getCurrentUrl());
System.out.println(webdriver.getTitle());
System.out.println(webdriver.getPageSource());
The output is:
Nov 15, 2016 5:36:07 AM org.openqa.selenium.phantomjs.PhantomJSDriverService <init>
INFO: executable: ./phantomjs-2.1.1-linux-x86_64/bin/phantomjs
Nov 15, 2016 5:36:07 AM org.openqa.selenium.phantomjs.PhantomJSDriverService <init>
INFO: port: 27571
Nov 15, 2016 5:36:07 AM org.openqa.selenium.phantomjs.PhantomJSDriverService <init>
INFO: arguments: [--webdriver=27571, --webdriver-logfile=./phantomjsdriver.log]
Nov 15, 2016 5:36:07 AM org.openqa.selenium.phantomjs.PhantomJSDriverService <init>
INFO: environment: {}
[INFO - 2016-11-15T13:36:07.904Z] GhostDriver - Main - running on port 27571
[INFO - 2016-11-15T13:36:08.113Z] Session [76cd8ec0-ab38-11e6-bceb-256426a0974e] - page.settings - {"XSSAuditingEnabled":false,"javascriptCanCloseWindows":true,"javascriptCanOpenWindows":true,"javascriptEnabled":true,"loadImages":true,"localToRemoteUrlAccessEnabled":false,"userAgent":"Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1 Safari/538.1","webSecurityEnabled":true}
[INFO - 2016-11-15T13:36:08.113Z] Session [76cd8ec0-ab38-11e6-bceb-256426a0974e] - page.customHeaders: - {}
[INFO - 2016-11-15T13:36:08.113Z] Session [76cd8ec0-ab38-11e6-bceb-256426a0974e] - Session.negotiatedCapabilities - {"browserName":"phantomjs","version":"2.1.1","driverName":"ghostdriver","driverVersion":"1.2.0","platform":"linux-unknown-64bit","javascriptEnabled":true,"takesScreenshot":true,"handlesAlerts":false,"databaseEnabled":false,"locationContextEnabled":false,"applicationCacheEnabled":false,"browserConnectionEnabled":false,"cssSelectorsEnabled":true,"webStorageEnabled":false,"rotatable":false,"acceptSslCerts":false,"nativeEvents":true,"proxy":{"proxyType":"direct"}}
[INFO - 2016-11-15T13:36:08.115Z] SessionManagerReqHand - _postNewSessionCommand - New Session Created: 76cd8ec0-ab38-11e6-bceb-256426a0974e
about:blank
<html><head></head><body></body></html>
So it happens like it just opens an about:blank page and nothing more. Is there a way to input user name and password into popup dialog, and continue the access?
Use this code for handling such type of errors, most of the time we get SSL handshake errors,i will suggest you to use the following code to handle such type of errors.
DesiredCapabilities caps = new DesiredCapabilities();
caps.setJavascriptEnabled(true);
caps.setCapability("takesScreenshot", true);
caps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,**Path**//phantomjs.exe" );
//For Handling SSL errors
caps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[]{"--ignore-ssl-errors=true"});
//To Disable Logs
caps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[]{"--webdriver-loglevel=NONE"});
driver = new PhantomJSDriver(caps);
If you are using WebDriverWait , i will suggest you to disable log from PhantomJS.

Scrapy redirects to homepage for some urls

I am new to Scrapy framework & currently using it to extract articles from multiple 'Health & Wellness' websites. For some of the requests, scrapy is redirecting to homepage(this behavior is not observed in browser). Below is an example:
Command:
scrapy shell "http://www.bornfitness.com/blog/page/10/"
Result:
2015-06-19 21:32:15+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-19 21:32:15+0530 [default] INFO: Spider opened
2015-06-19 21:32:15+0530 [default] DEBUG: Redirecting (301) to http://www.bornfitness.com/> from http://www.bornfitness.com/blog/page/10/>
2015-06-19 21:32:16+0530 [default] DEBUG: Crawled (200) http://www.bornfitness.com/> (referer: None)
Note that the page number in url(10) is a two-digit number. I don't see this issue with urls with single-sigit page number(8 for example).
Result:
2015-06-19 21:43:15+0530 [default] INFO: Spider opened
2015-06-19 21:43:16+0530 [default] DEBUG: Crawled (200) http://www.bornfitness.com/blog/page/8/> (referer: None)
When you have trouble replicating browser behavior using scrapy, you generally want to look at what are those things which are being communicated differently when your browser is talking to the website compared with when your spider is talking to the website. Remember that a website is (almost always) not designed to be nice to webcrawlers, but to interact with web browsers.
For your situation, if you look at the headers being sent with your scrapy request, you should see something like:
In [1]: request.headers
Out[1]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en',
'User-Agent': 'Scrapy/0.24.6 (+http://scrapy.org)'}
If you examine the headers sent by a request for the same page by your web browser, you might see something like:
**Request Headers**
GET /blog/page/10/ HTTP/1.1
Host: www.bornfitness.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36
DNT: 1
Referer: http://www.bornfitness.com/blog/page/11/
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Cookie: fealty_segment_registeronce=1; ... ... ...
Try changing the User-Agent in your request. This should allow you to get around the redirect.