I am using wget to download url that could be used on either linux/osx/windows. My question is if server behavior could be affected by user-agent string (-U) option ? According to this MS link web server can use this information to provide content that is tailored for your specific browser. According to Apache doc(access control section) you can use these directives to deny access to a particular browser (User-Agent). So I am wondering if I need to download links with different user-agent for different OS or one download would suffice.
Is this actually done ? I tried bunch of servers but did not really see different behavior across user agents.
There are sites that prevent scraping by returning an error response when they detect you're hitting their servers with an automation tool instead of a browser, and the user agent is one of the aspects of detecting that difference.
Other than that not much useful can be said about this, as we don't know what sites you want to target, what HTTP server they run and what code runs on top of that.
Related
My server has links to other servers. I have a relationship with the managers of those servers. I want to be sure that links to PDF files make the client Browser prompt the user to SAVE the file, not to have the file open directly in the Web browser. I don't believe I need to change the HTTP Headers on my server, I need to ask the admins on the associated servers to change THEIR HTTP headers to "allow cross origin" when they receive requests from my site as the "referrer". Is this correct? It's not easy to get this answer, lots of examples to this type of query talk about "go to your Browser settings and change how PDFs are handled", but I need a solution that, apart from users who HAVE set their Browser as their OS default PDF viewer, the PDF files will download to be opened in a sophisticated and powerful PDF renderer.
Tried some experiments on two servers I have direct control over, it seemed to work, but now need to engage with other server admins and I want to be sure I'm asking them to alter their HTTP config header without bothering them excessively: I don't want to have to do a lot of "experiments" with them, I want to be confident that what I'm asking them to do or change is correct.
I can easily watch and embedd any stream running on Ant Media Server with help of embedd URL but also it seems as other with the stream information can use the URL on their websites too.
I tried using CORS filter but it seems a little complicated and didn't work.
How can I easily prevent my streams from being embedded to unauthorized webistes/domains?
For workaround solutions in Ant Media server (v2.4.3 or older versions) please check here.
In v2.5.0 and above, you can allow selected domains through a single property file to let them embed the iframe code.
To allow only specific domains to embed the iframe code, edit the /usr/local/antmedia/webapps/app-name/WEB-INF/red5-web.properties file and add the below setting.
settings.contentSecurityPolicyHeaderValue=frame-ancestors 'self' https://allow-domain-name;
If you would like to allow multiple domains, then it should be like this.
settings.contentSecurityPolicyHeaderValue=frame-ancestors 'self' https://domain1 https://domain2;
​After making the changes, restart the server with sudo service antmedia restart.
'Self' is required to play the stream on the AMS dashboard panel itself. In this way, other than allowed domains, streams cannot be embedded using iframe code on other websites.
I have been developing a crawling script for a number of news websites and using Scrapy to handle the logic.
When I run my script on an Ubuntu web server (Digital Ocean, if that helps), a lot of the websites that return 200 on my local machine turn out to be 417 instead.
I was wondering how I should fix this, if it is a problem at all? I'm actually not quite sure if it is affecting the final output, but it seems like it has been.
Some of my own research has turned up:
http://www.checkupdown.com/status/E417.html . I've tried adding an Expect header to my requests, which hasn't worked
I've heard that it might be a problem with HTTP 1.1 vs 1.0? EDIT: Nope. Scrapy's HTTPDownloaderHandler automatically chooses 1.1 if it is available
417 is the error a web server gives you when your client says it expects content-types a,b,c, but the content that the server could deliver doesn't match any of these types.
This looks like a scrapy bug or, more likely, misconfiguration.
It seems either your public ip address was already banned or was banned while you scraped by the web server of the page you want to scrape. For the first situation you can reboot your instance to get a new public ip (at least this works on Amazon). For the second scenario, here are some tips from the official documentation to avoid this situation:
rotate your user agent from a pool of well-known ones from browsers
(google around to get a list of them)
disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if possible, use Google cache to fetch pages, instead of hitting the
sites directly
use a pool of rotating IPs. For example, the free Tor
project or paid services like ProxyMesh
use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera
Additionally, you can reduce concurrent requests settings in your spider, that worked once for me.
I want to capture all the network calls from Web Driver in Java. I am not doing any UI testing, just testing JS execution and, requests and responses of some network calls.
I tried using Browser Mob as is suggested in most forums, but I need it to work across all browsers. It worked flawlessly with Firefox, but I was facing some issues with the others. Safari driver doesn't event support a Proxy capability.
I don't want to use Fiddler as it involves some manual steps around invoking and storing the calls. Whereas, Browser Mob being an in-code proxy can be integrated in a more smoother fashion.
I also tried using the RC-like package included in Selenium standalone server package. But, I have some HTTPS calls and some nested iframes in cross domains. I am particularly interested in some cross domain POST call and it doesn't work out that well. Also, people keep saying it's not recommended to use that package.
So, I had a solution where we can use a standalone proxy server running on a machine. Using host entries, we'll point Web Driver to hit the proxy instead of the actual server. The proxy will record all the incoming calls and route them to the actual server host. Later, I can make a request to the proxy which will return me all the calls it intercepted. I am not sure whether it's still called a proxy or a router.
I came across TCPmon, but it's no longer being supported. Does anyone know some similar tools that could run on Unix systems or any alternate solutions?
We modified the Fiddler rules script to include a new exec action. If you use their native script editor, it also provide auto complete features and we were comfortably able to get around it. The syntax is similar to that of JavaScript.
The Fiddler package comes with a ExecActions.exe which can be used to pass console arguments to a running Fiddler instance using the command prompt.
The code we wrote processed all the sessions captured by Fiddler and wrote it to a file in a custom JSON format and later used GSON to deserialize it.
Please let me know, if you want further details.
I'm wondering how I can find out where the culprit is, as to what is NOT being transmitted over SSL on my website. It's blowing my mind, because I use relative URLs or explicitly choose HTTPS:// for all links, images, etc...
Any ideas/tools to find out what the issue is?
Thanks.
If you mean that some resources are transferred over HTTP without encryption, you can check for this in Chrome's Developer tools in the tab Resources - that should tell you which parts come from where - look for those with address starting with http:// .
Alternately, use Fiddler: by default, it won't decrypt HTTPS connections, so you'll be seeing CONNECT requests for HTTPS, and GET/POST for HTTP - those are your culprits.
For those, like myself, who run into this issue i suggest a few tips while designing your website.
Always use relative paths when ever possible "images/someimage.png" instead of using domain paths like http://someDomainName/images/someimage.png so on. Any one of these and it will cause the browser to throw that warning at you.
When linking to external content, Google/other Ads, javascript sources(such as jquery, so on), or any other media... make sure you use a https:// link if they have one available. Myself, i had one tiny image for a link to an external site but they did not offer a https link to the image, so i simply downloaded it and put it in my images folder. Problem solved.
The Chrome resources list is a very helpful tool, not sure if Firefox has something similar in its tool box. Another method, if you have shell/command line access, is to use grep to search the files for "http:". This, most often, will show anything that is linking to non secure content.