Scrapy and respect of robots.txt

Scrapy and respect of robots.txt - scrapy

I discovered yesterday that Scrapy respects the robots.txt file by default (ROBOTSTXT_OBEY = True).
If I request an URL with scrapy shell url, and if I have a response, does it mean that url is not protected by robots.txt?

According to the docs, it's enabled by default only when you create a project using scrapy startproject command, otherwise should be default False.
https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots
Answering your question, yes, scrapy shell command does respect robots.txt configuration defined in settings.py. If ROBOTSTXT_OBEY = True, trying to use scrapy shell command on a protected URL will generate a response None.
You can also test it passing robots.txt settings via command line:
scrapy shell https://www.netflix.com --set="ROBOTSTXT_OBEY=True"

Related

File upload request failed in Jmeter even after following correct steps

I have followed below steps to upload file in jmeter but it didn't worked. It throws Sorry, an error occurred while trying to execute your request. Please try again I have attached screenshots for more details.
Enabled Use multipart/form-data
Copied the file which is going to upload in /bin directory
I have tried with check/Uncheck Use multipart/form-data but no luck
In my HTTP request I passes action_id=1203 as Query parameters and in Form Parameters I am Passing other parameters like msgId, fieldId etc. but from screenshot you can observe when I execute that it passes my whole form parameters in just one single key of "msgId" I don't know why?
This are the headers which I pass
My Request with Query and Form parameters
File upload tab of http request
After execution request failed with this output. Here it passes all form params in single "msgId" key
F12 - Network request of Webpage form parameters (checked manually on web it works fine. Problem is in my jmeter request)

Just record the file upload using JMeter's HTTP(S) Test Script Recorder and it will generate the relevant HTTP Request sampler and HTTP Header Manager configuration which can be later on correlated/parameterized.
The only thing you need to do is to copy the file you're uploading to "bin" folder of your JMeter installation before recording. File path can be changed to whatever you want afterwards.
Also according to JMeter Best Practices you should always be using the latest version of JMeter so consider upgrading to JMeter 5.5 (or whatever is the latest stable version which is available at JMeter Downloads page) as soon as possible.

Securing GitLab Pages with Let's Encrypt gets 404

I am following this tutorial https://about.gitlab.com/2016/04/11/tutorial-securing-your-gitlab-pages-with-tls-and-letsencrypt/
Next step instructions are:
Make sure your web server displays the following content at
http://YOURDOMAIN.org/.well-known/acme-challenge/5TBu788fW0tQ5EOwZMdu1Gv3e9C33gxjV58hVtWTbDM
before continuing:
5TBu788fW0tQ5EOwZMdu1Gv3e9C33gxjV58hVtWTbDM.ewlbSYgvIxVOqiP1lD2zeDKWBGEZMRfO_4kJyLRP_4U
#
# output omitted
#
Press ENTER to continue
According to the tutorial, it's using Jekyll, but I don't use a static html generator like jekyll. The files are all static html. I created the exact path under root folder: /.well-known/acme-challenge/PukY0bbiH3nRfciQ4IzwTDIXFn4G5sZ5I-LkMz3-KHE.html
But after the piplines jobs are done, I am still getting 404. What's the problem here?

I had problem same yesterday and I found the solution, I hope it is not too late to share with you. According to this tutorial here, the "well-known" folder should be under the "public" folder.
And the letsencrypt need to access a .html file in the following path using the browser.
http://YOURDOMAIN.org/.well-known/acme-challenge/5TBu788fW0tQ5EOwZMdu1Gv3e9C33gxjV58hVtWTbDM
To do this, you must create the "index.html" file in the path below inside your gitlab repository.
public/.well-known/acme-challenge/5TBu788fW0tQ5EOwZMdu1Gv3e9C33gxjV58hVtWTbDM/index.html
In the "index.html" file you should put only the following sentence:
5TBu788fW0tQ5EOwZMdu1Gv3e9C33gxjV58hVtWTbDM.ewlbSYgvIxVOqiP1lD2zeDKWBGEZMRfO_4kJyLRP_4U
important: do not put any html tag, just the plain text above.
Then just continue following the tutorial. Good luck.

How do I add the same scrapy pipeline to any spider in scrapyd

I have several projects running in scrapyd and all uses the same pipeline, so How can I add this pipeline to every scheduled spider as default with out adding anything to the curl request, only having a flag in default_scrapyd.conf file?

Browser Caching Versions in Google Chrome with Tampermonkey

I have a question about caching versions of imported files (via web requests) in Google Chrome:
Let's say I have script.js, whose URL is:
http://www.getscripts.com/script.js (URL contents arbitrary after "http://", because TamperMonkey imports over the HTTP protocol)
If I import the script in Tampermonkey using #require, I want to use a query string for its version to avoid caching.
Caching versions:
Let's say, that I first #require the 1st "version" of the script (created it, and inserted initial content), by giving the require a URL of http://www.getscripts.com/script.js?v=1, so I pass in the URL a query string of the version v=1, and that the script file of version v=1 was not cached already.
I make some changes to the code of script.js, and the script that the URL provides also gets updated (I use surge.sh).
Then, I change my #require URL to: http://www.getscripts.com/script.js?v=2, so I pass in the URL a query string of the version v=2
Then I make some more changes in the code, make sure the URL gets the updated file, and give the #require my initial URL with v=1: http://www.getscripts.com/script.js?v=1
Question:
The script file that will be returned (via an HTTP request) - will it be of version 1, or 2?
What I'm doing is trying to force-download a new version of my script file, after I update the script's code, since Tampermonkey caches script files without re-downloading them, unless there was some changes made in the URL of the #require (what does the HTTP request).

This was solved by forcing the browser to download a new version of the script by adding a version parameter to the script's URL, as suggested by wOxxOm above.

Make file readable by CGI but not web browser

I have a text file "data.txt", and based on input to an html form I want to display a single line from that file. My result is delivered by a CGI script which needs to access data.txt, but I don't want a user to be able to type in "data.txt" into their web browser and see the whole file. Is there a simple way to make "data.txt" readable by the CGI script but not accessible by loading it with the browser?
I'm using standard apache on ubuntu. I believe the suexec module can do this, but I'm hoping for a simpler solution just using fancy permissions, chowns, etc. Thanks-

Store your datafile outside of the webserver filetree (for apache, check the DocumentRoot).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrapy and respect of robots.txt - scrapy

I discovered yesterday that Scrapy respects the robots.txt file by default (ROBOTSTXT_OBEY = True). If I request an URL with scrapy shell url, and if I have a response, does it mean that url is not protected by robots.txt?

Related

File upload request failed in Jmeter even after following correct steps

Securing GitLab Pages with Let's Encrypt gets 404

How do I add the same scrapy pipeline to any spider in scrapyd

Browser Caching Versions in Google Chrome with Tampermonkey

Make file readable by CGI but not web browser

Categories

Resources