Saving cookies between scrapy scrapes - scrapy

I'm collecting data from a site on a daily basis. Each day I run scrapy and the first request always gets redirected to the sites homepage because it seems scrapy doesnt have any cookies set yet. However after the first request,scrapy receives the cookie and from then on works just fine.
This however makes it very difficult for me to use tools like "scrapy view" etc with any particular url because the site will always redirect to the home page and thats what scrapy will open in my browser.
Can scrapy save the cookie and I specify to use it on all scrapes? Can I specify to use it with scrapy view etc.

There is no builtin mechanism to persist cookies between scrapy runs, but you can build it yourself (source code just to demonstrate the idea, not tested):
Step 1: Writing the cookies to file.
Get the cookie from the response header 'Set-Cookie' in your parse function. Then just serialize it into a file.
There are several ways how to do this explained here: Access session cookie in scrapy spiders
I prefer the direct approach:
# in your parse method ...
# get cookies
cookies = ";".join(response.headers.getlist('Set-Cookie'))
cookies = cookies.split(";")
cookies = { cookie.split("=")[0]: cookie.split("=")[1] for cookie in cookies }
# serialize cookies
# ...
Ideally this should be done with the last response your scraper receives. Serialize the cookies that come with each response into the same file, overwriting the cookies you serialized during processing previous responses.
Step 2: Reading and using cookies from file
To use the cookies after loading it from the file you just have to pass them into the first Request you do as 'cookies' parameter:
def start_requests(self):
old_cookies #= deserialize_cookies(xyz)
return Request(url, cookies=old_cookies, ...)

Related

Vue--How do I make the <a> tag carry the request header?

How do I make the tag carry the request header? I use the <a> tag to download. And I need to carry a token in the request header.
When you use a tag to download files or link to any document, in general, it is not possible to manipulate extra headers! Browsers will send the typical headers. To solve this problem, following are the alternative solutions.
Your token must be query parameter in the URL so that back-end server can read it.
Or you can use cookies to save the token and browser will ensure that cookies are sent for your request automatically. (For security, ensure that you cookie is HTTP only and rejects CORS requests)
Alternately, if you are not really after downloading the file or simply trying to show on browser, then you can use XHR or fetch where you are free to manipulate headers.

How to run multiple http requests on single click in IntelliJ http file instead of running them individually

I have 10 http requests in http file and want to run them on single click instead of clicking each request and see the output. Subsequent requests use output from previous requests so I want to run them serially in automated way.
At the moment IDEA's text-based HTTP client supports variable substitution and allows to write some simple JavaScript to access response.
All you have to do is create .http file with all your requests defined in a sequence. Then after each request you can add JavaScript block to fill variables that next request(s) require. Example:
### getting json and setting variable retrieved from response body
GET http://httpbin.org/json
> {%
client.global.set("title",response.body["slideshow"]["title"])
%}
### making request using previously set variable as a body
POST http://httpbin.org/anything
Content-Type: text/plain
{{title}}
Next step is running all requests at once like this:
You could use "Run/Debug Configurations", create a few separated files and setup order in window. But you'll get a big task, you cannot wait and setup timeouts for requests.
I guess you should use Jmeter for real work.
You also could try setup simple curl requests in the Linux bases os.

URL Redirection in import.io

Hi I am working on URL http://www.goodtoknow.co.uk/recipes/healthy?page=1&cost_range=any&total_time=any&skill_level=any&tags%5B0%5D=Healthy&tags%5B1%5D=Healthy and creating extractor. But the URL gets redirected to the URL http://www.goodtoknow.co.uk/recipes/healthy automatically in import.io I want to create extractor for the first mentioned URL. Is it possible? Is it happening because of the page requires cookies which they does not support?
If you examine the network requests using chrome or any other web-debugger you can see that the website is calling a second URL for the recipie data:
http://www.goodtoknow.co.uk/recipes/search?q=&page=1&cost_range=any&total_time=any&skill_level=any&tags[0]=Healthy&tags[1]=Healthy&_=1458727079183
This url does not redirect without cookies and you can set the page number manually.
Try training using this URL and see if it avoids the redirecting.

Capture the URL in the javascript initiated HTTP GET/POST requests using selenium

I'm using selenium for automating a procedure I frequently use on a site.
When I press on specific element on the site it runs some complex javascript code that eventually downloads a csv file using HTTP GET request.
I can see that the URL of this request looks like: www.somesite.com/somepage.php?token=RAPO09834HROLQ340HGIE309W&....
My question is: how can I get the token in this URL with selenium? (I need it for executing other HTTP GET requests for extracting more data from the site)
I am using Firefox driver on windows.
I tried to search all the html, js, cookies I get from this site, the token is not there. (its probably generated by the javascript code before it does the HTTP request)
I understand this is some kind of session id token as all the javascript generated HTTP requests are using the same token during my session.

Expires cache for same script across multiple pages

I have a script that is used across multiple pages on my site. I want to set the expires header so that browsers cache it and it doesn't get downloaded every time. That's ok and I understand how to do that, but I don't quite know how the browser works.
Does the browser cache it according to its path and then is it smart enough to know that any page requesting the script should use the cached version, or is there an association between the script and the page and therefore it would have to be cached against each page?
In the browser cache, there is no connection between the URL and the requesting page. Browser cache keys contain the path and sometimes the query string (see Is it the filename or the whole URL used as a key in browser caches?).
That's why Google recommends using their Libraries API: If every page that requires a specific version of jQuery pointed the browser to fetch the library from Google, the browser would fetch it only once for www.xyz.com and then re-use it from its cache for www.abc.com.