How to serialize and persist a scrapy Request for later use?

How to serialize and persist a scrapy Request for later use? - scrapy

I'm writing a downloader middleware able to reschedule any request to be recrawled n days later. To give you a rough idea, here is what the request to be rescheduled would look like:
Request(
url,
headers={...},
meta={
'schedule_recrawl_on': <timestamp>
},
dont_filter=False,
callback=self.parse_item
)
My idea is to serialize the request with pickle, persist it somewhere, then have this requests deserialized and injected into the scheduler some time after.
However serializing with pickle isn't easy because the object is referencing an external method callback=self.parse_item which is defined on the spider class.
There is a warning about this in the docs but no clear solution.
Has anyone solved a similar issue? maybe using another serialization principal?

You can use the Request.to_dict() method and request_from_dict() function, to serialize and deserialize a Scrapy requests. You pass the spider as argument, when deserializing.

Related

Request URI too long on spartacus services

I've been trying to make use of service.getNavigation() method, but apparently the Request URI is too long which causes this error:
Request-URI Too Long
The requested URL's length exceeds the capacity limit for this server.
Is there a spartacus config that can resolve this issue?
Or is this supposed to be handled in the cloud (ccv2) config?

Not sure which service are you talking about specifically and what data are you passing there. For starters, please read this: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/414
Additionally it would benefit everyone if you could say something about the service you're using and the data you are trying to pass/get.

The navigation component is firing a request for all componentIds. If you have a navigation with a lot of (root?) elements, the maximum length of HTTP GET request might be too long for the given client or server.
The initial implementation of loading components was actually done by a POST request, but the impression was that we would not need to support requests with so many components. I guess we were wrong.
Luckily, the legacy POST based request is still in the code base, it's OccCmsComponentAdapter.findComponentsByIdsLegacy.
The easiest way for you to use this code, is to provide a CustomOccCmsComponentAdapter, that extends from OccCmsComponentAdapter. Then you can override the findComponentsByIds method and simply call the super.findComponentsByIdsLegacy and pass in a copy of the arguments.
A more cleaner way would be to override the CmsComponentConnector and directly delegate the load to the adapter.findComponentsByIdsLegacy. I would not start here, as it's more complicated. Do a POC with the first suggested approach.

Cloning a Request with the already downloaded response

From within a scrapers parse callback, I wish to clone a request along with its response object and change its callback.
The behavior I'm expecting is that will generate a request, and it will have its callback executed skipping the download step, since it already has the original response object.
Is it possible to put new requests into the queue without ending the current iteration in the callback.
Furthermore, is it possible to generate a new request object for other spiders within the crawler?

Just do
response.copy()
yield another_function(response)
def another_function(self, response):
#here comes that your logic
request related data is available in response.request

Accessing the built request details in Karate

Just like how the response information can be accessed through response, responseHeaders etc, is there any way to access the request information? I noticed that request information is not available through variables. Are there are any workarounds to access this information?
I understand that we build the request ourselves in the test scenario using the Given, When steps, so it may sound redundant. The reason I'm looking for this is I would like to access the complete request details Karate would've built using our test definition. The idea is to make this information available to a java class which can be called through the Java Interop. More specifically, I'm trying to build a swagger request and response validator to be used from karate.
The workaround I am using is to explicitly create variables like apipath and apimethod and use them with path and method. This does the job, but still one has to ensure that these variables are explicitly set. It will be cleaner if whatever request Karate built is just accessible through a variable.

Please raise a feature request. We can look at making this available as karate.request or similar.

How is XHR a viable alternative to asynchronous module definition?

I'm learning about the case for asynchronous module definition (AMD) from here but am not quite clear about the below:
It is tempting to use XMLHttpRequest (XHR) to load the scripts. If XHR
is used, then we can massage the text above -- we can do a regexp to
find require() calls, make sure we load those scripts, then use eval()
or script elements that have their body text set to the text of the
script loaded via XHR.
XHR is using ajax or something to make a call to grab a resource from the database, correct? What does the eval() or script elements have to do with this? An example would be very helpful

That part of RequireJS' documentation is explaining why using XHR rather than doing what RequireJS does is problematic.
XHR is using ajax or something to make a call to grab a resource from the database, correct?
XHR is what allows you to make an Ajax call. jQuery's $.ajax for instance creates an XHR instance for you and uses it to perform the query. How the server responds depends on how the server is designed. Most of the servers I've developed won't use a database to answer a request made to a URL that corresponds to a JavaScript file. The file is just read from the file system and sent back to the client.
What does the eval() or script elements have to do with this?
Once the request is over, what you have is a string that contains JavaScript. You've fetched the code of your module but presumably you also want to execute it. eval is one way to do it but it has the disadvantages mentioned in the documentation. Another way to do it would be to create a script element whose body is the code you've fetched, and then insert this script in the DOM but this also has issues, as explained in the documentation you refer to.

no callback function in cross domain json file

i'm trying to use cross domain jsonp. i have done this before using the callback function in the json file from the other domain. i'm looking at an example json data file that google uses in one of its tutorials:
http://earthquake.usgs.gov/earthquakes/feed/geojsonp/2.5/week -- here obviously the callback function here is eqfeed_callback. in the json file i'm trying to use, there is no callback function that kicks everything off, there is just a bracket [. the file starts off like:
[{"Address":"4441 Van Nuys Blvd","City":"Sherman Oaks" ...
and ends like:
}]
what should i do? is there another way to get at the data without a callback function? i can't edit this file. it's a service that i have a subscription to.
thx.

If it's not your server, and the server doesn't support JSONP, there's no way you can force it to return jsonp. You could try adding ?callback=callback to your url to see if that convinces the server to wrap it in a callback, but if it doesn't, you're out of luck.
Well, almost. There is actually a really dirty hack that you shouldn't use, which is to override javascript's standard Array constructor to assign the contents of the array to a global variable. But that's pretty hideous and I strongly advise against it.
Better ask the maintainer of the service if they're willing to support JSONP. Or better yet, add a CORS header.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to serialize and persist a scrapy Request for later use? - scrapy

You can use the Request.to_dict() method and request_from_dict() function, to serialize and deserialize a Scrapy requests. You pass the spider as argument, when deserializing.

Related

Request URI too long on spartacus services

Cloning a Request with the already downloaded response

Accessing the built request details in Karate

How is XHR a viable alternative to asynchronous module definition?

no callback function in cross domain json file

Categories

Resources