From within a scrapers parse callback, I wish to clone a request along with its response object and change its callback.
The behavior I'm expecting is that will generate a request, and it will have its callback executed skipping the download step, since it already has the original response object.
Is it possible to put new requests into the queue without ending the current iteration in the callback.
Furthermore, is it possible to generate a new request object for other spiders within the crawler?
Just do
response.copy()
yield another_function(response)
def another_function(self, response):
#here comes that your logic
request related data is available in response.request
Related
I am learning Karate DSL in order to determine if it is a viable automation solution for our new API.
We have a unique environment in which we have a REST API to test, but also use REST services to perform other required actions while the original request is awaiting a response. The REST calls perform robotic actions to manipulate hardware, query our servers, etc.
We need the ability to send a REST request, perform various other REST requests (with assertions) while awaiting a response to the first request. Then, finally assert that the original request gets the correct response payload based on the actions performed between the first request and its response.
Rough example:
Feature: Async test
Background:
* def defaultAssertion = { success: true }
Given url 'http://foo/'
Scenario: Foo test
Given path 'start' <- start long running call
When method get
And request { externalId: 'id1'}
Given path 'robot-action' <- perform another call that resolves immediately
When method get
Then status 200
* match response contains deep defaultAssertion
Then status 200 <- somehow assert on first requests' response
* match response contains deep defaultAssertion
Obviously the example above does not work, but I am hoping we can structure our tests similarly.
I know tests can run in parallel, but I am not sure how to encapsulate them as "one test" vs "multiple tests" and control order (which is required for this to work properly).
There is documentation on Async behavior, but I found it difficult to follow. If anyone can provide more context on how to implement the example I would greatly appreciate it.
Any suggestions would be warmly welcomed and examples would be fantastic. Thanks all!
Actually being able to wait for an HTTP call to complete and do other things in the meantime is something Karate cannot do by default. This question has me stumped and it has never been asked for before.
The only way I think we can achieve it in Karate is to create a thread and then run a feature from there. There is a Java API to call a feature, but when you are doing all that, maybe you are better off using some hand-crafted Java code to make an HTTP request. Then your case aligns well with the example mentioned here: https://twitter.com/getkarate/status/1417023536082812935
So to summarize, my opinion of the best plan of action.
for this "special" REST call, be prepared to write Java code to make that HTTP call on a separate thread
call that Java code at the start of your Karate test (using Java interop)
you can pass the karate JS object instance
so that you can call karate.listen() when the long-running job is done
actually instead of the above, just use a CompletableFuture and a java method call (just like the example linked above)
Now that step won't block and you can do anything you want
After you have done all the other work, use the listen keyword or call a Java method on the helper you used at the start of your test (just like the linked example)
That should be it ! If you make a good case for it, we can build some of this into Karate, and I will think over it as well.
Feature: Async test
Background:
* def defaultAssertion = { success: true }
Given url 'http://foo/'
Scenario: Foo test
Given path 'start' <- start long running call
When method get
And request { externalId: 'id1'}
Given path 'robot-action' <- perform another call that resolves immediately
When method get
Then status 200
* match response contains deep defaultAssertion
Then status 200 <- somehow assert on first requests' response
* match response contains deep defaultAssertion
I'm writing a downloader middleware able to reschedule any request to be recrawled n days later. To give you a rough idea, here is what the request to be rescheduled would look like:
Request(
url,
headers={...},
meta={
'schedule_recrawl_on': <timestamp>
},
dont_filter=False,
callback=self.parse_item
)
My idea is to serialize the request with pickle, persist it somewhere, then have this requests deserialized and injected into the scheduler some time after.
However serializing with pickle isn't easy because the object is referencing an external method callback=self.parse_item which is defined on the spider class.
There is a warning about this in the docs but no clear solution.
Has anyone solved a similar issue? maybe using another serialization principal?
You can use the Request.to_dict() method and request_from_dict() function, to serialize and deserialize a Scrapy requests. You pass the spider as argument, when deserializing.
I'm improving the spider I wrote a few months ago. I'm trying to make it smarter and download only the new information from the website. For the purpose I am adding a code in the Download Middleware module to check whether URL ID is already visited or not. Except the URL which I can get fairly easy with request.url command I need to pass an Item from the Spider - that Item is the Date Last Updated.
The idea is to compare both values(URL and Date Last Update) with the ones from the database (regular csv file) and if both are the same to drop the request, if both are missing or if Last Update date doesn't match to proceed with the request.
The problem is that I don't know how to pass the Item from the Spider to the Middleware. I can see that in the Pipelines module (object) is passed to the class, tried to add it in Middleware class but it doesn't work.
Any ideas how to pass an Item or any other variable from the Spider to the Middleware module?
Usually you pass any additional info in the request meta as request.meta['my_thing'] = ... or as an argument yield Request(url, meta={'my_thing': ...}), which all middlewares up in the chain will be able to access. For your case however I'd recommend either to use scrapy built in cache middleware on dummy policy or either one of these two modules which do exactly the thing you have in mind:
https://github.com/TeamHG-Memex/scrapy-crawl-once
https://github.com/scrapy-plugins/scrapy-deltafetch
I have followed the guide here to create a postman mock for a postman collection. The mock seem to be successfully created, but I have no idea how to use the mock service.
I've been given a url for the mock, but how do I specify one of my requests? If I issue a GET request to https://{{mockid}}.mock.pstmn.io I get the following response:
{
"error": {
"name": "mockRequestNotFoundError",
"message": "We were unable to find any matching requests for the mock path (i.e. undefined) in your collection."
}
}
According to the same guide mentioned above the following url to "run the mock" https://{{mockId}}.mock.pstmn.io/{{mockPath}} but what exactly is mockPath?
Within my collection I have plenty of folders, and inside one of these folders I have a request with an example response. How do I access this example response through the mock? Thanks for all help in advance!
Here's the Postman Pro API, which doesnt mention a lot more than just creating reading mocks.
I had the same issue seeing an irrelevant error but finally I found the solution. Unfortunately I cannot find a reference in Postman website. But here is my solution:
When you create a Mock server you define your first request (like GET api/v1/about). So the Mock server will be created but even when you obtain your API key and put it in the header of request (as x-api-key) it still returns an error. It doesn't make sense but it turned out that defining the request is not enough. For me it only started returning a response when I added an Example for the request.
So I suggest for each request that you create, also create at least one example. The request you send will be matched with the examples you have created and the matched response will be returned. You can define body, headers and the HTTP status code of the example response..
I have no Pro Postman subscription and it worked for me using my free subscription.
Menu for adding an example or selecting one of them for editing:
UI for defining the example (See body, headers and status) :
How to go back to the request page:
Here is the correct reply I get based on my example:
If you request in the example is a GET on api.domain.com/api/foo then the mockPath is /api/foo and your mock endpoint is a GET call to https://{{mockid}}.mock.pstmn.io/api/foo.
The HTTP request methods and the the pathname as shown in the image below constitute a mock.
For ease of use the mock server is designed to be used on top of collections. The request in the examples is used as is along with response attached to it. The name of the folder or collection is not a part of the pathname and is not factored in anywhere when using a mock. Mocking a collection means mocking all the examples in within your collection. An example is a tuple of request and response.
An optional response status code if specified lets you fetch the appropriate response for the same path. This can be specified with the x-mock-response-code header. So passing x-mock-response-code as 404 will return the example that matches the pathname and has a response with status code of 404.
Currently if there are examples with the same path but different domains, and mock is unable to distinguish between them it will deterministically return the first one.
Also if you have several examples for the same query :
Mock request accept another optional header, x-mock-response-code, which specifies which integer response code your returned response should match. For example, 500 will return only a 500 response. If this header is not provided, the closest match of any response code will be returned.
Optional headers like x-mock-response-name or x-mock-response-id allow you to further specify the exact response you want by the name or by the uid of the saved example respectively.
Here's the documentation for more details.
{{mockPath}} is simply the path for your request. You should start by adding an example for any of your requests.
Example:
Request: https://www.google.com/path/to/my/api
After adding your mock server, you can access your examples at:
https://{{mockId}}.mock.pstmn.io/path/to/my/api
I am attempting to get the parameters for a POST request sent via the AFNetworking pod. However, I can't seem to get them. I am looping through the active operations with the below:
for(AFHTTPRequestOperation* operation in manager.operationQueue.operations){
NSLog(#"%#",operation.request.URL.path);
}
However, I can't get the parameters. I've tried using operation.request.URL.parameterString, but since it is POST, the string is null. Anyone know how to get these? I'd like to collect them so that I can cancel requests that are specific to the path and parameters sent, ensuring I'll get down to just the single request I need to cancel.
I ended up going about this a different way. I created a class that handled all the AFNetworking path calls. In this class is a dictionary which stored an integer id for the call and the operation. As the operations complete or fail, they are removed. The integer id is passed back to the calling object, allowing for unique access for canceling requests or polling.