How does scrapy-splash filter duplicates? - scrapy

When using scrapy-splash library to render JS. We add its custom DUPEFILTER_CLASS to the settings.py file.
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
Seems like this is used to filter requests in order not to send much requests and speed up the process. But, what is the base for filtering requests when using scrapy-splash? is it the url?

Duplicates are detected using the splash_request_fingerprint function. From looking at the code and issue 900 (still open) , the url is taken into account, but you have the option of passing a meta parameter to the request if you want to differentiate it from some other request with the same url. But we have to look at scrapy.utils.request:request_fingerprint because this too is called.
What is part of the fingerprint:
the url of the request
the request method (source and keep_fragments is set to True)
the request's body
What's not part of the fingerprint:
the http request headers (since include_headers is None by default)
url fragments by default are not used to compute the fingerprint, unless request.meta.splash.args contains the key url
It's useful to follow issue 900 in order to keep up to date. In the later comments , some recipes and examples for using/customizing fingerprinting are starting to emerge.

Related

How to use a Postman Mock Server

I have followed the guide here to create a postman mock for a postman collection. The mock seem to be successfully created, but I have no idea how to use the mock service.
I've been given a url for the mock, but how do I specify one of my requests? If I issue a GET request to https://{{mockid}}.mock.pstmn.io I get the following response:
{
"error": {
"name": "mockRequestNotFoundError",
"message": "We were unable to find any matching requests for the mock path (i.e. undefined) in your collection."
}
}
According to the same guide mentioned above the following url to "run the mock" https://{{mockId}}.mock.pstmn.io/{{mockPath}} but what exactly is mockPath?
Within my collection I have plenty of folders, and inside one of these folders I have a request with an example response. How do I access this example response through the mock? Thanks for all help in advance!
Here's the Postman Pro API, which doesnt mention a lot more than just creating reading mocks.
I had the same issue seeing an irrelevant error but finally I found the solution. Unfortunately I cannot find a reference in Postman website. But here is my solution:
When you create a Mock server you define your first request (like GET api/v1/about). So the Mock server will be created but even when you obtain your API key and put it in the header of request (as x-api-key) it still returns an error. It doesn't make sense but it turned out that defining the request is not enough. For me it only started returning a response when I added an Example for the request.
So I suggest for each request that you create, also create at least one example. The request you send will be matched with the examples you have created and the matched response will be returned. You can define body, headers and the HTTP status code of the example response..
I have no Pro Postman subscription and it worked for me using my free subscription.
Menu for adding an example or selecting one of them for editing:
UI for defining the example (See body, headers and status) :
How to go back to the request page:
Here is the correct reply I get based on my example:
If you request in the example is a GET on api.domain.com/api/foo then the mockPath is /api/foo and your mock endpoint is a GET call to https://{{mockid}}.mock.pstmn.io/api/foo.
The HTTP request methods and the the pathname as shown in the image below constitute a mock.
For ease of use the mock server is designed to be used on top of collections. The request in the examples is used as is along with response attached to it. The name of the folder or collection is not a part of the pathname and is not factored in anywhere when using a mock. Mocking a collection means mocking all the examples in within your collection. An example is a tuple of request and response.
An optional response status code if specified lets you fetch the appropriate response for the same path. This can be specified with the x-mock-response-code header. So passing x-mock-response-code as 404 will return the example that matches the pathname and has a response with status code of 404.
Currently if there are examples with the same path but different domains, and mock is unable to distinguish between them it will deterministically return the first one.
Also if you have several examples for the same query :
Mock request accept another optional header, x-mock-response-code, which specifies which integer response code your returned response should match. For example, 500 will return only a 500 response. If this header is not provided, the closest match of any response code will be returned.
Optional headers like x-mock-response-name or x-mock-response-id allow you to further specify the exact response you want by the name or by the uid of the saved example respectively.
Here's the documentation for more details.
{{mockPath}} is simply the path for your request. You should start by adding an example for any of your requests.
Example:
Request: https://www.google.com/path/to/my/api
After adding your mock server, you can access your examples at:
https://{{mockId}}.mock.pstmn.io/path/to/my/api

jMeter issue when using Cookie manager and Regular expression extractor

So basically I need to extract an auth token from header response of 1st http request and then use the extracted data in 2nd (and all the following) http requests cookies.
The issue here is, that I have cookie manager set for the whole controller and instead of getting actual data I get the name of variable in my cookie ".authToken=${auth}".
I am guessing the reason is that the variable is not declared when the test reaches Cookie manager, but I would expect jmeter to be smart enough to declare the variable when it gets to the regular expression extractor.
Structure
Thread
Cache Manager
Cookie Manager (Cookie Policy:compatibility; Implementation:HC3)
Controller
Http Request
Regular expression extractor
Http request (I need to use value extracted above in Request Cookie here)
Http request (I need to use the same value in Request Cookie here)
Http request (I need to use the same value in Request Cookie here)
.....
Details:
All the http requests are recorded with implementation HttpClient3.1
Pretty sure I have everything configured correctly as in variable names, regular expression since it works in a very specific case:
The only time it seemed to work correctly was when I had Cookie manager inside the http request and disabled the 'main' Cookie manager (the one for the whole controller). Then it got extracted correctly, but that would be really silly workaround for such a basic requirement and also I have many http requests (over 100) where I need to use the extracted value.
Jmeter doesn't need to use the variable before it's declared by the regular expression extractor, I made sure that the domain is correct and it gets used for the first time after it should have been extracted.
Another workaround I thought of would be having separate threads, have them linked and send the variable in between them, launching the next one once the data gets extracted, but that seems a little bit too drastic.
What I tried:
Splitting http requests into 2 different controllers and using 2 different Cookie managers - got "${auth}" instead of some value
Defining user variable above controller and then using "Apply to: Jmeter Variable" option - again got just string "${auth}" instead of some value.
Moving the Cookie manager to a position after the http request which is used for the extraction - again "${auth}" instead of some value
Setting different cookie's policy (not all of them, but few)
Setting "CookieManager.save.cookies=true" in jmeter.properties (and still have on true)
Any help/ideas are appreciated. I have been trying to figure this out for about an hour and I think I must be missing something very simple.
Alright, finally got this resolved after roughly 2 hours.
Thanks to this article, I was able to do what I needed
https://capacitas.wordpress.com/2013/06/11/thats-the-way-the-cookie-crumbles-jmeter-style-part-2/
In nutshell: You need to use beanshell pre-processor and add the cookie manually
Here is the beanshell script in case the site dies:
import org.apache.jmeter.protocol.http.control.CookieManager;
import org.apache.jmeter.protocol.http.control.Cookie;
CookieManager manager = sampler.getCookieManager();
Cookie cookie = new Cookie("CookieName", vars.get("YourExtractedVariable"), "Domain", "Path", false, 0);
manager.add(cookie);

How to force dispatcher cache urls with get parameters

As I understood after reading these links:
How to find out what does dispatcher cache?
http://docs.adobe.com/docs/en/dispatcher.html
The Dispatcher always requests the document directly from the AEM instance in the following cases:
If the HTTP method is not GET. Other common methods are POST for form data and HEAD for the HTTP header.
If the request URI contains a question mark "?". This usually indicates a dynamic page, such as a search result, which does not need to be cached.
The file extension is missing. The web server needs the extension to determine the document type (the MIME-type).
The authentication header is set (this can be configured)
But I want to cache url with parameters.
If I once request myUrl/?p1=1&p2=2&p3=3
then next request to myUrl/?p1=1&p2=2&p3=3 must be served from dispatcher cache, but myUrl/?p1=1&p2=2&p3=3&newParam=newValue should served by CQ for the first time and from dispatcher cache for subsequent requests.
I think the config /ignoreUrlParams is what you are looking for. It can be used to white list the query parameters which are used to determine whether a page is cached / delivered from cache or not.
Check http://docs.adobe.com/docs/en/dispatcher/disp-config.html#Ignoring%20URL%20Parameters for details.
It's not possible to cache the requests that contain query string. Such calls are considered dynamic therefore it should not be expected to cache them.
On the other hand, if you are certain that such request should be cached cause your application/feature is query driven you can work on it this way.
Add Apache rewrite rule that will move the query string of given parameter to selector
(optional) Add a CQ filter that will recognize the selector and move it back to query string
The selector can be constructed in a way: key_value but that puts some constraints on what could be passed here.
You can do this with Apache rewrites BUT it would not be ideal practice. You'll be breaking the pattern that AEM uses.
Instead, use selectors and extensions. E.g. instead of server.com/mypage.html?somevalue=true, use:
server.com/mypage.myvalue-true.html
Most things you will need to do that would ever get cached will work this way just fine. If you give me more details about your requirements and what you are trying to achieve, I can help you perfect the solution.

Jmeter : How to test a website to render a page regardless of the content

I have a requirement where the site only needs to respond to the user within certain seconds, regardless of the contents.
Now there is a option in Jmeter in HTTP Proxy Server -> URL Patterns to exclude and then to start recording.
Here I can specify gif, css or other content to ignore. However before starting the recording I have to be aware of what are the various contents that are going to be there.
Is there any specific parameter to pass to Jmeter or any other tool which takes care about loading the page only and I can assert the response code of that page and no the other contents of the page are recorded.
Thanks.
Use the standard HTTP Request sampler with DISABLED (not checked) option Retrieve All Embedded Resources from HTML Files (set via sampler's control panel):
"It also lets you control whether or not JMeter parses HTML files for
images and other embedded resources and sends HTTP requests to
retrieve them."
NOTE: You may also define the same setting via HTTP Request Defaults.
NOTE: See also "Response size calculation" in the same HTTP Request article.
Add assertions to your http samplers:
Duration Assertion: to tests if response was received within a defined amount of time;
Response Assertion: to ensure that request was successfull,
e.g.
Response Field to Test = Response Code
Pattern Matching Rules = Equals
Patterns to Test = 200
You want to run test that would ignore resources after certain number of seconds?
I don't understand, what are you trying to accomplish by doing that?
Users will still receive those resources when they request your url, so your tests wont be accurate.
I don't mean any disrespect, but is it possible that you misunderstood the requirements?
I assume that the requirement is to load all the resources in certain number of seconds, not to cut off the ones that fail to fit in that time?

How to link request in ISAPI extension to response in ISAPI filter?

I'm building sort of a http sniffer for IIS6, for that I'm using both ISAPI filter and ISAPI extension.
Extension - to read the request.
Filter - to read the response.
The reason i'm using extension is that I don't want to force the user to change to IIS5 Compatibility Mode and therefore can't subscribe to SF_NOTIFY_READ_RAW_DATA.
The thing is, when I read the response, I want to link it to the request, so I need to give a unique identifier to the request, and use it when reading the response.
I have read that there used to be an option to call ServerSupportFunction with SF_REQ_GET_CONNID, but that's not supported in IIS6.
Also I have read that a possible solution is to append customer header and then remove it - that would probably work, but seems less elegant than I hoped to implement.
Is there any way to get the connection ID (connID in EXTENSION_CONTROL_BLOCK) in the filter?
appreciate your response,
Sagiv
I had the same problem a few months ago.
I did the following to solve the problem:
On HttpFilterProc (ISAPI Filter) I looked for notification SF_NOTIFY_PREPROC_HEADERS.
I then injected my own header with a GUID to the request.
On HttpExtensionProc (ISAPI Extension) I read my header and extract the GUID.
I then read the request content and connected it with the GUID.
On OnSendRawData (ISAPI Filter) I read the (chunked) response content and again connect it with the GUID.
This way I have both the request content (from the Extension) and the response content (from the filter) linked!