How to add a url suffix before performing a callback in scrapy - scrapy

I have a crawler that works just fine in collecting the urls I am interested in. However, before retrieving the content of these urls (i.e. the ones that satisfy rule no 3), I would like to update them, i.e. add a suffix - say '/fullspecs' - on the right-hand side. That means that, in fact, I would like to retrieve and further process - through callback function - only the updated ones. How can I do that?
rules = (
Rule(LinkExtractor(allow=('something1'))),
Rule(LinkExtractor(allow=('something2'))),
Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'),
)

You can set process_value parameter to lambda x: x+'/fullspecs' or to a function if you want to do something more complex.
You'd end up with:
Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')),
callback='parse_archive', process_value=lambda x: x+'/fullspecs')
See more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor

Related

How can I access value in sequence type?

There are the following attributes in client_output
weights_delta = attr.ib()
client_weight = attr.ib()
model_output = attr.ib()
client_loss = attr.ib()
After that, I made the client_output in the form of a sequence through
a = tff.federated_collect(client_output) and round_model_delta = tff.federated_map(selecting_fn,a)in here . and I declared
`
#tff.tf_computation() # append
def selecting_fn(a):
#TODO
return round_model_delta
in here. In the process of averaging on the server, I want to average the weights_delta by selecting some of the clients with a small loss value. So I try to access it via a.weights_delta but it doesn't work.
The tff.federated_collect returns a tff.SequenceType placed at tff.SERVER which you can manipulate the same way as for example client dataset is usually handled in a method decorated by tff.tf_computation.
Note that you have to use the tff.federated_collect operator in the scope of a tff.federated_computation. What you probably want to do[*] is pass it into a tff.tf_computation, using the tff.federated_map operator. Once inside the tff.tf_computation, you can think of it as a tf.data.Dataset object and everything in the tf.data module is available.
[*] I am guessing. More detailed explanation of what you would like to achieve would be helpful.

Query parameter handling in karate framework

Is there any easy way to handle huge query param like below. Also I would like to know how can I do run time parameterisation for some values?
http://154.213.196.243:7941/v1/banking/Jumio/callback?callBackType=NetVerifyId&jumioIdScanReference=123abcde-1244-8571-3454-abcd12345567&merchantIdScanReference=66a9ff2e-d8ec-e811-a956-000d3ab3f117&verificationStatus=APPROVED_VERIFIED&idScanStatus=SUCCESS&id+ScanSource=API&idCheckDataPositions=OK&idCheckDocumentValidation=OK&idCheckHologram=OK&idCheckMRZcode=OK&idCheckMicroprint=OK&idCheckSecurityFeatures=OK&idCheckSignature=OK&transactionDate=2018-11-20T20%3A53%3A25.797Z&callbackDate=2018-11-20T20%3A53%3A25.797Z&idType=DRIVING_LICENSE&idCountry=GBR&idScanImage+=https%3A%2F%2Fnetverify.com%2Frecognition%2Fv1%2Fidscan%2F123abcde-1244-8571-3454-abcd12345567%2Ffront&idFirstName=ILARIA&idLastName=FURS&idDob=1976-12-23&idExpiry=2025-12-31&personalNumber=123456789&clientIp=xxx.xxx.xxx.xxx&idAddress=%7B%22country%22%3A%22USA%22%2C%20%22stateCode%22%3A%22US-OH%22%7D&idNumber=P12345&idStatus=TESTER961260SS9DL54&identityVerification=%7B%22similarity%22%3A%22MATCH%22%2C%22validity%22%3Atrue%7D HTTP/1.1
Yes. Read the docs: https://github.com/intuit/karate#param
For example:
* param callBackType = 'NetVerifyId'
and so on. And look at params where you can set all keys up as one single JSON and also do parameterization if needed, there are multiple possibilities: https://github.com/intuit/karate#params
See this example as well: dynamic-params.feature

Many inputs to one output, access wildcards in input files

Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)

Read response body in Apache mod_lua

I'm prototyping a simple "output" filter with Apache + mod_lua. How can I read response body, at the end of other native output filters applied, via LUA? For example, can I get the actual response that will be sent to the client?
The manual has some good guidance on this:
http://httpd.apache.org/docs/current/mod/mod_lua.html#modifying_buckets
Modifying contents with Lua filters Filter functions implemented via
LuaInputFilter or LuaOutputFilter are designed as three-stage
non-blocking functions using coroutines to suspend and resume a
function as buckets are sent down the filter chain. The core structure
of such a function is:
function filter(r)
-- Our first yield is to signal that we are ready to receive buckets.
-- Before this yield, we can set up our environment, check for conditions,
-- and, if we deem it necessary, decline filtering a request alltogether:
if something_bad then
return -- This would skip this filter.
end
-- Regardless of whether we have data to prepend, a yield MUST be called here.
-- Note that only output filters can prepend data. Input filters must use the
-- final stage to append data to the content.
coroutine.yield([optional header to be prepended to the content])
-- After we have yielded, buckets will be sent to us, one by one, and we can
-- do whatever we want with them and then pass on the result.
-- Buckets are stored in the global variable 'bucket', so we create a loop
-- that checks if 'bucket' is not nil:
while bucket ~= nil do
local output = mangle(bucket) -- Do some stuff to the content
coroutine.yield(output) -- Return our new content to the filter chain
end
-- Once the buckets are gone, 'bucket' is set to nil, which will exit the
-- loop and land us here. Anything extra we want to append to the content
-- can be done by doing a final yield here. Both input and output filters
-- can append data to the content in this phase.
coroutine.yield([optional footer to be appended to the content])
end

How can I dry this Rails Controller Action further

Our application uses a number of environments so we can experiment with settings without breaking things. In a typical controller action, I have something like this:
def some_action
...
if #foo.development_mode == 'Production'
#settings = SomeHelper::Production.lan(bar)
elsif #foo.development_mode == 'Beta'
#settings = SomeHelper::Beta.lan(nas)
elsif #foo.development_mode == 'Experimental'
#settings = SomeHelper::Experimental.lan(nas)
end
...
end
Since we have dozens of these, I figured I could try and dry things up with something like this:
#settings = "SomeHelper::#{#foo.development_mode}.lan(bar)"
Which obviously doesn't work - I just get:
"NasHelper::Production.lan(bar)"
How can I reduce this down or do I have to stick with what I've got??
If your concern is that you're ending up with a String rather than the object, you can use String.constantize (Rails only, with standard Ruby you'd have to implement this; it uses Object.const_get(String))
Another option would be .const_get (e.g. Object.const_get(x) where x is your string), you it doesn't, on its own, nest correctly, so you would have to split at "::", etc.
Also, there's the option of using eval to evaluate the String.
But note: eval should be used with great care (it's powerful), or not at all.
Edit:
This means that instead of:
#settings = "SomeHelper::#{#foo.development_mode}.lan(bar)"
You could run:
#settings = "SomeHelper::#{#foo.development_mode}".constantize.lan(bar)
Useful Sources:
http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-constantize
http://www.ruby-forum.com/topic/183112
http://blog.grayproductions.net/articles/eval_isnt_quite_pure_evil
In the first case, #settings receives the result of the method SomeHelper::Production.lan(bar); in the second, #settings just gets a string. You could use the send method of Object linked here to fire the method, or eval, but this wouldn't be my first choice.
It looks like you might be reinventing a wheel -- Rails already has the concept of "environments" pretty well wired into everything -- they are defined in app/config/environments. You set the environment when you launch the server, and can test like Rails.env.production?. To create new environments, just copy the existing environment file of the one closest to the new one, e.g. copy production.rb to beta.rb and edit as necessary, then test Rails.env.beta?, for example.
But this still leaves you testing which one all over the place. You can add to the config hash (e.g. config.some_helper.lan = 'bar'), which value you can assign to #settings directly. You have to make sure there's either a default or it's defined in all environments, but I think this is probably the right approach ... not knowing exactly what you aim to accomplish.