Two consecutive yields, only the first work - scrapy

I have this piece of code that only executes the first yield's callback and not the next one. I have tried reordering them and it gives the same result:
Only the first yield callback gets executed.
for j in range(totalOrderPages): # the code gets in the loop
productURI = feedUrl % (productId, j + 1)
print "Got in the loop" # this gets printed
yield response.follow(productURI, self.parse_orders, meta={'pid': productId, 'categories': categories})
yield response.follow(first_page, self.parse_product, meta={'pid': productId, 'categories': categories})
Is there anything in Python or scrapy that prevents 2 consecutive yields?
Second question:
I'm trying to debug this using pdb.set_trace() but when I try to execute yield from the debugging console, it give the yield outside function error.
Does anyone know how can we debug yields?
Thank you.

Without knowing more details, like the redirection behaviour of the specific site or the contents of the variables (feedUrl, productURI, first_page, etc), I would say that some requests are being dropped by the Dupefilter (https://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class).
I'd recommend you to enable the DEBUG logging level and setting DUPEFILTER_DEBUG=True, and check the logs to see if that's the case.
You can force requests to bypass the Dupefilter by adding dont_filter=True when calling response.follow.
If this doesn't solve your issue, please share your crawl logs so we can have more information to debug the issue. Happy scraping!

Related

WebSphere wsadmin testConnection error message

I'm trying to write a script to test all DataSources of a WebSphere Cell/Node/Cluster. While this is possible from the Admin Console a script is better for certain audiences.
So I found the following article from IBM https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.nd.multiplatform.doc/ae/txml_testconnection.html which looks promising as it describles exactly what I need.
After having a basic script like:
ds_ids = AdminConfig.list("DataSource").splitlines()
for ds_id in ds_ids:
AdminControl.testConnection(ds_id)
I experienced some undocumented behavior. Contrary to the article above the testConnection function does not always return a String, but may also throw a exception.
So I simply use a try-catch block:
try:
AdminControl.testConnection(ds_id)
except: # it actually is a com.ibm.ws.scripting.ScriptingException
exc_type, exc_value, exc_traceback = sys.exc_info()
now when I print the exc_value this is what one gets:
com.ibm.ws.scripting.ScriptingException: com.ibm.websphere.management.exception.AdminException: javax.management.MBeanException: Exception thrown in RequiredModelMBean while trying to invoke operation testConnection
Now this error message is always the same no matter what's wrong. I tested authentication errors, missing WebSphere Variables and missing driver classes.
While the Admin Console prints reasonable messages, the script keeps printing the same meaningless message.
The very weird thing is, as long as I don't catch the exception and the script just exits by error, a descriptive error message is shown.
Accessing the Java-Exceptions cause exc_value.getCause() gives None.
I've also had a look at the DataSource MBeans, but as they only exist if the servers are started, I quickly gave up on them.
I hope someone knows how to access the error messages I see when not catching the Exception.
thanks in advance
After all the research and testing AdminControl seems to be nothing more than a convinience facade to some of the commonly used MBeans.
So I tried issuing the Test Connection Service (like in the java example here https://www.ibm.com/support/knowledgecenter/en/SSEQTP_8.5.5/com.ibm.websphere.base.doc/ae/cdat_testcon.html
) directly:
ds_id = AdminConfig.list("DataSource").splitlines()[0]
# other queries may be 'process=server1' or 'process=dmgr'
ds_cfg_helpers = __wat.AdminControl.queryNames("WebSphere:process=nodeagent,type=DataSourceCfgHelper,*").splitlines()
try:
# invoke MBean method directly
warning_cnt = __wat.AdminControl.invoke(ds_cfg_helpers[0], "testConnection", ds_id)
if warning_cnt == "0":
print = "success"
else:
print "%s warning(s)" % warning_cnt
except ScriptingException as exc:
# get to the root of all evil ignoring exception wrappers
exc_cause = exc
while exc_cause.getCause():
exc_cause = exc_cause.getCause()
print exc_cause
This works the way I hoped for. The downside is that the code gets much more complicated if one needs to test DataSources that are defined on all kinds of scopes (Cell/Node/Cluster/Server/Application).
I don't need this so I left it out, but I still hope the example is useful to others too.

Twisted deferreds block when URI is the same (multiple calls from the same browser)

I have the following code
# -*- coding: utf-8 -*-
# 好
##########################################
import time
from twisted.internet import reactor, threads
from twisted.web.server import Site, NOT_DONE_YET
from twisted.web.resource import Resource
##########################################
class Website(Resource):
def getChild(self, name, request):
return self
def render(self, request):
if request.path == "/sleep":
duration = 3
if 'duration' in request.args:
duration = int(request.args['duration'][0])
message = 'no message'
if 'message' in request.args:
message = request.args['message'][0]
#-------------------------------------
def deferred_activity():
print 'starting to wait', message
time.sleep(duration)
request.setHeader('Content-Type', 'text/plain; charset=UTF-8')
request.write(message)
print 'finished', message
request.finish()
#-------------------------------------
def responseFailed(err, deferred):
pass; print err.getErrorMessage()
deferred.cancel()
#-------------------------------------
def deferredFailed(err, deferred):
pass; # print err.getErrorMessage()
#-------------------------------------
deferred = threads.deferToThread(deferred_activity)
deferred.addErrback(deferredFailed, deferred) # will get called indirectly by responseFailed
request.notifyFinish().addErrback(responseFailed, deferred) # to handle client disconnects
#-------------------------------------
return NOT_DONE_YET
else:
return 'nothing at', request.path
##########################################
reactor.listenTCP(321, Site(Website()))
print 'starting to serve'
reactor.run()
##########################################
# http://localhost:321/sleep?duration=3&message=test1
# http://localhost:321/sleep?duration=3&message=test2
##########################################
My issue is the following:
When I open two tabs in the browser, point one at http://localhost:321/sleep?duration=3&message=test1 and the other at http://localhost:321/sleep?duration=3&message=test2 (the messages differ) and reload the first tab and then ASAP the second one, then the finish almost at the same time. The first tab about 3 seconds after hitting F5, the second tab finishes about half a second after the first tab.
This is expected, as each request got deferred into a thread, and they are sleeping in parallel.
But when I now change the URL of the second tab to be the same as the one of the first tab, that is to http://localhost:321/sleep?duration=3&message=test1, then all this becomes blocking. If I press F5 on the first tab and as quickly as possible F5 on the second one, the second tab finishes about 3 seconds after the first one. They don't get executed in parallel.
As long as the entire URI is the same in both tabs, this server starts to block. This is the same in Firefox as well as in Chrome. But when I start one in Chrome and another one in Firefox at the same time, then it is non-blocking again.
So it may not neccessarily be related to Twisted, but maybe because of some connection reusage or something like that.
Anyone knows what is happening here and how I can solve this issue?
Coincidentally, someone asked a related question over at the Tornado section. As you suspected, this is not an "issue" in Twisted but rather a "feature" of web browsers :). Tornado's FAQ page has a small section dedicated to this issue. The proposed solution is appending an arbitrary query string.
Quote of the day:
One dev's bug is another dev's undocumented feature!

jmeter stop current iteration

I am facing the following problem:
I have multiple HTTP Requests in my testplan.
I want every request to be repeated 4 times if they fail.
I realized that with a BeanShell Assertion, and its already working fine.
My problem is, that I don't want requests to be executed if a previous Request failed 5 times,
BUT I also dont want the thread to end.
I just want the current thread iteration to end,
so that the next iteration of the thread can start again with the 1st request (if the thread is meant to be repeated).
How do I realize that within the BeanShell Assertion?
Here is just a short extract of my code where i want the solution to have
badResponseCounter is being increased for every failed try of the request, this seems to work so far. Afterwards, the variable gets resetted.
if (badResponseCounter = 5) {
badResponseCounter = 0;
// Stop current iteration
}
I already checked the API, methods like setStopTest() or setStopThread() are given, but nothing for quitting the current iteration. I also need the preference "continue" in the thread group, as otherwise the entire test will stop after 1 single request failed.
Any ideas of how to do this?
In my opinion the easiest way is using following combination:
If Controller to check ${JMeterThread.last_sample_ok} and badResponseCounter variables
Test Action Sampler as a child of If Controller configured to "Go to next loop iteration"
Try this.
ctx.setRestartNextLoop(true);
if the thread number is 2, i tried to skip. I get the below result as i expected (it does not call b-2). It does not kill the thread either.

How do I debug a Delayed::Worker.work_off that doesn't return success or failure

I am testing my Delayed::Job using Rspec.
In my rspec_controller:
it "queues up delayed job and fires" do
setup
expect {
post :create, {:job => valid_attributes}
}.to change(Delayed::Job, :count).by(2)
Delayed::Worker.new.work_off.should == [2,0]
end
Delayed::Job.count passes as expected, but Delayed::Worker.new.work_off returns as [0,0], indicating there are 0 successes and 0 failures when there are 2 jobs.
How should I debug to find out why work_off doesn't fire the jobs.
Edit: The 2 jobs that are supposed to run, have their run_at set into the future. Does work_off fire off jobs that are not meant to be immediate?
Although this could be an older question, there's one parameter that's not much documented, try using
Delayed::Worker.new(quiet: false).work_off
to debug the result of your background jobs, this could help you to find out if the fact that they're supposed to run in the future is messing with the assert itself.
EDIT: Don't forget to take off the "quiet:false" when you're done, otherwise your tests will always output the results of the background jobs.
The construct
Delayed::Worker.new.work_off
immediately processes everything that is in the DJ queue, and in the same thread as the caller (it doesn't spawn a separate worker thread). But this doesn't explain why you're not getting [2, 0] for a result.
To answer your original question 'How should I debug to find out why work_off doesn't fire the jobs?', I suggest you use the callback hooks to trace the lifecycle of the jobs. Add a comment if you need to be shown how to do that... :)

Rails 3.2.2 log files unordered, requests intertwined

I recollect getting log files that were nicely ordered, so that you could follow one request, then the next, and so on.
Now, the log files are, as my 4 year old says "all scroggled up", meaning that they are no longer separate, distinct chunks of text. Loggings from two requests get intertwined/mixed up.
For instance:
Started GET /foobar
...
Completed 200 OK in 2ms (Views: 0.4ms | ActiveRecord: 0.8ms)
Patient Load (wait, that's from another request that has nothing to do with foobar!)
[ blank space ]
Something else
This is maddening, because I can't tell what's happening within one single request.
This is running on Passenger.
I tried to search for the same answer but couldn't find any good info. I'm not sure if you should fix server or rails code.
If you want more info about the issue here is the commit that removed old way of logging https://github.com/rails/rails/commit/04ef93dae6d9cec616973c1110a33894ad4ba6ed
If you value production log readability over everything else you can use the
PassengerMaxInstancesPerApp 1
configuration. It might cause some scaling issues. Alternatively you could stuff something like this in application.rb:
process_log_filename = Rails.root + "log/#{Rails.env}-#{Process.pid}.log"
log_file = File.open(process_log_filename, 'a')
Rails.logger = ActiveSupport::BufferedLogger.new(log_file)
Yep!, they have made some changes in the ActiveSupport::BufferedLogger so it is not any more waiting until the request has ended to flush the logs:
http://news.ycombinator.com/item?id=4483390
https://github.com/rails/rails/commit/04ef93dae6d9cec616973c1110a33894ad4ba6ed
But they have added the ActiveSupport::TaggedLogging which is very funny and you can stamp every log with any kind of mark you want.
In your case could be good to stamp the logs with the request UUID like this:
# config/application.rb
config.log_tags = [:uuid]
Then even if the logs are messed up you still can follow which of them correspond to the request you are following up.
You can make more funny things with this feature to help you in your logs study:
How to log user_name in Rails?
http://zogovic.com/post/21138929607/running-time-in-rails-logs
Well, for me the TaggedLogging solution is a no go, I can live with some logs getting lost if the server crashes badly, but I want my logs to be perfectly ordered. So, following advice from the issue comments I'm applying this to my app:
# lib/sequential_logs.rb
module ActiveSupport
class BufferedLogger
def flush
#log_dest.flush
end
def respond_to?(method, include_private = false)
super
end
end
end
# config/initializers/sequential_logs.rb
require 'sequential_logs.rb'
Rails.logger.instance_variable_get(:#logger).instance_variable_get(:#log_dest).sync = false
As far as I can say this hasn't affected my app, it is still running and now my logs make sense again.
They should add some quasi-random reqid and write it in every line regarding one single request. This way you won't get confused.
I haven't used it, but I believe Lumberjack's unit_of_work method may be what you're looking for. You call:
Lumberjack.unit_of_work do
yield
end
And all logging done either in that block or in the yielded block are tagged with a unique ID.