Scrapy: drop item from scraper - scrapy

I would like to drop an item from the scraper itself instead of add the particular dropping logic of this scraper into the pipeline due is a specific case.
Scrapy has DropItem exception that is nicely handled from the Pipeline but it produces an error if is raised from the scraper:
#...
raise DropItem('Item dropped ' + self.id())
Output:
2019-11-13 13:27:27 [scrapy.core.scraper] ERROR: Spider error processing <GET http://domain.tld/> (referer: http://domain.tld/referer)
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/core/core/spiders/my_spider.py", line 46, in parse get.photos())
scrapy.exceptions.DropItem: Item dropped 35
Is there a more elegant way to handle this situation?

Related

Debugging: "Message: no such element: Unable to locate element:"

I am learning Pythons so please bear with me.
I adopted LinkedIn-Easy-Apply-Bot from: https://github.com/voidbydefault/LinkedIn-Easy-Apply-Bot
While the bot is working perfectly fine on my test account but when I change the email ID/password to real account (with everything being the same), I start getting these errors:
Traceback (most recent call last):
File "/home/me/Documents/LinkedIn-Easy-Apply-Bot/linkedineasyapply.py", line 124, in apply_jobs
job_results = self.browser.find_element_by_class_name("jobs-search-results")
File "/home/me/PycharmProjects/Better-LinkedIn-EasyApply-Bot/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 766, in find_element_by_class_name
return self.find_element(by=By.CLASS_NAME, value=name)
File "/home/me/PycharmProjects/Better-LinkedIn-EasyApply-Bot/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 1251, in find_element
return self.execute(Command.FIND_ELEMENT, {
File "/home/me/PycharmProjects/Better-LinkedIn-EasyApply-Bot/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 430, in execute
self.error_handler.check_response(response)
File "/home/me/PycharmProjects/Better-LinkedIn-EasyApply-Bot/venv/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".jobs-search-results"}
(Session info: chrome=102.0.5005.61)
I have tried deleting chromedriver to ensure version conflicts do not exit, I have tried adding time.sleep(5) in line 124 (above) and also tried driver.implicitly_wait(10). Unfortunately, the error persists with my real account.
Note that there are no issues with my real account if used manually. I am able to apply all sorts of jobs whether EasyApply or otherwise and the bot is working 100% fine on my test account hence the elements the code is looking for exist.
Please help in fixing the problem.
Thanks.

Odoo v13 : Could not uninstall crm App : Record does not exist or has been deleted (Record: ir.model.fields(9311,), User: 1)

Odoo Version : 13.0.20210614
Way to Reproduce : in Application, CRM App. > uninstall
Behavior :
image 1 in attach
Can t Unsintall : ERROR (image 1 in attach):
('Record does not exist or has been deleted (Record: ir.model.fields(9311,), User: 1)', None)
Same Bug was reported several times but still not fixed :
https://github.com/odoo/odoo/issues/38008
How to deal with it to uninstall the crm App ?
**************** TRACEBACK **************
2021-06-18 14:21:52,779 6 INFO samadeva-oerp-brstaging-2702918 odoo.addons.base.models.ir_module: ALLOW access to module.module_uninstall on ['sale_crm', 'crm_enterprise', 'crm_sms', 'website_crm', 'website_crm_sms', 'mass_mailing_crm', 'crm'] to user __system__ #1 via 86.243.106.83
2021-06-18 14:21:52,800 6 WARNING samadeva-oerp-brstaging-2702918 odoo.modules.loading: Transient module states were reset
2021-06-18 14:21:52,801 6 ERROR samadeva-oerp-brstaging-2702918 odoo.modules.registry: Failed to load registry
Traceback (most recent call last):
File "/home/odoo/src/odoo/odoo/api.py", line 745, in get
def get(self, record, field, default=NOTHING):
value = self._data[field][record._ids[0]]
KeyError: 9311
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/odoo/src/odoo/odoo/fields.py", line 1037, in __get__
value = env.cache.get(record, self)
File "/home/odoo/src/odoo/odoo/api.py", line 751, in get
raise CacheMiss(record, field)
odoo.exceptions.CacheMiss: ('ir.model.fields(9311,).model', None)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/odoo/src/odoo/odoo/modules/registry.py", line 86, in new
odoo.modules.load_modules(registry._db, force_demo, status, update_module)
File "/home/odoo/src/odoo/odoo/modules/loading.py", line 494, in load_modules
Module.browse(modules_to_remove.values()).module_uninstall()
File "<decorator-gen-61>", line 2, in module_uninstall
File "/home/odoo/src/odoo/odoo/addons/base/models/ir_module.py", line 73, in check_and_log
return method(self, *args, **kwargs)
File "/home/odoo/src/odoo/odoo/addons/base/models/ir_module.py", line 478, in module_uninstall
self.env['ir.model.data']._module_data_uninstall(modules_to_remove)
File "/home/odoo/src/odoo/odoo/addons/base/models/ir_model.py", line 1898, in _module_data_uninstall
model = self.pool.get(ir_field.model)
File "/home/odoo/src/odoo/odoo/fields.py", line 1050, in __get__
_("(Record: %s, User: %s)") % (record, env.uid),
odoo.exceptions.MissingError: ('Record does not exist or has been deleted (Record: ir.model.fields(9311,), User: 1)', None)
My inspection of the cause of this error triggered by Click on uninstall crm module shows me that the database table ir_model_data had a record (fk : res_id=9311) pointing to the other table ir_model_fields, where the corresponding pk id is missing (no record having pk : id=9311). To be able to uninstall the crm App, the only solution that i have found - after searching for hours to solve it using the odoo way - was to delete the "orphan" record in ir_model_data. Because it was not allowed doing that using oddo-bin shell, i had to fire the deletion by puting this line at the end of an def_buttonchangestatus python function clickable on the ui :
self.env['ir.model.data'].search([('res_id','=',9311)],limit=1).unlink()

How does "on-negotiation-needed" work when trying to stream using gstreamer webrtc?

How does the webrtc pipeline get any information about its peers?
This is what I assume what the on_negotiation_needed callback does?
def start_pipeline(self):
self.pipe = Gst.parse_launch(PIPELINE_DESC)
self.webrtc = self.pipe.get_by_name('sendrecv')
**self.webrtc.connect('on-negotiation-needed', self.on_negotiation_needed)**
self.webrtc.connect('on-ice-candidate', self.send_ice_candidate_message)
self.webrtc.connect('pad-added', self.on_incoming_stream)
self.pipe.set_state(Gst.State.PLAYING)
I see that it has the on_negotiation_needed callback but its unclear where the element variable comes from? I looked here: http://blog.nirbheek.in/2018/02/gstreamer-webrtc.html and here: https://github.com/centricular/gstwebrtc-demos and I am still confused as to how this negotiation works? From what I understand there are 2 (or more) peers and both of them must connect to the signaling server, then one of them has to create an offer.
I await for the message from (I assume) the gstreamer webrtcbin on the signaling server:
print (websocket.remote_address)
#get message from client
message = await asyncio.wait_for(websocket.recv(), 3000)
and I get this error when the pipline starts:
('192.168.11.138', 44120)
Error in connection handler
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 674, in transfer_data
message = yield from self.read_message()
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 742, in read_message
frame = yield from self.read_data_frame(max_size=self.max_size)
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 815, in read_data_frame
frame = yield from self.read_frame(max_size)
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 884, in read_frame
extensions=self.extensions,
File "/usr/local/lib/python3.6/dist-packages/websockets/framing.py", line 99, in read
data = yield from reader(2)
File "/usr/lib/python3.6/asyncio/streams.py", line 672, in readexactly
raise IncompleteReadError(incomplete, n)
asyncio.streams.IncompleteReadError: 0 bytes read on a total of 2 expected bytes
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/websockets/server.py", line 169, in handler
yield from self.ws_handler(self, path)
File "signaling_server.py", line 34, in signaling
message = await asyncio.wait_for(websocket.recv(), 3000)
File "/usr/lib/python3.6/asyncio/tasks.py", line 358, in wait_for
return fut.result()
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 434, in recv
yield from self.ensure_open()
File "/usr/local/lib/python3.6/dist-packages/websockets/protocol.py", line 646, in ensure_open
) from self.transfer_data_exc
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 1006 (connection closed abnormally [internal]), no reason
I cannot say about Python (unfortunately, cannot make Python bindings for GStreamer work on Windows), however, demo works from C# (I just checked).
First you should connect with your browser to https://webrtc.nirbheek.in/, and get the 'Our id' value.
Your Python Gstreamer should connect to wss://webrtc.nirbheek.in:8443, and use the Id value from the browser.
The browser will get the test image stream from the GStreamer, and the GStreamer application will get the Webcam image from the browser.
HTH, Tom
Here's a screenshot:

FDMEE. Exception when I try to load data by fdmAPI.executeQuery

I'm trying to retrieve data during AftLoad event:
RSet = fdmAPI.executeQuery("SELECT * FROM HYP_CST.dbo.STAGING", [])
while RSet.next():
do_something()
And get the following exception:
FATAL [AIF]: Error in CommData.loadData
Traceback (most recent call last):
File "<string>", line 4746, in loadData
File "<string>", line 445, in executeScript
File "F:\FDMEE/data/scripts/event/AftLoad.py", line 54, in <module>
while RSet.next():
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Query outputs some rows.
Any help would be appropriated. Thanks.
Jython provided an incorrect error stack. The reason of error was in the procedure do_something() what were not reflected in the stack.

Can I run scrapy spider with different setting in different process(parallel)?

I define one spider with name='myspider', its behavior would be different according to the setting.And I want to run the spider with different instances in different process, is it possible?
I check the source code,it seems the SpiderLoader just walk the spiders module and I could just run one spider with the same name one time.
the running code seems:
for item in items:
settings = get_project_settings()
settings.set('item', item)
settings.set('DEFAULT_REQUEST_HEADERS', item.get('request_header'))
process = CrawlerProcess(settings)
process.crawl("myspider")
process.start()
and of course, the error shows:
Traceback (most recent call last):
File "/home/xuanqi/workspace/github/foolcage/fospider/fospider/main.py", line 44, in <module>
process.start() # the script will block here until the crawling is finished
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1194, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1174, in startRunning
ReactorBase.startRunning(self)
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Thanks advance for any help!
Setting cannot be changed at runtime.
I suggest you to use spider argument to pass different variable to spider.
process = CrawlerProcess(settings)
process.crawl("myspider", request_headers='specified headers...')
process.start()
And for doing this, you have to override init function of your spider to accept these variables. And pass the request_header to every Request object you use in the spider.
def __init__(self, **kw):
super(MySpider, self).__init__(**kw)
self.headers = kw.get('request_headers')
...
yield scrapy.Request(url='www.example.com', headers=self.headers)