How to fix scrapy rules when only one rule is followed

How to fix scrapy rules when only one rule is followed - scrapy

This code is not working:
name="souq_com"
allowed_domains=['uae.souq.com']
start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]
rules = (
#categories
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[#id="body-column-main"]//div[contains(#class,"fl")]'),unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[#id="ItemResultList"]/div/div/div/a'),unique=True),callback='parse_item'),
Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'),unique=True)),
)
The first rule is getting responses, but the second rule is not working.
I'm sure that the second's rule xpath is correct (I've tried it using scrapy shell ) I also tried adding a callback to the first rule and selecting the path of the second rule ('//div[#id="ItemResultList"]/div/div/div/a') and issuing a Request and it's working correctly.
I also tried a workaround, I tried to use a Base spider instead of a Crawl Spider, it only issues the first request and doesn't issue the callback.
how should I fix that ?

The order of rules is important. According to scrapy docs for CrawlSpider rules:
If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
If I follow the first link in http://uae.souq.com/ae-en/shop-all-categories/c/, i.e. http://uae.souq.com/ae-en/antique/l/, the items you want to follow are within this structure
<div id="body-column-main">
<div id="box-ads-souq-1340" class="box-container ">...
<div id="box-results" class="box-container box-container-none ">
<div class="box box-style-none box-padding-none">
<div class="bord_b_dash overhidden hidden-phone">
<div class="item-all-controls-wrapper">
<div id="ItemResultList">
<div class="single-item-browse fl width-175 height-310 position-relative">
<div class="single-item-browse fl width-175 height-310 position-relative">
...
So, the links you target with the 2nd Rule are in <div> that have "fl" in their class, so they also match the first rule, which looks for all links in '//div[#id="body-column-main"]//div[contains(#class,"fl")]', and therefore will NOT be parsed with parse_item
Simple solution: Try putting your 2nd rule before the "categories" Rule (unique=True by default for SgmlLinkExtractor)
name="souq_com"
allowed_domains=['uae.souq.com']
start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[#id="ItemResultList"]/div/div/div')), callback='parse_item'),
#categories
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[#id="body-column-main"]//div[contains(#class,"fl")]'))),
Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'))),
)
Another option is to change your first rule for category pages to a more restrictive XPath, that does not exist in the individual category pages, such as '//div[#id="body-column-main"]//div[contains(#class,"fl")]//ul[#class="refinementBrowser-mainList"]'
You could also define a regex for the category pages and use accept parameter in you Rules.

Related

Can hx-select-oob update several targets?

If I understand the docs correctly it allows
<div>
<div id="alert"></div>
<button hx-get="/info"
hx-select="#info-details"
hx-swap="outerHTML"
hx-select-oob="#alert">
Get Info!
</button>
</div>
replace the whole <button> due to the hx-swap directive with the content of #info-details - via hx-select="#info-details" - from the response sent.
define a 2nd target hx-select-oob="#alert" which takes the content from #alert from the response sent.
Is there any way hx-select-oob can have multiple targets?
Like hx-select-oob="#alert,#second-target"?
Or how to manage updating multiple targets at once?

If your goal is to return html content and copy the same content to multiple targets - you simply use class selector instead of ID.
If you have multiple content returned, then you need to use htmx-swap-oob

How to build an Angular directive that doesn't generate containing tags?

I'm having a little difficulty with this, I'm hoping maybe someone has some ideas. It's a very complex situation, but I'll try to simplify and nutshell it here.
Let's say that I have a directive that generates several elements, for example, a Label and a Textbox. This directive will need to generate different content in differtag ent scenarious.
The problem with a directive that has more than one element in it is that you can't use the "replace:true" tag unless you wrap everything in one element. "replace" doesn't like multiple root elements. If you don't use the "replace" option then the tag that generates the directive becomes the containing element.
I don't want all these extra tags. It's breaking my CSS and generating unnecessary content. So instead of this:
<div ng-repeat="s in something">
<my-directive>
<label><input>
</my-directive>
<my-directive>
<label><input>
</my-directive>
<my-directive>
<label><input>
</my-directive>
</div><!-- End repeat -->
I want this:
<div ng-repeat="s in something">
<label><input>
<label><input>
<label><input>
</div><!-- End repeat -->
I hope this question makes sense.

Scrapy, javascript form, not crawling next page

I am having an issue. I am using scrapy to extract data from HTML tables that are displayed after a form search. The problem is that it will not continue to crawl to the next page. I have tried multiple combinations of rules. I understand that it is not recommended to override the default parse logic in CrawlSpider. I have found many answers that fix others issues but, I have not been able to find a solution in which a form POST must occur first. I look at my code and see that it requests the allowed_urls then POST to search.do and the results are returned in HTML formatted results page and thus the parsing begins. Here is my code and I have replaced the real url with nourl.com
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
from EMD.items import EmdItem
class EmdSpider(CrawlSpider):
name = "emd"
start_urls = ["https://nourl.com/methor"]
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//div//div//div//span[#class="pagelinks"]/a[#href]'))),
Rule(SgmlLinkExtractor(allow=('')), callback = 'parse_item')
)
def parse_item(self, response):
url = "https://nourl.com/methor-app/search.do"
payload = {"county": "ANDERSON"}
return (FormRequest(url, formdata = payload, callback = self.parse_data))
def parse_data(self, response):
print response
sel = Selector(response)
items = sel.xpath('//td').extract()
print items
I have left allow = ('') blank because I have tried so many combinations of it. Also in my xpath leads to this:
<div align="center">
<div id="bg">
<!--
Main Container
-->
<div id="header2"></div>
<!--
Content
-->
<div id="content">
<!--
Hidden/Accessible Headers
-->
<h1 class="hide"></h1>
<!--
InstanceBeginEditable name="Content"
-->
<h2></h2>
<p align="left"></p>
<p id="printnow" align="center"></p>
<p align="left"></p>
<span class="pagebanner"></span>
<span class="pagelinks">
[First/Prev]
<strong></strong>
,
<a title="Go to page 2" href="/methor-app/results.jsp?d-49653-p=2"></a>
,
<a title="Go to page 3" href="/methor-app/results.jsp?d-49653-p=3"></a>
[
/
]
</span>
I have checked with multiple tools and my xpath is correctly pointing to the URLs to go to next page. my output in the command prompt is only grabbing data from the first page. I have seen a couple of tutorials where the code contains a yield statement but I am not sure what that does other than "tell the function that it will be used again later without loosing its data" Any ideas would be helpful. Thank you!!!

It may be because you need to select the actual URL in your rule, not just the <a>node. [...] in XPath is used to make a condition, not select something. Try:
//span[#class="pagelinks"]/a/#href
Also a few comments:
How did you find this HTML? Beware of tools to find XPath, as HTML retrieved with browsers and with scrapy may be different, because scrapy doesn't handle Javascript (which can be used to generated the page you're looking at, and also some browsers try to sanitize HTML).
It may not be the case here, but the "javascript form" in a scrapy question spooked me. You should always check that the content of response.body is what you expect.
//div//div//div is exactly the same as //div. The two slashes means we don't care anymore about the structure, just select all the nodes named div in the children of the current node. That also why here //span[...] might do the trick.

How do I select a particular dynamic div, using Selenium when I don't have a unique id or name?

Only the content of the div is unique. So, in the following dynamically generated html, only "My Article-1245" is unique:
<div class="col-md-4 article">
<h2>
My Article-1245
Delete
Edit
</h2>
<p>O ephemeral text! Here today, gone tomorrow. Not terribly important, but necessary</p>
</div>
How do I select the edit/delete link of this specific div, using Selenium? assertText/verifyText requires an element locator, but I do not have any unique id/name (out of my control). There will be many such div blocks, with other content text, all dynamically generated.
Any help would be appreciated.

If text 'My Article' appears each time, you may use following:
//For Delete
driver.findElement(By.xpath("//h2[contains(text(),'My Article-')]/a[text()='Delete']"));
//For Edit
driver.findElement(By.xpath("//h2[contains(text(),'My Article-')]/a[text()='Edit']"));
Hope it meets your requirement :)

Matching by text is always a bad automated testing concept. If you want to keep clean and reliable test scripts, then :
Contact your web dev to add unique identifiers to the elements
Suck it up, and create selectors based on what's there.
You are able to create a CSS selector based on what you want.
What you should do is create the selector using parent-child relationships:
driver.findElement(By.cssSelector("div.article:nth-child(X) a[href^='delete']"));
As I am ignorant of your appp, this is also assuming that all of your article classes are under the same parent. You would substitute X with the number of the div you want to refer to. e.g.:
<div id="someparent">
<div class="...article" />
<div class="...article" />
...
</div>

How to get an article description or excerpt within Expression Engine

I saw in Expression Engine I can use {embed:title} and {site_name} variables, but now I need a variable to pull an excerpt or description of the article itself. Is there such a variable/tag?

ExpressionEngine tags are based solely on custom fields which you yourself have defined. So in the field group for your "articles" channel, you'll have some fields, maybe {article_summary}, {article_body}, {article_image}, etc. To display your summary, just use {article_summary} in your template.
I'm assuming that you're coming from something like WordPress maybe, where every piece of content has the_content() and the_excerpt() ... aside from a handful of global variables, and some fields which are universal to all entries (like {title}, {entry_date}, etc), ExpressionEngine isn't like that. You define what fields you use for each channel - you have complete control.

Here is the actual code you have to include in your EE template.
{exp:channel:entries channel="article" limit="5" dynamic="no"}
<div class="home_thumb">
<h1>{title}</h1>
<div class="home_thumb_img">
<img src="{article_image}">
{if article_content}
<p>{article_content}</p>
{/if}
</div>
</div>
{/exp:channel:entries}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to fix scrapy rules when only one rule is followed - scrapy

Related

Can hx-select-oob update several targets?

How to build an Angular directive that doesn't generate containing tags?

Scrapy, javascript form, not crawling next page

How do I select a particular dynamic div, using Selenium when I don't have a unique id or name?

How to get an article description or excerpt within Expression Engine

Categories

Resources