Scrapy crawl web with many duplicated element class name - scrapy

I'm new to the Scrapy and trying to crawl the web but the HTML element consist of many DIV that have duplicated class name eg.
<section class= "pi-item pi-smart-group pi-border-color">
<section class="pi-smart-group-head">
<h3 class = "pi-smart-data-label pi-data-label pi-secondary-font pi-item-spacing">
</section>
<section class= "pi-smart-group-body">
<div class="pi-smart-data-value pi-data-value pi-font pi-item-spacing">
</div>
</section>
</section>
My problem is that this structure repeat for many other element and when I'm using response.css I will get multiple element which I didn't want
(Basically I want to crawl the Pokemon information eg. "Types", "Species" and "Ability" of each Pokemon from https://pokemon.fandom.com/wiki/Bulbasaur , I have done get url for all Pokemon but stuck in getting information from each Pokemon)

I have tried to do this scrapy project for you and got the results. The issue I see is that you have used CSS. You can scrape with that, but it is far more effective to use Xpath selectors. You have more versatility to select the specific tags you want. Here is the code I wrote for you. Bare in mind, this code is just something I did quickly to get your results. It works but I did it in this way so it is easy for you understand it since you are new to scrapy. Please let me know if this is helpful
import scrapy
class PokemonSpiderSpider(scrapy.Spider):
name = 'pokemon_spider'
start_urls = ['https://pokemon.fandom.com/wiki/Bulbasaur']
def parse(self, response):
pokemon_type = response.xpath("(//div[#class='pi-data-value pi-font'])[1]/a/#title")
pokemon_species = response.xpath('//div[#data-source="species"]//div/text()')
pokemon_abilities = response.xpath('//div[#data-source="ability"]/div/a/text()')
yield {
'pokemon type': pokemon_type.extract(),
'pokemon species': pokemon_species.extract(),
'pokemon abilities': pokemon_abilities.extract()
}

You can use XPath expression with a property text:
abilities = response.xpath('//h3[a[.="Abilities"]]/following-sibling::div[1]/a/text()').getall()
species = response.xpath('//h3[a[.="Species"]]/following-sibling::div[1]/text()').get()

Related

Xpath following the for the next thumbnail - optimized solution? (for selenium automated solution)

At demo store we have a list of thumbnail as given below :
<ul class="product_list grid row">
<li class="ajax_block_product....">
<div class="product-container">
<!-- at this level ist the thumbnail number 1 -->
</div>
</li>
<li class="ajax_block_product....">
<div class="product-container">
<!-- at this level ist the thumbnail number 2 -->
</div>
</li>
<li class="ajax_block_product....">
<div class="product-container">
<!-- at this level ist the thumbnail number 3 -->
</div>
</li>
</ul>
I can navigate through thumbnails using the xpath mentioned below:
thumbnail number 1 = //div[#class="product-container"]
thumbnail number 2 = //div[#class="product-container"]/following::div[#class="product-container"][1]
thumbnail number 3 = //div[#class="product-container"]/following::div[#class="product-container"][1]/following::div[#class="product-container"][1]
Through, above xpath are working fine for me but not an optimized solution.
Update 1: The objective is leave the xpath in a closed "form" at web page object library, for using by automated tests.
Get all the thumbnails in a list and then navigate through. Use below CSS selector for the same. Not sure which language you are using. I write this using Java.
List<WebElement> products = driver.findElements(By.cssSelector("div.product-container"));
for(WebElement product : products){
String productName = product.findElement(By.cssSelector(".product-name")).getText();
String productPrice = product.findElement(By.cssSelector(".right-block .price")).getText();
...
}
OR using the xpath mentioned by you.
List<WebElement> allThumbnails = driver.findElements(By.xpath("//div[#class='product-container']"));
for(WebElement thumbnails:allThumbnails){
String productName = product.findElement(By.xpath(".//a[#class='product-name']")).getText();
String productPrice = product.findElement(By.xpath("//div[#class='right-block']//span[#class='price product-price']")).getText();
...
}
UPDATED
As per your comment if it is require to mentioned fix element in your xpath then using indexes would be the right approach.
there are total 7 product present on the page you shared the URL and xpath so you can write it like-
thumbnail number 1 = //ul[#class='product_list grid row']/li[1]
thumbnail number 2 = //ul[#class='product_list grid row']/li[2]
...
thumbnail number 7 = //ul[#class='product_list grid row']/li[7]
If all the products share the same class you can just have Selenium pick up all matching elements, as opposed to explicitly stating one after the other like you're currently trying.
You've not said what language you're using, but here's what it'd look like in C#:
// Setup your web driver...
var thumbnails = driver.FindElements(By.XPath("//div[#class=\"product-container\"]"));
foreach (var thumbnail in thumbnails)
{
// Do work
}
Or
for(int i = 0; i < thumbnails.Count; i++)
{
// Do work by index
}
Note the plural FindElements, it returns a collection of all matching elements. Selenium does a fairly good job of matching methods across languages so that should point you in the right direction at least.

Scrapy returning same first row data in each row instead of separate data for each row

I have written a simple scrape using scrapy, but it keeps returning the first instance of the target data instead of the correct data in each row from each instance of target data. In this case, it returns the first link for all scraped jobs from the Indeed website, instead of the correct link for each job.
I've tried both using (div) and avoiding (.//div) absolute paths, as well as using [0] at the end of the lin. Without, [0], it returns all data from all rows in each cell.
Link to example of source data is;
https://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A330K-%C2%A3460K&radius=25&fromage=2&limit=50&sort=date&psf=advsrch
Target data is href="/rc/clk?jk=56e4f5164620b6da&fccid=6920a3604c831610&vjs=3"
Target data from page
<div class="title">
<a target="_blank" id="jl_56e4f5164620b6da" href="/rc/clk?jk=56e4f5164620b6da&fccid=6920a3604c831610&vjs=3" onmousedown="return rclk(this,jobmap[0],1);" onclick=" setRefineByCookie(['radius', 'jobtype', 'salest']); return rclk(this,jobmap[0],true,1);" rel="noopener nofollow" title="Project Manager" class="jobtitle turnstileLink " data-tn-element="jobTitle">
<b>Project</b> <b>Manager</b></a>
Here's my code
def parse(self, response):
titles = response.css('div.jobsearch-SerpJobCard')
items = []
for title in titles:
item = ICcom4Item()
home_url = ("http://www.indeed.co.uk")
item ['role_title_link'] = titles.xpath('div[#class="title"]/a/#href').extract()[0]
items.append(item)
return items
I just need the correct link from each job to appear. All help welcome!
The problem is in the line below:
item ['role_title_link'] = titles.xpath('div[#class="title"]/a/#href').extract()[0]
Instead of titles.xpath, you should use title.xpath, like below:
item ['role_title_link'] = title.xpath('div[#class="title"]/a/#href').extract()[0]
Then, your code will scrape the link for each job, as you want.

Getting a specific part of a website with Beautiful Soup 4

I got the basics down of finding stuff with Beautiful Soup 4. However right now I am stuck with a specific problem.I want to scrape the "2DKT94P" from the data-oid of the below code:
<div class="js-object listitem_wrap " data-estateid="45784882" data-oid="2DKT94P">
<div class="listitem relative js-listitem ">
Any pointers on how I might do this? I would also appreciate a pointer for an advanced tutorial that covers this, and/or a link on where I would have been able to find this in the official documentation because I failed to recognize the correct part...
Thanks in advance!
you should locate the div tag using class attribute then get it's data-oid attribute
div = soup.find("div", class_="js-object")
oid = div['data-oid']
If your data is well formated you can do this via this way:
from bs4 import BeautifulSoup
example = """
<div class="js-object listitem_wrap " data-estateid="45784882" data-
oid="2DKT94P">
<div class="listitem relative js-listitem ">2DKT94P DIV</div>
</div>
<div>other div</div>"""
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find(attrs= {"data-oid":"2DKT94P"})
print (RandomDIV.get_text().strip())
Outputs:
2DKT94P DIV
Find more info about find or find_all with attributes here.
Or via select:
RandomDIV = soup.select("div[data-oid='2DKT94P']")
print (RandomDIV[0].get_text().strip())
Find more about select.
EDIT:
Totally misunderstood the question. If you want to search only for data-oid you can do like this:
soup = BeautifulSoup(example, "html.parser")
RandomDIV = soup.find_all(lambda tag: [t for t in tag.attrs if
t == 'data-oid'])
for div in RandomDIV:
#data-oid
print(div["data-oid"])
#text
print (div.text.strip())
Learn more here.

Scrapy, No Errors, Spider closes after crawling

for restaurant in response.xpath('//div[#class="listing"]'):
restaurantItem = RestaurantItem()
restaurantItem['name'] = response.css(".title::text").extract()
yield restaurantItem
next_page = response.css(".next > a::attr('href')")
if next_page:
url = response.urlJoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
I fixed all the errors, that it was giving me. Now, I am getting no errors. Spider just closes, after crawling the start_url. the for loop never gets executed.
When you try to find an element this way:
response.xpath('//div[#class="listing"]')
You are telling I want to find a div that literally only has "listing" as its class:
<div class="listing"></div>
But this doesn't exist anywhere in the DOM, what's happening is the following:
<div class="listing someOtherClass"></div>
To select the above element you have to tell that the element contains a certain attribute text but can contain more. Here, like this:
response.xpath('//div[contains(#class,"listing")]')

Using Scrapy to scrape data after form submit

I'm trying to scrape content from listing detail page that can only be viewed by clicking the 'view' button which triggers a form submit . I am new to both Python and Scrapy
Example markup
<li><h3>Abc Widgets</h3>
<form action="/viewlisting?id=123" method="post">
<input type="image" src="/images/view.png" value="submit" >
</form>
</li>
My solution in Scrapy is to extract form actions then use Request to return the page with a callback to parse it for for the desired content. However I have hit a few issues
I'm getting the following error "request url must be str or unicode"
secondly when I hardcode a URL to overcome the above issue it seems my parsing function is returning what looks like a list
Here is my code - with reactions of the real URLs
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from wfi2.items import Wfi2Item
class ProfileSpider(Spider):
name = "profiles"
allowed_domains = ["wfi.com.au"]
start_urls = ["http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=WA",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=VIC",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=QLD",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NSW",
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=TAS"
"http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NT"
]
def parse(self, response):
hxs = Selector(response)
forms = hxs.xpath('//*[#id="area-managers"]//*/form')
for form in forms:
action = form.xpath('#action').extract()
print "ACTION: ", action
#request = Request(url=action,callback=self.parse_profile)
request = Request(url=action,callback=self.parse_profile)
yield request
def parse_profile(self, response):
hxs = Selector(response)
profile = hxs.xpath('//*[#class="contentContainer"]/*/text()')
print "PROFILE", profile
I'm getting the following error "request url must be str or unicode"
Please have a look at the scrapy documentation for extract(). It says : "Serialize and return the matched nodes as a list of unicode strings" (bold added by me).
The first element of the list is probably what you want. So you could do something like:
request = Request(url=response.urljoin(action[0]), callback=self.parse_profile)
secondly when I hardcode a URL to overcome the above issue it seems my
parsing function is returning what looks like a list
According to the documentation of xpath it's a SelectorList. Add extract() to the xpath and you'll get a list with the text tokens. Eventually you want to clean up and join the elements that list before further processing.