Select and "ol" html tag with Watir Webdriver - selenium

I want to create a for loop to get the text from several lis with watir.
Here's is the HTML I'm trying to scrape
<ol class="shared-connections">
<li class="small-result">
<a class="img-link" href="http://www.google.com"></a>
</li>
<li class="small-result">
<a class="img-link" href="http://www.google.com"></a>
</li>
</ol>
I'm tring to get the href value in the links with a loop, but I can't get the loop to initiate with this code:
#browser.ol(class: "shared-connections").lis(class: "small-result").each do |connection|
p "is this working?"
end
The "ol" tag prevents the loop from working and gives me this error:
/Library/Ruby/Gems/2.0.0/gems/watir-webdriver-0.9.1/lib/watir-webdriver/elements/element.rb:536:in `assert_element_found': unable to locate element, using {:class=>"shared-connections", :tag_name=>"ol"} (Watir::Exception::UnknownObjectException)
Any idea how to get "ol" to work with Watir? Thanks!

It seems you didn't open proper page via code.
Code is working well. I created file with code you provided
in irb:
pp File.readlines('a.html')
["<ol class=\"shared-connections\">",
" <li class=\"small-result\">,
" <a class=\"img-link\" href=\"http://www.google.com\"></a>,
" </li>",
" <li class=\"small-result\">,
" <a class=\"img-link\" href=\"http://www.google.com\"></a>",
" </li>",
"</ol>"]
Then
b = Watir::Browser.new :chrome
b.goto 'file://' + Dir.pwd + '/a.html'
b.ol(class: "shared-connections").lis(class: "small-result").each do |connection|
p "is this working?"
end
"is this working?"
"is this working?"
=> [#<Watir::LI:0x604055b6f2097db6 located=false selector={element: (webdriver element)}>, #<Watir::LI:0x..f5826cdc74313e1a located=false selector={element: (webdriver element)}>]
You can ensure about this with #browser.html

Related

Data scraping by selenium p tag

I searched a lot on the internet. I couldn't find an example similar to the one below. I'm trying to pull text from a web page. There is no location line in the first p tag. The second location section has a location line. When pulling data, I can only pull the contents of the p tag, which is the location row. I cannot pull the contents of the other p tag. I wonder how can I pull the data inside the first and second p tag?
HTML codes of Page Source:
<div class=" col-md-8">
<p>
<i class='fa fa-home main-color'></i> ORHAN MAH.İBRAHİM CAD. NO:35
<br>
<i class='fa fa-phone main-color'></i>
<a class="gri" href="tel:0508-2920344">0508-2920344 </a>
<br />
<i class='fa fa-clock-o main-color'></i>
<span class="red">19.01.2022</span>
</p>
<p>
<i class='fa fa-home main-color'></i> HAZAN MAH.ÖKTEM CAD. NO:13/B
<br>
<i class='fa fa-phone main-color'></i>
<a class="gri" href="tel:0584 837 23 70">0584 837 23 70 </a>
<br>
<i class="fa fa-map-marker main-color"></i>
<a class="gri" href="https://www.google.com/maps?q=35.554433,25.887766" target="_blank">Haritada</a>
<br />
<i class='fa fa-clock-o main-color'></i>
<span class="red">20.01.2022</span>
</p>
</div>
Here is the selenium code I used to pull the data from the HTML source above:
item = browser.find_elements_by_class_name("col-md-10")
urls = browser.find_elements_by_xpath("//div[#class=' col-md-10']/p/a[2]")
for i in zip(item,urls):
try:
address = i[0].find_element_by_css_selector("p").text.split("\n")[:2]
except:
address = None
try:
phone = i[0].find_element_by_xpath("//a[#class='gri'][1]").text
except:
phone = None
print(address)
print(phone)
try:
url = i[1].get_attribute('href').replace("https://www.google.com/maps?q=","")
except:
url = None
try:
date = i[0].find_element_by_xpath("//span[#class='red'][1]").text
except:
date = None
print(url)
print(date)
Use xpath //div[#class=' col-md-8']/p. This will return data of both p tags.
Then you can perform string operations as per your requirement and use data of each p tag using for loop
The 1.p tag blog has no location section. The 2.p tag blog has a location section. In the 1.p tag I want, I want to print none instead of the location in the p blog. When I try to pull with zip_longest regularly the location fails to pull.
#1.p tag block
ORHAN MAH.İBRAHİM CAD. NO:35
0508-2920344
19.01.2022
#2.p tag block
HAZAN MAH.ÖKTEM CAD. NO:13/B
0584 837 23 70
Haritada
20.01.2022

Find specific data-availability class

I have the following webpage source:
<li class="available" data-availability="homeDelivery">
<i class="icon-tick"></i> FREE delivery
</li>
I want to print "Free delivery" to the screen if data-availability == homeDelivery.
I tried with the below code but I get no match.
result = soup.find_all("option", {"data-availability": 'homeDelivery'})
print(result)
ANy ideas? Thank you!
You should be looking for <li> tag not option.
Try this:
from bs4 import BeautifulSoup
sample = """<li class="available" data-availability="homeDelivery">
<i class="icon-tick"></i> FREE delivery
</li>"""
result = BeautifulSoup(sample, "html.parser").find_all("li", {"data-availability": 'homeDelivery'})
print([i.getText(strip=True) for i in result])
Output:
['FREE delivery']

Scrape text under div tag that is in quotes

Trying to scrape this part: "Lounge, Showers, Lockers"
https://i.stack.imgur.com/k5mzg.png
<div class="CourseAbout-otherFacilities more">
<h3 class="CourseAbout-otherFacilities-title">Available Facilities</h3> " Lounge, Showers, Lockers "
</div>
Website:
https://www.golfadvisor.com/courses/16929-black-at-bethpage-state-park-golf-course
response.css('.CourseAbout-foodAndBeverage.more::text').get() command returns " \n "
Thank you
There are three text elements in your target div (matched by your CSS expression):
<div class="CourseAbout-otherFacilities more">FIRST<h3
<h3 class="CourseAbout-otherFacilities-title">SECOND</h3>
</h3>THIRD</div>
By using .get() you're telling Scrapy to return first match.
I recommend to use XPath expression here instead and match your element by text:
//h3[.="Available Facilities"]/following-sibling::text()[1]'

How to select specific text to scrape

I'm trying to scrape the following HTML, I want just to get the Some Header part and not the additional info.
<li class="media">
<div class="media-body">
<h4> Some Header <span class="label label-info"> additional Info </span> </h4> Address info
<br>
</div> </li>`
I'm trying the following:
val li: Elements = ul.select("li")
val list: Elements = li.select("a")
val headers: Elements = list.select("h4")
`
and then when I try to get the inner text via, headers.text() I'm getting both Some Header and additional Info
How can I only scrape the Some Header part?
You are almost near to the solution .You are probably looking for calling ownText:
String s = "<li class=\"media\"> \n" +
" <div class=\"media-body\"> \n" +
" <h4> Some Header <span class=\"label label-info\"> additional Info </span> </h4> Address info\n" +
" <br> \n" +
" </div> </li>";
Document document = Jsoup.parse(s);
Elements element = document.select("li");
Elements elements = element.select("a");
System.out.println(elements.select("h4").first().ownText()); ;
Output:
Some Header

Scrapy CSV export shows the same data in all rows

I'm trying to scrape the following html code:
<ul class="results-list" id="search-results">
<li>
<h3 class="name">First John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
<li>
<h3 class="name">Second John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
</ul>
When I run my spider, I get 2 rows, containing the same information. I have name,email,phone columns and for example in the name column for both I would get:
First John,Second John.
My Scrapy code is the following:
people= response.xpath('//ul[#class="results-list"]/li')
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'//h3/text()').extract()
item['Email'] = person.xpath(
'//div[#class="details"]/a/#href').extract()
item['Phone'] = person.xpath(
'//div[#class="details"]/span[#class="phone"]/text()').extract()
yield item
However when I run scrapy crawl MySpider -o output.csv I get the same information in all rows.
you are using absolute path on your xpath expressions, change them to:
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'.//h3/text()').extract_first()
item['Email'] = person.xpath(
'.//div[#class="details"]/a/#href').extract_first()
item['Phone'] = person.xpath(
'.//div[#class="details"]/span[#class="phone"]/text()').extract_first()
yield item