I'm trying to scrape the following HTML, I want just to get the Some Header part and not the additional info.
<li class="media">
<div class="media-body">
<h4> Some Header <span class="label label-info"> additional Info </span> </h4> Address info
<br>
</div> </li>`
I'm trying the following:
val li: Elements = ul.select("li")
val list: Elements = li.select("a")
val headers: Elements = list.select("h4")
`
and then when I try to get the inner text via, headers.text() I'm getting both Some Header and additional Info
How can I only scrape the Some Header part?
You are almost near to the solution .You are probably looking for calling ownText:
String s = "<li class=\"media\"> \n" +
" <div class=\"media-body\"> \n" +
" <h4> Some Header <span class=\"label label-info\"> additional Info </span> </h4> Address info\n" +
" <br> \n" +
" </div> </li>";
Document document = Jsoup.parse(s);
Elements element = document.select("li");
Elements elements = element.select("a");
System.out.println(elements.select("h4").first().ownText()); ;
Output:
Some Header
Related
I searched a lot on the internet. I couldn't find an example similar to the one below. I'm trying to pull text from a web page. There is no location line in the first p tag. The second location section has a location line. When pulling data, I can only pull the contents of the p tag, which is the location row. I cannot pull the contents of the other p tag. I wonder how can I pull the data inside the first and second p tag?
HTML codes of Page Source:
<div class=" col-md-8">
<p>
<i class='fa fa-home main-color'></i> ORHAN MAH.İBRAHİM CAD. NO:35
<br>
<i class='fa fa-phone main-color'></i>
<a class="gri" href="tel:0508-2920344">0508-2920344 </a>
<br />
<i class='fa fa-clock-o main-color'></i>
<span class="red">19.01.2022</span>
</p>
<p>
<i class='fa fa-home main-color'></i> HAZAN MAH.ÖKTEM CAD. NO:13/B
<br>
<i class='fa fa-phone main-color'></i>
<a class="gri" href="tel:0584 837 23 70">0584 837 23 70 </a>
<br>
<i class="fa fa-map-marker main-color"></i>
<a class="gri" href="https://www.google.com/maps?q=35.554433,25.887766" target="_blank">Haritada</a>
<br />
<i class='fa fa-clock-o main-color'></i>
<span class="red">20.01.2022</span>
</p>
</div>
Here is the selenium code I used to pull the data from the HTML source above:
item = browser.find_elements_by_class_name("col-md-10")
urls = browser.find_elements_by_xpath("//div[#class=' col-md-10']/p/a[2]")
for i in zip(item,urls):
try:
address = i[0].find_element_by_css_selector("p").text.split("\n")[:2]
except:
address = None
try:
phone = i[0].find_element_by_xpath("//a[#class='gri'][1]").text
except:
phone = None
print(address)
print(phone)
try:
url = i[1].get_attribute('href').replace("https://www.google.com/maps?q=","")
except:
url = None
try:
date = i[0].find_element_by_xpath("//span[#class='red'][1]").text
except:
date = None
print(url)
print(date)
Use xpath //div[#class=' col-md-8']/p. This will return data of both p tags.
Then you can perform string operations as per your requirement and use data of each p tag using for loop
The 1.p tag blog has no location section. The 2.p tag blog has a location section. In the 1.p tag I want, I want to print none instead of the location in the p blog. When I try to pull with zip_longest regularly the location fails to pull.
#1.p tag block
ORHAN MAH.İBRAHİM CAD. NO:35
0508-2920344
19.01.2022
#2.p tag block
HAZAN MAH.ÖKTEM CAD. NO:13/B
0584 837 23 70
Haritada
20.01.2022
Trying to scrape this part: "Lounge, Showers, Lockers"
https://i.stack.imgur.com/k5mzg.png
<div class="CourseAbout-otherFacilities more">
<h3 class="CourseAbout-otherFacilities-title">Available Facilities</h3> " Lounge, Showers, Lockers "
</div>
Website:
https://www.golfadvisor.com/courses/16929-black-at-bethpage-state-park-golf-course
response.css('.CourseAbout-foodAndBeverage.more::text').get() command returns " \n "
Thank you
There are three text elements in your target div (matched by your CSS expression):
<div class="CourseAbout-otherFacilities more">FIRST<h3
<h3 class="CourseAbout-otherFacilities-title">SECOND</h3>
</h3>THIRD</div>
By using .get() you're telling Scrapy to return first match.
I recommend to use XPath expression here instead and match your element by text:
//h3[.="Available Facilities"]/following-sibling::text()[1]'
I'm trying to scrape the following html code:
<ul class="results-list" id="search-results">
<li>
<h3 class="name">First John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
<li>
<h3 class="name">Second John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
</ul>
When I run my spider, I get 2 rows, containing the same information. I have name,email,phone columns and for example in the name column for both I would get:
First John,Second John.
My Scrapy code is the following:
people= response.xpath('//ul[#class="results-list"]/li')
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'//h3/text()').extract()
item['Email'] = person.xpath(
'//div[#class="details"]/a/#href').extract()
item['Phone'] = person.xpath(
'//div[#class="details"]/span[#class="phone"]/text()').extract()
yield item
However when I run scrapy crawl MySpider -o output.csv I get the same information in all rows.
you are using absolute path on your xpath expressions, change them to:
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'.//h3/text()').extract_first()
item['Email'] = person.xpath(
'.//div[#class="details"]/a/#href').extract_first()
item['Phone'] = person.xpath(
'.//div[#class="details"]/span[#class="phone"]/text()').extract_first()
yield item
I want to create a for loop to get the text from several lis with watir.
Here's is the HTML I'm trying to scrape
<ol class="shared-connections">
<li class="small-result">
<a class="img-link" href="http://www.google.com"></a>
</li>
<li class="small-result">
<a class="img-link" href="http://www.google.com"></a>
</li>
</ol>
I'm tring to get the href value in the links with a loop, but I can't get the loop to initiate with this code:
#browser.ol(class: "shared-connections").lis(class: "small-result").each do |connection|
p "is this working?"
end
The "ol" tag prevents the loop from working and gives me this error:
/Library/Ruby/Gems/2.0.0/gems/watir-webdriver-0.9.1/lib/watir-webdriver/elements/element.rb:536:in `assert_element_found': unable to locate element, using {:class=>"shared-connections", :tag_name=>"ol"} (Watir::Exception::UnknownObjectException)
Any idea how to get "ol" to work with Watir? Thanks!
It seems you didn't open proper page via code.
Code is working well. I created file with code you provided
in irb:
pp File.readlines('a.html')
["<ol class=\"shared-connections\">",
" <li class=\"small-result\">,
" <a class=\"img-link\" href=\"http://www.google.com\"></a>,
" </li>",
" <li class=\"small-result\">,
" <a class=\"img-link\" href=\"http://www.google.com\"></a>",
" </li>",
"</ol>"]
Then
b = Watir::Browser.new :chrome
b.goto 'file://' + Dir.pwd + '/a.html'
b.ol(class: "shared-connections").lis(class: "small-result").each do |connection|
p "is this working?"
end
"is this working?"
"is this working?"
=> [#<Watir::LI:0x604055b6f2097db6 located=false selector={element: (webdriver element)}>, #<Watir::LI:0x..f5826cdc74313e1a located=false selector={element: (webdriver element)}>]
You can ensure about this with #browser.html
<div id="div_12_1_1_1_3_1_2_1_1_1_2" class="Quantity CoachView CoachView_show" data-eventid="" data-viewid="qty" data-config="config12" data-bindingtype="Decimal" data-binding="local.priceBreak.quantity" data-type="com.ibm.bpm.coach.Snapshot_a30ea40f_cb24_4729_a02e_25dc8e12dcab.Quantity">
<div class="w-decimal w-group clearfix">
<div class="p-label-container span4">
<div class="p-fields-container controls-row span8 l-input fixed-units">
<input id="div_12_1_1_1_3_1_2_1_1_1_2-in" class="p-field span8" type="text" maxlength="16">
<input id="div_12_1_1_1_3_1_2_1_1_1_2-iu" class="p-unit span4" type="text" maxlength="2" style="display: none;">
<select class="p-unit span4" style="display: none;"></select>
<div class="p-unit span4">CM</div>
<div class="p-help-block"></div>
</div>
<div class="p-fields-container span8 l-output" style="display: none;">
</div>
</div>
<div id="div_12_1_1_1_3_1_2_1_1_1_3" class="Quantity CoachView CoachView_show" data-eventid="" data-viewid="Quantity2" data-config="config73" data-bindingtype="Integer" data-binding="local.priceBreak.numberDeliveries" data-type="com.ibm.bpm.coach.Snapshot_a30ea40f_cb24_4729_a02e_25dc8e12dcab.Quantity">
here how to click on text box of whose id is "div_12_1_1_1_3_1_2_1_1_1_2-in "
but for some scenario its changing to "div_5_1_1_1_3_1_2_1_1_1_2-in "
i have tried with the following ,
driver.findElement(By.xpath("//div/input[ends-with(#id,'__1_1_1_3_1_2_1_1_1_2-in')]")).sendKeys("98989998989");
but it is not working ..
Output:
org.openqa.selenium.InvalidSelectorException: The given selector //div/input[ends-with(#id,'__1_1_1_3_1_2_1_1_1_2-in')] is either invalid or does not result in a WebElement. The following error occurred:
InvalidSelectorError: Unable to locate an element with the xpath expression //div/input[ends-with(#id,'__1_1_1_3_1_2_1_1_1_2-in')] because of the following error:
[Exception... "The expression is not a legal expression." code: "51" nsresult: "0x805b0033 (NS_ERROR_DOM_INVALID_EXPRESSION_ERR)" location: "file:///C:/Users/SUNIL~1.WAL/AppData/Local/Temp/anonymous4157273428687139624webdriver-profile/extensions/fxdriver#googlecode.com/components/driver_component.js Line: 5956"]
Command duration or timeout: 41 milliseconds
For documentation on this error, please visit: http://seleniumhq.org/exceptions/invalid_selector_exception.html
Build info: version: '2.37.0', revision: 'a7c61cb', time: '2013-10-18 17:15:02'
you can try with the following cssSelector,
driver.findElement(By.cssSelector("div.fixed-units > input[id$='in']")).sendKeys("98989998989");
If 'in' text is present always then you can use xpath. You can try //div[contains(#id, 'in')]
ends-with is an XPath 2 query, of which none of the five major actually support v2
Your options are either to use other methods as already suggested, or for an XPath 1 solution you could use:
//div/input[substring(#id, string-length(#id) - 22) = '_1_1_1_3_1_2_1_1_1_2-in']
Although it's ugly, really.
I actually used "starts-with"...but I see you have multiple that start with "div".
If your elements all stay in the same place on the page and aren't subject to change, try this out. Here's some Java code:
By by = By.xpath("(//*[starts-with(#" + attributeName + ", '" + attributeValue + "')])[" + n + "]");
In your case, it would look like this:
By by = By.xpath("(//*[starts-with(#id, 'div')])[2]");
What this will do is pick the second element that starts with "div" in the DOM.
It's a bit of a hack...but it might work out for you.