I have some HTML I'm trying to scrape. Tring to learn Selenium. What I need is the words Fashion, and Long, and so on.
I've tried:
style = driver.find_element_by_xpath("//strong[text()='Style:']/following::strong").text
style = driver.find_element_by_xpath("//strong[text()='Style:']/following-sibling::strong").text
style = driver.find_element_by_xpath("//strong[contains(.,'Style:')] /preceding-sibling::strong").text
and everything in between.
<div class="xxkkk20">
<strong>Style:</strong> Fashion <br>
<strong>Shirt Length:</strong> Long <br>
<strong>Collar:</strong> Scoop Neck <br>
<strong>Material:</strong> Polyester <br>
<strong>Pattern Type:</strong> Floral,Skulls <br>
<strong>Embellishment:</strong> Lace <br>
<strong>Thickness:</strong> Standard <br>
<strong>Fabric Stretch:</strong> High Stretch <br>
<strong>Seasons:</strong> Summer <br>
<strong>Weight:</strong> 0.1700kg <br>
<strong>Package Contents:</strong> 1 x Tank Top
</div>
You can store all of them in a list like given below and iterate them over the list to get all the web element and finally apply text method to extract text.
all_elements = driver.find_elements(By.XPATH, "//div[#class='xxkkk20']/strong")
for ele in all_elements:
print(ele.text)
Update 1 :
keys = driver.find_elements(By.XPATH, "//strong")
for key in keys:
print(key.get_attribute('innerHTML'))
pairs = driver.find_elements(By.XPATH,'//div[#class="xxkkk20"]').text.split("\n")
for pair in pairs:
texts = pair.split()
print(texts[1])
Explanation:
First you get all the texts inside the parent div element.
Then you split it by \n according to <br> elements there.
Now you should actually have pairs of texts like Style: Fashion.
And since you want to get the second values only you need to split each pair and get the second substring.
I suggest there is a space there between the first and the second texts in the pairs strings.
In case there is no spaces there, you still can split it finally by : so it will look like this:
pairs = driver.find_elements(By.XPATH,'//div[#class="xxkkk20"]')text.split("\n")
for pair in pairs:
texts = pair.split(":")
print(texts[1])
Try this
textValues = driver.find_elements_by_xpath('//div[#class="xxkkk20"]')text.split("\n")
for txt in textValues:
print(txt.split(":")[1].strip())
Related
Using Selenium 4.8 in .NET 6, I have the following html structure to parse.
<ul class="search-results">
<li>
<a href=//to somewhere>
<span class="book-desc">
<div class="book-title">some title</div>
<span class="book-author">some author</span>
</span>
</a>
</li>
</ul>
I need to find and click on the right li where the book-title matches my variable input (ideally ignore sentence case too) AND the book author also matches my variable input. So far I'm not getting that xpath syntax correct. I've tried different variations of something along these lines:
var matchingBooks = driver.FindElements(By.XPath($"//li[.//span[#class='book-author' and text()='{b.Authors}' and #class='book-title' and text()='{b.Title}']]"));
then I check if matchingBooks has a length before clicking on the first element. But matchingBooks is always coming back as 0.
class="book-author" belongs to span while class="book-title" belongs to div child element.
Also it cane be extra spaces additionally to the text, so it's better to use contains instead of exact equals validation.
So, instead of "//li[.//span[#class='book-author' and text()='{b.Authors}' and #class='book-title' and text()='{b.Title}']]" please try this:
"//li[.//span[#class='book-author' and(contains(text(),'{b.Authors}'))] and .//div[#class='book-title' and(contains(text(),'{b.Title}'))]]"
UPD
The following XPath should work. This is a example specific XPath I tried and it worked "//li[.//span[#class='book-author' and(contains(text(),'anima'))] and .//div[#class='book-title' and(contains(text(),'Coloring'))]]" for blood of the fold search input.
Also, I guess you should click on a element inside the li, not on the li itself. So, it's try to click the following element:
"//li[.//span[#class='book-author' and(contains(text(),'{b.Authors}'))] and .//div[#class='book-title' and(contains(text(),'{b.Title}'))]]//a"
I've recently come across an issue.
I need to find a div tag on a page, that contain specific text. The problem is, that text is divided into two parts by an inner link tag, so that an HTML tree would look like:
**<html>
<...>
<div>
start of div text - part 1
<a/>
end of div text - part 2
</div>
<...>
</html>**
To uniquely identify that div tag I'd need two parts of div text. Naturally, I would come up with something like this XPath:
.//div[contains(text(), 'start of div text') and contains(text(), 'end of div text')]
However, it doesn't work, the second part can not be found.
What would be the best approach to describe this kind of tag uniquely?
try to use below XPath to match required div by two text nodes:
//div[normalize-space(text())="start of div text - part 1" and normalize-space(text()[2])="end of div text - part 2"]
You were almost there. You simply need to replace the text() with . as follows:
//div[contains(., 'start of div text') and contains(., 'end of div text')]
Here is the snapshot of the validation :
This should work:
//div[contains(text(), 'start of div text') and contains(./a/text(), 'end of div text')]
Well if you have HTML DOM tree like this :
<div id="container" class="someclass">
<div>
start of div text - part 1
<a/>
end of div text - part 2
</div>
</div>
for extracting div text, you can write xpath like this :
//div[#id='container']/child::div
P.S : Writing xpath based on text to find the same exact text is not a good way to write Xpath.
If all you want is the div element of those child text elements, then you could isolate a piece of unique content from "part 1" and try the following:
//*[contains(., 'part 1')]/parent::div
This way you wouldn't have to think about the div's attributes.
However, this is usually not best practice. Ideally, you should use the following Xpath in most cases:
//div[#id,('some id') and contains(., 'part 1')]
I would like to collect the Japanese articles searched by google. I try to extract Japanese sentences, then I run the following code in order to get the tag including the most Japanese words.
texts = mostTag.xpath('<<path>>/text()').extract()
text = ''
for s in texts:
text += s
But, this code has a problem in such cases as the article has a link between sentences as below.
<div class="sample">
<p>
"A"
B
"C"
</p>
</div>
In this case, my program get AC but what I want is ABC. I appreciate it if anyone tell me how to get the sentence as 'ABC'.
You can try to use string():
text = mostTag.xpath('string(//div[#class="sample"])').extract_first()
Or use html2text
How can I find text in following HTML:
style="background-color: transparent;">
<-a hre f="/">Home<-/ a>
<-a id="brea dcrumbs-790" class=" main active mainactive" href="/products">Products<-/a>
<-a href="/products/fruit-and-creme-curds">Fruit & Crème Curds<-/a>
Crème Banana Curd
<-/li>
<-/ul>"
</div>
This is HTML for Bread Crumb, first three are link and fourth is page name. I want to read page name (Crème Banana Curd) from Bread crumb. But since this is not in any node so how to catch it
If the text isn't present inside any tag, then it is present in body tag:-
So you can use something like below to identify it:-
html/body/text()
Though the question seems to be vague without a proper HTML source but still you may try the solution below by storing the Text in a Variable-
var breadcrumb = FindElement(By.XPath(".//*[#id='brea dcrumbs-790']/following-sibling::a")).Text;
use the below code:
WebElement elem = driver.findElement(By.xpath("//*[contains(text(),'Crème Banana Curd')]"));
elem.getText();
hope this will help you.
Looking to do something relatively straightforward, I'm scraping text which so far I have had no problem grabbing, but I need to keep the <br> tags because white space analysis is an important part of the dataset.
Is there a way to keep the <br> tags so I can turn them into \n\rlater on.
Example:
<p>
<span>Some text.</br></span>
<a>Some more text.<br></a>
<span>Some more more text.<br></span>
</p>
I need : Some text.<br>Some more text.<br>Some more more text.<br>
Right now I get: Some text. Some more text. Some more more text.
Advice?
The only way is to get the html format of your selection , all you have to do is change the column type from Text to HTML , also there is no way to get only the text + the <br>.