How do I get sentences including links? - scrapy

I would like to collect the Japanese articles searched by google. I try to extract Japanese sentences, then I run the following code in order to get the tag including the most Japanese words.
texts = mostTag.xpath('<<path>>/text()').extract()
text = ''
for s in texts:
text += s
But, this code has a problem in such cases as the article has a link between sentences as below.
<div class="sample">
<p>
"A"
B
"C"
</p>
</div>
In this case, my program get AC but what I want is ABC. I appreciate it if anyone tell me how to get the sentence as 'ABC'.

You can try to use string():
text = mostTag.xpath('string(//div[#class="sample"])').extract_first()
Or use html2text

Related

Removing HTML Tags from Big Query

I have the a column in my table which stores a paragraph like below :
<p><img src="https://mywebsite.com/medias/NH2xcoUOfANfFb6l4xNgOFch3dc4TvoX2XBnI6to.jpg" alt="" width="250" height="33"></p><p><span style="font-size: 16pt; font-family: Mali, cursive; font-weight: 500;">My beautiful text is here. Show me without tags, please.</span> </p>
I want to remove all the html tags and, if possible, replace an HTML image to (Image)text.
So my expected output will be like below :
(Image) My beautiful text is here. Show me without tags, please.
OR just
My beautiful text is here. Show me without tags, please.
Thank you so much.
Try below naive approach
select html,
regexp_replace(
regexp_replace(
regexp_replace(html,
r'<img [^<>]*>', r'(Image) '),
r'(&)([^&;]*)(;)', r'<\2>'
),r'\<[^<>]*\>', ''
) as text
from your_table
if applied to sample data in your question - output is
As you can see first step is to replace Image Tag with (Image) text, second step is to address HTML encoding by enclosing them into <...> - for example becomes < > and finally remove everything between and including < and >
Note: above is simplistic approach - might not work for more complex htmls

How can I search a Beautiful Soup tree to get the tag path to a text match?

I would like to search a Beautiful Soup element for a text match and return the sequence of tags that lead to the element containing that text.
For example, if at soup.html.head.meta there is text “Hello everybody”, I would like to search on “soup.head” for “Hello everybody” and return the result “soup.html.head.meta”.
Is there a good way to do this and if there is not a simple way, is there a good workaround for quickly finding out where certain known text is located?
Example:
I retrieved the HTML source code from this URL with wget: https://www.gitpod.io/docs/context-urls
I created a Beautiful Soup object from this document like so:
soup = bs4.BeautifulSoup(doc, 'html.parser')
The method soup.html.head.get_text() returns
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGitpod
Contexts\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
I know that somewhere in the head element is some text, "Gitpod Contexts". I would like to know the nearest element tag so I can delete everything except that element, because I am trying to prune the Beautiful Soup object to just contain elements with text in them, myself, without using "get_text()" over the entire object and just automatically pulling it out.
Example 2
A simpler demonstration would be this:
<html>
<body>
<p>
Hello!
</p>
<p>
Goodbye!
</p>
</body>
</html>
The function:
html.returnLocationOf("Hello!")
returns:
html.body.p
I don't know enough about Beautiful Soup to know how it would specify "the second p" for "Goodbye!" but I imagine it could be incorporated as a method somehow.

How to get following-sibling or child of HTML with xpath

I have some HTML I'm trying to scrape. Tring to learn Selenium. What I need is the words Fashion, and Long, and so on.
I've tried:
style = driver.find_element_by_xpath("//strong[text()='Style:']/following::strong").text
style = driver.find_element_by_xpath("//strong[text()='Style:']/following-sibling::strong").text
style = driver.find_element_by_xpath("//strong[contains(.,'Style:')] /preceding-sibling::strong").text
and everything in between.
<div class="xxkkk20">
<strong>Style:</strong> Fashion <br>
<strong>Shirt Length:</strong> Long <br>
<strong>Collar:</strong> Scoop Neck <br>
<strong>Material:</strong> Polyester <br>
<strong>Pattern Type:</strong> Floral,Skulls <br>
<strong>Embellishment:</strong> Lace <br>
<strong>Thickness:</strong> Standard <br>
<strong>Fabric Stretch:</strong> High Stretch <br>
<strong>Seasons:</strong> Summer <br>
<strong>Weight:</strong> 0.1700kg <br>
<strong>Package Contents:</strong> 1 x Tank Top
</div>
You can store all of them in a list like given below and iterate them over the list to get all the web element and finally apply text method to extract text.
all_elements = driver.find_elements(By.XPATH, "//div[#class='xxkkk20']/strong")
for ele in all_elements:
print(ele.text)
Update 1 :
keys = driver.find_elements(By.XPATH, "//strong")
for key in keys:
print(key.get_attribute('innerHTML'))
pairs = driver.find_elements(By.XPATH,'//div[#class="xxkkk20"]').text.split("\n")
for pair in pairs:
texts = pair.split()
print(texts[1])
Explanation:
First you get all the texts inside the parent div element.
Then you split it by \n according to <br> elements there.
Now you should actually have pairs of texts like Style: Fashion.
And since you want to get the second values only you need to split each pair and get the second substring.
I suggest there is a space there between the first and the second texts in the pairs strings.
In case there is no spaces there, you still can split it finally by : so it will look like this:
pairs = driver.find_elements(By.XPATH,'//div[#class="xxkkk20"]')text.split("\n")
for pair in pairs:
texts = pair.split(":")
print(texts[1])
Try this
textValues = driver.find_elements_by_xpath('//div[#class="xxkkk20"]')text.split("\n")
for txt in textValues:
print(txt.split(":")[1].strip())

How to read a text when it is not in any HTML tag

How can I find text in following HTML:
style="background-color: transparent;">
<-a hre f="/">Home<-/ a>
<-a id="brea dcrumbs-790" class=" main active mainactive" href="/products">Products<-/a>
<-a href="/products/fruit-and-creme-curds">Fruit & Crème Curds<-/a>
Crème Banana Curd
<-/li>
<-/ul>"
</div>
This is HTML for Bread Crumb, first three are link and fourth is page name. I want to read page name (Crème Banana Curd) from Bread crumb. But since this is not in any node so how to catch it
If the text isn't present inside any tag, then it is present in body tag:-
So you can use something like below to identify it:-
html/body/text()
Though the question seems to be vague without a proper HTML source but still you may try the solution below by storing the Text in a Variable-
var breadcrumb = FindElement(By.XPath(".//*[#id='brea dcrumbs-790']/following-sibling::a")).Text;
use the below code:
WebElement elem = driver.findElement(By.xpath("//*[contains(text(),'Crème Banana Curd')]"));
elem.getText();
hope this will help you.

Need to keep <br> in text block tags while using import.io

Looking to do something relatively straightforward, I'm scraping text which so far I have had no problem grabbing, but I need to keep the <br> tags because white space analysis is an important part of the dataset.
Is there a way to keep the <br> tags so I can turn them into \n\rlater on.
Example:
<p>
<span>Some text.</br></span>
<a>Some more text.<br></a>
<span>Some more more text.<br></span>
</p>
I need : Some text.<br>Some more text.<br>Some more more text.<br>
Right now I get: Some text. Some more text. Some more more text.
Advice?
The only way is to get the html format of your selection , all you have to do is change the column type from Text to HTML , also there is no way to get only the text + the <br>.