extract text from html string with Scrapy - scrapy

Here is the html string in question.
<div class="def ddef_d db">a <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/book" title="book">book</a> of grammar <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/rule" title="rules">rules</a>: </div>
With BeautifulSoup, this code
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text
gets me
a book of grammar rules:
which is exactly what I want.
With scrapy, how do I get the same result?
from scrapy import Selector
sel = Selector(text=htmltxt)
sel.css('.ddef_d::text').getall()
this code gets me
['a ', ' of grammar ', ': ']
How should I fix it?

aYou can use this code to get all text inside div and its child:
text = ''.join(sel.css('.ddef_d ::text').getall())
print(text)
your selector returns text only from the div, but part of text located inside child elements (a), that's why you have to add space before ::text to include child text into result.

Related

BS4 - Replacing text content, preserving tags

I have an HTML document that uses the text-styling style attribute to change case. When I see that style, I'd like to change all text for which that tag applies, retaining the HTML tags.
I have a partial solution that replaces the tag entirely. The approach that seems like it ought to be correct gives me AttributeError: 'NoneType' object has no attribute 'next_element'
Example:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
# works, but replaces all text, removing the HTML tags
for node in soup.find_all(attrs={'style': upper_patt}):
node.replace_with(node.text.upper())
# does not work, throws AttributeError error
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for txt in node.strings:
txt.replace_with(txt.upper())
Seems like you want to change the inner text to uppercase for all the children of an element with text-transform: uppercase.
Instead of altering the result of find_all, loop over the children text with node.findChildren(text=True) of the result, and use replace_with() to change the text:
from bs4 import BeautifulSoup, NavigableString, Tag
import re
html = '''
<div style="text-transform: uppercase;">
Foo0
<font>Foo0</font>
<div>Foo1
<div>Foo2</div>
</div>
</div>
'''
upper_patt = re.compile('(?i)text-transform:\s*uppercase')
soup = BeautifulSoup(html, "html.parser")
for node in soup.find_all(attrs={'style': upper_patt}):
for child in node.findChildren(recursive=True, text=True):
child.replace_with(child.text.upper())
print(soup)
Prints:
<div style="text-transform: uppercase;">
FOO0
<font>FOO0</font>
<div>FOO1
<div>FOO2</div>
</div>
</div>

Pandas web scraping(Beautiful soup) find in tag with class, another tag with a link. Then following the link inside href

I tried fins 'td' tag with specific attribute, and then find 'a' tag inside of the 'td' tag
for row in bs4.find_all('<td class="series-column"'):
for link in bs4.find_all('a'):
if link.has_attr('href') and (link.has_attr('class') == 'formatted-title external-link result-url'):
print(link.attrs['href'])
On the screenshot you see html for this page
Your bs4.find_all('<td class="series-column"') is wrong. You have to supply tag name and attributes you want to find, for example bs4.find_all('td', class_='series-column'). Or use CSS selector:
from bs4 import BeautifulSoup
txt = '''
<td class="series-column">
<a class="formatted-title external-link result-url" href="//knoema.com/...">link text</a>
</td>'''
soup = BeautifulSoup(txt, 'html.parser')
for link in soup.select('td.series-column a.formatted-title.external-link.result-url'):
print(link['href'])
Prints:
//knoema.com/...

How to use BeautifulSoup to get content inside over-line tags

I would like to extract the content("_The_important_content_") from an HTML snippet as follows:
<div
class="
a:2
c:gray
m:da
"
>
_The_important_content_
</div>
My code is just:
for i in soup.findAll('div', class_="a:2 c:gray m:da"):
print(i.text)
But because the "class" field contains new line symbols and is expanded to multiple line so that BeautifulSoup cannot match, the code returns nothing. How can I specify the correct class field and get the content?
There are many tags with the same "class" value and other "class" value but I want to extract the contents from the tags with that specific "class" value.
Try this:
html='''
<div
class="
a:2
c:gray
m:da
"
>
_The_important_content_
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
item = soup.select("[class^=]")[0].text
print(item.strip())
Result:
_The_important_content_

Getting inner tag text whilst using a filter in BeautifulSoup

I have:
... html
<div id="price">$199.00</div>
... html
How do I get the $199.00 text. Using
soup.findAll("div",id="price",text=True)
does not work as I get all the innet text from the whole document.
Find div tag, and use text attribute to get text inside the tag.
>>> from bs4 import BeautifulSoup
>>>
>>> html = '''
... <html>
... <body>
... <div id="price">$199.00</div>
... </body>
... </html>
... '''
>>> soup = BeautifulSoup(html)
>>> soup.find('div', id='price').text
u'$199.00'
You are SO close to make it work.
(1) How to search and locate the tag that you are interested:
Let's take a look at how to use find_all function:
find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs):...
name="div":The name attribute will contains the tag name
attrs={"id":"price"}: The attrs is a dictionary who contains the attribute
recursive: a flag whether dive into its children or not.
text: could be used along with regular expressions to search for tags which contains certain text
limit: is a flag to choose how many you want to return limit=1 make find_all the same as find
In your case, here are a list of commands to locate the tags playing with different flags:
>> # in case you have multiple interesting DIVs I am using find_all here
>> html = '''<html><body><div id="price">$199.00</div><div id="price">$205.00</div></body></html>'''
>> soup = BeautifulSoup(html)
>> print soup.find_all(attrs={"id":"price"})
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
>> # This is a bit funky but sometime using text is extremely helpful
>> # because text is actually what human will see so it is more reliable
>> import re
>> tags = [text.parent for text in soup.find_all(text=re.compile('\$'))]
>> print tags
[<div id="price">$199.00</div>, <div id="price">$205.00</div>]
There are many different ways to locate your elements and you just need to ask yourself, what will be the most reliable way to locate a element.
More Information about BS4 Find, click here.
(2) How to get the text of a tag:
tag.text will return unicode and you can convert to string type by using tag.text.encode('utf-8')
tag.string will also work.

Beautiful Soup - how to get href

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:
<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>
This is what I thought should work
href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]
.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.
The following does what you want, I think:
import re
from BeautifulSoup import BeautifulSoup
html = '''<div id='id_Website'>
<strong>Website:</strong>
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''
soup = BeautifulSoup(html)
for t in soup.findAll(text=re.compile(r'Website:')):
# Find the parent of the NavigableString, and see
# whether that's a <strong>:
s = t.parent
if s.name == 'strong':
print s.nextSibling.nextSibling['href']
... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.