Extracting <br> records from html parsed data using beautifulsoup - beautifulsoup

I have online text data like this:
plain_text= "<a href="/url?q=https://www.aarnoldmovingcompany.com/contact-us/&sa=U&ved=0ahUKEwCgAMAA&usg=AOvVaw1pasRFOwk">
</b> Moving Louisville - Headquarters.<br>
commercial moving services nationwide. Visit our website today to learn more!<br><div class="osl">
<br>
5200 Interchange Way Louisville, KY 40229.<br>
... <b> A. Arnold</b>"
I am trying to extract all <br> tags from this text, so output will look like:
commercial moving services nationwide. Visit our website today to learn more
5200 Interchange Way Louisville, KY 40229.
This doesn't work for me:
soup=BeautifulSoup(plain_text,"lxml")
out=soup.find_all('br')
It throws me:
[<br/>,
<br/>]

You can use next_sibling, check the code below.
from bs4 import BeautifulSoup
text = """<a href="/url?q=https://www.aarnoldmovingcompany.com/contact-us/&sa=U&ved=0ahUKEwCgAMAA&usg=AOvVaw1pasRFOwk">
</b> Moving Louisville - Headquarters.<br>
commercial moving services nationwide. Visit our website today to learn more!<br><div class="osl">
<br>
5200 Interchange Way Louisville, KY 40229.<br>
... <b> A. Arnold</b>"""
soup = BeautifulSoup(text,'lxml')
name = soup.br.next_sibling
address = name.next.next.text.strip()
print(name, '\n', address)
Output
commercial moving services nationwide. Visit our website today to learn more!
5200 Interchange Way Louisville, KY 40229.
... A. Arnold

Related

Passing nested values to class methods in scrapy

I'm new to web scraping, please pardon the possible vagueness in my terminology :|
A snippet of an HTML page that I'm trying to write a spider for:
<h3>2019 General Meetings</h3>
<p><strong>Group 20:</strong> <br />Wednesday, June 5, 9 a.m. <br /> Bank & Trust, 10000 E. Western Ave.</p>
<p>Wednesday, July 11, 9 a.m. <br />Bank & Trust, 10000 E. Western Ave.</p>
<p><strong>Group 20:</strong> <br />Monday, July 8, 9 a.m.<br />Hubbard, 1740 W. 199th St.</p>
<p> </p></div>
The logic I'm trying to follow is:
I have the <h3> which is the "top level" (or at least I consider it to be), there are other h3's on the page, so I need to make sure only this <h3> gets passed to the following parsers.
For the above, I'm using
response_items = response.xpath("//h3[contains(#h3, 'General Meetings')]")
And I think I have it working. (But needs more testing to make certain.)
I need to pass each of the <p> to a respective parser within the class, and each should return a required piece of information about the meeting, e.g
_parser_date will return the date, _parser_address will return the address, and do on.
I'm coming short on finding the correct scrapy/xpath syntax for this. Following https://docs.scrapy.org/en/latest/topics/selectors.html I can't get this to work quite well.
I'm particularly interested in each parser to "pick up" on a pattern within the <p>'s it's going to parse, and if it's a date pattern then format it, and return.
If it's a location pattern.. and so on.
I'm trying to avoid using re.(), unless you'd advise it's the right thing to do here.
Any insights would be most welcome,
Thank you.
This should work:
for p_node in response.xpath('//h3[contains(., 'General Meetings')]/following-sibling::p[position() < last()]'):
address = p_node.xpath('./text()[last()]).get()
date = p_node.xpath('./text()[last() - 1]).get()
I used position() < last() to skip last empty <p> and also I'm parsing data from the end.

Replace part of serialized data "resets" data to standard settings

I'm managing a couple of hundreds of websites and need to change part of a serialized data.
It's a wordpress child theme and inside of the theme's "Options" settings.
Using this script
UPDATE wp_e4e5_options
SET option_value = REPLACE(option_value, 'Copyright | by <a href="https://company.com"', ' ')
a:1:{s:3:"copyright";s:17:"Copyright | by <a href="https://company.com"";}
I was certain, it would just find that part of the serialized data and replace. But it doesn't.
It reset the setting to the theme's standard setting. Even when I edit it manually by using Adminer.php in the table, it resets.
I'm aware, that this might be in the wrong forum, since it's Wordpress related, but I believe it's SQL that's the issue here.
So my question is:
If i edit it manually using Adminer.php (simple version of
phpMyAdmin), it resets all the settings back to standard. How can I
edit only part of the serialized data and only the part shown above?
What makes it "reset" to standard settings?
UPDATE:
Thanks to #Kaperto I got this working code now, which gave me a new issue.
UPDATE wp_e4e51a4870_options
SET option_value = REPLACE(option_value, 's:173:"© Copyright - Company name [nolink] | by <a href="https://company-name.com" target="_blank" rel="nofollow">Company Name</a>";', 's:40:"© Copyright - Company name [nolink]";')
The problem is, it's gonna be used as a code snippet with ManageWP looping through several hundreds of websites which all have different company names. So the first part of the string is unique but the rest is the same on all sites, after the pipe |.
So I somehow need to do this:
Find whole series where this string is included | by <a href="https://company-name.com" target="_blank" rel="nofollow">Company Name</a>"
Get the whole series s:173:"© Copyright -
Company name [nolink] | by <a href="https://company-
name.com" target="_blank"
rel="nofollow">Company Name</a>
Replace with new series with updated character count, since company name is different
Is this even achievable with pure SQL commands?

Webscraping: Crawling Pages and Storing Content in DataFrame

Following code can be used to reproduce a web scraping task for three given example urls:
Code:
import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup
# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)
def scrape_content(url):
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html,"lxml")
# Get page title
title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
# Get content from paragraphs
content = soup.find("div", {"class":"section-content"}).find_all('p')
print(title)
for p in content:
p = p.get_text(strip=True)
print(p)
Apply scraping to each url:
urls['url'].apply(scrape_content)
Out:
Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report
0 None
1 None
2 None
Name: url, dtype: object
Problems:
The code currently only outputs content for the first paragraph of every page. I like to get data for every p in the given selector.
For the final data, I need a data frame that contains the url, title, and content. Therefore, I like to know how I can write the scraped information into a data frame.
Thank you for your help.
Your problem is in this line:
content = soup.find("div", {"class":"section-content"}).find_all('p')
find_all() is getting all the <p> tags, but only in the results .find() - which just returns the first example which meets the criteria. So you're getting all the <p> tags in the first div.section_content. It's not exactly clear what the right criteria are for your use case, but if you just want all the <p> tags you can use:
content = soup.find_all('p')
Then you can make scrape_urls() merge the <p> tag text and return it along with the title:
content = '\r'.join([p.get_text(strip=True) for p in content])
return title, content
Outside the function, you can build the dataframe:
url_list = urls['url'].tolist()
results = [scrape_url(url) for url in url_list]
title_list = [r[0] for r in results]
content_list = [r[1] for r in results]
df = pd.DataFrame({'url': url_list, 'title': title_list, 'content': content_list})

google rich snippet how to display job list?

How can i let "google rich snippet" display this format such as the following?
rich snippet display the "Job Title", "Company", "Location", "Posted"
glassdoor jobs - Computerworld
25+ items - 5158+ glassdoor jobs available on Computerworld.
Job Title Company Location Posted.
Senior Software Engineer ... Riverbed Technology Sunnyvale, CA Aug 09.
Senior Java Software Engineer Glassdoor.com Manhattan, NY Aug 17.
is it use microdata, mircoformat or RDFa?
or need to write the specific HTML structure?
i know the JobPosting of microdata, but i think this format is more better to me.
Thanks for your help!
it's "bulleted snippets".
※ Use a consistent structure, whatever it is.
※ Keep extraneous code to a minimum.
※ Test removing your META description or setting it to “”.
http://moz.com/blog/how-do-i-get-googles-bulleted-snippets
http://insidesearch.blogspot.tw/2011/08/new-snippets-for-list-pages.html

Is it possible to use beautiful soup to extract multiple types of items?

I've been looking at documentation and they don't cover this issue. I'm trying to extract all text and all links, but not separately. I want them interleaved to preserve context. I want to end up with an interleaved list of text and links. Is this even possible with BeautifulSoup?
Yes, this is definitely possible.
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.example.com")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
print a
Breaking this code snippet down, you are making a request for a website (in this case Google.com) and parsing the response back with BeautifulSoup. Your requirements were to find all links and text and keep the context. The output of the above code will look like this:
<img src="/_img/iana-logo-pageheader.png" alt="Homepage" />
Domains
Numbers
Protocols
About IANA
RFC 2606
About
Presentations
Performance
Reports
Domains
Root Zone
.INT
.ARPA
IDN Repository
Protocols
Number Resources
Abuse Information
Internet Corporation for Assigned Names and Numbers
iana#iana.org