BeautifulSoup: Parse HTML Table if it contains a keyword

BeautifulSoup: Parse HTML Table if it contains a keyword - beautifulsoup

I have this html file: https://www.sec.gov/Archives/edgar/data/706688/000119312512154452/d292519ddef14a.htm
And about a thousand more like this, all filed by different firms that use different html formats.
I am interested in one table in that whole document, the beneficial holders table. I want to parse that out using BeautifulSoup.
I am able to parse out all tables in the document, but not the one I need. If I had a list of keywords like "Beneficial","Holders","Ownership" etc, how would I extract only the tables that contain any of the words in the list?

You can do something like this and then if statement to match against keywords!
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.sec.gov/Archives/edgar/data/'
'706688/000119312512154452/d292519ddef14a.htm')
soup = BeautifulSoup(req.content, 'html.parser')
tables = soup.find_all('table')
table = tables[3]#find 4th table from the webpage
print(table.text)

Related

Problem with getting a table from a website with pandas

I'm trying to get a table from a website with pd.read_html:
import requests
import pandas as pd
url = 'https://www.macrotrends.net/stocks/charts/AAPL/apple/financial-ratios?freq=Q'
html = requests.get(url).content
df_list = pd.read_html(html)
however no table is found:
ValueError: No tables found
Is there a quick remedy here, I'm not experienced with web scraping.
Thank you

Note: As mentioned in my comment there is no HTML table and content is generated dynamically by JavaScript - Alternatives are extract info from JavaScript or using Selenium
Another alternative, if you like to go with pandas.read_html(), is to pick the specific ressources for historic metrics stored in tables and process it in the way you like:
import requests
import pandas as pd
url = 'https://www.macrotrends.net/stocks/charts/AAPL/apple/'
metrics = ['current-ratio','ebit-margin']
for m in metrics:
html = requests.get(url+m).content
print(pd.read_html(html)[0]) #or process/store for your needs

Webscraping: Crawling Pages and Storing Content in DataFrame

Following code can be used to reproduce a web scraping task for three given example urls:
Code:
import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup
# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)
def scrape_content(url):
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html,"lxml")
# Get page title
title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
# Get content from paragraphs
content = soup.find("div", {"class":"section-content"}).find_all('p')
print(title)
for p in content:
p = p.get_text(strip=True)
print(p)
Apply scraping to each url:
urls['url'].apply(scrape_content)
Out:
Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report
0 None
1 None
2 None
Name: url, dtype: object
Problems:
The code currently only outputs content for the first paragraph of every page. I like to get data for every p in the given selector.
For the final data, I need a data frame that contains the url, title, and content. Therefore, I like to know how I can write the scraped information into a data frame.
Thank you for your help.

Your problem is in this line:
content = soup.find("div", {"class":"section-content"}).find_all('p')
find_all() is getting all the <p> tags, but only in the results .find() - which just returns the first example which meets the criteria. So you're getting all the <p> tags in the first div.section_content. It's not exactly clear what the right criteria are for your use case, but if you just want all the <p> tags you can use:
content = soup.find_all('p')
Then you can make scrape_urls() merge the <p> tag text and return it along with the title:
content = '\r'.join([p.get_text(strip=True) for p in content])
return title, content
Outside the function, you can build the dataframe:
url_list = urls['url'].tolist()
results = [scrape_url(url) for url in url_list]
title_list = [r[0] for r in results]
content_list = [r[1] for r in results]
df = pd.DataFrame({'url': url_list, 'title': title_list, 'content': content_list})

Combining separate tags in beautiful soup for find methods

I have some html tags that I've selected in Beautiful Soup based on some criteria. I'd like to be able to do further queries (e.g. find() or find_all()) on these tags however I haven't been able to find a method that would allow this since they are all separate entities.

I would combine the tags from the beginning. Combine the initial queries into one by using the ability to pass in a list to the find_all() method, and then search on that result. Here is an example that will return all the links in a table cell, table header, or div:
soup.find_all(["td","th","div"]).find_all("a")
Link to the documentation about lists: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-list
If your initial query is complicated you can bundle it in a function: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function

rather than combining everything in a line, i would prefer to separate them. For example, i would first using find_all() to capture them into a list then use find() to narrow down my search.
container = soup.find_all("tag",{"tag class":"tag class name"})
what_i_want = container.find("another tag")
hope this helps

BeautifulSoup not getting entirety of extracted class

I am trying to extract data from craigslist using BeautifulSoup. As a preliminary test, I wrote the following:
import urllib2
from bs4 import BeautifulSoup, NavigableString
link = 'http://boston.craigslist.org/search/jjj/index100.html'
print link
soup = BeautifulSoup(urllib2.urlopen(link).read())
print soup
x=soup.body.find("div",class_="content")
print x
Upon printing soup, I can see the entire webpage. However, upon trying to find something more specific such as the class called "content", it prints None. I know that the class exists in the page source as I looked on my own browser, but for some reason, it is not finding it in the BeautifulSoup parsing. Any ideas?
Edit:
I also added in the following to see what would happen:
print soup.body.article
When I do so, it prints out some information between the article tags, but not all. Is it possible that when I am using the find function, it is somehow skipping some information? I'm really not sure why this is happening when it prints the whole thing for the general soup, but not when I try to find particulars within it.

The find method on the BeautifulSoup instance (your soup variable) is not the same as the find method on a Tag (your soup.body).
This:
soup.body.find("div",class_="content")
is only searching through the direct children of the body tag.
If you call find on the BeautifulSoup instance, it does what you want and searches the whole document:
soup.find("div",class_="content")

Is it possible to use beautiful soup to extract multiple types of items?

I've been looking at documentation and they don't cover this issue. I'm trying to extract all text and all links, but not separately. I want them interleaved to preserve context. I want to end up with an interleaved list of text and links. Is this even possible with BeautifulSoup?

Yes, this is definitely possible.
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.example.com")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
print a
Breaking this code snippet down, you are making a request for a website (in this case Google.com) and parsing the response back with BeautifulSoup. Your requirements were to find all links and text and keep the context. The output of the above code will look like this:
<img src="/_img/iana-logo-pageheader.png" alt="Homepage" />
Domains
Numbers
Protocols
About IANA
RFC 2606
About
Presentations
Performance
Reports
Domains
Root Zone
.INT
.ARPA
IDN Repository
Protocols
Number Resources
Abuse Information
Internet Corporation for Assigned Names and Numbers
iana#iana.org

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BeautifulSoup: Parse HTML Table if it contains a keyword - beautifulsoup

Related

Problem with getting a table from a website with pandas

Webscraping: Crawling Pages and Storing Content in DataFrame

Combining separate tags in beautiful soup for find methods

BeautifulSoup not getting entirety of extracted class

Is it possible to use beautiful soup to extract multiple types of items?

Categories

Resources