Is it possible to use beautiful soup to extract multiple types of items? - beautifulsoup

I've been looking at documentation and they don't cover this issue. I'm trying to extract all text and all links, but not separately. I want them interleaved to preserve context. I want to end up with an interleaved list of text and links. Is this even possible with BeautifulSoup?

Yes, this is definitely possible.
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.example.com")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
print a
Breaking this code snippet down, you are making a request for a website (in this case Google.com) and parsing the response back with BeautifulSoup. Your requirements were to find all links and text and keep the context. The output of the above code will look like this:
<img src="/_img/iana-logo-pageheader.png" alt="Homepage" />
Domains
Numbers
Protocols
About IANA
RFC 2606
About
Presentations
Performance
Reports
Domains
Root Zone
.INT
.ARPA
IDN Repository
Protocols
Number Resources
Abuse Information
Internet Corporation for Assigned Names and Numbers
iana#iana.org

Related

Webscraping: Crawling Pages and Storing Content in DataFrame

Following code can be used to reproduce a web scraping task for three given example urls:
Code:
import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup
# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)
def scrape_content(url):
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html,"lxml")
# Get page title
title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
# Get content from paragraphs
content = soup.find("div", {"class":"section-content"}).find_all('p')
print(title)
for p in content:
p = p.get_text(strip=True)
print(p)
Apply scraping to each url:
urls['url'].apply(scrape_content)
Out:
Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report
0 None
1 None
2 None
Name: url, dtype: object
Problems:
The code currently only outputs content for the first paragraph of every page. I like to get data for every p in the given selector.
For the final data, I need a data frame that contains the url, title, and content. Therefore, I like to know how I can write the scraped information into a data frame.
Thank you for your help.
Your problem is in this line:
content = soup.find("div", {"class":"section-content"}).find_all('p')
find_all() is getting all the <p> tags, but only in the results .find() - which just returns the first example which meets the criteria. So you're getting all the <p> tags in the first div.section_content. It's not exactly clear what the right criteria are for your use case, but if you just want all the <p> tags you can use:
content = soup.find_all('p')
Then you can make scrape_urls() merge the <p> tag text and return it along with the title:
content = '\r'.join([p.get_text(strip=True) for p in content])
return title, content
Outside the function, you can build the dataframe:
url_list = urls['url'].tolist()
results = [scrape_url(url) for url in url_list]
title_list = [r[0] for r in results]
content_list = [r[1] for r in results]
df = pd.DataFrame({'url': url_list, 'title': title_list, 'content': content_list})

BeautifulSoup not getting entirety of extracted class

I am trying to extract data from craigslist using BeautifulSoup. As a preliminary test, I wrote the following:
import urllib2
from bs4 import BeautifulSoup, NavigableString
link = 'http://boston.craigslist.org/search/jjj/index100.html'
print link
soup = BeautifulSoup(urllib2.urlopen(link).read())
print soup
x=soup.body.find("div",class_="content")
print x
Upon printing soup, I can see the entire webpage. However, upon trying to find something more specific such as the class called "content", it prints None. I know that the class exists in the page source as I looked on my own browser, but for some reason, it is not finding it in the BeautifulSoup parsing. Any ideas?
Edit:
I also added in the following to see what would happen:
print soup.body.article
When I do so, it prints out some information between the article tags, but not all. Is it possible that when I am using the find function, it is somehow skipping some information? I'm really not sure why this is happening when it prints the whole thing for the general soup, but not when I try to find particulars within it.
The find method on the BeautifulSoup instance (your soup variable) is not the same as the find method on a Tag (your soup.body).
This:
soup.body.find("div",class_="content")
is only searching through the direct children of the body tag.
If you call find on the BeautifulSoup instance, it does what you want and searches the whole document:
soup.find("div",class_="content")

How do I access the "See Also" Field in the Wiktionary API?

Many of the Wiktionary pages for Chinese Characters (Hanzi) include links at the top of the page to other similar-looking characters. I'd like to use the Wiktionary API to send a single character in the query and receive a list of similar characters as the response. Unfortunately, I can't seem to find any query that includes the "See Also" field. Is this kind of query possible?
The “see also” field is just a line of wiki code in the page source, and there is no way for the API to know that it's different from any other piece of text on the page.
If you are happy with using only the English version of Wiktionary, you can fetch the wikicode: index.php?title=太&action=raw, and then parse the result for the template also. In this case, the line you are looking for is {{also|大|犬}}.
To check if the template is used on the page at all, query the API for titles=太&prop=templates&tltemplates=Template:also
Similar templates are avilable in more language editions of Wiktionary, in case you want to use other sources than the English one. The current list is:
br:Patrom:gwelet
ca:Plantilla:vegeu
cs:Šablona:Viz
de:Vorlage:Siehe auch
el:Πρότυπο:δείτε
es:Plantilla:desambiguación
eu:Txantiloi:Esanahi desberdina
fi:Malline:katso
fr:Modèle:voir
gl:Modelo:homo
id:Templat:lihat
is:Snið:sjá einnig
it:Template:Vedi
ja:テンプレート:see
no:Mal:se også
oc:Modèl:veire
pl:Szablon:podobne
pt:Predefinição:ver também
ru:Шаблон:Cf
sk:Šablóna:See
sv:Mall:se även
It has been suggested that the WikiData project be expanded to cover Wiktionary. If and when that happens, you might be able to query theWikiData API for that kind of stuff!

SEO/Web Crawling Tool to Count Number of Headings (H1, H2, H3...)

Does anyone know of a tool or script that will crawl my website and count the number of headings on every page within my website? I would like to know how many pages in my website have more than 4 headings (h1). I have Screaming Frog, but it only counts the first two H1 elements. Any help is appreciated.
My Xidel can do that, e.g.:
xidel http://stackoverflow.com/questions/14608312/seo-web-crawling-tool-to-count-number-of-headings-h1-h2-h3 -e 'concat($url, ": ", count(//h1))' -f '//a[matches(#href, "http://[^/]*stackoverflow.com/")]'
The xpath expression in the -e argument tells it to count the h1-tags and the -f option on which pages
This is such a specific task that I would just recommend you write it yourself. The simplest thing you need is an XPATH selector to give you the h1/h2/h3 tags.
Counting the headings:
Pick any one of your favorite programming languages.
Issue a web request for a page on your website (Ruby, Perl, PHP).
Parse the HTML.
Invoke the XPATH heading selector and count the number of elements that it returns.
Crawling your site:
Do step 2 through 4 for all of your pages (you'll probably have to have a queue of pages that you want to crawl). If you want to crawl all of the pages, then it will be just a little more complicated:
Crawl your home page.
Select all anchor tags.
Extract the URL from each href and discard any URLs that don't point to your website.
Perform a URL-seen test: if you have seen it before, then discard, otherwise queue for crawling.
URL-Seen test:
The URL-seen test is pretty simple: just add all the URLs you've seen so far to a hash map. If you run into a URL that is in your hash map, then you can ignore it. If it's not in the hash map, then add it to the crawl queue. The key for the hash map should be the URL and the value should be some kind of a structure that allows you to keep statistics for the headings:
Key = URL
Value = struct{ h1Count, h2Count, h3Count...}
That should be about it. I know it seems like a lot, but it shouldn't be more than a few hundred lines of code!
I found a tool in Code Canyon: Scrap(e) Website Analyser: http://codecanyon.net/item/scrap-website-analyzer/3789481.
As you will see from some of my comments, there was a small amount of configuration, but it is working well so far.
Thanks BeniBela, I will also look at your solution and report back.
You might use xPather chrome extension or similar, and the xPath query:
count(//*[self::h1 or self::h2 or self::h3])
Thanks to:
SEO/Web Crawling Tool to Count Number of Headings (H1, H2, H3...)
https://devhints.io/xpath

citeseerx search api

Is there a way to access CiteSeerX programmatically (e.g. search by author and/or title?) Surprisingly I cannot find anything relevant; surely others too are trying to get scholarly article metadata without resorting to scraping?
EDIT: note that CiteSeerX supports OAI PMH, but that seems to be an API geared towards digital libraries keeping up to date with each other ("content dissemination") and does not specifically support search. Moreover the citeseer info on that page is very sparse and even says "Currently, there are difficulties with the OAI".
There is another SO question about CiteSeerX API (though not specifically search); the 2 answers do not resolve the problem (one talks about Mendeley, another piece of software, and the other says OAI-PMH implementations are free to offer extensions to the minimal spec).
Alternatively, can anyone suggest a good way to obtain citations from authors/titles programmatically?
As suggested by one of the commenters, I tried jabref first:
jabref -n -f "citeseer:title:(lessons from) author:(Brewer)"
However jabref seems to not realize that the query string needs to contain colons and so throws an error.
For search results, I ended up scraping the CiteSeerX results with Python's BeautifulSoup:
url = "http://citeseerx.ist.psu.edu/search?q="
q = "title%3A%28{1}%29+author%3%28{0}%29&submit=Search&sort=cite&t=doc"
url += q.format (author_last, title.replace (" ", "+"))
soup = BeautifulSoup (urllib2.urlopen (url).read ())
result = soup.html.body ("div", id = "result_list") [0].div
title = result.h3.a.string.strip ()
authors = result ("span", "authors") [0].string
authors = authors [len ("by "):].strip ()
date = result ("span", "pubyear") [0].string.strip (", ")
It is possible to get a document ID from the results (the misleadingly-named "doi=..." part in the summary link URL) and then pass that to the CiteSeerX OAI engine to get Dublin Core XML (e.g. http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:CiteSeerX.psu:10.1.1.42.2177); however that XML ends up containing multiple dc:date elements, which makes it less useful than the scrape output.
Too bad CiteSeerX makes people resort to scraping in spite of all the open archives / open access rhetoric.