Getting news from WSJ using BeautifulSoup - beautifulsoup

I'm trying o scrappe the main news from the WSJ website (https://www.wsj.com/news/economy specifically).
However, I'm not understanding why executing the code below I obtain the news on a side column titled "Most Popular News", since I'm looking for a div class "WSJTheme--headline--7VCzo7Ay " that seems to refer to the main news instead when I inspect the site.
Would appreciate very much any help to get the news from the main section of the page link sent above. For example, two of the news I'm after are (right now) "Powell Says Low-Income Lending Rules Should Apply to All Firms" and "Treasury Expects to Borrow $1.3 Trillion in Second Half of Fiscal 2021". Thank you! Code below.
from bs4 import BeautifulSoup
import requests
from datetime import date, time, datetime, timedelta
url='https://www.wsj.com/news/economy'
response=requests.get(url)
soup=BeautifulSoup(response.content,'lxml')
for item in soup.select('.WSJTheme--headline--7VCzo7Ay '):
headline = item.find('h2').get_text()
link = item.find('a')['href']
noticia = headline + ' - ' + link
print(noticia)

There are two problems with your current code:
You need to specify the HTTP User-Agent header, otherwise, the website thinks that your a bot and will block you.
You are searching for article headlines by searching for an <h2> tag, however, only the first article is under an <h2>, the others are under an <h3> tag. To select both <h2> and <h3> you can use a CSS selector: .select_one("h2, h3").
from bs4 import BeautifulSoup
import requests
url = "https://www.wsj.com/news/economy"
# Specify the `user-agent` inorder not to be blocked
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "lxml")
for item in soup.select(".WSJTheme--headline--7VCzo7Ay"):
# Articles might be under an `h2` or `h3`, use a CSS selector to select both
headline = item.select_one("h2, h3").get_text()
link = item.find("a")["href"]
noticia = headline + " - " + link
print(noticia)
Output (truncated):
Consumer Demand Drives U.S. Imports to Record High - https://www.wsj.com/articles/u-s-trade-deficit-widened-to-74-4-billion-in-march-11620132526
Who Would Pay Biden’s Corporate Tax Hike Is Key to Policy Debate - https://www.wsj.com/articles/who-would-pay-bidens-corporate-tax-increase-is-key-question-in-policy-debate-11620130284
Powell Says Low-Income Lending Rules Should Apply to All Firms - https://www.wsj.com/articles/powell-highlights-slower-recovery-for-low-wage-and-minority-workers-11620065926
Treasury Expects to Borrow $1.3 Trillion in Second Half of Fiscal 2021 - https://www.wsj.com/articles/treasury-expects-to-borrow-1-3-trillion-over-second-half-of-fiscal-2021-11620068646
Yellen to Appoint Senior Fed Official to Run Top Bank Regulator - https://www.wsj.com/articles/yellen-to-appoint-senior-fed-official-to-run-occ-11620057637
...

Related

Webscraping: Crawling Pages and Storing Content in DataFrame

Following code can be used to reproduce a web scraping task for three given example urls:
Code:
import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup
# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)
def scrape_content(url):
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html,"lxml")
# Get page title
title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
# Get content from paragraphs
content = soup.find("div", {"class":"section-content"}).find_all('p')
print(title)
for p in content:
p = p.get_text(strip=True)
print(p)
Apply scraping to each url:
urls['url'].apply(scrape_content)
Out:
Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report
0 None
1 None
2 None
Name: url, dtype: object
Problems:
The code currently only outputs content for the first paragraph of every page. I like to get data for every p in the given selector.
For the final data, I need a data frame that contains the url, title, and content. Therefore, I like to know how I can write the scraped information into a data frame.
Thank you for your help.
Your problem is in this line:
content = soup.find("div", {"class":"section-content"}).find_all('p')
find_all() is getting all the <p> tags, but only in the results .find() - which just returns the first example which meets the criteria. So you're getting all the <p> tags in the first div.section_content. It's not exactly clear what the right criteria are for your use case, but if you just want all the <p> tags you can use:
content = soup.find_all('p')
Then you can make scrape_urls() merge the <p> tag text and return it along with the title:
content = '\r'.join([p.get_text(strip=True) for p in content])
return title, content
Outside the function, you can build the dataframe:
url_list = urls['url'].tolist()
results = [scrape_url(url) for url in url_list]
title_list = [r[0] for r in results]
content_list = [r[1] for r in results]
df = pd.DataFrame({'url': url_list, 'title': title_list, 'content': content_list})

Having trouble with Python Web Scraper

i'm new to scrapping and would love some help or just a push along in the right direction. I've currently tried using scrapy but could not get it working at all.
What i'm trying to do is get the titles, episode and html 5 video player link's + different qualities (480p, 720p,etc..) from this page. I'm not sure how i'm meant to get the video src's from the iframe elements though.
As mentioned before any help would be very helpful.
Thanks.
I don't have previous experience with Scrapy, but I'm in the middle of a Python Web Scraping project myself. I'm using BeautifulSoup for scraping.
I've written part of the code - this gets all of the titles, episodes, thumbnails, and loads the link to the new page for further processing. If you're having more troubles, leave a message ;)
from bs4 import BeautifulSoup
from urllib import request
url = "http://getanime.to/recent"
h = {'User-Agent': 'Mozilla/5.0'}
req = request.Request(url, headers=h)
data = request.urlopen(req)
soup = BeautifulSoup(data)
# print(soup.prettify()[:1000]) # For testing purposes - should print out the first 1000 characters of the HTML document
links = soup.find_all('a', class_="episode-release")
for link in links:
# Get required info from this link
thumbnail = link.find('div', class_="thumbnail")["style"]
thumbnail = thumbnail[22:len(thumbnail)-3]
title = link.find('div', class_="title-text").contents[0].strip()
episode = link.find('div', class_="super-block").span.contents[0]
href = link["href"]
# print(thumbnail, title, episode, href) # For testing purposes
# Load the link to this episode for further processing
req2 = request.Request(href, headers=h)
data2 = request.urlopen(req2)
soup2 = BeautifulSoup(data2)
vid_sources = soup2.find('ul', class_="dropdown-menu dropdown-menu--top video-sources")
# TODO repeat the above process to find all video sources
Edit: the above code is for python3. For clarification.
(posting as another answer, since comments remove linebreaks):
Sure, happy to help ;) you're very much on the right track, so keep at it. I am wondering why you're using find_all('iframe'), since I couldn't find any examples with multiple iframe's, but it'll work just as well I guess. If you know there's only one, it saves some time to use soup.find().
Using type(iframexx) shows me that it points to a list which contains the actual data we want. Then
for iframe in iframexx:
print(type(iframexx))
ifr = iframexx[0]
print(ifr)
print(ifr["data-src"])
allowed me to get the data-src.

Scrapy Crawling but not Scraping

I am scraping a set of ~10,000 links in the same domain with identical structure using scrapy runspider command. Randomly in between some pages (a significant ~40% to 50% pages) are Crawled but Not scraped, because in my parse method I evaluate a particular element in the page, based on which I scrape the other elements of the page. But as it goes for some Reason (more on this Reason later), for some of the urls that element evaluates incorrectly. To fix this I want to call my parse method for these urls repeatedly up to a maximum of say 5 times till it evaluates correctly (hoping that in 5 runs the page will respond correctly to the condition or otherwise I assume that the element is genuinely evaluated as wrong). How to code this (part code below)?
Possible Reason for the above behaviour: my weblinks are of the type
www.example.com/search_term/ which are actually dynamically generated page after entering "search_term" in www.example.com. So my guess is that in several cases Scrapy gets the response even before the page www.example.com/search_term/ is fully generated. Maybe the ideal solution is to use a webdriver and all, but all of that will be too complex for me at this stage. As long as I get 95% scraping, I am happy.
Relevant Code below (sanitised for readability without leaving out any details):
class mySpider(scrapy.Spider):
name = "spidername"
def start_requests(self):
urls = [url1, ... url10000]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse,headers={
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})
def parse(self, response):
if (value of particular_item in page == 10):
yield {'someitem':
response.xpath('/html/body/div').extract())}
else:
<<Once again call this parse fuction with the same url upto a maximum of 5 times - Need help in writing the code here>>
Your XPath requires that the body of the HTML you are parsing has a div as first element:
<html>
<body>
<div>...
Are you sure every site looks that way? Without any information on what you try to scrape I cannot give you more advice.
Alternatively you can try another solution where you extract all the divs from the website:
for div in response.xpath('//div').extract():
yield {'div': div}

citeseerx search api

Is there a way to access CiteSeerX programmatically (e.g. search by author and/or title?) Surprisingly I cannot find anything relevant; surely others too are trying to get scholarly article metadata without resorting to scraping?
EDIT: note that CiteSeerX supports OAI PMH, but that seems to be an API geared towards digital libraries keeping up to date with each other ("content dissemination") and does not specifically support search. Moreover the citeseer info on that page is very sparse and even says "Currently, there are difficulties with the OAI".
There is another SO question about CiteSeerX API (though not specifically search); the 2 answers do not resolve the problem (one talks about Mendeley, another piece of software, and the other says OAI-PMH implementations are free to offer extensions to the minimal spec).
Alternatively, can anyone suggest a good way to obtain citations from authors/titles programmatically?
As suggested by one of the commenters, I tried jabref first:
jabref -n -f "citeseer:title:(lessons from) author:(Brewer)"
However jabref seems to not realize that the query string needs to contain colons and so throws an error.
For search results, I ended up scraping the CiteSeerX results with Python's BeautifulSoup:
url = "http://citeseerx.ist.psu.edu/search?q="
q = "title%3A%28{1}%29+author%3%28{0}%29&submit=Search&sort=cite&t=doc"
url += q.format (author_last, title.replace (" ", "+"))
soup = BeautifulSoup (urllib2.urlopen (url).read ())
result = soup.html.body ("div", id = "result_list") [0].div
title = result.h3.a.string.strip ()
authors = result ("span", "authors") [0].string
authors = authors [len ("by "):].strip ()
date = result ("span", "pubyear") [0].string.strip (", ")
It is possible to get a document ID from the results (the misleadingly-named "doi=..." part in the summary link URL) and then pass that to the CiteSeerX OAI engine to get Dublin Core XML (e.g. http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:CiteSeerX.psu:10.1.1.42.2177); however that XML ends up containing multiple dc:date elements, which makes it less useful than the scrape output.
Too bad CiteSeerX makes people resort to scraping in spite of all the open archives / open access rhetoric.

Is it possible to use beautiful soup to extract multiple types of items?

I've been looking at documentation and they don't cover this issue. I'm trying to extract all text and all links, but not separately. I want them interleaved to preserve context. I want to end up with an interleaved list of text and links. Is this even possible with BeautifulSoup?
Yes, this is definitely possible.
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.example.com")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
print a
Breaking this code snippet down, you are making a request for a website (in this case Google.com) and parsing the response back with BeautifulSoup. Your requirements were to find all links and text and keep the context. The output of the above code will look like this:
<img src="/_img/iana-logo-pageheader.png" alt="Homepage" />
Domains
Numbers
Protocols
About IANA
RFC 2606
About
Presentations
Performance
Reports
Domains
Root Zone
.INT
.ARPA
IDN Repository
Protocols
Number Resources
Abuse Information
Internet Corporation for Assigned Names and Numbers
iana#iana.org