Webscraping: Crawling Pages and Storing Content in DataFrame - pandas

Following code can be used to reproduce a web scraping task for three given example urls:
Code:
import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup
# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)
def scrape_content(url):
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html,"lxml")
# Get page title
title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
# Get content from paragraphs
content = soup.find("div", {"class":"section-content"}).find_all('p')
print(title)
for p in content:
p = p.get_text(strip=True)
print(p)
Apply scraping to each url:
urls['url'].apply(scrape_content)
Out:
Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report
0 None
1 None
2 None
Name: url, dtype: object
Problems:
The code currently only outputs content for the first paragraph of every page. I like to get data for every p in the given selector.
For the final data, I need a data frame that contains the url, title, and content. Therefore, I like to know how I can write the scraped information into a data frame.
Thank you for your help.

Your problem is in this line:
content = soup.find("div", {"class":"section-content"}).find_all('p')
find_all() is getting all the <p> tags, but only in the results .find() - which just returns the first example which meets the criteria. So you're getting all the <p> tags in the first div.section_content. It's not exactly clear what the right criteria are for your use case, but if you just want all the <p> tags you can use:
content = soup.find_all('p')
Then you can make scrape_urls() merge the <p> tag text and return it along with the title:
content = '\r'.join([p.get_text(strip=True) for p in content])
return title, content
Outside the function, you can build the dataframe:
url_list = urls['url'].tolist()
results = [scrape_url(url) for url in url_list]
title_list = [r[0] for r in results]
content_list = [r[1] for r in results]
df = pd.DataFrame({'url': url_list, 'title': title_list, 'content': content_list})

Related

Struggling to import analyst share price to GoogleSheets

I am trying to create a column that imports the analyst price target from TipRanks website.
I uploaded two images:
Image 1: you can see the cell that I want to import.
Image 2: you can see my function that doesn't work.
What should I change in order to get this live info?
Thanks.
The site you are checking is actually "javascript" generated thus import functions won't properly work on them.
To check, just try to import the whole site data. If it returns a javascript function, then it is javascript generated.
Sample (tipranks.com)
What you can do is actually try to find other sites that provide the same data.
I did find one with the same data you are looking for, 50.38 for csiq. Link is "https://www.marketwatch.com/investing/stock/csiq/analystestimates". And since data is shown as table, it would be easier to import using importhtml.
Cell formula is:
=INDEX(IMPORTHTML("https://www.marketwatch.com/investing/stock/csiq/analystestimates", "table", 5), 2, 2)
Sample output:
The table is the fifth one in the DOM, and INDEX(table, 2, 2) means getting the 2nd row 2nd column of the table.
If the site is no good for you, you can try finding other sites that would suit your needs. And then use either importhtml or importxml depending on the site structure.
When you inspect the network when the website is loading you will see that the prices come when calling the forecast endpoint https://www.tipranks.com/stocks/tsla/forecast. This in turn returns an html response which is probably generated with Javascript on the client because they use React on the frontend, but you can still see the preview in the Network tab of the browser dev tools.
You can then copy the preview in VSCode and prettify it, to try and pin point the span holding the price. Of course it won't be exact science, because the html tags are generated with some media queries, but you will get close enough to some extent.
After you get the xml path but you get an empty error, you can delete some tags until you get some text. Use search in google sheets to search for highest price label, and than continue adding tags until you get the desired value.
Here is what I managed to get:
Lowest price target:
=importxml("https://www.tipranks.com/stocks/snow/forecast", "/html/body/div[1]/div[1]/div[4]/div[1]/div[2]/div[1]/div[4]/div[2]/div[2]/div[4]/div[1]/div[1]/div[5]/span[2]")
Average price target:
=importxml("https://www.tipranks.com/stocks/snow/forecast", "/html/body/div[1]/div[1]/div[4]/div[1]/div[2]/div[1]/div[4]/div[2]/div[2]/div[4]/div[1]/div[1]/div[3]/span[2]")
Highest price target:
=importxml("https://www.tipranks.com/stocks/snow/forecast", "/html/body/div[1]/div[1]/div[4]/div[1]/div[2]/div[1]/div[4]/div[2]/div[2]/div[4]/div[1]/div[1]/div[1]/span[2]")
In time these methods might change depending on their development process, but you could use the above steps to update the script.
P.S. I wasn't satisfied with the marketwatch analyst price targets. I think the wisdom of the crowd is better on tipranks.
Try this one. Works perfectly fine on my personal Stock Portfolio on Google Sheets:
Lowest Price Target:
=importxml(CONCATENATE("https://www.tipranks.com/stocks/", A1,"/forecast"), "//*[#class='colorpurple-dark ml3 mobile_fontSize7 laptop_ml0']")
Average Price Target:
=importxml(CONCATENATE("https://www.tipranks.com/stocks/", A4,"/forecast"), "//*[#class='colorgray-1 ml3 mobile_fontSize7 laptop_ml0']")
Highest Price Target:
=importxml(CONCATENATE("https://www.tipranks.com/stocks/", A4,"/forecast"), "//*[#class='colorpale ml3 mobile_fontSize7 laptop_ml0']")

Invoke a different Spider after page is parsed on another Spider

This has been addressed to some extent here and here
But I'd like to ask here before doing any of what's suggested there because I don't really like any of the approaches.
So basically, I'm trying to scrape Steam games. As you may know, Steam has a link where you can access the whole reviews for a game, an example:
https://steamcommunity.com/app/730/reviews/?browsefilter=toprated&snr=1_5_100010_
You can ignore snr and browsefilter query params there.
Anyhow, I have created a single Spider that will crawl the list of games here and works pretty well:
https://store.steampowered.com/search/?sort_by=Released_DESC
But now, for each game I want to retrieve all reviews.
Originally I created a new Spider that deals with the infinte scroll in the page that has the whole set of reviews for a game, but obviously that spider needs the URL where those reviews live.
So basically what I'm doing now is scrape all games pages and store the URL with reviews for each game in a txt file that is then passed as parameter to the second spider. But I don't like this because it forces me to do a 2-step process and besides, I need to map the results of the second spider to the results of the first one somehow (this reviews belong to this game, etc)
So my questions are:
Would it be best to send the results of scraping the game page (and thus the URL with All reviews) to the second spider, or at least the URL and then fetch all reviews for each game using the second spider? This will be O(N*M) in terms of performance, being N number of games and M number of reviews per game, maybe just because of this, having 2 spiders is worth it...thoughts?
Can I actually invoke a Spider from another Spider? From what I've read in Scrapy documentation, doesn't look like it. I can probably move everything to one spider but will look awful and it doesn't adhere to the single-responsability principle...
Why don't you use a different parse procedure?
https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse_author)
And add the needed values with the meta tag:
https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta
example in
Is it possible to pass a variable from start_requests() to parse() for each individual request?

Having trouble with Python Web Scraper

i'm new to scrapping and would love some help or just a push along in the right direction. I've currently tried using scrapy but could not get it working at all.
What i'm trying to do is get the titles, episode and html 5 video player link's + different qualities (480p, 720p,etc..) from this page. I'm not sure how i'm meant to get the video src's from the iframe elements though.
As mentioned before any help would be very helpful.
Thanks.
I don't have previous experience with Scrapy, but I'm in the middle of a Python Web Scraping project myself. I'm using BeautifulSoup for scraping.
I've written part of the code - this gets all of the titles, episodes, thumbnails, and loads the link to the new page for further processing. If you're having more troubles, leave a message ;)
from bs4 import BeautifulSoup
from urllib import request
url = "http://getanime.to/recent"
h = {'User-Agent': 'Mozilla/5.0'}
req = request.Request(url, headers=h)
data = request.urlopen(req)
soup = BeautifulSoup(data)
# print(soup.prettify()[:1000]) # For testing purposes - should print out the first 1000 characters of the HTML document
links = soup.find_all('a', class_="episode-release")
for link in links:
# Get required info from this link
thumbnail = link.find('div', class_="thumbnail")["style"]
thumbnail = thumbnail[22:len(thumbnail)-3]
title = link.find('div', class_="title-text").contents[0].strip()
episode = link.find('div', class_="super-block").span.contents[0]
href = link["href"]
# print(thumbnail, title, episode, href) # For testing purposes
# Load the link to this episode for further processing
req2 = request.Request(href, headers=h)
data2 = request.urlopen(req2)
soup2 = BeautifulSoup(data2)
vid_sources = soup2.find('ul', class_="dropdown-menu dropdown-menu--top video-sources")
# TODO repeat the above process to find all video sources
Edit: the above code is for python3. For clarification.
(posting as another answer, since comments remove linebreaks):
Sure, happy to help ;) you're very much on the right track, so keep at it. I am wondering why you're using find_all('iframe'), since I couldn't find any examples with multiple iframe's, but it'll work just as well I guess. If you know there's only one, it saves some time to use soup.find().
Using type(iframexx) shows me that it points to a list which contains the actual data we want. Then
for iframe in iframexx:
print(type(iframexx))
ifr = iframexx[0]
print(ifr)
print(ifr["data-src"])
allowed me to get the data-src.

citeseerx search api

Is there a way to access CiteSeerX programmatically (e.g. search by author and/or title?) Surprisingly I cannot find anything relevant; surely others too are trying to get scholarly article metadata without resorting to scraping?
EDIT: note that CiteSeerX supports OAI PMH, but that seems to be an API geared towards digital libraries keeping up to date with each other ("content dissemination") and does not specifically support search. Moreover the citeseer info on that page is very sparse and even says "Currently, there are difficulties with the OAI".
There is another SO question about CiteSeerX API (though not specifically search); the 2 answers do not resolve the problem (one talks about Mendeley, another piece of software, and the other says OAI-PMH implementations are free to offer extensions to the minimal spec).
Alternatively, can anyone suggest a good way to obtain citations from authors/titles programmatically?
As suggested by one of the commenters, I tried jabref first:
jabref -n -f "citeseer:title:(lessons from) author:(Brewer)"
However jabref seems to not realize that the query string needs to contain colons and so throws an error.
For search results, I ended up scraping the CiteSeerX results with Python's BeautifulSoup:
url = "http://citeseerx.ist.psu.edu/search?q="
q = "title%3A%28{1}%29+author%3%28{0}%29&submit=Search&sort=cite&t=doc"
url += q.format (author_last, title.replace (" ", "+"))
soup = BeautifulSoup (urllib2.urlopen (url).read ())
result = soup.html.body ("div", id = "result_list") [0].div
title = result.h3.a.string.strip ()
authors = result ("span", "authors") [0].string
authors = authors [len ("by "):].strip ()
date = result ("span", "pubyear") [0].string.strip (", ")
It is possible to get a document ID from the results (the misleadingly-named "doi=..." part in the summary link URL) and then pass that to the CiteSeerX OAI engine to get Dublin Core XML (e.g. http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:CiteSeerX.psu:10.1.1.42.2177); however that XML ends up containing multiple dc:date elements, which makes it less useful than the scrape output.
Too bad CiteSeerX makes people resort to scraping in spite of all the open archives / open access rhetoric.

Is it possible to use beautiful soup to extract multiple types of items?

I've been looking at documentation and they don't cover this issue. I'm trying to extract all text and all links, but not separately. I want them interleaved to preserve context. I want to end up with an interleaved list of text and links. Is this even possible with BeautifulSoup?
Yes, this is definitely possible.
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.example.com")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
print a
Breaking this code snippet down, you are making a request for a website (in this case Google.com) and parsing the response back with BeautifulSoup. Your requirements were to find all links and text and keep the context. The output of the above code will look like this:
<img src="/_img/iana-logo-pageheader.png" alt="Homepage" />
Domains
Numbers
Protocols
About IANA
RFC 2606
About
Presentations
Performance
Reports
Domains
Root Zone
.INT
.ARPA
IDN Repository
Protocols
Number Resources
Abuse Information
Internet Corporation for Assigned Names and Numbers
iana#iana.org