How to access all text from a website, including the a tag? - beautifulsoup

I'm trying to extract all the article text from the following site:
https://www.phonearena.com/reviews/Samsung-Galaxy-S9-Plus-Review_id4494
I tried findAll(text=True) but it extracts lot of useless information.
So I did findAll(text=True, recursive=False) but it ignores text data in certain tags like ? What's the most effective way of extracting the text in this case?

The website seems to be javascript protected. It loads the body content when requests already retrieved the http response. You need to simulate a real page request. With the python module Selenium Webdriver it would be possible.

Related

script to check entire website to figure out if there are any pages which are taking more time to load

Can we have a script which will crawl through the entire website to figure out if there are any pages which are taking more time to load (some pages under a particular category were taking more time to load) in selenium Webdriver or jmeter
For JMeter you can use HTML Link Parser configuration element for this purposes. From the documentation:
Spidering Example
Consider a simple example: let's say you wanted JMeter to "spider" through your site, hitting link after link parsed from the HTML returned from your server (this is not actually the most useful thing to do, but it serves as a good example). You would create a Simple Controller, and add the "HTML Link Parser" to it. Then, create an HTTP Request, and set the domain to ".*", and the path likewise. This will cause your test sample to match with any link found on the returned pages. If you wanted to restrict the spidering to a particular domain, then change the domain value to the one you want. Then, only links to that domain will be followed.
More information on above approach and a couple more options: How to Spider a Site with JMeter - A Tutorial
Remember that JMeter is not a browser hence it doesn't execute JavaScript so your results may not be precise enough as JMeter doesn't measure the time required to actually render the page.

Scrape a part of website and notify on change

The website of my university unfortunately does not provide feeds but they keep publishing information there that is important for me (deadlines, dates of exams etc.) as links to pdfs
in a certain section of the website.
How can I regularly scrape that section of the site and have me notified (growl, mail something alike).
Normally I would use wget to mirror it but how to extract only parts of the website?
Is there a cli tool that can extract the XHTML via XPATH or similar?
Try this:
wget --spider --server-response http://example.com
This will print the headers which might contain the "Length"-attribute. If it changes, you can notify yourself.
edit: If it changes, you can download the whole html file, grep for a pdf file or whatever you want to look for (maybe for "<div id='news'>(.*?)</div>")
Mmm... You should take a look at QueryPath. QueryPath makes easy to parse HTML. What if the HTML structure changes? What if you want specific elements of the page? QueryPath does the hard work for you. Do you like JQuery? QueryPath is like the JQuery of PHP.
See: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP
See: http://querypath.org/
You might be interested in looking at Pjscrape (disclaimer: this is my project). It's a web-scraping tool built on PhantomJS, giving you full jQuery access to the page in a headless Webkit browser context. It makes it very easy to pull semi-structured data from webpages via the command line, particularly if the page you're scraping has a consistent structure for new elements.
For example, you can pull all the course titles from this course catalog with the following code:
pjs.addScraper(
// the page you're scraping
'http://www.ischool.berkeley.edu/courses/catalog',
// selector for elements you want to pull text from
'.views-row .views-field-title'
);
// suppress STDOUT logging
pjs.config('log', 'none');
Running this from the command line gives you JSON to STDOUT by default:
~> phantomjs /path/to/pjscrape.js my_script.js
["W10. Introduction to Information","24. Freshman Seminar", ...]
So it would be pretty simple to run this script on a regular basis, capture the output in a file, and then alert you when the new output doesn't match the previous scrape. You can also write your own scraper functions, so there's a lot of flexibility for more complex scraping if a simple selector won't do the trick.

Pass a string to various websites

I have a product code which I need to enter into 6 different websites in order to pull different information from them about the product. Is there away to save this product code into some sort of variable and pass it into each websites input box and it return all the information from each one automatically? Really have no idea where to go/start with this so if anyone can brainstorm a few ideas to get me moving that would be great.
In order get what you are planning for:
You need a script which visits the specified web site,
then at the website, you can get the element by tag.
For instance in javascript,
var textBox = document.getElementByTag(Input);
This will give you a reference to text field to enter the text. It can be done as follows:
textBox.value = "any string";
Once you have done this, you will have to retrieve the results from the page, based on the website layout.
So if you can specify about your work in detail, you would get better response.
Assuming you're talking about using an ordinary GUI browser, the best you can do is copy it to your system clipboard, and paste it into each page on the browser.
If you're talking about a programmatic web-access like wget or curl, it depends on what language you are writing your script in.
you have to create the web request for each web site and find a way to parse the response which will be HTML
have a look at the HttpWebRequest you can find lots of example on internet that shows how you can create an HTTP POST to a website.
http://www.terminally-incoherent.com/blog/2008/05/05/send-a-https-post-request-with-c/

How might I provided a URL use the FireShot API to take a screenshot, upload to Imgur, and return some output (eg. markdown)

I am looking for a way to utilize the FireShot API with JS to given a URL (or perhaps a list) use the FireShot API to take screenshot, upload to Imgur, then return the user the URLs or perhaps something like markdown to use quickly in forums.
Method 1: Open new window
I tried opening the URL in a new window, but found that I cant control that page with JS dues to cross domain problems. The same with iFrames.
Method 2: simple $.get()
A simple $.get() wont work because of the same cross domain issues I guess?
http://jsfiddle.net/t6aeq/
$.get($url.val(), function(data) {
console.log(data);
});
Via PHP "Proxy"
So I tried creating a simple PHP script that gets the HTML of the URL and returns it to my JS (using file_get_contents($url)). But some sites like Microsoft will detect that I am using some automated methods and give an error page of sorts. I also cant seem to find a way to use jQuery to query that returned HTML for link[rel=stylesheet], script, style and body to append to the head and a div respectively. I posted abt that on another question
A new Idea: Embed scripts on browser level
So I thought away of getting around these is using iMacros or GreeseMonkey or something to insert scripts into pages on the browser level instead? But any guidance or tips on how can I do that? Also, I'd prefer a pure JS/PHP method if available so users are not limited to using Browser plugin/scripts (tho I will be the only user for now)
It suddenly came to my mind that this may not work because the FireShot API key and Imgur is limited to the domain? Any solutions?
You might be able to inject the FireShot script using Greasemonkey. But, first use GM_xmlhttpRequest() to fetch an API key, for that page's domain, from the "Create FireShot API Key" page.
Note that GM_xmlhttpRequest() does not have the same cross-domain issues that $.get() has.
However, at this point you might be better off just writing your own Firefox add-on. Maybe start with FireShot's code for ideas. Also see the Screengrab add-on.

scraping dynamic content

I am working on a web scraping project. do any body have idea of scraping dynamic content.
Dynamic content on base of query string is similar to static content but dynamic content based on some event of a control within same page is the point where i am stuck. because in this case page url remain same.
I am using C#.
Thanks in advance
Your question is rather general.
I'm not sure what you mean by event of a control, but as long as a browser generates http request you can catch it using tools like Firebug for Firefox or tools built in Google Chrome and see what is actually being sent to the server. So called AJAX requests are nothing else than standard http requests, it's just that web page is not reloaded as a whole.
Based on that information and page source it is possible to figure out how to create range of reguests that would simulate user interaction with dynamic elements on the page.