Mobile Site not giving correct Data - Beautiful Soup - beautifulsoup

I'm trying to get product details from the following website.
Baby Shampoo
Specifically the TCIN:# and product details.
But this information is not showing up in the page when I parse it.
A simple line like:
spans = soup.find_all("span", {"class" : "list-value"})
is turning up no results, and when do I go even more basic to:
print(soup.prettify)
I see the page print out but none of the details are in the page. I am not seeing any iframes on the page, and can't figure out why the data is not showing.
I even attempted to adjust my headers in the request:
headers = { 'User-Agent': 'Mozilla/5.0 (Linux; <Android Version>; <Build Tag etc.>) AppleWebKit/<WebKit Rev> (KHTML, like Gecko) Chrome/<Chrome Rev> Mobile Safari/<WebKit Rev>'}
and also:
headers = { 'User-Agent': 'Mozilla/5.0'}
but neither of these are changing the results. Any ideas what could be happening, and where this data could be located?
Thanks,
Mike

If you see all the Network Requests through Chrome Developer Options or Firefox Firebug, you can see all the http get and post requests made and then you have to find out which one contains the needed information. Make sure that you have Network toolbar enabled and Preserve Log checked before making the request in browser. In your case, the information is fetched by the GET request - http://tws.target.com/productservice/services/item_service/v1/by_itemid?id=13197674&callback=browseCallback

Related

How to add „From” request header in Firefox-Selenium?

I’m looking for a way to add the “From” request header to disclose an email address of the requesting user.
I am using Selenium with Firefox, alternatively I can switch to PhantomJS or Chrome.
It could be some preference to set in selenium.webdriver.FirefoxProfile. I checked Firefox’s about:config documentation, but can’t find any indication how to implement this header. Any help is appreciated especially that it is difficult to google this issue given the name of the header.
Still haven’t found an interface how to do this with Selenium. There is however a workaround with intercepting the request using selenium-wire:
from seleniumwire import webdriver
def from_request_header_interceptor(request):
del request.headers['From']
request.headers['From'] = 'email#example.com'
driver = webdriver.Firefox(**kwargs)
driver.request_interceptor = from_request_header_interceptor
driver.get('https://www.httpbin.org/headers')
The httpbin outputs i.a. headers “From”: “email#example.com”. Seems to work.

Scrapy - Javascript rendering

I would like to get some data from here:
https://www.drivy.com/location-voiture/liege/mitsubishi-colt-359699?address=Gare+de+Li%C3%A8ge-Guillemins&city_display_name=&country_scope=BE&distance=200&end_date=2019-05-27&end_time=06%3A00&latitude=50.6251&longitude=5.5659&start_date=2019-05-26&start_time=06%3A00
I'm searching for the ID of the owner of the car. This ID is within the aattribute of class car_owner_section. For the page above it is the numbers in the hrefattribute like this "/users/1228276". The issue is that this link is apparently rendered by javascript and I absolutely want to avoid scrapy-splash. Does anyone has an idea on how to find this ID ? It should be somewhere on a JSON I guess but I've searched for days now and found nothing.
I tested it on scrapy shell, and the response returns the link you are looking for, without using splash. You might want to check your settings.
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

Phantomas not render page directly which the page url contains hashmark

I'm testing a webpage by using phantomas,but I found the problem when I use the url contains hashmark such as 'http://bookstore2.shuqireader.com/route.php?sq_pg_param=bsbc&ver=151011#!/bid/3379630/'.
The screenshot of this page in Phantomas is all about blank,but it work perfectly by using PhantomJs alone.
I installed Phantomas by 'npm install'
phantoms http://bookstore2.shuqireader.com/route.php?sq_pg_param=bsbc&ver=151011#!/bid/3379630/ --screenshot=saveimg.png
saveimg.png is all blank
var webPage = require('webpage');
var page = webPage.create();
page.customHeaders = {
"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.0; en-us; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"
};
page.open('http://bookstore2.shuqireader.com/route.php? sq_pg_param=bsbc&ver=151011#!/bid/3379630/', function (status) {
if(status=="success"){
page.render('saveimg.png');
}
phantom.exit();
});
in phantomJs way, saveimg.png is normal
Is it a bug?
Not the answer you want, but generally all these NPM wrappers of PhantomJs suck for various reasons, the authors generally only handle specifically their use case and the packages fail others who have slightly different needs.
Usually they fail for performance reasons (no problem if you are OK maxing out a CPU every request) but as you see, sometimes you'll be caught by situations the authors didn't code for.
You are much better off just writing your launching phantomjs.exe as a child process. Another alternative is to use the api at http://api.PhantomJsCloud.com (disclosure: I made it)

How to get project badge via GitHub API?

I have build a resume page for my self, and list all my projects there by using GitHub API. Some of the project are document which have rtfd build passing badge, some are python projects which have travis-ci and pep-lint badges.
Now, I want to display the badges as with the projects, how should I use with the API?
My page is here: http://gh.windrunner.info/resume/#/github
You could also use a different API with https://github-shields.com/
See "How to embed live Github PR status in your blogs & docs"
Consider the PR https://github.com/cloudfoundry/bosh/pull/715.
The URL doesn't indicate if the PR is open/merged/closed.
The cloudfoundry/bosh/pull/715 portion of the URL is copied directly into the following base URL:
https://github-shields.com/github/ + cloudfoundry/bosh/pull/715 + .svg gives a URL that redirects to the PR.
https://github-shields.com/github/cloudfoundry/bosh/pull/715.svg
As an image URL it gives cloudfoundry/bosh/pull/715
Awesome, it was merged!
For the status of a project, the OP kxxoling reports in the comments having found shields.io:
https://img.shields.io/badge/<SUBJECT>-<STATUS>-<COLOR>.svg
it indicates how to get the status of a badge.
If there none badge added for that project, it will return a inaccessible badge like this: https://img.shields.io/travis/kxxoling/z42-doc.svg =>
For projects like https://github.com/kxxoling/z42-doc (which does have a badge in it), you need to fetch the README and then search through it for possible badges. Without knowing what language you'd prefer to use, I'm going to write some pseudo-code
First you need to retrieve the README that GitHub identified as the one to render on your home-page. You can do this by doing
GET /repos/kxxoling/z42-doc/readme
Host: https://api.github.com
Accept: application/vnd.github.v3.raw
If instead you'd rather parse HTML, change "raw" to "html" in the last header, e.g.,
GET /repos/kxxoling/z42-doc/readme
Host: https://api.github.com
Accept: application/vnd.github.v3.html
With the contents of the README, now you just need to parse it for links or directives that are specific to the mark-up languages you chose for your READMEs. You can parse them out with regular expressions or an HTML/XML parsing library of your choosing (if you're retrieving the rendered content from GitHub).

Authorization VK.COM with QWebView

I am trying to download the authorization page for VK.COM, but keep getting a white browser window. In that case loadFinished(bool). Code for example:
QWebView* view = new QWebView;
view->load (QUrl ("https://oauth.vk.com/authorize?client_id=1234567&scope=wall,offline&redirect_uri=http://oauth.vk.com/blank.html&display=page&response_type=token"));
view->show ();
If I change the web address (vk.com for example), the site is displayed normally. I don't understand why load() doesn't work normally with query to VK API. I am using Qt 5.0.2.
This code works on QT 4
It often happens when server returns anything but 200 status code. In your case there is "401 - Unauthorized" status code.
This link might be helpful for you: http://www.qtcentre.org/threads/37122-Detecting-finished-download-of-HTML-content-for-QWebView