Beginner Help - Scraping Ecommerce websites - scrapy

I am new to scrapy and web crawling. I am trying to scrape data from ecommerce sites in India and am unable to do so.
I am trying to pull the text out of the following hyperlink:
<a ng-href="/pd/40011505/figaro-extra-virgin-olive-oil-1-ltr/?nc=cl-prod-list&t_pg=&t_p=&t_s=cl-prod-list&t_pos=1&t_ch=desktop" ng-click="vm.pushToGoogleAnalytics('_trackEvent','item-clicked','custom-page',sectionModel.display_pos+' | '+sectionModel.pageinternalName+' | LNPD | '+sectionModel.internalName+' | '+vm.selectedProduct.sku+' | '+vm.selectedProduct.p_desc +' | '+ vm.itemposition,1)" class="ng-binding" style="text-align: left;" data-original-title="" data-trigger="focus" uib-tooltip="Extra Virgin Olive Oil" data-sectioninteractionplower="{"EventName":"ItemClicked", "CustomPageGroup" : "", "CustomPage":"", "ScreenInPageContext" : "cl-prod-list", "ScreenInPagePosition":"1",
"SectionItemName":"", "SectionItemPosition":"1"}" ng-bind="vm.selectedProduct.p_desc" href="/pd/40011505/figaro-extra-virgin-olive-oil-1-ltr/?nc=cl-prod-list&t_pg=&t_p=&t_s=cl-prod-list&t_pos=1&t_ch=desktop" css="1">Extra Virgin Olive Oil</a>
Xpath/Css Selectors are not working for me.
Appreciate any help.

To get the text from the a tag I would use the following css selector
resp.css('a::text').extract()
Response:
['Extra Virgin Olive Oil']

Related

Beautiful Soup and Requests - add missing </td></tr> to one line of HTML code

I am currently Python coding using Beautiful Soup. The website i am trying to extract data from is http://xml.coverpages.org/country3166.html
On the whole I can get everything working that I want. I am extracting country code and country from the HTML using the <tr> tag. This is for a project I am setting myself.
The problem is that the source HTML is missing some closing tags on one of the countries (Moldova). See below. This means when I loop through my code it stops doing what I need at Moldova.
<tr valign=top><td>MA</td><td>Morocco</td></tr>
<tr valign=top><td>MC</td><td>Monaco</td></tr>
<tr valign=top><td>MD</td><td>Moldova, Republic of
<tr valign=top><td>MG</td><td>Madagascar</td></tr>
Thanks
I know I could just create a new text file and manually amend it but is there anything I can do Beautiful Soup wise to fix this? My plan was to iterate through each line until Moldova is found and then append </td></tr> on the end. Is there a more efficient way?
If I inspect the source you've linked the HTML seems fine, there's probably a mistake in your way of scraping the data.
A small example were we search for each tr, get it's children (2x td), and parse those as code and country to show a list:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
response = http.request('GET', 'http://xml.coverpages.org/country3166.html')
soup = BeautifulSoup(response.data, 'html.parser')
for tr in soup.findAll("tr"):
childs = tr.findChildren();
code = childs[0].getText();
country = childs[1].getText();
print(code, country)
Will output:
AD Andorra
AE United Arab Emirates
AF Afghanistan
AG Antigua & Barbuda
AI Anguilla
AL Albania
AM Armenia
AN Netherlands Antilles
AO Angola
AQ Antarctica
AR Argentina
AS American Samoa
... and many more, including Moldova and beyond

How to find the text which is outside of a tag using selenium?

I want to retrieve all anchor tags which are labeled as "Working" from the following code.
<div class="A">
Profile 1
Working
<br/>
Profile 2
<br/>
Profile 3
Working
<br/>
</div>
Here I want to retrieve the anchor tags of "facebook" and "linkedin" as they have labeled as "Working".
Following are my trails which didn't work properly.
//div[#class='A']//a//following-sibling::text()
//div[#class='A'][contains(text(),'Working')]
Could any one provide the xpath to achieve this.
Try below expression:
//div[#class='A']/text()[normalize-space()="Working"]/preceding-sibling::a[1]
Which returns
Profile 1
Profile 3

How to read a text when it is not in any HTML tag

How can I find text in following HTML:
style="background-color: transparent;">
<-a hre f="/">Home<-/ a>
<-a id="brea dcrumbs-790" class=" main active mainactive" href="/products">Products<-/a>
<-a href="/products/fruit-and-creme-curds">Fruit & Crème Curds<-/a>
Crème Banana Curd
<-/li>
<-/ul>"
</div>
This is HTML for Bread Crumb, first three are link and fourth is page name. I want to read page name (Crème Banana Curd) from Bread crumb. But since this is not in any node so how to catch it
If the text isn't present inside any tag, then it is present in body tag:-
So you can use something like below to identify it:-
html/body/text()
Though the question seems to be vague without a proper HTML source but still you may try the solution below by storing the Text in a Variable-
var breadcrumb = FindElement(By.XPath(".//*[#id='brea dcrumbs-790']/following-sibling::a")).Text;
use the below code:
WebElement elem = driver.findElement(By.xpath("//*[contains(text(),'Crème Banana Curd')]"));
elem.getText();
hope this will help you.

SEO: Similar Title & H1 wording across most pages on a site

On a site that sells electronics, suppose the Title and H1 tag of every category page contained the words "Shop for" followed by the category.
IE the page that pertained to laptops would have in its tags:
<title> Shop for Laptops </title>
<h1> Shop for Laptops <h1>
...and the page that pertained to cameras would have in its tags:
<title> Shop for Cameras </title>
<h1> Shop for Cameras <h1>
// etc..
Would the fact that "Shop for" was in the title and H1 tag of every category page hinder SEO? Would it be better if there was some variance to the words "Shop for" across the site?
The right format for title is "The most important keyword | 2nd important keyword | 3rd important keyword". For example, if we want to sell a camera, we can set the title for the page - "Canon EOS 5D Mark III - SLR Camera - Your Website Name". "Canon EOS 5D Mark III" is the product name, "SLR Camera" is the category name, and the 3rd important keyword is your website name. In addition, we use symbols "-" or "|" to divide the keywords. Last, remove the words that users do not search from search engines, eg, "search for".
Yea I am agree with #Cythilya
That is good and title should be like this
"The most important keyword | 2nd important keyword | 3rd important keyword"

(FitNesse, Xebium, Selenium IDE) How to use Xebium format to click on a label?

I'm a very new for Xebium.
I cannot use
| ensure | do | click | on | id=text |
to click on a checkbox because it is a label, id is look like hidden.
So, is there any way to click from label?
thank you for your suggestion :)
You can use Xpath to find your desired element. If you have the following structure of your form:
<form action="target.html">
<label for="male">Male</label>
<input type="checkbox" name="sex" id="male" value="male"><br />
<label for="female">Female</label>
<input type="checkbox" name="sex" id="female" value="female"><br />
<br/>
<input type="submit" value="Submit">
</form>
, then you could the following Xebium command to click on the label with Female as value:
| ensure | do | click | on | xpath=(//label[contains(text(),'Female')]) |
You can also use the following command, if you want to click in the check box that is coupled to a specific label:
| ensure | do | click | on | xpath=(//input[contains(preceding-sibling::label/text(),'Male')])|
If your label tag is after your input tag, you have to change preceding-sibling to following-sibling.
Note; you can try this out on w3schools xpath examples. That example uses radio buttons instead of check boxes. Since the example occurs in an iFrame, you have to move into the frame by selectFrame|iframeResult.