extracting currency values using scrapy/xpath - scrapy

Trying to get currency values from some html using scrapy.Code is
links = hxs.select('//a[#class="product-image"]/div[#class="price-box"]//span[#class="price"]/text()').extract()')
And the HTML
<div>
<span>
<sub>
<li class="item first">
<a href="http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd/pitch-perfect-dvd.html" title="Image for Pitch Perfect" class="product-image">
<span class="exclusive-star">
</span>
<img src="http://www.xtra-vision.ie/media/catalog/product/cache/3/small_image/124x173/5b02ab93946615b958c913185aae2414/i/w/iws_5167c10c906b57.33524324.JPG.jpg" alt="Image for Pitch Perfect" />
<h2 class="product-name">Pitch Perfect</h2>
<div class="price-box">
<span class="regular-price" id="product-price-5174">
<span class="price">
€15
<sub class="price-bit">.99</sub>
</span>
</span>
</div>
</a>
</li>
</sub>
</span>
</div>
The resulting price i get is \u20ac15\t\t\t\t\t\t
Is there some way I can extract 15.99 from this html using xpath

I used a combination of xpath and Python so might not be quite what you were after, although this was mainly employed to get rid of the extraneous tabs added to the end of the "price".
price = hxs.select('//span[#class="price"]/text()').extract()
pricebit = hxs.select('//span[#class="price"]/sub[#class="price-bit"]/text()').extract()
totalprice = price + price-bit
totalstr = ''.join(totalprice).replace('\t','')

Related

How to clean up pulled data from BeautifulSoup, Pandas, Python

Hello everyone I have the information I want pulled using BeautiuflSoup but I can't seem to get it printed out correctly to send to pandas and excel.
html_f ='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret"> </span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">Comment</span><br>
[1] Comments
</p>
</div>
</div>
</li>'''
My code used to pull the data I want:
soup = BeautifulSoup(html_f,'html.parser')
for child in soup.findAll('li',class_='list-group-item')[0]:
print (child.text)
Here is the info it pulls But it prints it out weird with tons of spacing
07/01/2022 Date
Comment
[1] Comments
Ideally, I only need the top portion of (date and File Date) printed out but at the very least I need help getting it into a list format like:
07/01/2022 Date
Comment
[1] Comments
To get your information printed as expected in your question, you could use stripped_strings and iterate over its elements:
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
Note: In new code use find_all() instead of old syntax findAll().
Example
html='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
Output
07/01/2022 Date
Comment
[1] Comments
Not sure cause you are talking about pandas, you also could pick each information, clean it up and append to a list of dicts:
data = []
for e in soup.find_all('li',class_='list-group-item'):
data.append({
'date': e.p.text.strip().replace(' Date',''),
'comment': e.select_one('.tyler-toggle-container br').next_sibling.strip()
})
pd.DataFrame(data)
or
data = [{
'date':soup.select_one('li.list-group-item .text-primary').text.strip().replace(' Date',''),
'comment':soup.select_one('li.list-group-item .tyler-toggle-container br').next_sibling.strip()
}]
Output
date
comment
07/01/2022
[1] Comments
So far so good, it's my trying
doc='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')
text=[' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
print(text)
Output:
['07/01/2022, Comments']
Try this ways,must work
text=' '.join([' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]).strip()
#Or
text= [' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
final_text= text[1]+ ',' +text[2]
final_text= text[1]+text[2].split()#if you want to make list

How to get loop variable suffixed in v-text to get the exact key from data?

I am not able to print the i in passengers.count to v-text="passengers.details.title_i"
I tried wrapping it like this v-text="passengers.details.title_[i]" but doesn't work.
<div v-if="_.size(passengers.details) > 0" v-for="i in passengers.count" :key="i" style="margin-bottom: 15px;">
<h4>Passenger <span v-text="i"></span></h4>
<ul class="list-inline">
<li><label>Title:</label> <span v-text="passengers.details.title_${i}"></span></li>
<li><label>First Name:</label> <span v-text="passengers.details.first_name_${i}"></span></li>
</ul>
</div>
I need v-text="passengers.details.title_[i]" to be read as
v-text="passengers.details.title_1"
You need to use square brackets around the whole property name, not just the number on the end.
You could use backticks:
<span v-text="passengers.details[`title_${i}`]"></span>
Or just concatenation:
<span v-text="passengers.details['title_' + i]"></span>
Not sure why you're using v-text, using {{ ... }} would seem simpler:
<span>{{ passengers.details[`title_${i}`] }}</span>

How to write a XPath using the > Operators?

I am trying to add the item to the cart which price is > (Greater than) the $35 using the below XPath
//div[#class='m-product-mini']//span[contains(text()>'$35.00')]
but using this XPath I am unable to identify the price value, below is the HTML code.
<div class="m-product-mini">
<div data-id="EF_TLR04-1A-P_EF_TLR04-1A">
<!-- main-image -->
<div class="m-product-mini-image">
Quick view
<a href="/bouquet/stunning-statement-bouquet/p_ef_tlr04-1a?skuId=EF_TLR04-1A&zipMin=">
</a>
</div>
<span class="m-product-mini-merchandising-icon">
<img src="new.jpg" alt="New Flower Arrangement by Florence's Flowers & Gifts">
</span>
<h2 class="m-category-flower-link-h2">Stunning Statement Bouquet</h2>
<span>$36.99</span> <span class="priceTag-discount"></span>
</div>
</div>
Try to use below XPath to get required output:
//div[#class='m-product-mini']//span[number(substring-after(text(), '$')) > 35]
Note that you need to
get rid of "$" sign, so substring-after(text(), '$') used
convert result into integer, so number() is used

Web scraping Linkedin Job posts using Python, Selenium & Phantomjs

Some LinkedIn job posts contain a see more button that expands the whole job description:
https://www.linkedin.com/jobs/view/401243784/?refId=3024203031501300167509&trk=d_flagship3_search_srp_jobs
I tried to expand it using the element.click() but the source I get after expansion contains some placeholder divs instead of the original div. How, can I scrap those hidden texts.
This is what I get from driver.page_source
<div class="jobs-ghost-placeholder jobs-ghost-placeholder--medium jobs-ghost-placeholder--thin mb2"></div>
<div class="jobs-ghost-placeholder jobs-ghost-placeholder--x-small jobs-ghost-placeholder--thin mb2"></div>
<div class="jobs-ghost-placeholder jobs-ghost-placeholder--small jobs-ghost-placeholder--thin"></div>
Instead of the source I get from chrome inspect:
<div id="ember7189" class="jobs-description-details pt5 ember-view"> <h3 class="jobs-box__sub-title js-formatted-exp-title">Seniority Level</h3>
<p class="jobs-box__body js-formatted-exp-body">Associate</p>
<!---->
<h3 class="jobs-box__sub-title js-formatted-industries-title">Industry</h3>
<ul class="jobs-box__list jobs-description-details__list js-formatted-industries-list">
<li class="jobs-box__list-item jobs-description-details__list-item">Real Estate</li>
<li class="jobs-box__list-item jobs-description-details__list-item">Information Technology and Services</li>
</ul>
<h3 class="jobs-box__sub-title js-formatted-employment-status-title">Employment Type</h3>
<p class="jobs-box__body js-formatted-employment-status-body">Full-time</p>
<h3 class="jobs-box__sub-title js-formatted-job-functions-title">Job Functions</h3>
<ul class="jobs-box__list jobs-description-details__list js-formatted-job-functions-list">
<li class="jobs-box__list-item jobs-description-details__list-item">Information Technology</li>
<li class="jobs-box__list-item jobs-description-details__list-item">Project Management</li>
<li class="jobs-box__list-item jobs-description-details__list-item">Product Management</li>
</ul>
</div>
I also tried different values for the wait WebDriverWait(driver, 3) but in vain.
code:
employment_type = wait.until(EC.presence_of_element_located(
(By.CSS_SELECTOR, 'div.jobs-description__details>div.jobs-description-details>p.js-formatted-employment-status-body'))).text
raises timeout exception as it only finds those jobs-ghost-placeholder instead of the described css_selector

Clarification on relative xpath

I want an xpath which represents the element 2 w.r.t element 1, as shown in the picture below. Can I build it using following-sibling?
Any help appreciated.
<div id="leads_data" class="uk-width-large-8-12 uk-container-center">
<div class="md-card-list leads_cards_list">
<ul class="">
<li class="item-shown" style="margin: 40px -20px; min-height: 90px;">
<div class="md-card-list-item-menu" data-uk-dropdown="{mode:'click',pos:'bottom-right'}">
<span>12:35:22 PM</span>
<a class="md-btn md-btn-flat md-btn-wave waves-effect waves-button" onclick="showModal(446)" data-uk-modal="{ center:true }" href="#lead_info">Edit</a>
</div>
<div class="md-card-list-item-sender">
<span>
<i class="material-icons">face</i>
CDYCUQYGNKYNFNFCWOUO
</span>
</div>
<div class="md-card-list-item-sender">
<span>
<i class="material-icons">call</i>
+91-1119771441
</span>
</div>
<div class="md-card-list-item-subject">
<span>
<i class="material-icons">email</i>
rnkntsoorh#jd.com
</span>
</div>
<div class="md-card md-card-list-item-content-wrapper uk-margin-top" onclick="showModal(446)" style="display: block; opacity: 1;">
<div class="md-card-content leads_card_content">
<div class="uk-text-large uk-margin-remove">
<b>Remarks:</b>
Some remarks of TSYNTORSAC
</div>
</div>
</div>
<br/>
</li>
I want to start with div having the text "CDYCUQYGNKYNFNFCWOUO" and locate the div having the text "Some remarks of TSYNTORSAC".
You can use below xPath :-
//span[contains(.,'CDYCUQYGNKYNFNFCWOUO')]/parent::div/following::div[contains(.,'Some remarks of TSYNTORSAC')][last()]
or
//span[contains(.,'CDYCUQYGNKYNFNFCWOUO')]/parent::div/following::div[contains(.,'Some remarks of TSYNTORSAC') and contains(#class, 'uk-text-large uk-margin-remove')]
or
//span[contains(.,'CDYCUQYGNKYNFNFCWOUO')]/parent::div/following-sibling::div[contains(.,'Some remarks of TSYNTORSAC')]/descendant::div[contains(#class, 'uk-text-large uk-margin-remove')]
or
//span[contains(.,'CDYCUQYGNKYNFNFCWOUO')]/parent::div/following-sibling::div[contains(.,'Some remarks of TSYNTORSAC')]/div/div
Hope it will help you..:)
In addition to the ways described by Saurabh gaur you can also use following-sibling to locate the desired element.
.//span[text()='CDYCUQYGNKYNFNFCWOUO']/../following-sibling::div[position()=3]//div[contains(.,'Some remarks of TSYNTORSAC')]
or
.//span[text()='CDYCUQYGNKYNFNFCWOUO']/../following-sibling::div[position()=3]/div/div[contains(text(),'Some remarks of TSYNTORSAC']
Hope this helps
You could simply get the element containing the desired text instead of using following:
//li[div[normalize-space(.)='face CDYCUQYGNKYNFNFCWOUO']]/div[last()]
try this :-
//div[contains(#class,'md-card-list-item-sende') and contains(.,'CDYCUQYGNKYNFNFCWOUO')]/following::div[contains(#class,'md-card-list-item-content') and descendant::b[contains(.,'Remarks')]]