How to remove redundant space in BeautifulSoup output - beautifulsoup

I intend to scrape a website using BeautifulSoup. I'm working on the following HTML :
html =
<div id="article-body" itemprop="articleBody">
<p>
<span class="quote down bgQuote" data-channel="/quotes/zigman/170324/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockSLB" href="/investing/stock/slb?mod=MW_story_quote" data-track-mod="MW_story_quote">
SLB,
<span class="bgPercentChange">-3.04%</span>
</a>
</span>
reported late Thursday
higher third-quarter profit that beat targets and sales only slightly below estimates
. Schlumberger’s results came a day after rival Halliburton Co.
<span class="quote down bgQuote" data-channel="/quotes/zigman/228631/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockHAL" href="/investing/stock/hal?mod=MW_story_quote" data-track-mod="MW_story_quote">
HAL,
<span class="bgPercentChange">-0.66%</span>
</a> """
I want to get a plain text without any redundant space,I followed the answer by Twig but SLB and -3.04% and also HAL and -0.66% are still placed in different lines.My favorable output would be like :
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates. Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66% also posted higher-than-expected profit.
It is my code:
import urllib2
from bs4 import BeautifulSoup
import re
newsText = soap(html)
text = list(newsText.stripped_strings)
finalText = "\n\n".join(text) if descriptions else ""
re.sub(r'[\ \n]{2,}', '', finalText)
print finalText
I am very thankful in advance.

soup = BeautifulSoup(html, 'lxml')
text = soup.get_text(strip=True, separator=' ')
print(text)
out:
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates . Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66%

Related

Unable to loop through navigable string with BeautifulSoup CSS Selector

I would like to extract the content inside the p tag below.
<section id="abstractSection" class="row">
<h3 class="h4">Abstract<span id="viewRefPH" class="pull-right hidden"></span>
</h3>
<p> Variation of the (<span class="ScopusTermHighlight">EEG</span>), has functional and. behavioural effects in sensory <span class="ScopusTermHighlight">EEG</span We can interpret our. Individual <span class="ScopusTermHighlight">EEG</span> text to extract <span class="ScopusTermHighlight">EEG</span> power level.</p>
</section>
A one line Selenium as below,
document_abstract = WebDriverWait(self.browser, 20).until(
EC.visibility_of_element_located((By.XPATH, '//*[#id="abstractSection"]/p'))).text
can extract easily the p tag content and provide the following output:
Variation of the EEG, has functional and. behavioural effects in sensoryEEG. We can interpret our. Individual EEG text to extract EEG power level.
Nevertheless, I would like to employ the BeautifulSoup due to speed consideration.
The following bs by referring to the css selector (i.e.,#abstractSection ) was tested
url = r'scopus_offilne_specific_page.html'
with open(url, 'r', encoding='utf-8') as f:
page_soup = soup(f, 'html.parser')
home=page_soup.select_one('#abstractSection').next_sibling
for item in home:
for a in item.find_all("p"):
print(a.get_text())
However, the compiler return the following error:
AttributeError: 'str' object has no attribute 'find_all'
Also, since Scopus require login id, the above problem can be reproduce by using the offline html which is accessible via this link.
May I know where did I do wrong, appreciate for any insight
Thanks to this OP, the problem issued above apparently can be solve simply as below
document_abstract=page_soup.select('#abstractSection > p')[0].text

HOW DO I PARSE THE DATA IN THESE HTML TAGS?

I am a python newbie.I was trying to get some data from my school site. Below is the code I wrote to scrap only the news items. It works but I want the title, date and paragraph to be in new lines. I feel there is something missing in my code but I don't have a hang on it. Need your help guys.
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen("http://www.kibabiiuniversity.ac.ke")
soup = BeautifulSoup(page)
for i in soup.findAll("div", {"class": "blog-thumbnail-inside"}):
print (i.get_text())
print ("----------" *20)
And here is the html tag structure of the page I'm trying to scrape.
<div class="blog-thumbnail-inside">
<h2 class="blog-thumbnail-title post-widget-title-color gdl-title">
<a href="http://www.kibabiiuniversity.ac.ke">
Completion of fees & collection of exam cards.
</a>
</h2>
<div class="blog-thumbnail-info post-widget-info-color gdl-divider">
<div class="blog-thumbnail-date">Posted on 09 Jan 2017</div>
</div>
<div class="blog-thumbnail-context">
<div class="blog-thumbnail-content">
Download the information on fee payment and collection of exam cards..
</div>
</div>
</div>
for i in soup.findAll("div", {"class": "blog-thumbnail-inside"}):
print (i.get_text('\n')) #You can specify a string to be used to join the bits of text together
print ("----------" *20)
out:
Final Undergraduate Examination Timetable for Semester 1 2016/2017
Posted on 11 Jan 2017
Download Undergraduate Timetable
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Vacancies for Administrative and Teaching Positions
Posted on 11 Jan 2017
Kibabii University is a fully fledged public institution of higher education and research in Kenya with a student population of 6400 and staff population of 346. The University seeks to appoint innovative individuals with experience and excellent credentials
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

What's wrong with my structure data?

It's been more than 4 months that our rich snippets suddenly disappeared, some error were reported in GWT, i corrected everything and errors are now decreasing (only 5 left). here is my code:
<section class="c-center" itemscope itemtype="http://schema.org/Product">
<div>
<h1><span itemprop="name">Product name</span> <span itemprop="brand" class="brand">Brand of product</span></h1>
<div id="reviews" itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
<div class="rating">
<meta itemprop="ratingValue" content="4.8" />
<meta itemprop="ratingCount" content="56" />
<div class="fill" style="width:96%"></div>
<div class="stars"></div>
</div>
<div class="rating-info">
Based on 56 reviews - Write a review
</div>
</div>
</div>
<div id="img">
<img src="/link-to-image.jpg" alt="Img alt" itemprop="image" />
</div>
<div id="info">
<meta itemprop="url" content="site.com/link-of-product/">
<div id="price-container" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="priceCurrency" content="EUR">
<meta itemprop="gtin13" content="1234567899999">
<span class="price" itemprop="price">19,95 €</span> <del>28,50 €</del> -
<span class="stock"><link itemprop="availability" href="http://schema.org/InStock">Available</span>
</div>
</div>
</section>
here are my questions:
1- is there anything wrong?
2- I've seen in many posts that currency should not be in the itemprop="price" but in google examples, they do include it! what should I do?
3- should I use ratingCount or reviewCount ?
4- some products exist in different sizes with different prices, is it recommended to include the AggregateOffer with lowest and highest price?
Thanks a lot
How does it appear visually?
The structured data linter shows a typical snippet which looks good and has star rating, and there are no errors in google's tool. Two things which stand out are:
url has no protocol, set to http://yoursite.com/page1 for
price should be number only, which could well be affecting search results, currency is a separate field so should not be embedded in price as well
use <meta> to give your price with a full stop as the separator, not the comma and put large values as 1234567.89 not 1,234,567.89 or 1.234.567,89 but display it as you would normally
price info from http://schema.org/
Use the priceCurrency property (with ISO 4217 codes e.g. "USD") instead of including ambiguous symbols such as '$' in the value.
Use '.' (Unicode 'FULL STOP' (U+002E)) rather than ',' to indicate a decimal point. Avoid using these symbols as a readability separator.
Note that both RDFa and Microdata syntax allow the use of a "content=" attribute for publishing simple machine-readable values alongside more human-friendly formatting.
Use values from 0123456789 (Unicode 'DIGIT ZERO' (U+0030) to 'DIGIT NINE' (U+0039)) rather than superficially similiar Unicode symbols.
google actually gives this example in its policies page
<span itemprop="priceCurrency" content="USD">$</span><span itemprop="price">119.99</</span>
previous Offer price, you could include in <del> structured data for the expired Offer price, with priceValidUntil set to a date in the past, the current price can also have an expiry date.
consider setting itemCondition to http://schema.org/NewCondition
image urls - I've noticed that full url starting path rather than a relative path seem to be preferred - your /link-to-image.jpg is interpreted as http://example.com/link-to-image.jpg not http://site. com/link-to-image.jpg in the testing tool, I'm unsure if this is the same when testing direct from the URL but it seems best not to be amigous
lastly use a shopping search tool, including google shopping to search for a best seller, see if it can find it by price, brand, availability etc. if competitor sites appear first you can even check the structured data tester with their URL to see if you are missing anything

Google Rich Snippets Not Working

I'm working on a website for a friend (www.texasfriendlydds.com) and am trying to give them an edge with Rich Snippets that Google allegedly loves. It's a defensive driving school with 10 locations in the Austin area. I've placed the schema.org code within the address of each location, but while searching 'defensive driving austin' - I do not see any of the locations listed. I have 10 of the following code for each location(different address for each):
<div itemscope itemtype="http://schema.org/LocalBusiness">
<span itemprop="name">Texas Friendly Defensive Driving</span><br />
<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">13201 Ranch Road 620</span><br />
<span itemprop="addressLocality">Austin</span> <span itemprop="addressRegion">TX</span> <span itemprop="postalCode">78750</span>
</div>
<div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
<span itemprop="ratingValue">4.6</span> stars - based on <span itemprop="reviewCount">24</span> reviews
</div>
Free meal w/ <span itemprop="priceRange">$40 tuition</span><br /><br />
<meta itemprop="openingHours" content="Thursdays 3:30pm - 9:30pm"><b>Thursdays 3:30pm - 9:30pm</b><br />
</div>
In addition, at the bottom of the page, I aggregate all the reviews in attempt to get organic search rich snippet star-ratings to no avail. I've compared my code directly with the following site:
- http://www.microdatagenerator.com/aggregate-rating-schema-generator/
They were exactly the same (minus the values). You can find their snippets by Googling 'aggregate rating schema' and find the 2nd listing with rich snippet stars and 956 ratings. At one point I read that you need to show proof of your ratings, but this site doesn't do that and they have them.
I've used the Google Structured Data Testing Tool (https://developers.google.com/structured-data/testing-tool/) and everything comes out peachy. So why in the world am I not seeing any results from this?
We (Google) don't accept rich snippets for homepages; rich snippet annotations should be placed on leaf pages.

Retrieve 5th item's price value using selenium webdriver

Say I have multiple price quotes from multiple retailers, how will I retrieve the 5th value from a particular retailer - say Target or Walmart ? I can get to the 5th entry using the matching image logo bit how do I retrieve the value ?
Adding Html Code to make things more clear .I need to retrieve the ratePrice value (198)
<div id="rate-297" class="rateResult standardResult" vendor="15">
<div class="rateDetails">
<h4>Standard Goods
<br>
<img src="http://walmart.com/walmart/ZEUSSTAR999.jpg">
</h4>
<p>
<span class="vendorPart-380">
<img alt="Walmart" src="/cb2048547924/icons/15.gif">
<br>
<strong>
<br>
MNC
</span>
</p>
</div>
<div class="ratePrice">
<h3>
$198
<sup>49</sup>
</h3>
<p>
<strong>$754.49</strong>
<br>
</p>
<a class="button-select" href="https://www.walmart.com/us/order/95134/2013-05-14-10-00/95134/2013-05-17-10-00/297"> </a>
</div>
</div>
If you could provide some HTML it would help. Speaking generally from what you're asking you'd get a locator to the price div or whatever HTML element and then get its text using something like:
_driver.FindElement(locator_of_element).Text
The trick is understanding the HTML in order to target the 5th element. So if you can find the row that has the 5th entry then it's simply a matter or then finding the price div in that row and getting the text of it.
EDIT based on more info provided by OP in comments
Using the HTML you provided (which isn't well formed by the way, missing closing strong tag, a tag, etc.). I'd say do the following:
_driver.FindElement(By.XPath("//div[#class='ratePrice'][5]/h3")).Text