HOW DO I PARSE THE DATA IN THESE HTML TAGS?

HOW DO I PARSE THE DATA IN THESE HTML TAGS? - beautifulsoup

I am a python newbie.I was trying to get some data from my school site. Below is the code I wrote to scrap only the news items. It works but I want the title, date and paragraph to be in new lines. I feel there is something missing in my code but I don't have a hang on it. Need your help guys.
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen("http://www.kibabiiuniversity.ac.ke")
soup = BeautifulSoup(page)
for i in soup.findAll("div", {"class": "blog-thumbnail-inside"}):
print (i.get_text())
print ("----------" *20)
And here is the html tag structure of the page I'm trying to scrape.
<div class="blog-thumbnail-inside">
<h2 class="blog-thumbnail-title post-widget-title-color gdl-title">
<a href="http://www.kibabiiuniversity.ac.ke">
Completion of fees & collection of exam cards.
</a>
</h2>
<div class="blog-thumbnail-info post-widget-info-color gdl-divider">
<div class="blog-thumbnail-date">Posted on 09 Jan 2017</div>
</div>
<div class="blog-thumbnail-context">
<div class="blog-thumbnail-content">
Download the information on fee payment and collection of exam cards..
</div>
</div>
</div>

for i in soup.findAll("div", {"class": "blog-thumbnail-inside"}):
print (i.get_text('\n')) #You can specify a string to be used to join the bits of text together
print ("----------" *20)
out:
Final Undergraduate Examination Timetable for Semester 1 2016/2017
Posted on 11 Jan 2017
Download Undergraduate Timetable
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Vacancies for Administrative and Teaching Positions
Posted on 11 Jan 2017
Kibabii University is a fully fledged public institution of higher education and research in Kenya with a student population of 6400 and staff population of 346. The University seeks to appoint innovative individuals with experience and excellent credentials
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Related

Beautiful Soup and Requests - add missing </td></tr> to one line of HTML code

I am currently Python coding using Beautiful Soup. The website i am trying to extract data from is http://xml.coverpages.org/country3166.html
On the whole I can get everything working that I want. I am extracting country code and country from the HTML using the <tr> tag. This is for a project I am setting myself.
The problem is that the source HTML is missing some closing tags on one of the countries (Moldova). See below. This means when I loop through my code it stops doing what I need at Moldova.
<tr valign=top><td>MA</td><td>Morocco</td></tr>
<tr valign=top><td>MC</td><td>Monaco</td></tr>
<tr valign=top><td>MD</td><td>Moldova, Republic of
<tr valign=top><td>MG</td><td>Madagascar</td></tr>
Thanks
I know I could just create a new text file and manually amend it but is there anything I can do Beautiful Soup wise to fix this? My plan was to iterate through each line until Moldova is found and then append </td></tr> on the end. Is there a more efficient way?

If I inspect the source you've linked the HTML seems fine, there's probably a mistake in your way of scraping the data.
A small example were we search for each tr, get it's children (2x td), and parse those as code and country to show a list:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
response = http.request('GET', 'http://xml.coverpages.org/country3166.html')
soup = BeautifulSoup(response.data, 'html.parser')
for tr in soup.findAll("tr"):
childs = tr.findChildren();
code = childs[0].getText();
country = childs[1].getText();
print(code, country)
Will output:
AD Andorra
AE United Arab Emirates
AF Afghanistan
AG Antigua & Barbuda
AI Anguilla
AL Albania
AM Armenia
AN Netherlands Antilles
AO Angola
AQ Antarctica
AR Argentina
AS American Samoa
... and many more, including Moldova and beyond

How to remove redundant space in BeautifulSoup output

I intend to scrape a website using BeautifulSoup. I'm working on the following HTML :
html =
<div id="article-body" itemprop="articleBody">
<p>
<span class="quote down bgQuote" data-channel="/quotes/zigman/170324/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockSLB" href="/investing/stock/slb?mod=MW_story_quote" data-track-mod="MW_story_quote">
SLB,
<span class="bgPercentChange">-3.04%</span>
</a>
</span>
reported late Thursday
higher third-quarter profit that beat targets and sales only slightly below estimates
. Schlumberger’s results came a day after rival Halliburton Co.
<span class="quote down bgQuote" data-channel="/quotes/zigman/228631/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockHAL" href="/investing/stock/hal?mod=MW_story_quote" data-track-mod="MW_story_quote">
HAL,
<span class="bgPercentChange">-0.66%</span>
</a> """
I want to get a plain text without any redundant space,I followed the answer by Twig but SLB and -3.04% and also HAL and -0.66% are still placed in different lines.My favorable output would be like :
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates. Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66% also posted higher-than-expected profit.
It is my code:
import urllib2
from bs4 import BeautifulSoup
import re
newsText = soap(html)
text = list(newsText.stripped_strings)
finalText = "\n\n".join(text) if descriptions else ""
re.sub(r'[\ \n]{2,}', '', finalText)
print finalText
I am very thankful in advance.

soup = BeautifulSoup(html, 'lxml')
text = soup.get_text(strip=True, separator=' ')
print(text)
out:
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates . Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66%

What's wrong with my structure data?

It's been more than 4 months that our rich snippets suddenly disappeared, some error were reported in GWT, i corrected everything and errors are now decreasing (only 5 left). here is my code:
<section class="c-center" itemscope itemtype="http://schema.org/Product">
<div>
<h1><span itemprop="name">Product name</span> <span itemprop="brand" class="brand">Brand of product</span></h1>
<div id="reviews" itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
<div class="rating">
<meta itemprop="ratingValue" content="4.8" />
<meta itemprop="ratingCount" content="56" />
<div class="fill" style="width:96%"></div>
<div class="stars"></div>
</div>
<div class="rating-info">
Based on 56 reviews - Write a review
</div>
</div>
</div>
<div id="img">
<img src="/link-to-image.jpg" alt="Img alt" itemprop="image" />
</div>
<div id="info">
<meta itemprop="url" content="site.com/link-of-product/">
<div id="price-container" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="priceCurrency" content="EUR">
<meta itemprop="gtin13" content="1234567899999">
<span class="price" itemprop="price">19,95 €</span> <del>28,50 €</del> -
<span class="stock"><link itemprop="availability" href="http://schema.org/InStock">Available</span>
</div>
</div>
</section>
here are my questions:
1- is there anything wrong?
2- I've seen in many posts that currency should not be in the itemprop="price" but in google examples, they do include it! what should I do?
3- should I use ratingCount or reviewCount ?
4- some products exist in different sizes with different prices, is it recommended to include the AggregateOffer with lowest and highest price?
Thanks a lot

How does it appear visually?
The structured data linter shows a typical snippet which looks good and has star rating, and there are no errors in google's tool. Two things which stand out are:
url has no protocol, set to http://yoursite.com/page1 for
price should be number only, which could well be affecting search results, currency is a separate field so should not be embedded in price as well
use <meta> to give your price with a full stop as the separator, not the comma and put large values as 1234567.89 not 1,234,567.89 or 1.234.567,89 but display it as you would normally
price info from http://schema.org/
Use the priceCurrency property (with ISO 4217 codes e.g. "USD") instead of including ambiguous symbols such as '$' in the value.
Use '.' (Unicode 'FULL STOP' (U+002E)) rather than ',' to indicate a decimal point. Avoid using these symbols as a readability separator.
Note that both RDFa and Microdata syntax allow the use of a "content=" attribute for publishing simple machine-readable values alongside more human-friendly formatting.
Use values from 0123456789 (Unicode 'DIGIT ZERO' (U+0030) to 'DIGIT NINE' (U+0039)) rather than superficially similiar Unicode symbols.
google actually gives this example in its policies page
<span itemprop="priceCurrency" content="USD">$</span><span itemprop="price">119.99</</span>
previous Offer price, you could include in <del> structured data for the expired Offer price, with priceValidUntil set to a date in the past, the current price can also have an expiry date.
consider setting itemCondition to http://schema.org/NewCondition
image urls - I've noticed that full url starting path rather than a relative path seem to be preferred - your /link-to-image.jpg is interpreted as http://example.com/link-to-image.jpg not http://site. com/link-to-image.jpg in the testing tool, I'm unsure if this is the same when testing direct from the URL but it seems best not to be amigous
lastly use a shopping search tool, including google shopping to search for a best seller, see if it can find it by price, brand, availability etc. if competitor sites appear first you can even check the structured data tester with their URL to see if you are missing anything

Google Rich Snippets Not Working

I'm working on a website for a friend (www.texasfriendlydds.com) and am trying to give them an edge with Rich Snippets that Google allegedly loves. It's a defensive driving school with 10 locations in the Austin area. I've placed the schema.org code within the address of each location, but while searching 'defensive driving austin' - I do not see any of the locations listed. I have 10 of the following code for each location(different address for each):
<div itemscope itemtype="http://schema.org/LocalBusiness">
<span itemprop="name">Texas Friendly Defensive Driving</span><br />
<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">13201 Ranch Road 620</span><br />
<span itemprop="addressLocality">Austin</span> <span itemprop="addressRegion">TX</span> <span itemprop="postalCode">78750</span>
</div>
<div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
<span itemprop="ratingValue">4.6</span> stars - based on <span itemprop="reviewCount">24</span> reviews
</div>
Free meal w/ <span itemprop="priceRange">$40 tuition</span><br /><br />
<meta itemprop="openingHours" content="Thursdays 3:30pm - 9:30pm"><b>Thursdays 3:30pm - 9:30pm</b><br />
</div>
In addition, at the bottom of the page, I aggregate all the reviews in attempt to get organic search rich snippet star-ratings to no avail. I've compared my code directly with the following site:
- http://www.microdatagenerator.com/aggregate-rating-schema-generator/
They were exactly the same (minus the values). You can find their snippets by Googling 'aggregate rating schema' and find the 2nd listing with rich snippet stars and 956 ratings. At one point I read that you need to show proof of your ratings, but this site doesn't do that and they have them.
I've used the Google Structured Data Testing Tool (https://developers.google.com/structured-data/testing-tool/) and everything comes out peachy. So why in the world am I not seeing any results from this?

We (Google) don't accept rich snippets for homepages; rich snippet annotations should be placed on leaf pages.

Is there any tool to automatically capture screenshots of running ad campaigns? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Objectif would be to monitor a bunch of websites and automatically take screenshots when a specific ad tag is called (previously defined, belonging to a new campaign)

automatically take screenshots when a specific ad tag is called
I'm not really sure what you mean when you say that a specific ad tag is "called." What do you mean when you say that?
If I understand you correctly, you seem to want to capture an ad if it's displayed on a website. If that's the correct understanding then you can just save the HTML of an ad and "replay" it later. You do have to check for certain tags in the HTML in order to detect if an ad is showing.
So let's look at how things work on with Google ads. And let's say you search for the keyword "dentist": http://www.google.com/search?hl=en&q=dentist
First you have to detect that an ad exists and the most common way this is indicated on Google is with the <div id="tads"> div. If you look at the HTML, the div tag will look like this:
<div id="tads" class="c" style="margin:0 0 14px;padding-top:2px;padding-right:36px;padding-bottom:2px;padding-left:8px;min-height:0">
<h2 style="font-size:11px;font-weight:normal;margin:0 -35px 0 0;padding:1px 4px 0 0;text-align:right">
<ol style="padding:3px 0" onmouseover="return true">
<li class="taf">
// collapsed html for the purposes of this example
<li class="tam">
// collapsed html for the purposes of this example
<li class="tal">
// collapsed html for the purposes of this example
</ol>
<div id="topDnsPrefetchHints">
</div>
If you take all the HTML in the parent node, plop it in the middle of some <html></html> tags, save it to a file and you would have essentially made a copy of the ad. When you open the saved page you will see the ad (although it will not have any of the pretty formatting):
<html>
<head>
</head>
<body>
<span style="margin-right:0" id="taw"> <div></div> <div style="margin:0 0 14px;padding-top:2px;padding-right:36px;padding-bottom:2px;padding-left:8px;min-height:0" id="tads" class="c"><h2 style="font-size:11px;font-weight:normal;margin:0 -35px 0 0;padding:1px 4px 0 0;text-align:right">Ads<span> - <a onclick="google.x(this,function(){if(google.wta){google.wta.toggleLightbox(this,'0CAYQJw','0CAgQKQ',380);}});return false;" id="wtata" class="gl wtall" href="javascript:void(0)">Why these ads?</a><span style="display:none" class="wtalbc">These ads are based on your current search terms.<br><div style="padding-top:12px">Visit Google’s <a class="std gl wtaal" href="#">Ads Preferences Manager</a> to learn more or opt out.</div></span></span></h2><ol style="padding:3px 0" onmouseover="return true"><li class="taf"><div sig="1W8" cved="0CA8Qhw0wAQ" pved="0CA0QhQ0wAQ" class="vsc vsta"><h3 onmouseover="document.getElementById('topDnsPrefetchHints').innerHTML='<link rel="dns-prefetch" href="//www.1800dentist.com">';">Find a Local <b>Dentist</b> - 24/7 Online Appointment Booking.</h3><div class="vspib" aria-label="Result details" role="button" tabindex="0"><div class="vspii"><div class="vspiic"></div></div></div><div><div class="kv"><cite>www.1800<b>dentist</b>.com/Chicago<b>Dentist</b></cite></div></div><div><div style="display:none" id="poAs0p1" class="esc slp">You +1'd this publicly. <a class="fl" href="#">Undo</a></div></div><span class="ac">Call or Visit 1-800-<b>DENTIST</b>® Today! </span><div><div style="margin-bottom:0px;margin-top:4px" class="oslk osi">Find Gentle Pain-Free Dentists - Find a Dentist Online for Free</div></div></div></li><li class="tam"><div sig="nxV" cved="0CBYQhw0wAg" pved="0CBQQhQ0wAg" class="vsc vsta"><h3 onmouseover="document.getElementById('topDnsPrefetchHints').innerHTML='<link rel="dns-prefetch" href="//dentalsalon.com">';"><b>DENTIST</b> at Dental Salon - Super Convenient</h3><div class="vspib" aria-label="Result details" role="button" tabindex="0"><div class="vspii"><div class="vspiic"></div></div></div><div><div class="kv"><cite>www.dentalsalon.com/Great-Reviews</cite></div></div><div><div style="display:none" id="poAs0p2" class="esc slp">You +1'd this publicly. <a class="fl" href="#">Undo</a></div></div><span class="ac">Open 7 Days and Evenings - Affordable - 12 <b>Dentists</b> </span><div><div style="margin-bottom:0px;margin-top:0px">Suite 800, 939 W North Ave, Chicago, IL - 1 (312) 642-3370 - <a google.adping('t',2,11)"="" &&="" class="flonmousedown="google.adPing" href="http://maps.google.com/maps?hl=en&um=1&ie=UTF-8&daddr=Suite+800,+939+W+North+Ave,+Chicago,+IL&f=d&saddr=&iwstate1=dir:to&fb=1&geocode=3891923684064231596,41.910331,-87.652690&sa=X&ei=NnM5T9nSC-r20gGymeXFAg&ved=0CBMQmxA">Directions</a></div></div></div></li><li class="tal"><div sig="ohs" cved="0CCAQhw0wAw" pved="0CB4QhQ0wAw" class="vsc vsta"><h3 onmouseover="document.getElementById('topDnsPrefetchHints').innerHTML='<link rel="dns-prefetch" href="//www.BigSmileDental.com">';">Big Smile Dental - $1 Exam & X-Rays or Free Whitening</h3><div class="vspib" aria-label="Result details" role="button" tabindex="0"><div class="vspii"><div class="vspiic"></div></div></div><div><div class="kv"><cite>www.bigsmiledental.com</cite></div></div><div><div style="display:none" id="poAs0p3" class="esc slp">You +1'd this publicly. <a class="fl" href="#">Undo</a></div></div><span class="ac">As Seen on FoxNews (773)772-8400 </span><div><div style="margin-bottom:0px;margin-top:4px" class="oslk osi">Porcelain Veneers - Invisalign - Teeth Whitening - Dental Implants</div></div></div></li></ol><div id="topDnsPrefetchHints"><link href="//www.1800dentist.com" rel="dns-prefetch"></div></div> </span>
</body>
</html>
Finally, if you want the pretty formatting, then reference Google's style sheet and now you have an almost identical copy of the ad that Google is displaying.
Things to note:
Every advertiser (Google, Bing, Yahoo, etc.) has a different format, so you may have to do the above for every advertiser you want to monitor.
Even within advertisers, there may be variations on the ad depending on the browser that is displaying the ad (Google displays ads differently on iphones than they do for android phones).
Finally, since I already work for a company that provides the ad capturing service (and more), I might as well plug our solutions.
We have several examples on how the ads look once captured, including this one:

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HOW DO I PARSE THE DATA IN THESE HTML TAGS? - beautifulsoup

Related

Beautiful Soup and Requests - add missing </td></tr> to one line of HTML code

How to remove redundant space in BeautifulSoup output

What's wrong with my structure data?

Google Rich Snippets Not Working

Is there any tool to automatically capture screenshots of running ad campaigns? [closed]

Categories

Resources