Beautiful Soup and Requests - add missing </td></tr> to one line of HTML code - beautifulsoup

I am currently Python coding using Beautiful Soup. The website i am trying to extract data from is http://xml.coverpages.org/country3166.html
On the whole I can get everything working that I want. I am extracting country code and country from the HTML using the <tr> tag. This is for a project I am setting myself.
The problem is that the source HTML is missing some closing tags on one of the countries (Moldova). See below. This means when I loop through my code it stops doing what I need at Moldova.
<tr valign=top><td>MA</td><td>Morocco</td></tr>
<tr valign=top><td>MC</td><td>Monaco</td></tr>
<tr valign=top><td>MD</td><td>Moldova, Republic of
<tr valign=top><td>MG</td><td>Madagascar</td></tr>
Thanks
I know I could just create a new text file and manually amend it but is there anything I can do Beautiful Soup wise to fix this? My plan was to iterate through each line until Moldova is found and then append </td></tr> on the end. Is there a more efficient way?

If I inspect the source you've linked the HTML seems fine, there's probably a mistake in your way of scraping the data.
A small example were we search for each tr, get it's children (2x td), and parse those as code and country to show a list:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
response = http.request('GET', 'http://xml.coverpages.org/country3166.html')
soup = BeautifulSoup(response.data, 'html.parser')
for tr in soup.findAll("tr"):
childs = tr.findChildren();
code = childs[0].getText();
country = childs[1].getText();
print(code, country)
Will output:
AD Andorra
AE United Arab Emirates
AF Afghanistan
AG Antigua & Barbuda
AI Anguilla
AL Albania
AM Armenia
AN Netherlands Antilles
AO Angola
AQ Antarctica
AR Argentina
AS American Samoa
... and many more, including Moldova and beyond

Related

How do I get sentences including links?

I would like to collect the Japanese articles searched by google. I try to extract Japanese sentences, then I run the following code in order to get the tag including the most Japanese words.
texts = mostTag.xpath('<<path>>/text()').extract()
text = ''
for s in texts:
text += s
But, this code has a problem in such cases as the article has a link between sentences as below.
<div class="sample">
<p>
"A"
B
"C"
</p>
</div>
In this case, my program get AC but what I want is ABC. I appreciate it if anyone tell me how to get the sentence as 'ABC'.
You can try to use string():
text = mostTag.xpath('string(//div[#class="sample"])').extract_first()
Or use html2text

HOW DO I PARSE THE DATA IN THESE HTML TAGS?

I am a python newbie.I was trying to get some data from my school site. Below is the code I wrote to scrap only the news items. It works but I want the title, date and paragraph to be in new lines. I feel there is something missing in my code but I don't have a hang on it. Need your help guys.
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen("http://www.kibabiiuniversity.ac.ke")
soup = BeautifulSoup(page)
for i in soup.findAll("div", {"class": "blog-thumbnail-inside"}):
print (i.get_text())
print ("----------" *20)
And here is the html tag structure of the page I'm trying to scrape.
<div class="blog-thumbnail-inside">
<h2 class="blog-thumbnail-title post-widget-title-color gdl-title">
<a href="http://www.kibabiiuniversity.ac.ke">
Completion of fees & collection of exam cards.
</a>
</h2>
<div class="blog-thumbnail-info post-widget-info-color gdl-divider">
<div class="blog-thumbnail-date">Posted on 09 Jan 2017</div>
</div>
<div class="blog-thumbnail-context">
<div class="blog-thumbnail-content">
Download the information on fee payment and collection of exam cards..
</div>
</div>
</div>
for i in soup.findAll("div", {"class": "blog-thumbnail-inside"}):
print (i.get_text('\n')) #You can specify a string to be used to join the bits of text together
print ("----------" *20)
out:
Final Undergraduate Examination Timetable for Semester 1 2016/2017
Posted on 11 Jan 2017
Download Undergraduate Timetable
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Vacancies for Administrative and Teaching Positions
Posted on 11 Jan 2017
Kibabii University is a fully fledged public institution of higher education and research in Kenya with a student population of 6400 and staff population of 346. The University seeks to appoint innovative individuals with experience and excellent credentials
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

How to remove redundant space in BeautifulSoup output

I intend to scrape a website using BeautifulSoup. I'm working on the following HTML :
html =
<div id="article-body" itemprop="articleBody">
<p>
<span class="quote down bgQuote" data-channel="/quotes/zigman/170324/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockSLB" href="/investing/stock/slb?mod=MW_story_quote" data-track-mod="MW_story_quote">
SLB,
<span class="bgPercentChange">-3.04%</span>
</a>
</span>
reported late Thursday
higher third-quarter profit that beat targets and sales only slightly below estimates
. Schlumberger’s results came a day after rival Halliburton Co.
<span class="quote down bgQuote" data-channel="/quotes/zigman/228631/composite" data-bgformat="">
<a class="qt-chip trackable" data-fancyid="XNYSStockHAL" href="/investing/stock/hal?mod=MW_story_quote" data-track-mod="MW_story_quote">
HAL,
<span class="bgPercentChange">-0.66%</span>
</a> """
I want to get a plain text without any redundant space,I followed the answer by Twig but SLB and -3.04% and also HAL and -0.66% are still placed in different lines.My favorable output would be like :
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates. Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66% also posted higher-than-expected profit.
It is my code:
import urllib2
from bs4 import BeautifulSoup
import re
newsText = soap(html)
text = list(newsText.stripped_strings)
finalText = "\n\n".join(text) if descriptions else ""
re.sub(r'[\ \n]{2,}', '', finalText)
print finalText
I am very thankful in advance.
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text(strip=True, separator=' ')
print(text)
out:
SLB, -3.04% reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates . Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66%

How to read a text when it is not in any HTML tag

How can I find text in following HTML:
style="background-color: transparent;">
<-a hre f="/">Home<-/ a>
<-a id="brea dcrumbs-790" class=" main active mainactive" href="/products">Products<-/a>
<-a href="/products/fruit-and-creme-curds">Fruit & Crème Curds<-/a>
Crème Banana Curd
<-/li>
<-/ul>"
</div>
This is HTML for Bread Crumb, first three are link and fourth is page name. I want to read page name (Crème Banana Curd) from Bread crumb. But since this is not in any node so how to catch it
If the text isn't present inside any tag, then it is present in body tag:-
So you can use something like below to identify it:-
html/body/text()
Though the question seems to be vague without a proper HTML source but still you may try the solution below by storing the Text in a Variable-
var breadcrumb = FindElement(By.XPath(".//*[#id='brea dcrumbs-790']/following-sibling::a")).Text;
use the below code:
WebElement elem = driver.findElement(By.xpath("//*[contains(text(),'Crème Banana Curd')]"));
elem.getText();
hope this will help you.

WebMatrix Nested WebGrid VB code

I have looked at the code in both the following posts:
formatting in razor nested webgrid, replied to by nemesv in October 2011 and
Razor Nested WebGrid, replied to by Chad Moran in April 2011.
They both seem to be close to my problem but the code is C# based, I believe, and I am having difficulty converting it to VB. I am also not sure they are exactly where I am at. I am particularly bemused by the following line, because of the two equals signs and double reference to subGrid.
WebGrid subGrid = subGrid = new WebGrid(item.SubItems)
I am also not sure whether topGrid and subGrid are just generic names, used for the purpose of illustration, or whether they are key words.
As a very relevant aside I will mention that this point in my web page project has held me up for five years now (I am not exaggerating - I just stopped working on the project for two years because of it). I have tried using ASP in VWD and now Grid View in WebMatrix and I hope I will not fail again.
Database record
Fields: Publisher_Name, Publisher_City, Series_Published, No_of_Series
Record Example: Price Stern Sloan, Baltimore, JKLMNO, 6
My two planned grid names
Publishers_Grid (top)
Series_Grid (sub)
What I am trying to do
For each of the characters in the string JKLMNO, access a second table, where each letter is the primary key for a record in that table.
Retrieve the value of the field, Back_Cover_Image, in that second table, which will be the file name or, at least, the unique part of the file name, for the image to be displayed.
If I go with the partial unique bit of file name approach, build the full file names for the images. And then -
Display as a second web grid row, the images thus pointed at, in the record example, that would be 6 images.
Thus I would end up, for the example record, something like the following (I have used XX to stand for an image): -
Price Stern Sloan Baltimore XX XX XX XX XX XX
I certainly I hope I am not wasting the valuable time of experts who I greatly admire. I'm just trying to achieve something that seems quite simple to me, having originally been a PL/1 programmer, 30 years ago, and a great user of nested arrays within that language, but I just can't work out the syntax in VB, Razor and WebMatrix.
I look forward to some constructive answers, and please do use VB.
My WebMatrix page so far
#Code
Layout = "~/Shared/Layouts/_Layout.vbhtml"
Dim HWB_Database As Database = Database.Open("How_and_Why_Wonder_Books")
Dim HWB_Publishers_All_sqlCommand = "SELECT * FROM Publishers ORDER BY Publisher_Code"
Dim Publishers_Data = HWB_Database.Query(HWB_Publishers_All_sqlCommand)
Dim Publishers_Grid = New WebGrid(Publishers_Data)
End Code
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>How and Why Wonder Books - Publishers</title>
</head>
<body>
<div style="margin-left: 100px">
<p style="Width: 1020px; border-width: 1px" class="InstructionsHeader">
Click on a publisher to see the list of titles produced under that imprint, click on a thumbnail to see details of that series type.
</p>
</div>
<br>
<br>
<div id="Publishers_Grid_Display">
#Publishers_Grid.GetHtml(columns:= Publishers_Grid.Columns(
Publishers_Grid.Column("Publisher_Name"),
Publishers_Grid.Column("Place_of_Publication"),
Publishers_Grid.Column("Series_Published")
)
)
</div>
Thank you for your promptings and encouragement.