Jsoup simple selector code help needed please - selector

I'm having a hard time getting the wanted info from a very simple code.
For example, I have no problem collecting my data within this simple code:
<HTML>
<TABLE>
<TABLE WIDTH=100%><TR class=FSS-data-row-highlight>
<TD> Evgeni Malkin, Pit (C/RW)</TD>
<TD class=FSS-data-right> 6 pts in last 2 GP </TD>
</TR>
</TABLE>
</TABLE>
</HTML>
What I need is the 'Evgeni Malkin 6 pts in last 2' string which works fine within that code. But when connect to the whole page, it returns nothing. I guess it is because there are tables within tables but I can't figure out how to proceed. Here is my code:
Document doc = Jsoup.connect("http://forecaster.thehockeynews.com/hockeynews/hockey/statistics.cgi?mlb&mode=hotnot/").get();
Elements scanYearplace = doc.select("tr.FSS-data-row-highlight td");
String yearplace = scanYearplace.text();
In fact I need all to grab the infos on all the other players too but it would be a start if I could that one to work.
Any suggestions?
Thanks in advance!

--- see update below ---
Please realize this is a fragile in that any change to the site could potentially break this. You'll also want to do some error checking and whatnot. Also, I didn't see the "6 pts in last 2 GP" text like you have above, but you can grab whatever stat you want using this code. Just change the stats.get(4) to whatever you want.
Document doc = Jsoup.connect("http://forecaster.thehockeynews.com/hockeynews/hockey/statistics.cgi?mlb&mode=hotnot/").get();
for (Element e : doc.select(".FSS-data-row")) {
Element td = e.select("td.FSS-data-left > a").first();
String name = (td != null?td.text():null);
Elements stats = e.select(".FSS-data-right");
String goals = (stats.size() > 0?stats.get(4).text():null);
System.out.println(name + ":" + goals);
}
Sample output:
null:null
J. Benn:13
P. Sharp:20
P. Marleau:17
T. Oshie:14
The first null:null is because that is like the header row on the page.
-----UPDATE-----
The url you had in your post pointed to the wrong page. Here is updated code to get what I think you want..
Document doc = Jsoup.connect("http://forecaster.thehockeynews.com/hockeynews/hockey/statistics.cgi?&mode=hotnot").get();
for (Element e : doc.select("tr.FSS-data-row-highlight")) {
Element tdname = e.select("td > a").first();
String name = (tdname != null?tdname.text():null);
Element tdstat = e.select("td.FSS-data-right").first();
String stat = tdstat.text();
System.out.println(name + ":" + stat);
}
Sample output:
Mathieu Perreault:5 pts in last 2 GP 
Mikhail Grabovski:5 pts in last 2 GP 
James Neal:4 pts in last 2 GP 
Kris Versteeg:4 pts in last 2 GP 
Evgeni Malkin:12 pts in last 6 GP 

Related

How to scrape a data table that has multiple pages using selenium?

I'm extracting NBA stats from my yahoo fantasy account. Below is the code that I made in jupyter notebook using selenium. Each page shows 25 players and a total of 720 players. I did a for loop that will scrape players in increments of 25 instead of one by one.
for k in range (0,725,25):
Players = driver.find_elements_by_xpath('//tbody/tr/td[2]/div/div/div/div/a')
Team_Position = driver.find_elements_by_xpath('//span[#class= "Fz-xxs"]')
Games_Played = driver.find_elements_by_xpath('//tbody/tr/td[7]/div')
Minutes_Played = driver.find_elements_by_xpath('//tbody/tr/td[11]/div')
FGM_A = driver.find_elements_by_xpath('//tbody/tr/td[12]/div')
FTM_A = driver.find_elements_by_xpath('//tbody/tr/td[14]/div')
Three_Points = driver.find_elements_by_xpath('//tbody/tr/td[16]/div')
PTS = driver.find_elements_by_xpath('//tbody/tr/td[17]/div')
REB = driver.find_elements_by_xpath('//tbody/tr/td[18]/div')
AST = driver.find_elements_by_xpath('//tbody/tr/td[19]/div')
ST = driver.find_elements_by_xpath('//tbody/tr/td[20]/div')
BLK = driver.find_elements_by_xpath('//tbody/tr/td[21]/div')
TO = driver.find_elements_by_xpath('//tbody/tr/td[22]/div')
NBA_Stats = []
for i in range(len(Players)):
players_stats = {'Name': Players[i].text,
'Position': Team_Position[i].text,
'GP': Games_Played[i].text,
'MP': Minutes_Played[i].text,
'FGM/A': FGM_A[i].text,
'FTM/A': FTM_A[i].text,
'3PTS': Three_Points[i].text,
'PTS': PTS[i].text,
'REB': REB[i].text,
'AST': AST[i].text,
'ST': ST[i].text,
'BLK': BLK[i].text,
'TO': TO[i].text}
driver.get('https://basketball.fantasysports.yahoo.com/nba/28951/players?status=ALL&pos=P&cut_type=33&stat1=S_AS_2021&myteam=0&sort=AR&sdir=1&count=' + str(k))
The browser will go page by page after it's done. I print out the results. It only scrape 1 player. What did I do wrong?
A picture of my codes and printing the results
It's hard to see what the issue here is without looking at the original page (can you provide a URL?), however looking at this:
next = driver.find_element_by_xpath('//a[#id = "yui_3_18_1_1_1636840807382_2187"]')
"1636840807382" looks like a Javascript timestamp, so I would guess that the reference you've got hardcoded there is dynamically generated, so the element "yui_3_18_1_1_1636840807382_2187" no longer exists.

beautifulsoup: get text (including html tags) between two different tags (</h3> and <h2>)

I am trying to scrape an html file structured as follow using beautifulsoup. Basicaly, each unit is constisted of:
one <h2></h2>
one <h3></h3>
more than one <p></p>
Something like follow:
<h2>January, 2020</h2>
<h3>facility</h3>
<p>text1-1</p>
<p>text1-2</p>
<h2>April, 2020</h2>
<h3>scientists</h3>
<p>text2-1</p>
<p>text2-2</p>
<h2>June, 2020</h2>
<h3>lawyers</h3>
<p>text3-1</p>
<h2>.....
I want to get text including the <p> tags between </h3> and the next <h2>. The result should be:
for row #1:
<p>text1-1</p>
<p>text1-2</p>
for row #2:
<p>text2-1</p>
<p>text2-2</p>
for row #3:
<p>text3-1</p>
Here is what I tried so far:
num_h2 = len(soup.find_all('h2'))
for i in range(0,num_h2):
print('---------')
print(i)
p_string = ''
sibling = soup.find_all('h3')[i].find_next_sibling('p').getText()
if sibling:
p_string += sibling
else:
break
print(p_string)
The problem with this solution is that it only shows the content of the first <p> under each unit. I do not know how to find how many <p> are there to generate a for loop. Also, is there a better way to do this than using find_next_silibing()?
Maybe css selectors can help:
for s in soup.select('h3'):
for ns in (s.fetchNextSiblings()):
if ns.name == "h2":
break
else:
if ns.name == "p":
print(ns)
Output:
<p>text1-1</p>
<p>text1-2</p>
<p>text2-1</p>
<p>text2-2</p>
<p>text3-1</p>

ExpertPDF - How to know page number based on content in HTML

Suppose I have a HTML that have some heading & text like:
Heading 1
text......
Heading 2
text.....
Heading 3
text.....
Now I have to print this template in PDF, during print out, I have to add index page which actually refer page number with heading. Means print out should be like this.
Heading 1 ....... 1 [page number]
Heading 2 ....... 2
Heading 3 ....... 3
Heading 1
text......
Heading 2
text.....
Heading 3
text.....
So here I want to know, how to know page number based on text in HTML, like heading 1 belong to which page number & for others.
Any suggestion or idea really appreciated.
pdfConverter.PdfFooterOptions.PageNumberTextFontSize = 10;
pdfConverter.PdfFooterOptions.ShowPageNumber = true;
Its done inside the body of this method :-
private void AddFooter(PdfConverter pdfConverter)
{
string thisPageURL = HttpContext.Current.Request.Url.AbsoluteUri;
string headerAndFooterHtmlUrl = thisPageURL.Substring(0, thisPageURL.LastIndexOf('/')) + "/HeaderAndFooterHtml.htm";
//enable footer
pdfConverter.PdfDocumentOptions.ShowFooter = true;
// set the footer height in points
pdfConverter.PdfFooterOptions.FooterHeight = 60;
//write the page number
pdfConverter.PdfFooterOptions.TextArea = new TextArea(0, 30, "This is page &p; of &P; ",
new System.Drawing.Font(new System.Drawing.FontFamily("Times New Roman"), 10, System.Drawing.GraphicsUnit.Point));
pdfConverter.PdfFooterOptions.TextArea.EmbedTextFont = true;
pdfConverter.PdfFooterOptions.TextArea.TextAlign = HorizontalTextAlign.Right;
// set the footer HTML area
pdfConverter.PdfFooterOptions.HtmlToPdfArea = new HtmlToPdfArea(headerAndFooterHtmlUrl);
pdfConverter.PdfFooterOptions.HtmlToPdfArea.EmbedFonts = cbEmbedFonts.Checked;
}
See this page for more details
http://www.expertpdf.net/expertpdf-html-to-pdf-converter-headers-and-footers/
This is actually a pretty tricky problem which ExpertPDF would have to provide specific functionality to make possible.
My solution (not expertpdf) for this was to calculate the layout of the PDF first, get the text to be used in the index for each page and then calculate the layout of the index page/s. Then I'm able to number the pages (including the index pages) then update the page numbers in the index.. This is the only way to handle template pages which span multiple pages themselves, index text which wraps to take up more than a single line, and indexes which span multiple pages.
Create a TextElement
TextElement te = new TextElement(xPos, yPos, width, ""Page &p; of &P;"", footerFont);
footerTemplate.AddElement(te);
The library will automatically replace the &p; tokens.

httpagility pack scraping between broken tag

i need to scrape a p tag which has h3 tag after it but does not have a closing p tag. It looks like this :
<script ad>asdasdasd</script>
<p>Translation companies are
-----------------------
-----------------------
<h3 class="this_class">mind blown site</h3>
There is no </p> tag so i cannot parse it completely. Now i have two questions :
1) can this be parsed using httpagility xpath ?
2) i have a function to find text between two strings (getbetween). But i have a doubt - If i use "asdasdasd" and " is it always 100% that vb.net will use the script tag which is just above h3 because there are 2-3 same lines - "asdasdasd"
3) Any other method you guys are aware of ?
(had to write in code so html does not mess up)
Regards,
It might be a good idea to post some more "real" html to really help you, at least the tags between the h3 and the p.
Anyway, this should get you the p-Tag from the h3-Tag.
HtmlDocument doc = new HtmlDocument();
doc.Load(... //Load the Html...
//Either of these lines will do
HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[#class='this_class']/preceding-sibling::p");
//HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[contains(text(),'mind blown site')]/preceding-sibling::p");
string pInnerHtml = pNode.NextSibling.InnerHtml; //Has the text "Translation companies are...."
So in general, to get all the nodes from the opening p tag to the start of a tag you don't want, you could do this:
var p = doc.DocumentNode.SelectSingleNode("//p");
var h3 = p.SelectSingleNode("following-sibling::h3[#class='this_class']");
var following = new List<string>();
for (var current = p.NextSibling; current != h3; current = current.NextSibling)
{
following.Add(current.InnerText);
}
var innerText = String.Concat(following);

How parsing works

I am trying the sample code for the piracy report.
The line of code:
for incident in soup('td', width="90%"):
seraches the soup for an element td with the attribute width="90%", correct? It invokes the __init__ method of the BeautifulStoneSoup class, which eventually invokes SGMLParser.__init__(self)
Am I correct with the class flow above?
The soup looks like this in the report now:
<td class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" ><p>22.09.2010: 0236 UTC: Posn: 03:49.9N – 006:54.6E: Off Bonny River: Nigeria.<p/>
<p>About 21 armed pirates in three crafts boarded a pipe layer crane vessel undertow. All crew locked themselves in accommodations. Pirates were able to take one crewmember as hostage. Master called Nigerian naval vessel in vicinity. Later pirates released the crew and left the vessel. All crew safe.<p/></td>
There is no width markup in the text. I changed the line of code that is searching:
for incident in soup('td', class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations"):
It appears that class is a reserved word, maybe?
How do I get the current example code to run, and has more changed in the application than just the HTML output?
The URL I am using:
urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")
There must be a better way....
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")
soup = BeautifulSoup(page)
soup.find("table",{"class" : "fabrikTable"})
list1 = soup.table.findAll('p', limit=50)
i = 0
imax = 0
for item in list1 :
imax = imax + 1
while i < imax:
Itime = list1[i]
i = i + 2
Incident = list1[i]
i = i + 1
Inext = list1[i]
print "Time ", Itime
print "Incident", Incident
print " "
i = i + 1
class is a reserved word and will not work with that method.
This method works but does not return the list:
soup.find("tr", { "class" : "fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" })
And I confirmed the class flow for the parse.
The example will run, but the HTML must be parsed with different methods because the width='90%' is no longer in the HTML.
Still working on the proper methods; will post back when I get it working.