regex using vb.net - vb.net

i have this html code
<div class="name">
<span id="businessNumOnMap2" class="resultNumberOnMap" style="display:none;">2</span>
<span>
Bangsar Seafood Garden Restaurant</span><span id="phoneSpan1"></span>
<script type="text/javascript">
var d=document.getElementById('phoneSpan1');d.innerHTML+='0';d.innerHTML+='3';d.innerHTML+=0?'8':'-';d.innerHTML+=1?'2':'1';d.innerHTML+='2';d.innerHTML+=1?'8':'1';d.innerHTML+=0?'0':'2';d.innerHTML+='2';d.innerHTML+=0?'4':'5';d.innerHTML+='5';d.innerHTML+=1?'5':'0';
</script>
</div>
i start my regex with this : <div class="name"[^>]*>[\s\S]+?</div>
and i remove the html. im using this : <[^>]*>
how ever, the out put is Bangsar Seafood Garden Restaurant <script type = "text/javascript"> ...</script><div>
the one that i want is on Bangsar Seafood Garden Restaurant..can anyone help me?

If all you want is the business name, can't you just do a text search for class="businessName" and then take everything between the next > and <. It might not work if the HTML changes, but you run the same risk with Regex.

Related

Use GetElementsByClass to find all <div> elements by class name, nested inside a <p> element

I am creating a parser using Jsoup in Kotlin
I need to get a inner text of a tag with class "ptrack-content" inside the tag with class "titleCard-synopsis"
When I am trying to getElementsByClass in a element objects that created by a former getElementsByClass, I getting 0 elements
Code:
class NetlifxHtmlParser {
val html = """
<div class="titleCardList--metadataWrapper">
<div class="titleCardList-title"><span class="titleCard-title_text">Map Her</span><span><span class="duration ellipsized">50m</span></span></div>
<p class="titleCard-synopsis previewModal--small-text">
<div class="ptrack-content">A hidden map rocks Hartley High as the students' sexcapades are publicly exposed. Caught as the culprit, Amerie becomes an instant social pariah.</div>
</p>
</div>
<div class="titleCardList--metadataWrapper">
<div class="titleCardList-title"><span class="titleCard-title_text">Renaissance Titties</span><span><span class="duration ellipsized">50m</span></span></div>
<p class="titleCard-synopsis previewModal--small-text">
<div class="ptrack-content">Amerie, the new outcast, receives a party invitation that gives her butterflies. But when she manages to show up, a bitter surprise awaits.</div>
</p>
</div>
""".trimIndent()
fun parseEpisode() {
val doc = Jsoup.parseBodyFragment(html)
val titleCards = doc.getElementsByClass("titleCard-synopsis")
println("Episode: count titleCard = > ${titleCards.count()}") // 2
titleCards.forEachIndexed { index, element ->
val ptrack = element.getElementsByClass("ptrack-content")
println("Episode: count ptrack = > ${ptrack.count()}") // 0 !!
println("inner html = > ${ptrack.html()}") // null string !!
}
}
}
In the above code,
First, I am extracting tags with class name titleCard-synopsis.
For that , I using doc.getElementsByClass("titleCard-synopsis") which returns 2 element items.
Then, In the List of titleCard elements, I am extracting the elements that have ptrack-content as Class, by using the same getElementsByClass in each element,
which returns empty list.
Why this is happening ?
My goal is, I need to extract the description text for each title, the stored in the interior tags of p tag with class titleCard-synopsis.
If I try to get directly from "ptrack-content", it's working fine, but this a general class used in many places in the main HTML source. (this is snippet)
I need to get a inner text of a tag with class "ptrack-content" inside the tag with class "titleCard-synopsis"
But in the above method in the code, I am only getting emtpy list.
Why ?
Also note that, if I invoke the HTML() method in a element object of titleCards(ptrack.html()),
I am not getting the inner DIV tag, an empty string!!!
Please guide my to resolve the issue !
TL;DR
I need to get a inner text of a tag with class "ptrack-content" inside the tag with class "titleCard-synopsis"
I'm not really familiar with Kotlin, but this should produce the desired output:
val doc = Jsoup.parseBodyFragment(html)
val result = doc.select(".titleCard-synopsis + .ptrack-content")
result.forEachIndexed {index, element ->
println("${element.html()}")
}
Live example
This is an interesting problem!
You basically have an invalid HTML and jsoup is smart enough to auto-correct it for your. Your HTML structure gets altered and suddenly your query does not work.
This is the error:
<p class="titleCard-synopsis previewModal--small-text">
<div class="ptrack-content">A hidden map rocks Hartley High as the students' sexcapades are publicly exposed. Caught as the culprit, Amerie becomes an instant social pariah.</div>
</p>
You can't nest a <div> element inside a <p> element like that.
Paragraphs are block-level elements, and notably will automatically close if another block-level element is parsed before the closing </p> tag. [Source: <p>: The Paragraph element]
Also, look at Nesting block level elements inside the <p> tag... right or wrong?
This is how jsoup parses your tree:
<html>
<head></head>
<body>
<div class="titleCardList--metadataWrapper">
<div class="titleCardList-title">
<span class="titleCard-title_text">Map Her</span><span><span class="duration ellipsized">50m</span></span>
</div>
<p class="titleCard-synopsis previewModal--small-text"></p>
<div class="ptrack-content">
A hidden map rocks Hartley High as the students' sexcapades are publicly exposed. Caught as the culprit, Amerie becomes an instant social pariah.
</div>
<p></p>
</div>
<div class="titleCardList--metadataWrapper">
<div class="titleCardList-title">
<span class="titleCard-title_text">Renaissance Titties</span><span><span class="duration ellipsized">50m</span></span>
</div>
<p class="titleCard-synopsis previewModal--small-text"></p>
<div class="ptrack-content">
Amerie, the new outcast, receives a party invitation that gives her butterflies. But when she manages to show up, a bitter surprise awaits.
</div>
<p></p>
</div>
</body>
</html>
As you can see, elements with class titleCard-synopsis have no children with class ptrack-content.

How to create an Xpath in a tricky section of document (for me) for the purpose of using with Selenium Basic in VBA

OK, so I mentioned Selenium Basic as that is the use of the XPath and I believe Selenium Basic uses Selenium version 2 so maybe it won't be able to understand some/all answers that might require the latest Selenium. But someone might take that into account if necessary.
There are dynamic classes at play here.
Criteria for selection.
1. Class starting with 'NextToJump__eventWrapper' (the outer one) must be used.
2. Class starting with 'NextToJump__venue' must contain text = 'Ballarat'
3. Class starting with 'NextToJump__race' (and/or span) must contain text = 'Race 2'
I need to be able to click on the <a> tag that contains Points 2 and 3.
The best that I've been able to do (and checked) using ChroPath in Chrome Devtools is...
//div[starts-with(#class,'NextToJump__eventWrapper')]//descendant::*[contains(text(),'Ballarat')]
But note that there are 2 cases of Point 2 in the HTML but only 1 case that satisfies Points 2 and 3.
Thanks
<div class="NextToJump__eventWrapper--13zZJ">
<div>
<div class="NextToJump__raceEvent--bfMON" data-testid="next-to-jump-item">
<a class="Link__link--9x4YY" href="/racing-betting/greyhound-racing/crayford-am/20200708/race-1-1801951-58544404">
<div class="NextToJump__iconWrapper--1yG60"></div>
<div class="NextToJump__eventDetail--CUzdX">
<div class="NextToJump__venue--1jwWA">Ballarat</div>
<div class="NextToJump__race--3JydR"><span>Race 1</span></div>
</div>
<div class="NextToJump__countdown--EG8mR"><span class="Countdown__countdown--4vRpD Countdown__imminent--2yc2K">52s</span></div>
</a>
</div>
<div class="NextToJump__raceEvent--bfMON" data-testid="next-to-jump-item">
<a class="Link__link--9x4YY active" href="/racing-betting/greyhound-racing/rockhampton/20200708/race-4-1799474-58466521" aria-current="page">
<div class="NextToJump__iconWrapper--1yG60"></div>
<div class="NextToJump__eventDetail--CUzdX">
<div class="NextToJump__venue--1jwWA">Rockhampton</div>
<div class="NextToJump__race--3JydR"><span>Race 4</span></div>
</div>
<div class="NextToJump__countdown--EG8mR"><span class="Countdown__countdown--4vRpD Countdown__imminent--2yc2K">2m 52s</span></div>
</a>
</div>
<div class="NextToJump__raceEvent--bfMON" data-testid="next-to-jump-item">
<a class="Link__link--9x4YY" href="/racing-betting/greyhound-racing/ballarat/20200708/race-4-1799454-58465201">
<div class="NextToJump__iconWrapper--1yG60"></div>
<div class="NextToJump__eventDetail--CUzdX">
<div class="NextToJump__venue--1jwWA">Ballarat</div>
<div class="NextToJump__race--3JydR"><span>Race 2</span></div>
</div>
<div class="NextToJump__countdown--EG8mR"><span class="Countdown__countdown--4vRpD Countdown__imminent--2yc2K">5m 52s</span></div>
</a>
</div>
</div>
</div>
The xpath expression you need to use to select your target <a> tag is long and convoluted, but that's life....
[formatted for ease of reading, but you can use that in one line]
//a
[ancestor::div[starts-with(#class,'NextToJump__eventWrapper')]]
[.//div[.="Ballarat"]
[starts-with(#class,'NextToJump__venue-')]
[./following-sibling::div[.="Race 2"]
[starts-with(#class,'NextToJump__race-')]
]
]
Edit:
In "plain English":
Find an <a> node which meets ALL these conditions (i) has an ancestor (not a parent) node which is a <div>, which <div> has a class attribute with an attribute name which starts with NextToJump__eventWrapper; and (ii) it has <div>descendant (not just a child) node, which has Ballarat as a text node AND which has a class attribute with an attribute name which starts with NextToJump__venue-, where that <div>descendant itself has a following sibling which is a <div> which itself has a Race 2 text node AND which has a class attribute with an attribute name which starts with NextToJump__race-...
Yes, the word "plain" doesn't really fit here, but that's the closest I could get. I like xpath, and it's very powerful, but sometimes it's very hard to follow... As an aside, it would have been somewhat less cryptic if xquery was used instead of straight xpath.

Not finding the Correct xpath

I'm trying write a Python script to get some information from Google's products listed on the top right of the screen. (Usual 6 pictures with price and seller)
I am using Python, PhantomJS and Selenium
Doing a google search for "red shoe" I want my script to return the prices. I get stuck in the step where I try to even find the element containing the products. Am I missing something with my xpath?
def getTopSongs(object):
print "Working YETI"
browser = webdriver.PhantomJS('c:/projects/phantomjs/phantomjs.exe')
browser.get('http://google.com/search?q=red+shoe')
time.sleep(5)
title = browser.find_element_by_xpath('//div[contains#class, "pla-unit")]/text()[contains(., "red")]/following::b').text
From Google's webpage I element under a few nested
<div id="rhs">
...
<div class="_Pwb">
<div class="_Ohb">
<div style="width:109px" class="pla-unit">
<div class="_PD">
<div class="pla-unit-img-container">
<div class="_Z5">
<div class="_vT"><a href="http://www.somewebsite.com">
<span class="rhsl4">Nina 'Forbes' Peep Toe Pump <b>Red</b> R...</span>
<span class="rhsg3 rhsl5">Nina 'Forbes' Peep Toe Pum...</span>
<span class="rhsg4">Nina 'Forbes' Peep Toe Pu...</span></a>
</div>
<div class="_QD"><b>$78.95</b></div>
<div class="_mC">
<span class="rhsl4 a">Nordstrom</span>
<span class="rhsg3 rhsl5 a">Nordstrom</span>
<span class="rhsg4 a">Nordstrom</span>
</div>
</div>
*Update:
I added more HTML. In this example I am looking to get the text from ($78.95) annd (Norstrom)
*Update
To clarify,
<div id="rhs">
is an unique element
There are however multiple (6) elements of:
<div style="width:109px" class="pla-unit">
The elements under each category have the same name and follow the same structure and substructures
ie, there are 6
<div class="_PD">
<div class="pla-unit-img-container">
<div class="_Z5">
<div class="_vD">
<div class="_QD">
<div class="_mC">
and so on.
The main objective is to get all of the elements but for purposes of debugging I was asking help to get the first one.
The xpath for a price unit using XPathChecker on Firefox is:
id('rhs_block')/x:div[1]/x:div/x:div/x:div/x:div[1]/x:div[1]/x:div[2]/x:div[2]/x:b
You can use ancestor:: to go back up then following-sibling:: to get elements at the same level that follow it.
I haven't tried this but give it a shot:
title = browser.find_element_by_xpath('//div[contains#class, "pla-unit")]/text()[contains(., "red")]/ancestor::div/following-sibling::div[1]').text
Then to get to your div class ='mC' you just change:
following-sibling::div[1]
to
following-sibling::div[2]
and get the text from the spans under that.

Soundcloud waveform nodes

I was reading a article on soundcloud today about their waveforms and how they generate them by converting the highest volume point into a INT between 0 - 1.
After that I opened the console on chrome and then a track on Soundcloud, going through the networks tab (all files) there was no file returning a array of data to generate the html5 waveform, so my question is how do they do it without requesting the data?
Interesting question :) I'm no expert at HTML5's canvas, but I'm sure it has to do with that.
If you look at the DOM you'll see a structure like this:
<div class="sound__body">
<div class="sound__waveform">
<div class="waveform loaded">
<div class="waveform__layer waveform__scene">
<canvas aria-hidden="true" class="g-box-full sceneLayer" width="453" height="60"></canvas>
<canvas aria-hidden="true" class="g-box-full sceneLayer waveformCommentsNode loaded" width="453" height="60"></canvas>
<canvas aria-hidden="true" class="g-box-full sceneLayer" width="453" height="60"></canvas>
</div>
<div class="commentPlaceholder g-z-index-content">...</div>
<div class="commentPopover darkText smallAvatar small">...</div>
</div>
</div>
</div>
On my page I have four sounds. In my networkpanel I also have four of these:
https://wis.sndcdn.com/iGZOEq0vuemr_m.png
They are being sent as JSON, not as PNG!
And contain stuff like:
{"width":1800,"height":140,"samples":
[111,116,118,124,121,121,116,103,119,120,118,118,119,123,128,128,119,119,119,120,117,116,123,127,124,119,115,120,120,121,120,120,121,121,117,116,117,120,123,119,121,125,128,126,122,99,119,120,121,117,122,120,125,125,134,135,130,126,122,123,120,124,126,124,114,111,119,120,120,118,119,132,133,128,127,
...much more
...much more
122,120,125,125,134,135,130]}
I'm pretty sure this is the data being used to draw the waveform using canvas.
As far as i understand this process.
SoundCloud creates an image directly after the upload.
You can access it via the tracks endpoint.
SC.get('/tracks/159966669', function(sound) {
$('#result').append('<img src="' +sound.waveform_url+'"/>' );
});
I.e. http://jsfiddle.net/iambnz/fzm4mckd/
Then they use a script like that, written by (former) SoundCloud devs, http://waveformjs.org - which converts the image into floats.
Example call:
http://www.waveformjs.org/w?url=https%3A%2F%2Fw1.sndcdn.com%2FzVjqZOwCm71W_m.png&callback=callback_json1
Example response (extract)
callback_json1([0.07142857142857142,0.5428571428571428,0.7857142857142857,0.65,0.6142857142857143,0.6357142857142857,0.5428571428571428,0.6214285714285714,0.6357142857142857,0.6571428571428571,0.6214285714285714,0.5285714285714286,0.6642857142857143,0.5714285714285714,0.5,0.5,0.6,0.4857142857142857,0.4785714285714286,0.5714285714285714,0.6642857142857143,0.6071428571428571,0.6285714285714286,0.5928571428571429,0.6357142857142857,0.6428571428571429,0.5357142857142857,0.65,0.5857142857142857,0.5285714285714286,0.55,0.6071428571428571,0.65,0.6142857142857143,0.5928571428571429,0.6428571428571429,...[....]
See example here, more detailed on waveform.js
HTML
<div class="example-waveform" id="example2">
<canvas width="550" height="50"></canvas>
</div>
JS
SC.get('/tracks/159966669', function(sound) {
var waveform = new Waveform({
container: document.getElementById("example2"),
innerColor: "#666666"
});
waveform.dataFromSoundCloudTrack(sound);
});
http://jsfiddle.net/iambnz/ro1481ga/
See docs here: http://waveformjs.org/#endpoint
I hope this will help you a bit.

really could use some assistance with variables declarations for my assignment

ok here is what i got from you can you check if this is right i cant edit to much of it from what the book says so it has to stay somewhat in this format im guessing..hope you can help
<!DOCTYPE html>
<html>
<head>
<script type="text/javascript">
/* <![CDATA[ */
/* ]]> */
document.getElementById( news ) .innerHTML='newsItem1';
var newsItem1 = "L'AQUILA, ITALY (AP) - L'Aquila's chief prosecutor announced an investigation into allegations of shoddy construcation as workers continued to scour the rubble for people still missing after a devastating earthquake five days ago. http://in.reuters.com/article/idUSWBT01103020090411;
var newsItem2 = "WASHINGTON (Reuters) - President Barack Obama said on Friday the recession-hit US ecomony was showing 'glimmers of hope' despite remaining under strain and promised further steps in coming weeks to tackle the finicial crisis. http://in.reuters.com/article/idUSWBT01103020090411";
var newsItem3 = "(eWeek.com) - Apple is close to hitting 1 billion downloads from its App Store and plans on prize giveaway for whoever downloads the billionth application that includes a MacBook Pro and an iPod Touch. http://www.eweek.com/c/a/application-development/eweek-newsbreak-april-13-2009/";
var newsItem4 = "ALTANTA (AP) - Chipper Jones drove in two runs, including a tiebreaking single, and the Atlanta Braves beat Washington 8-5 on Sunday to hand the Nationals their sixth straight loss to start the season. http://www.newsvine.com/_news/2009/04/11/nationals-8-5?category=sports";
</script>
</head>
<body>
<form action="" name="newsHeadlines" method="get">
</form>
<table style="border: 0; width: 100%">
<tr valign="top">
<td>
<select name="headline" multiple="multiple"
style="height: 93px">
<option onclick="document.newsHeadlines.news.value=newItem1">Investigation of building standards in quake zone</option>
<option onclick="document.newsHeadlines.news.value=newsItem2">Obama sees signs of economic progress</option>
<option onclick="document.newsHeadlines.news.value=newsItem3">Apple App Downloads Approach 1 Billion</option>
<option onclick="document.newsHeadlines.news.value=newsItem4">Jones, Braves beat winless Nationals 8-5</option>
</select>
</td>
<td>
<textarea id="news" name="news" cols="50" rows="10"
style="background-color: transparent"></textarea>
</td>
</tr>
</table>
</body>
</html>
can someone help me fix the problem everytime i click on "Investigation of building standards in quake zone" nothing shows in the textarea i created.
Although it might be too late for your assignment, I thought I should point out some errors in the code and try to implement what you are intending to, i.e. displaying different newsItems upon clicking corresponding select option.
document.getElementById( 'news' ).innerHTML='newsItem1';
should be written after the textbox tag with id 'news' is declared, or else will give null value.
In <option onclick="document.newsHeadlines.news.value=newItem1>,
there's a typo (newItem1 should be newsItem1) and the onclick attribute value is not correct. To correctly select the textbox and display intended text in it, you should use the following: onclick="document.getElementById('name').value=newsItem1". Use similar values for rest of the options.