Beautiful Soup find string and then the nth element down - beautifulsoup

In Beautiful Soup is it possible to search for a text string and then from there find the nth element down?
Update
I am Trying to target the following code field to grab the text. I tried a soup find and findall however I have other pages that I want to target that are just slightly different so I need something really robust
My Plan
Go to page
Find the string Model name
Find nth element down, in this case the next anchor tag
My code to find string
model_name=soup.find(text='Model name')
print model_name

Ok I got it, its actually really simple. The solution I found gets a little messy but it works
All you got to use is the next operator so the code looks like this
model_name=soup.find(text='Model name').next.text
adding as many next operators until you reach the target.

Related

Using XPath In Cycle For

I'm trying to solve one simple problem that i can't understand how i could solve in excel vba.
I have one cycle for and a Xpath with 24 elements and i want to get text from each element, but when i use xpath shows me that i put the wrong xpath. I understand that i put i inside the string of xpath and that get me wrong. I try different approach like using
Name.Add findApp.FindElementByXPath("(//span[#class='offer-item-title'])["&"i]").Text
but nothing seems work. Can someone help me how i could solve this? Thank you so much :)
Code:
for i=0 to 23
Name.Add findApp.FindElementByXPath("(//span[#class='offer-item-title'])[i]").Text
Next i
XPath doesn't know that i is the name of a VB variable, it thinks it is the name of an element in your source document.
You can construct an expression like this:
FindElementByXPath("(//span[#class='offer-item-title'])[" & i & "]")
Or better, but I don't know if VBA offers the capability, is to pass a parameter into the XPath expression -- ideally you only want to compile the XPath expression once, rather than repeating the compilation 23 times, because compiling it typically takes 100 times longer than executing it.
But for this particular example, it would be better to construct an expression that reads everything you want in one go, rather than making 24 separate calls. Incidentally, XPath indexing starts at one, so the call with i=0 will select nothing. Given that this is XPath 1.0, you can do
//span[#class='offer-item-title'])[position() < 24]

Removing Comment and Doctype from BS4 parser when running throug soup()

I'm trying to remove the Comments and Doctype of a BS4 instance by doing the following:
for elt in soup.find_all(text=lambda text: isinstance(text, (Doctype, Comment))):
elt.extract()
But right after, I loop over all the items, one by one, to run some custom processing.
I feel like it's doing the loop twice, which is not great in terms of performances.
But when I try to do :
soup = BeautifulSoup(message, features='html.parser')
for tag in soup():
if isinstance(tag, (Comment, Doctype)):
tag.extract()
It doesn't work because tag is always a bs4.element.Tag
Is there way to loop to all elements via for tag in soup(), and removing the comments and the doctype?
Thank you in advance!
Perhaps I am not understanding your concern with efficiency. You say that the first snippet is working, but since after it, you have to loop over again, this might end in a not-efficient solution.
Let's put on the algorithm hat on. Iterate over a list is O(n), in your case n is equal to the soup objects. Then you have to re-iterate over the objects (minus the removed tags) therefore 2*O(n) => O(n) as the worst case scenario.
The solution your propose is to remove the tags within a loop. But then, unless you are performing the filter within that loop, you will still end up with the same 2 loops. Resulting in a not efficient way, since you are basically re-implementing the tag removal.
#######
But to reply specifically to your question as "How do I navigate through soup tags and objects?" Then my suggestion would be to traverse the soup object with: descendants, children, next, and previous. Here the official documentation.

VB expressions to help search through scraped data in UiPath

I have made a process that reads PDFs and scrapes their text in UiPath. I am struggling to come up with a regular expression that I can use to search for a PO Number. The text that comes from the scrape is fairly unstructured so my best bet is to search for a set of numbers that starts with a 'PO' with no space. For example, "PO1234567890". I will be setting a variable so the system knows that no PO number was found if the string doesn't come up with anything. Any reference material would be welcome as I am a beginner to VB. Thanks!
I have researched and cannot find a way to do the type of search I would like to do.
I expect to be able to search for a "PO1234567890" and no let something like "PO" save. So I somehow need to be able to search for "PO - two digits" and any numbers following without whitespace.
Just try the following:
Dim Regex As System.Text.RegularExpressions.Regex
Regex = New System.Text.RegularExpressions.Regex("PO[0-9]+")
Regex.Matches(SearchString)
The regex string PO[0-9]+ means:
PO followed by at least one number
if you want more digits for example 3... just use PO[0-9]{3}[0-9]* that means:
PO followed by three numbers and as numbers as it can match.
If you need help using regex matches just ask.
Hope it helps!

selecting first span value from a group in selenium

I would want to select the first instance of an element in a page where many number of such elements are present with 'ID' which will not be same always.
for example, visit, http://www.sbobet.com/euro which lists lot of sports and odds, where I want to click on the first odds.
and the html structure would be like this,
I want to click on this first span value and proceed with some test case.
Any help on how to achieve this ?
There could be two approaches two the problem:
1. If you are sure you will always need only the first instance:
driver.FindElementsByClassName("OddsR")[0];
If not, then you have collection of elemets and you can access an of those
2. Also, you can first identify any closest enclosing div and then you can use the same snippet as above:
driver.FindElementsByClassName("OddsR")[0];
This one is a better approach if page is a bit dynamic in nature
Use #class attribute. If OddsR class you are intrested in is the 1st one on the page then just use Driver.FindElement(By.ClassName("OddsR")). Webdriver will pick the 1st occurence (no matter if there are more)
Have checked your link and I agree with alecxe, you should probably start with div. But i would suggest a simpler selector :
css = "div.MarketBd span.OddsR"
The above selector will always point to the first span of "OddsR" class within div of "MarketBd" class.
Thanks for the response.
I am finally able to click on the element, by this XPATH,
"//span[#class='OddsR']"
This clicks on the first occurrence of 'OddsR' values, without giving any index.

Split and find specific text?

ok so i've made a HTTPWEBREQUEST and i've made the source of the result show in a richtextbox, Now say i have this in the richtextbox
<p>Short URL: <code>http://URL.me/u/eywnp</code></p>
How would i go about just getting the "http://URL.me/u/eywnp" ive tried split but didnt work, guess i'm doing it wrong?
NOTE the URL will be different everytime
Split isn’t the right tool for the job. It will result in a rather complex piece of code that’s quite brittle (meaning it will break as soon as there’s the slightest change in the input).
For a robust, well-written solution you need to parse the HTML properly. Luckily there exist canned solutions for that: The HtmlAgilityPack library.
Dim doc As New HtmlDocument()
doc.LoadHtml(yourCode)
Dim result = doc.DocumentElement.SelectNodes("//a[#href]")(0)("href")
The only complicated part here is the string "//a[#href]". This is an XPath string. XPath strings are a mini-language that is used to address elements in an HTML or XML document. They are conceptually similar to file paths (like C:\Users\foo\Documents\file.txt) but with a slightly different syntax.
The XPath simply selects all the <a> elements having a href attribute from your document. Then you can grab the first of that collection and retrieve the href attribute’s value.
Thanks for all your help, i did find a solution and i used
Dim iStartIndex, iEndIndex As Integer
With RichTextBox1.Text
iStartIndex = .IndexOf("<p>Short URL: <code><a href=") + 29
iEndIndex = .IndexOf(""">", iStartIndex)
Clipboard.SetText(.Substring(iStartIndex, iEndIndex - iStartIndex))
End With
works perfect so far