Removing Comment and Doctype from BS4 parser when running throug soup() - beautifulsoup

I'm trying to remove the Comments and Doctype of a BS4 instance by doing the following:
for elt in soup.find_all(text=lambda text: isinstance(text, (Doctype, Comment))):
elt.extract()
But right after, I loop over all the items, one by one, to run some custom processing.
I feel like it's doing the loop twice, which is not great in terms of performances.
But when I try to do :
soup = BeautifulSoup(message, features='html.parser')
for tag in soup():
if isinstance(tag, (Comment, Doctype)):
tag.extract()
It doesn't work because tag is always a bs4.element.Tag
Is there way to loop to all elements via for tag in soup(), and removing the comments and the doctype?
Thank you in advance!

Perhaps I am not understanding your concern with efficiency. You say that the first snippet is working, but since after it, you have to loop over again, this might end in a not-efficient solution.
Let's put on the algorithm hat on. Iterate over a list is O(n), in your case n is equal to the soup objects. Then you have to re-iterate over the objects (minus the removed tags) therefore 2*O(n) => O(n) as the worst case scenario.
The solution your propose is to remove the tags within a loop. But then, unless you are performing the filter within that loop, you will still end up with the same 2 loops. Resulting in a not efficient way, since you are basically re-implementing the tag removal.
#######
But to reply specifically to your question as "How do I navigate through soup tags and objects?" Then my suggestion would be to traverse the soup object with: descendants, children, next, and previous. Here the official documentation.

Related

How can I find element by non-unique resousre-id?

I test an application which use non-unique resourse-id for elements.
Is there any way to find such elements by xpath like
//*[#resourse-id='non-unique-id'][2]
I mean the second element with same resourse-id.
I'd recommend avoiding xpath in mobile automation since this is the most time-consuming strategy to find elements.
If you don't have any other anchors for your elements but you confident in its order, you can stick to the following approach: Appium driver can return a list of elements with the same locator, in case of Page Object model you can either do this way:
#AndroidFindBy(uiAutomator = "resourceIdMatches(\".*whatever\")")
private List<MobileElement> elements;
so, once your page is initialized, you can access an element by index:
elements.get(1).click();
or, in case of manual managenemt, you can do this way:
List<MobileElement> elements = driver.findElements(MobileBy.AndroidUIAutomator("resoureceIdMatches(\".*whatever\")"));
elements.get(3).click();
Hope this helps.
As far as my understanding goes, you need to select the second element with the path as mentioned: //*[#resourse-id='non-unique-id']
To do that, you need to first grab all the elements with the same non-unique resource ID and then get() them. So, your code should be:
driver.findElements(By.xpath("//*[#resourse-id='non-unique-id']")).get(1).click();
The index for any list starts at 0. So, the second element can be accessed through the value of 1.
Hope this helps.
Try following approach:
(//*[#resourse-id='non-unique-id'])[2]
HTML with non-unique ids is not a valid HTML document.
So, for the sake of future testability, ask the developers to fix the ids.

Capybara/Selenium: Speeding up Dropdown Selecting?

So this is a bit of a performance question regarding Selenium Webdriver (Chromedriver) and Capybara.
I have some react-select dropdowns with quite a bit of data in them. For some reason the react-selects take a VERY VERY long time to pick out the option in them. The code is pretty simple and I grabbed it from here: https://github.com/JedWatson/react-select/issues/832
But it basically comes down to:
page.find('.Select-control').click
page.find('.Select-option', text: 'the text').click
Thing is, this works fine. But it takes an extremely long time (Upwards of a minute a dropdown). Now...in Capybaras defense these dropdowns have a LOT of options to select from, so I thought selecting from the top-most item would be the fastest, but that doesn't seem to affect it.
Does Capybara/Selenium hold the "options" in a different sorted list somewhere or something? Since i'd assume selecting from a top option in the dropdown would be faster, but it doesn't seem to be?
Generally, when using the text option, find first finds all the elements that match using the locator with the current selector type. In your case your selector type is defaulting to :css, and the locator is .Select-option. So Capybara will find all elements with the class Select-option and then it will go through each of those elements comparing the text to see what matches (and checking visibility), but it will have to compare all of them to make sure the selector isn't ambiguous.
One way to speed that up would be to use first with a minimum option
page.first('.Select-option', text: 'the text', minimum: 1).click
which can skip some of the text and visibility checking since it doesn't have to worry about ambiguous elements. Another solution would be to skip the text option altogether and write it into an XPath along the lines of
page.find(:xpath, XPath.css('.Select-option')[XPath.string.n.is('the text')]).click # Haven't verified this is 100% correct but it should be close
If you're doing this a lot in your app you may want to consider creating a custom selector for this
Capybara.add_selector(:react_option) do
xpath do |locator|
XPath.css('.Select-option')[XPath.string.n.is(locator)]
end
# You can add other filters in here - see https://github.com/teamcapybara/capybara/blob/master/lib/capybara/selector.rb
end
which would then allow you to do
page.find(:react_option, 'the text').click
Note, if you can limit the element types it will also make the query more efficient, so if all of the elements are <li> elements you might want to do something like
XPath.css('li.Select-option')[XPath.string.n.is(locator)]

Difference and when to use text=true and get_text()

i am leaning bs4 library if someone can please explain what is Difference and when to use text=true and get_text()?
If used with find_all or find, text=true looks for every tags with texts inside it while get_text() returns the text from your found tags.

Beautiful Soup find string and then the nth element down

In Beautiful Soup is it possible to search for a text string and then from there find the nth element down?
Update
I am Trying to target the following code field to grab the text. I tried a soup find and findall however I have other pages that I want to target that are just slightly different so I need something really robust
My Plan
Go to page
Find the string Model name
Find nth element down, in this case the next anchor tag
My code to find string
model_name=soup.find(text='Model name')
print model_name
Ok I got it, its actually really simple. The solution I found gets a little messy but it works
All you got to use is the next operator so the code looks like this
model_name=soup.find(text='Model name').next.text
adding as many next operators until you reach the target.

Add spaces between words in spaceless string

I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).