Need to keep <br> in text block tags while using import.io - import.io

Looking to do something relatively straightforward, I'm scraping text which so far I have had no problem grabbing, but I need to keep the <br> tags because white space analysis is an important part of the dataset.
Is there a way to keep the <br> tags so I can turn them into \n\rlater on.
Example:
<p>
<span>Some text.</br></span>
<a>Some more text.<br></a>
<span>Some more more text.<br></span>
</p>
I need : Some text.<br>Some more text.<br>Some more more text.<br>
Right now I get: Some text. Some more text. Some more more text.
Advice?

The only way is to get the html format of your selection , all you have to do is change the column type from Text to HTML , also there is no way to get only the text + the <br>.

Related

Removing HTML Tags from Big Query

I have the a column in my table which stores a paragraph like below :
<p><img src="https://mywebsite.com/medias/NH2xcoUOfANfFb6l4xNgOFch3dc4TvoX2XBnI6to.jpg" alt="" width="250" height="33"></p><p><span style="font-size: 16pt; font-family: Mali, cursive; font-weight: 500;">My beautiful text is here. Show me without tags, please.</span> </p>
I want to remove all the html tags and, if possible, replace an HTML image to (Image)text.
So my expected output will be like below :
(Image) My beautiful text is here. Show me without tags, please.
OR just
My beautiful text is here. Show me without tags, please.
Thank you so much.
Try below naive approach
select html,
regexp_replace(
regexp_replace(
regexp_replace(html,
r'<img [^<>]*>', r'(Image) '),
r'(&)([^&;]*)(;)', r'<\2>'
),r'\<[^<>]*\>', ''
) as text
from your_table
if applied to sample data in your question - output is
As you can see first step is to replace Image Tag with (Image) text, second step is to address HTML encoding by enclosing them into <...> - for example becomes < > and finally remove everything between and including < and >
Note: above is simplistic approach - might not work for more complex htmls

Selenium XPath find element where second text child element contains certain text (use contains on array item)

The page contains a multi-select dropdown (similar to the one below)
The html code looks like the below:
<div class="button-and-dropdown-div>
<button class="Multi-Select-Button">multi-select button</button>
<div class="dropdown-containing-options>
<label class="dropdown-item">
<input class="checkbox">
"
Name
"
</label>
<label class="dropdown-item">
<input class="checkbox">
"
Address
"
</label>
</div>
After testing in firefox developer tools, I was finally able to figure out the xPath needed in order to get the text for a certain label ...
The below XPath statement will return the the text "Phone"
$x("(//label[#class='dropdown-item'])[4]/text()[2]")
The label contains multiple text items (although it looks like there is just one text object when looking at the UI) in the label element. There are actually two text elements within each label element. The first is always empty, the second contains the actual text (as shown in the below image when observing the element through the Firefox developer tool's console window):
Question:
How do I modify the XPath shown above in order to use in Selenium's FindElement?
Driver.FindElement(By.XPath("?"));
I know how to use the contains tool, but apparently not with more complex XPath statements. I was pretty sure one of the below would work but they did not (develop tool complain of a syntax error):
$x("(//label[#class='dropdown-item' and text()[2][contains(., 'Name')]]")
$x("(//label[#class='dropdown-item' and contains(text()[2], 'Name')]")
I am using the 'contains' in order to avoid white-space conflicts.
Additional for learning purposes (good for XPath debugging):
just in case anyone comes across this who is new to XPath, I wanted to show what the data structure of these label objects looked like. You can explore the data structure of objects within your webpage by using the Firefox Console window within the developer tools (F12). As you can see, the label element contains three sub-items; text which is empty, then the inpput checkbox, then some more text which has the actual text in it (not ideal). In the picture below, you can see the part of the webpage that corresponds to the label data structure.
If you are looking to find the element that contains "Name" given the HTML above, you can use
//label[#class='dropdown-item'][contains(.,'Name')]
So finally got it to work. The Firefox developer environment was correct when it stated there was a syntax problem with the XPath strings.
The following XPath string finally returned the desired result:
$x("//label[#class='dropdown-item' and contains(text()[2], 'Name')]")

how to get text from text node without getting content of siblings

I have following code
<div>
<p>some paragraph</p>
some nasty text that I need
<span>something else</span>
</div>
Now I need to get some nasty text that I need only. How to do it using only XPath 1.0? Is it possible?
How to do it using only XPath 1.0? Is it possible?
Yes - and it's rather trivial:
/div/text()
I wonder why you did not try that? All other text nodes are either in a p or span element and should not cause you any trouble.

How to verify text across HTML elements in Selenium

Given the following code, how would I verify the text within using Selenium?
<div class='my-text-block>
<p>My first paragraph of text</p>
<p>My second paragraph of text</p>
</div>
I am wanting to, in one verifyText statement to capture all the text:
My first paragraph of text
My second paragraph of text
Is it possible?
Since you've tagged this with selenium-webdriver, I'm assuming you want a code example but because you've not stated what language you're using, I'll give you a python example. It should be easy to translate that to a different language if needed.
ok(driver.find_element("class", "my-text-block").text == "What I expect it to be")
The text attribute on a WebElement object simply contains all visible text within that element and all children elements.
And some lovely docs, of course.

dijit.InlineEditBox with highlighted html

I have some dijit.InlineEditBox widgets and now I need to add some search highlighting over them, so I return the results with a span with class="highlight" over the matched words. The resulting code looks like this :
<div id="title_514141" data-dojo-type="dijit.InlineEditBox"
data-dojo-props="editor:\'dijit.form.TextBox\', onFocus:titles.save_old_value,
onChange:titles.save_inline, renderAsHtml:true">Twenty Thousand Leagues <span
class="highlight">Under</span> the Sea</div>
This looks as expected, however, when I start editing the title the added span shows up. How can I make the editor remove the span added so only the text remains ?
In this particular case the titles of the books have no html in them, so some kind of full tag stripping should work, but it would be nice to find a solution (in case of short description field with a dijit.Editor widget perhaps) where the existing html is left in place and only the highlighting span is removed.
Also, if you can suggest a better way to do this (inline editing and word highlighting) please let me know.
Thank you !
How will this affect your displayed content in the editor? It rather depends on the contents you allow into the field - you will need a rich-text editor (huge footprint) to handle html correctly.
These RegExp's will trim away XML tags
this.value = this.displayNode.innerHTML.replace(/<[^>]*>/, " ").replace(/<\/[^>]*>/, '');
Here's a running example of the below code: fiddle
<div id="title_514141" data-dojo-type="dijit.InlineEditBox"
data-dojo-props="editor:\'dijit.form.TextBox\', onFocus:titles.save_old_value,
onChange:titles.save_inline, renderAsHtml:true">Twenty Thousand Leagues <span
class="highlight">Under</span> the Sea
<script type="dojo/method" event="onFocus">
this.value = this.displayNode.innerHTML.
replace(/<[^>]*>/, " ").
replace(/<\/[^>]*>/, '');
this.inherited(arguments);
</script>
</div>
The renderAsHtml attribute only trims 'off one layer', so embedded HTML will still be html afaik. With the above you should be able to 1) override the onFocus handling, 2) set the editable value yourself and 3) call 'old' onFocus method.
Alternatively (as seeing you have allready set 'titles.save_*' in props, use dojo/connect instead of dojo/method - but you need to get there first, sort of say.