How to tokenize html tags with spacy? - spacy

I need to tokenize html text with spacy. Or merge tags after tokenization. They can be any html tags, e.g.:
<br> <br/> <br > <n class="ggg">
There is an example of tag merging in documentation for tag, but it can't work with all types of tags. If I write rule like:
[{'ORTH': '<'}, {}, {'ORTH': '>'}]
It will join some tags:
<br><p>
Or separate like:
<
n
class="ggg
"
>
I have tried to write custom tokenizer also, but I had problem with spaces.
I want every html tag to be a separate token, e.g.:
<br>
<br >
<n class="ggg">

IMHO, removing the HTML tags and converting to plain text is the correct way to go, rather than making html tags 'stop words', because some of those tags are actually valid words that can appear in text and should NOT be ignored (e.g., <body> vs body).
If you have a construct like
<span>word</span><span>word</span>
It renders as wordword in a user agent and should in fact be interpreted as a single word. For example, one might give you an HTML page containing something like:
<p><strong>S</strong>oup .... </p>
This obviously renders as 'Soup' and should be taken as the word soup and not as the words s and oup.
Now, if for whatever reason you must assume that any HTML tag boundary is a word separator (wrong, in most cases), you should do the following: use an HTML stream tokenizer, e.g., libxml2 and write handlers for startElement and characters only. The former should output a single space and the latter should output the characters as it gets them. This will convert your HTML input to plain text (just like an HTML tag remover would do), but also add a space after each element tag, so <span>word</span><span>word</span> would get converted to: "(space)word(space)word". This might add multiple spaces when nested tags are present, but you can easily deal with this when you split the cleaned-up text into words for further processing.

Related

The separator between keywords in PDF meta data

I cannot find an "official" documentation on whether the keywords and keyword phrases in the meta data of a PDF file are to be separated by a comma or by a comma with space.
The following example demonstrates the difference:
keyword,keyword phrase,another keyword phrase
keyword, keyword phrase, another keyword phrase
Any high-quality references?
The online sources I found are of low quality.
E.g., an Adobe press web page says "keywords must be separated by commas or semicolons", but in the example we see a semicolon with a following space before the first keyword and a semicolon with a following space between each two neighbor keywords. We don't see keyword phrases in the example.
The keywords metadata field is a single text field - not a list. You can choose whatever is visually pleasing to you. The search engine which operates on the keyword data may have other preferences, but I would imagine that either comma or semicolon would work with most modern search engines.
Reference: PDF 32000-1:2008 on page 550 at 1. Adobe; 2. The Internet Archive
ExifTool, for example parses for comma separated values, but if it does not find a comma it will split on spaces:
# separate tokens in comma or whitespace delimited lists
my #values = ($val =~ /,/) ? split /,+\s*/, $val : split ' ', $val;
foreach $val (#values) {
$et->FoundTag($tagInfo, $val);
}
I dont have a "high-quality references" but, if i generated a pdf using latex i do it in the following way:
adding in my main.tex following line:
\usepackage[a-1b]{pdfx}
then i write a file main.xmpdata and add this lines:
\Title{My Title}
\Author{My Name}
\Copyright{Copyright \copyright\ 2018 "My Name"}
\Kewords{KeywordA\sep
KeywordB\sep
KeywordC}
\Subject{My Short Discription}
after generating the pdf with pdflatex i used a python script based on "pdfminer.six" to extract the metadata
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fp = open('main.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
if 'Metadata' in doc.catalog:
metadata = resolve1(doc.catalog['Metadata']).get_data()
print(metadata) # The raw XMP metadata
The part with the Keywords then look like this:
...<rdf:Bag><rdf:li>KeywordA</rdf:li>\n <rdf:li>KeywordB...
and looking with "Adobe Acrobat Reader DC" at the properties of "main.pdf" i find in the properties the following entry in the section keywords:
;KeywordA;KeywordB;KeywordC
CommonLook claim to be "a global leader in electronic document accessibility, providing software products and professional services enabling faster, more cost-efficient, and more reliable processes for achieving compliance with the leading PDF and document accessibility standards, including WCAG, PDF/UA and Section 508."
They provide the following advice on PDF metadata:
Pro Tip: When you’re entering Keywords into the metadata, separate
them with semicolons as opposed to commas.
although give no further reasoning as to why this is the preferred choice.

lxml clean breaks href attribute

import lxml.html.clean as clean
cleaner = clean.Cleaner(style=True, remove_tags=['div','span',], safe_attrs_only=['href',])
text = cleaner.clean_html('link')
print text
prints
link
how to get:
link
i.e href in normal encoding?
clean does the right thing -- the string in parentheses should be properly encoded, and the seemingly garbled thing is the proper encoding.
You might not know, but kyrillic domain names don't exist -- there's a complex system to map these to "allowed" characters.

Escape non HTML tags from plain text

I need to escape non HTML tags to convert a plain text into a valid HTML, how can I do that?
I'm using ruby, but I may use an external tool
You can use CGI::escapeHTML to escape all special HTML characters so you can safely paste the content into HTML as text.

multiline code block in markdown adds unwanted tabs

today I'm implementing my page in nanoc (haml templates) and I wanted to write some posts in markdown, but when it goes to multiline code blocks something weird is happening - second line in code block has additional tabs. I've tried multiple markdown syntaxes, such as:
//double tab wrapping
line 1 is fine
line 2 is wrapping (don't know why!)
and
~~~
//tilde code wrapping
line 1 is fine
line 2 is wrapping
~~~
And both solutions gives me the result something like this:
line 1 is fine
line 2 is wrapping
Inspecting elements through browser shows that there is no additional padding - this whitespace is made with tabs for sure.
Can someone help me with this? Maybe I'm doing something wrong?
When you use = in Haml to include the results of a script, Haml will re-indent the inserted text so that it matches the indentation of where it is included. For example, if you have Haml that looks something like this:
%html
%body
.foo
= insert_something
and insert_something returns some HTML like this:
<p>
This is possily generated from Markdown.
</p>
then the resulting HTML will look like this:
<html>
<body>
<div class='foo'>
<p>
This is possily generated from Markdown.
</p>
</div>
</body>
</html>
Note how the p element is indented to match its position in the document.
Normally this doesn’t matter, because of the way whitespace in HTML is collapsed. However there are HTML elements where the whitespace is important, in particular pre.
What it looks like is happening here is that your Markdown is generating something like
<pre><code>line 1 is fine
line 2 is wrapping
</code></pre>
and when it is included in your Haml file (I’m guessing you’re using Haml layouts with = yield to include the Markdown) it is being indented and the whitespace is showing up when you view the page. Note how the first line is immediately after the opening tags so there is no extra whitespace.
There are a couple of ways to fix this. If you set the :ugly option then Haml won’t re-indent blocks like this (sorry I don’t know how you set Haml options in Nanoc).
You could also use the find_and_preserve helper method. This replaces all newlines in whitespace sensitive tags with HTML entity
, so that they won’t be affected by the extra whitespace when indented:
= find_and_preserve(yield)
Haml provides an easy way to use find_and_preserve; ~ works the same as =, except that it runs find_and_preserve on the result, so you can do:
~ yield

How to handle new line in handlebar.js

I am using HandleBar.js in my rails jquery mobile application.
I have a json returned value data= "hi\n\n\n\n\nb\n\n\n\nhow r u"
which when used in .hbs file as {{data}} showing me as hi how r u and not as with the actual new line inserted
Please suggest me.
Pre tag helps me
Handlebars doesn't mess with newlines in your data unless you have registered a helper which is doing something with them. A good way of dealing with newlines in HTML without converting them to br tags would be to use the CSS property white-space while rendering the handlebars template in HTML. You can set its value to pre-line.
Read the related documentation on MDN
Look at the source of the generated file - your newline characters are probably there, HTML simply does not render newline characters as new lines.
You can insert a linebreak with <br />
However, it looks like you're trying to format the position of your lines using newline characters, which technically should be done by wrapping your lines in <p> or <div> tags and styling with CSS.
Simply use the CSS property white-space and set the value as pre-line
For a example:
<p style="white-space: pre-line">
{{text}}
</p>