How to customize stanfordNLP tokenizer to ignore asterisk character? - tokenize

I'm using stanfordCoreNLP library's tokenizer as a part of my project.For the following string
abc def *ghi
It is giving the following tokensabc,def,*ghi
But,I want asterisk to be included as in abc,def,*ghi.How to customize PBTTokenizer to acheive this?

Please see my answer for this question:
How to set delimiters for PTB tokenizer?
You can set the tokenizer to tokenize on white space only:
(command-line) -tokenize.whitespace
(in Java code) props.setProperty("tokenize.whitespace", "true");

Related

add regex to stop words in spacy

is there a way we can add regex patters to stop words in SPACY or NLTK?
Regular expressions are patterns used to match character combinations in strings. So I think for recognizing SPACY or NLTK, we need many specified case to map pattern.

awk pattern to match an XML PI at the start of a line

I have an XML document containing a number of XML Processing Instructions which are of the form:
<?cpdoc something?>
I am trying to match them in awk with the pattern
/^\<\?cpdoc/
but it's not returning anything. If I remove the ^ anchor, it works (but I have other similar PIs which don't start a line which I don't want matched).
It looks as if it's being confused by the \<\? but why is it ignoring the line-start anchor?
Don't parse XML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, XML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint
xmlstarlet
saxon-lint (my own project)
Check: Using regular expressions with HTML tags
Example using xpath :
xmllint --xpath '//processing-instruction()' file.xml
Solution by OP and explanation by Ed Morton.
It works if the less-than is not escaped, as otherwise it's a word boundary. So instead of:
\<\?
I should use literal:
<\?
This is because we can't just go escaping any character and hoping for the best, we have to know which characters are metacharacters and then escape them if we want them treated as literal.

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));

Using $ as delimiter in StringTemplate from ANTRL rewriter grammars

I'm trying to write an ANTLR3 grammar that generates HTML output using StringTemplate. To avoid having to escape all the HTML tags in the template rules (e.g. \<p\><variable>\</p\>), I'd prefer to use dollar as the delimiter for StringTemplate (e.g. <p>$variable$</p>).
While the latter seems to be the default when StringTemplate is used on its own, the parser code generated by ANTRL always uses AngleBracketTemplateLexer when initializing StringTemplate.
How can I get ANTLR to generate code using DefaultTemplateLexer (i.e. the variant that uses dollar as the delimiter)?
Try setting the DefaultTemplateLexer.class in the StringTemplateGroup like this:
StringTemplateGroup group = new StringTemplateGroup(reader,
DefaultTemplateLexer.class);

Inserting special character in Redmine wiki page

I'm using Redmine and I'm trying to insert the special character | inside a table in a Redmine wiki page. I don't want this character to be parsed as a column separator.
I've achieved this by doing a <code>|</code> around this character, but I don't want to use the code tag, since this character will gain code attributes, namely the courier new font.
Is there a tag for displaying plain text and avoid the parsing from the Redmine wiki engine?
I'm reading the redmine wiki formatting documentation but it is very poor and points me to textile formatting which doesn't seem to include this special case.
I could not get the exclimation point to work, but this works for me.
<notextile>|</notextile>
The only way I found out to overcome this problem is to insert the HTML code for the character I want to isolate. For instance, instead of putting an underscore and make the wiki think I'm starting an italic word, I have to put the HTML code for it:
_
Example:
this is a _test - _text comment here_
Without the underscore code (_) redmine wiki engine will think that italic starts at test and this is the wrong result:
this is a test - text comment here
So, putting the ASCII code for the underscore corrects this problem. Unfortunately, this parsing is not very clever (yet I hope).
Here is a link for an ASCII code table with many symbols and characters:
http://www.ascii.cl/htmlcodes.htm