How to avoid splitting of word with case - asp.net-core

Below is the text that I want to split:
string content = "Tonight the warning from the Police commissioner as they crack down on anyone who leaves there house without a 'reasonable excuse'
(TAKE SOT)"
string[] contentList = content.split(' ');
I do not want to split the word i.e (TAKE SOT), if there is a text within parentheses and in an upper case then how to avoid splitting of the part.
Thanks

The split method can have two parameters.The first one delimits the substrings in this instance.The second one is the maximum number of elements expected in the array.
The following code snippet work for me, you can refer to it.
string content = "Tonight the warning from the Police commissioner as they crack down on anyone who leaves there house without a 'reasonable excuse' (TAKE SOT)";
string[] contents = content.Split(" ", content.Substring(0, content.IndexOf("(")).Split(" ").Length);
And this is the result of the code:
I use the method Split(String, Int32, StringSplitOptions)

Later after doing some research I found a solution for the problem:
public class Program
{
public static void Main()
{
string BEFORE_AND_AFTER_SOUND_EFFECT = #"\(([A-Z\s]*)\)";
List<string> wordList = new List<string>();
string content = "Tonight the warning from the Police commissioner as they (VO) crack down on anyone who leaves there house without a 'reasonable excuse' (TAKE SOT) INCUE:police will continue to be out there.";
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
foreach(var item in wordList)
{
Console.WriteLine(item);
}
}
private static void GetWords(string content, List<string> wordList, string BEFORE_AND_AFTER_SOUND_EFFECT)
{
if(content.Length>0)
{
if (Regex.IsMatch(content, BEFORE_AND_AFTER_SOUND_EFFECT))
{
int textLength = content.Substring(0, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index).Length;
int matchLength = Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Length;
string[] words = content.Substring(0, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index - 1).Split(' ');
foreach (var word in words)
{
wordList.Add(word);
}
wordList.Add(content.Substring(Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Length));
content = content.Substring((textLength + matchLength) + 1); // Remaining content
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
}
else
{
wordList.Add(content);
content = content.Substring(content.Length); // Remaining content
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
}
}
}
}
Thanks.

Related

I have synonym matching working EXCEPT in quoted phrases

Simple synonyms (wordA = wordB) are fine. When there are two or more synonyms (wordA = wordB = wordC ...), then phrase matching is only working for the first, unless the phrases have proximity modifiers.
I have a simple test case (it's delivered as an Ant project) which illustrates the problem.
Materials
You can download the test case here: mydemo.with.libs.zip (5MB)
That archive includes the Lucene 9.2 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo.zip (9KB)
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant rnsearch
Input
When indexing the documents, the following synonym list is used (permuted as necessary):
note,notes,notice,notification
subtree,sub tree,sub-tree
I have three documents, each containing a single sentence. The three sentences are:
These release notes describe a document sub tree in a simple way.
This release note describes a document subtree in a simple way.
This release notice describes a document sub-tree in a simple way.
Problem
I believe that any of the following searches should match all three documents:
release note
release notes
release notice
release notification
"release note"
"release notes"
"release notice"
"release notification"
As it happens, the first four searches are fine, but the quoted phrases demonstrate a problem.
The searches for "release note" and "release notes" match all three records, but "release notice" only matches one, and "release notification" does not match any.
However if I change the last two searches like so:
"release notice"~1
"release notification"~2
then all three documents match.
What appears to be happening is that the first synonym is being given the same index position as the term, the second synonym has the position offset by 1, the third offset by 2, etc.
I believe that all the synonyms should be given the same position so that all four phrases match without the need for proximity modifiers at all.
Edit, here's the source of my analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
this.synlist = synlist;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new LowerCaseFilter(src);
if (synlist != null) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
private static SynonymMap getSynonyms(String synlist) {
boolean dedup = Boolean.TRUE;
SynonymMap synMap = null;
SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
int cnt = 0;
try {
BufferedReader br = new BufferedReader(new FileReader(synlist));
String line;
try {
while ((line = br.readLine()) != null) {
processLine(builder,line);
cnt++;
}
} catch (IOException e) {
System.err.println(" caught " + e.getClass() + " while reading synonym list,\n with message " + e.getMessage());
}
System.out.println("Synonym load processed " + cnt + " lines");
br.close();
} catch (Exception e) {
System.err.println(" caught " + e.getClass() + " while loading synonym map,\n with message " + e.getMessage());
}
if (cnt > 0) {
try {
synMap = builder.build();
} catch (IOException e) {
System.err.println(e);
}
}
return synMap;
}
private static void processLine(SynonymMap.Builder builder, String line) {
boolean keepOrig = Boolean.TRUE;
String terms[] = line.split(",");
if (terms.length < 2) {
System.err.println("Synonym input must have at least two terms on a line: " + line);
} else {
String word = terms[0];
String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
addSyns(builder, word, synonymsOfWord, keepOrig);
}
}
private static void addSyns(SynonymMap.Builder builder, String word, String[] syns, boolean keepOrig) {
CharsRefBuilder synset = new CharsRefBuilder();
SynonymMap.Builder.join(syns, synset);
CharsRef wordp = SynonymMap.Builder.join(word.split("\\s+"), new CharsRefBuilder());
builder.add(wordp, synset.get(), keepOrig);
}
private String synlist;
}
The analyzer includes synonyms when it builds the index, and does not add synonyms when it is used to process a query.
For the "note", "notes", "notice", "notification" list of synonyms:
It is possible to build an index of the above synonyms so that every query listed in the question will find all three documents - including the phrase searches without the need for any ~n proximity searches.
I see there is a separate question for the other list of synonyms "subtree", "sub tree", "sub-tree" - so I will skip those here (I expect the below approach will not work for those, but I would have to take a closer look).
The solution is straightforward, and it's based on a realization that I was (in an earlier question) completely incorrect in an assumption I made about how to build the synonyms:
You can place multiple synonyms of a given word at the same position as the word, when building your indexed data. I incorrectly thought you needed to provide the synoyms as a list - but you can provide them one at a time as words.
Here is the approach:
My analyzer:
Analyzer analyzer = new Analyzer() {
#Override
protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
tokenStream = new SynonymGraphFilter(tokenStream, getSynonyms(), ignoreSynonymCase);
tokenStream = new FlattenGraphFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
};
The getSynonyms() method used by the above analyzer, using the note,notes,notice,notification list:
private SynonymMap getSynonyms() {
// de-duplicate rules when loading:
boolean dedup = Boolean.TRUE;
// include original word in index:
boolean includeOrig = Boolean.TRUE;
String[] synonyms = {"note", "notes", "notice", "notification"};
// build a synonym map where every word in the list is a synonym
// of every other word in the list:
SynonymMap.Builder synMapBuilder = new SynonymMap.Builder(dedup);
for (String word : synonyms) {
for (String synonym : synonyms) {
if (!synonym.equals(word)) {
synMapBuilder.add(new CharsRef(word), new CharsRef(synonym), includeOrig);
}
}
}
SynonymMap synonymMap = null;
try {
synonymMap = synMapBuilder.build();
} catch (IOException ex) {
System.err.print(ex);
}
return synonymMap;
}
I looked at the indexed data by using org.apache.lucene.codecs.simpletext.SimpleTextCodec, to generate human-readable indexes (just for testing purposes):
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
This allowed me to see where the synonyms were inserted into the indexed data. So, for example, taking the word note, we see the following indexed entries:
term note
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
So, that tells us that all three documents contain note at token position 2 (the 3rd word).
And for notification we see exactly the same data:
term notification
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
We see this for all the words in the synonym list, which is why all 8 queries return all 3 documents.

Multiple synonym matching is not working the way I intend or expect, what am I doing wrong?

I am indexing technical documentation and incorporating synonyms at index time, so that users can search with a number of alternative patterns. But only some synonyms seem to be getting into the map.
I have a text file synonyms.list which contains a series of lines like so:
note,notes,notice
subtree,sub-tree,sub tree
My analyzer and synonym map builder (I've removed try and catch wrappers to save space, but they aren't the problem):
public class TechAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
result = new SynopnymGraphFilter(result, getSynonyms(getSynonymsList()), Boolean.TRUE);
result = new FlattenGraphFilter(result);
return new TokenStreamComponents(src, result);
}
private static SynonymMap getSynonyms(String synlist) {
boolean dedup = Boolean.TRUE;
SynonymMap synMap = null;
SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
int cnt = 0;
BufferedReader br = new BufferedReader(new FileReader(synlist));
String line;
while ((line = br.readLine()) != null) {
processLine(builder,line);
cnt++;
}
br.close();
if (cnt > 0) {
synMap = builder.build();
}
return synMap;
}
private static void processLine(SynonymMap.Builder builder, String line) {
boolean keepOrig = Boolean.TRUE;
String terms[] = line.split(",");
if (terms.length > 1) {
String word = terms[0];
String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
for (String syn : synonymsOfWord) {
addPair(builder, word, syn, keepOrig);
}
}
}
private static void addPair(SynonymMap.Builder builder, String word, String syn, boolean keepOrig) {
CharsRef synp = SynonymMap.Builder.join(syn.split("\\s+"), new CharsRefBuilder());
CharsRef wordp = new CharsRef(word);
builder.add(wordp, synp, keepOrig);
// builder.add(synp, wordp, keepOrig); // ? do I need this??
}
I'm not splitting word in addPair() because (at the moment, anyway) the first term in every line of synonyms.list must be a word not a phrase.
My first question relates to that comment at the bottom of addPair(): if I am adding (word,synonym) to the map, do I also need to add (synonym,word)? Or is the map commutative? I can't tell, because of the problem I'm having which is the basis of the next question.
So... the technical documentation being indexed contains some documents which refer to "release notes", and some which refer to "release notices". There are also points described as a "release note". So I would like a search for any of "release note", "release notes", or "release notice" to match all three alternatives.
My code doesn't seem to enable this. If I index a single file which refers to "release notes" I can inspect the generated index with luke and I can see that the index only ever contains one synonym, not two. The same position in the index might have "note" and "notes", or "notes" and "notice", depending on the order of the words in the synonyms.list text file, but it will never have "note", "notes" and "notice".
Obviously I'm not building the map correctly, but the documentation hasn't helped me see what I am doing wrong.
If you've read this far, and can see the flaw in my code, please help me see it too!
Thanks, etc.

Apache PDFBox replace text results in few character missed

Trying to use Apache PDFBox version 2.0.2 for a text replace (with the below code) produces an output where few of the characters would not be displayed, mostly the capital Case Character. For example a replacement with "ABCDEFGHIJKLMNOPQRSTUVWXYZ" the output appears in pdf as "ABCDEF HIJKLM OP RST W Y ". Is this some bug ?? or we have some workaround to handle these character .
public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
PDPageTree pages = document.getDocumentCatalog().getPages();
for (PDPage page : pages) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
string = StringUtils.replaceOnce(string, searchString, replacement);
cosString.setValue(string.getBytes());
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
out.close();
}
return document;
}
Quoting from
https://pdfbox.apache.org/2.0/migration.html
Why was the ReplaceText example removed?
The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:
[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ
Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.
You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.
======================================================================
Your description suggests that the initial file has been using a font subset, that is missing the characters G, N, Q, V and Y.
And no, there is no easy workaround. You would have to delete the text you don't want from the content stream, and then append a new content stream with the text you want with a new font at the correct place.
P.S. the current PDFBox version is 2.0.7, not 2.0.2.

Lucene: how to preserve whitespaces etc when tokenizing stream?

I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is
Term1: Term2 Stopword! Term3
Term4
then I want the output to look like
Term1': Term2' Stopword! Term3'
Term4'
(where Termi' is translation of Termi) instead of simply
Term1' Term2' Term3' Term4'
Currently I am doing the following:
PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
PatternAnalyzer.WHITESPACE_PATTERN,
false,
WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);
while (ts.incrementToken()) { // loop over tokens
String termIn = charTermAttribute.toString();
...
}
but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!
============ UPDATE!
I tried splitting the original stream into "words" and "non-words". It seems to work fine. Not sure whether it's the most efficient way, though:
public ArrayList splitToWords(String sIn)
{
if (sIn == null || sIn.length() == 0) {
return null;
}
char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>();
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
boolean newIsLetter = Character.isLetter(c[pos]);
if (newIsLetter == curIsLetter) {
continue;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
tokenStart = pos;
curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));
return list;
}
Well it doesn't really lose whitespace, you still have your original text :)
So I think you should make use of OffsetAttribute, which contains startOffset() and endOffset() of each term into your original text. This is what lucene uses, for example, to highlight snippets of search results from the original text.
I wrote up a quick test (uses EnglishAnalyzer) to demonstrate:
The input is:
Just a test of some ideas. Let's see if it works.
The output is:
just a test of some idea. let see if it work.
// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
String input = "Just a test of some ideas. Let's see if it works.";
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
StringBuilder output = new StringBuilder(input);
// in some cases, the analyzer will make terms longer or shorter.
// because of this we must track how much we have adjusted the text so far
// so that the offsets returned will still work for us via replace()
int delta = 0;
TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
ts.reset();
while (ts.incrementToken()) {
String term = termAtt.toString();
int start = offsetAtt.startOffset();
int end = offsetAtt.endOffset();
output.replace(delta + start, delta + end, term);
delta += (term.length() - (end - start));
}
ts.close();
System.out.println(output.toString());
}

searching a list object

I have a list:
Dim list As New List(Of String)
with the following items:
290-7-11
1255-7-12
222-7-11
290-7-13
What's an easy and fast way to search if duplicate of "first block" plus "-" plus "second block" is already in the list. Example the item 290-7 appears twice, 290-7-11 and 290-7-13.
I am using .net 2.0
If you only want to know if there are duplicates but don't care what they are...
The easiest way (assuming exactly two dashes).
Boolean hasDuplicatePrefixes = list
.GroupBy(i => i.Substring(0, i.LastIndexOf('-')))
.Any(g => g.Count() > 1)
The fastest way (at least for large sets of strings).
HashSet<String> hashSet = new HashSet<String>();
Boolean hasDuplicatePrefixes = false;
foreach (String item in list)
{
String prefix = item.Substring(0, item.LastIndexOf('-'));
if (hashSet.Contains(prefix))
{
hasDuplicatePrefixes = true;
break;
}
else
{
hashSet.Add(prefix);
}
}
If there are cases with more than two dashes, use the following. This will still fail with a single dash.
String prefix = item.Substring(0, item.IndexOf('-', item.IndexOf('-') + 1));
In .NET 2.0 use Dictionary<TKey, TValue> instead of HashSet<T>.
Dictionary<String, Boolean> dictionary= new Dictionary<String, Boolean>();
Boolean hasDuplicatePrefixes = false;
foreach (String item in list)
{
String prefix = item.Substring(0, item.LastIndexOf('-'));
if (dictionary.ContainsKey(prefix))
{
hasDuplicatePrefixes = true;
break;
}
else
{
dictionary.Add(prefix, true);
}
}
If you don't care about readability and speed, use an array instead of a list, and you are a real fan of regular expressions, you can do the following, too.
Boolean hasDuplicatePrefixes = Regex.IsMatch(
String.Join("#", list), #".*(?:^|#)([0-9]+-[0-9]+-).*#\1");
Do you want to stop user from adding it?
If so, a HashTable with the key as first block-second block could be of use.
If not, LINQ is the way to go.
But, it will have to traverse the list to check.
How big can this list be?
EDIT: I don't know if HashTable has generic version.
You could also use SortedDictionary which can take generic arguments.
If you're list contains only strings, then you can simply make a method that takes the string you want to find along with the list:
Boolean isStringDuplicated(String find, List<String> list)
{
if (list == null)
throw new System.ArgumentNullException("Given list is null.");
int count = 0;
foreach (String s in list)
{
if (s.Contains(find))
count += 1;
if (count == 2)
return true;
}
return false;
}
If you're numbers have a special significance in your program, don't be afraid to use a class to represent them instead of sticking with strings. Then you would have a place to write all the custom functionality you want for said numbers.