Lucene: how to preserve whitespaces etc when tokenizing stream? - lucene

I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is
Term1: Term2 Stopword! Term3
Term4
then I want the output to look like
Term1': Term2' Stopword! Term3'
Term4'
(where Termi' is translation of Termi) instead of simply
Term1' Term2' Term3' Term4'
Currently I am doing the following:
PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
PatternAnalyzer.WHITESPACE_PATTERN,
false,
WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);
while (ts.incrementToken()) { // loop over tokens
String termIn = charTermAttribute.toString();
...
}
but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!
============ UPDATE!
I tried splitting the original stream into "words" and "non-words". It seems to work fine. Not sure whether it's the most efficient way, though:
public ArrayList splitToWords(String sIn)
{
if (sIn == null || sIn.length() == 0) {
return null;
}
char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>();
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
boolean newIsLetter = Character.isLetter(c[pos]);
if (newIsLetter == curIsLetter) {
continue;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
tokenStart = pos;
curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));
return list;
}

Well it doesn't really lose whitespace, you still have your original text :)
So I think you should make use of OffsetAttribute, which contains startOffset() and endOffset() of each term into your original text. This is what lucene uses, for example, to highlight snippets of search results from the original text.
I wrote up a quick test (uses EnglishAnalyzer) to demonstrate:
The input is:
Just a test of some ideas. Let's see if it works.
The output is:
just a test of some idea. let see if it work.
// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
String input = "Just a test of some ideas. Let's see if it works.";
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
StringBuilder output = new StringBuilder(input);
// in some cases, the analyzer will make terms longer or shorter.
// because of this we must track how much we have adjusted the text so far
// so that the offsets returned will still work for us via replace()
int delta = 0;
TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
ts.reset();
while (ts.incrementToken()) {
String term = termAtt.toString();
int start = offsetAtt.startOffset();
int end = offsetAtt.endOffset();
output.replace(delta + start, delta + end, term);
delta += (term.length() - (end - start));
}
ts.close();
System.out.println(output.toString());
}

Related

How to avoid splitting of word with case

Below is the text that I want to split:
string content = "Tonight the warning from the Police commissioner as they crack down on anyone who leaves there house without a 'reasonable excuse'
(TAKE SOT)"
string[] contentList = content.split(' ');
I do not want to split the word i.e (TAKE SOT), if there is a text within parentheses and in an upper case then how to avoid splitting of the part.
Thanks
The split method can have two parameters.The first one delimits the substrings in this instance.The second one is the maximum number of elements expected in the array.
The following code snippet work for me, you can refer to it.
string content = "Tonight the warning from the Police commissioner as they crack down on anyone who leaves there house without a 'reasonable excuse' (TAKE SOT)";
string[] contents = content.Split(" ", content.Substring(0, content.IndexOf("(")).Split(" ").Length);
And this is the result of the code:
I use the method Split(String, Int32, StringSplitOptions)
Later after doing some research I found a solution for the problem:
public class Program
{
public static void Main()
{
string BEFORE_AND_AFTER_SOUND_EFFECT = #"\(([A-Z\s]*)\)";
List<string> wordList = new List<string>();
string content = "Tonight the warning from the Police commissioner as they (VO) crack down on anyone who leaves there house without a 'reasonable excuse' (TAKE SOT) INCUE:police will continue to be out there.";
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
foreach(var item in wordList)
{
Console.WriteLine(item);
}
}
private static void GetWords(string content, List<string> wordList, string BEFORE_AND_AFTER_SOUND_EFFECT)
{
if(content.Length>0)
{
if (Regex.IsMatch(content, BEFORE_AND_AFTER_SOUND_EFFECT))
{
int textLength = content.Substring(0, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index).Length;
int matchLength = Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Length;
string[] words = content.Substring(0, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index - 1).Split(' ');
foreach (var word in words)
{
wordList.Add(word);
}
wordList.Add(content.Substring(Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Index, Regex.Match(content, BEFORE_AND_AFTER_SOUND_EFFECT).Length));
content = content.Substring((textLength + matchLength) + 1); // Remaining content
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
}
else
{
wordList.Add(content);
content = content.Substring(content.Length); // Remaining content
GetWords(content, wordList, BEFORE_AND_AFTER_SOUND_EFFECT);
}
}
}
}
Thanks.

What is the right way to get term positions in a Lucene document?

The example in this question and some others I've seen on the web use postings method of a TermVector to get terms positions. Copy paste from the example in the linked question:
IndexReader ir = obtainIndexReader();
Terms tv = ir.getTermVector( doc, field );
TermsEnum terms = tv.iterator();
PostingsEnum p = null;
while( terms.next() != null ) {
p = terms.postings( p, PostingsEnum.ALL );
while( p.nextDoc() != PostingsEnum.NO_MORE_DOCS ) {
int freq = p.freq();
for( int i = 0; i < freq; i++ ) {
int pos = p.nextPosition(); // Always returns -1!!!
BytesRef data = p.getPayload();
doStuff( freq, pos, data ); // Fails miserably, of course.
}
}
}
This code works for me but what drives me mad is that the Terms type is where the position information is kept. All the documentation I've seen keep saying that term vectors keep position data. However, there are no methods on this type to get that information!
Older versions of Lucene apparently had a method but as of at least version 6.5.1 of Lucene, that is not the case.
Instead I'm supposed to use postings method and traverse the documents but I already know which document I want to work on!
The API documentation does not say anything about postings returning only the current document (the one the term vector belongs to) but when I run it, I only get the current doc.
Is this the correct and only way to get position data from term vectors? Why such an unintuitive API? Is there a document that explains why the previous approach changed in favour of this?
Don't know about "right or wrong" but for version 6.6.3 this seems to work.
private void run() throws Exception {
Directory directory = new RAMDirectory();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(directory, indexWriterConfig);
Document doc = new Document();
// Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES
FieldType type = new FieldType();
type.setStoreTermVectors(true);
type.setStoreTermVectorPositions(true);
type.setStoreTermVectorOffsets(true);
type.setIndexOptions(IndexOptions.DOCS);
Field fieldStore = new Field("tags", "foo bar and then some", type);
doc.add(fieldStore);
writer.addDocument(doc);
writer.close();
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Term t = new Term("tags", "bar");
Query q = new TermQuery(t);
TopDocs results = searcher.search(q, 1);
for ( ScoreDoc scoreDoc: results.scoreDocs ) {
Fields termVs = reader.getTermVectors(scoreDoc.doc);
Terms f = termVs.terms("tags");
TermsEnum te = f.iterator();
PostingsEnum docsAndPosEnum = null;
BytesRef bytesRef;
while ( (bytesRef = te.next()) != null ) {
docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
// for each term (iterator next) in this field (field)
// iterate over the docs (should only be one)
int nextDoc = docsAndPosEnum.nextDoc();
assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
final int fr = docsAndPosEnum.freq();
final int p = docsAndPosEnum.nextPosition();
final int o = docsAndPosEnum.startOffset();
System.out.println("p="+ p + ", o=" + o + ", l=" + bytesRef.length + ", f=" + fr + ", s=" + bytesRef.utf8ToString());
}
}
}

Apache PDFBox replace text results in few character missed

Trying to use Apache PDFBox version 2.0.2 for a text replace (with the below code) produces an output where few of the characters would not be displayed, mostly the capital Case Character. For example a replacement with "ABCDEFGHIJKLMNOPQRSTUVWXYZ" the output appears in pdf as "ABCDEF HIJKLM OP RST W Y ". Is this some bug ?? or we have some workaround to handle these character .
public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
PDPageTree pages = document.getDocumentCatalog().getPages();
for (PDPage page : pages) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
//Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
string = StringUtils.replaceOnce(string, searchString, replacement);
cosString.setValue(string.getBytes());
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
out.close();
}
return document;
}
Quoting from
https://pdfbox.apache.org/2.0/migration.html
Why was the ReplaceText example removed?
The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:
[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ
Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.
You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.
======================================================================
Your description suggests that the initial file has been using a font subset, that is missing the characters G, N, Q, V and Y.
And no, there is no easy workaround. You would have to delete the text you don't want from the content stream, and then append a new content stream with the text you want with a new font at the correct place.
P.S. the current PDFBox version is 2.0.7, not 2.0.2.

How to get Document ids for Document Term Vector in Lucene

I am new to Lucene world, and don't have much working knowledge of the subject. I need to extract document term vector and I found the following code online How to extract Document Term Vector in Lucene 3.5.0.
/**
* Sums the term frequency vector of each document into a single term frequency map
* #param indexReader the index reader, the document numbers are specific to this reader
* #param docNumbers document numbers to retrieve frequency vectors from
* #param fieldNames field names to retrieve frequency vectors from
* #param stopWords terms to ignore
* #return a map of each term to its frequency
* #throws IOException
*/
private Map<String,Integer> getTermFrequencyMap(IndexReader indexReader, List<Integer> docNumbers, String[] fieldNames, Set<String> stopWords)
throws IOException {
Map<String,Integer> totalTfv = new HashMap<String,Integer>(1024);
for (Integer docNum : docNumbers) {
for (String fieldName : fieldNames) {
TermFreqVector tfv = indexReader.getTermFreqVector(docNum, fieldName);
if (tfv == null) {
// ignore empty fields
continue;
}
String terms[] = tfv.getTerms();
int termCount = terms.length;
int freqs[] = tfv.getTermFrequencies();
for (int t=0; t < termCount; t++) {
String term = terms[t];
int freq = freqs[t];
// filter out single-letter words and stop words
if (StringUtils.length(term) < 2 ||
stopWords.contains(term)) {
continue; // stop
}
Integer totalFreq = totalTfv.get(term);
totalFreq = (totalFreq == null) ? freq : freq + totalFreq;
totalTfv.put(term, totalFreq);
}
}
}
return totalTfv;
}
I have created the index which resides in the following directory.
String indexDir = "C:\\Lucene\\Output\\";
Directory dir = FSDirectory.open(new File(indexDir));
IndexReader reader = IndexReader.open(dir);
My problem is that I do not know how to get the doc ids (List docNumbers) which is required for the above mentioned function. I have tried a couple of methods like
TermDocs docs = reader.termDocs();
but it did not work.
Lucene starts assigning ids from zero, and maxDoc() is the upper limit, so you can simply loop to get all ids, skipping deleted documents (Lucene marks them for deletion when you call deleteDocument):
for (int docNum=0; docNum < reader.maxDoc(); docNum++) {
if (reader.isDeleted(docNum)) {
continue;
}
TermFreqVector tfv = reader.getTermFreqVector(docNum, "fieldName");
...
}
For this to work, you have to enable them during indexing, see Field.TermVector.

WiX: Extracting Binary-string in Custom Action yields string like "???good data"

I just found a weird behavior when attempting to extract a string from the Binary table in the MSI.
I have a file containing Hello world, the data I get is ???Hello world. (Literary question mark.)
Is this as intended?
Will it always be exactly 3 characters in the beginning?
Sample code:
[CustomAction]
public static ActionResult CustomAction2(Session session)
{
View v = session.Database.OpenView("SELECT `Name`,`Data` FROM `Binary`");
v.Execute();
Record r = v.Fetch();
int datalen = r.GetDataSize("Data");
System.IO.Stream strm = r.GetStream("Data");
byte[] rawData = new byte[datalen];
int res = strm.Read(rawData, 0, datalen);
strm.Close();
String s = System.Text.Encoding.ASCII.GetString(rawData);
// s == "???Hello World"
return ActionResult.Success;
}
Wild guess, but if you created the file using Notepad, couldn't that just be your byte order mark?
Try
String s = System.Text.Encoding.UTF8.GetString(rawData);
if (s.Length > 0 && s[0] == '\uFEFF')
{
s = s.Substring(1);
}
instead of String s = System.Text.Encoding.ASCII.GetString(rawData);