Proper storage/retrieval of termVector

Proper storage/retrieval of termVector - indexing

I'm using Lucene.NET 4.8-beta00005.
I have a "name" field in my documents defined as follows:
doc.Add(CreateField(NameField, entry.Name.ToLower()));
writer.AddDocument(doc);
Where CreateField is implemented as follows
private static Field CreateField(string fieldName, string fieldValue)
{
return new Field(fieldName, fieldValue, new FieldType() {IsIndexed = true, IsStored = true, IsTokenized = true, StoreTermVectors = true, StoreTermVectorPositions = true, StoreTermVectorOffsets = true, StoreTermVectorPayloads = true});
}
The "name" field is assigned a StandardAnalyzer.
Then in my CustomScoreProvider I'm retriving the terms from the term vector as follows:
private List<string> GetDocumentTerms(int doc, string fieldName)
{
var indexReader = m_context.Reader;
var termVector = indexReader.GetTermVector(doc, fieldName);
var termsEnum = termVector.GetIterator(null);
BytesRef termBytesRef;
termBytesRef = termsEnum.Next();
var documentTerms = new List<string>();
while (termBytesRef != null)
{
//removing trailing \0 (padded to 16 bytes)
var termText = Encoding.Default.GetString(termBytesRef.Bytes).Replace("\0", "");
documentTerms.Add(termText);
termBytesRef = termsEnum.Next();
}
return documentTerms;
}
Now I have a document where the value of the "name" field is "dan gertler diamonds ltd."
So the terms from the term vector I'm expecting are:
dan gertler diamonds ltd
But my GetDocumentTerms gives me the following terms:
dan diamonds gertlers ltdtlers
I'm using as StandardAnalyzer with the field so I'm not expecting it to do much transformation to the orignal words in the field (and I did check with this particular name and StandardAnalyzer).
What am I doing wrong here and how to fix it?
Edit: I'm extracing the terms manually with each field's Analyzer and stroing the them in a separate String field as a workaroud for now.

If you want to get the terms in correct order, you must also use the positional information. Test this code:
Terms terms = indexReader.GetTermVector(doc, fieldName);
if (terms != null)
{
var termIterator = terms.GetIterator(null);
BytesRef bytestring;
var documentTerms = new List<Tuple<int, string>>();
while ((bytestring = termIterator.Next()) != null)
{
var docsAndPositions = termIterator.DocsAndPositions(null, null, DocsAndPositionsFlags.OFFSETS);
docsAndPositions.NextDoc();
int position;
for(int left = docsAndPositions.Freq; left > 0; left--)
{
position = docsAndPositions.NextPosition();
documentTerms.Add(new Tuple<int, string>(position, bytestring.Utf8ToString()));
}
}
documentTerms.Sort((word1, word2) => word1.Item1.CompareTo(word2.Item1));
foreach (var word in documentTerms)
{
Console.WriteLine("{0} {1} {2}", fieldName, word.Item1, word.Item2);
}
}
This code also handles the situation where you have the same term (word) in more than one place.

Related

Syntax Highlighting for go in vb.net

Ok so I have been making a simple code editor in vb.net for go.. (for personal uses)
I tried this code -
Dim tokens As String = "(break|default|func|interface|select|case|defer|go|map|struct|chan|else|goto|package|switch|const|fallthrough|if|range|type|continue|for|import|return|var)"
Dim rex As New Regex(tokens)
Dim mc As MatchCollection = rex.Matches(TextBox2.Text)
Dim StartCursorPosition As Integer = TextBox2.SelectionStart
For Each m As Match In mc
Dim startIndex As Integer = m.Index
Dim StopIndex As Integer = m.Length
TextBox2.[Select](startIndex, StopIndex)
TextBox2.SelectionColor = Color.FromArgb(0, 122, 204)
TextBox2.SelectionStart = StartCursorPosition
TextBox2.SelectionColor = Color.RebeccaPurple
Next
but I couldn't add something like print statements say I want a fmt.Println("Hello World"), that is not possible, anyone help me?
I want a simple result that will do proper syntax without glitching text colors like this current code does.

Here's a code showing how to update highlighting with strings and numbers.
You would need to tweak it further to support syntax like comments, etc.
private Regex BuildExpression()
{
string[] exprs = {
"(break|default|func|interface|select|case|defer|go|map|struct|chan|else|goto|package|switch|const|fallthrough|if|range|type|continue|for|import|return|var)",
#"([0-9]+\.[0-9]*(e|E)(\+|\-)?[0-9]+)|([0-9]+\.[0-9]*)|([0-9]+)",
"(\"\")|\"((((\\\\\")|(\"\")|[^\"])*\")|(((\\\\\")|(\"\")|[^\"])*))"
};
StringBuilder sb = new StringBuilder();
for (int i = 0; i < exprs.Length; i++)
{
string expr = exprs[i];
if ((expr != null) && (expr != string.Empty))
sb.Append(string.Format("(?<{0}>{1})", "_" + i.ToString(), expr) + "|");
}
if (sb.Length > 0)
sb.Remove(sb.Length - 1, 1);
RegexOptions options = RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.Compiled | RegexOptions.IgnoreCase;
return new Regex(sb.ToString(), options);
}
private void HighlightSyntax()
{
var colors = new Dictionary<int, Color>();
var expression = BuildExpression();
Color[] clrs = { Color.Teal, Color.Red, Color.Blue };
int[] intarray = expression.GetGroupNumbers();
foreach (int i in intarray)
{
var name = expression.GroupNameFromNumber(i);
if ((name != null) && (name.Length > 0) && (name[0] == '_'))
{
var idx = int.Parse(name.Substring(1));
if (idx < clrs.Length)
colors.Add(i, clrs[idx]);
}
}
foreach (Match match in expression.Matches(richTextBox1.Text))
{
int index = match.Index;
int length = match.Length;
richTextBox1.Select(index, length);
for (int i = 0; i < match.Groups.Count; i++)
{
if (match.Groups[i].Success)
{
if (colors.ContainsKey(i))
{
richTextBox1.SelectionColor = colors[i];
break;
}
}
}
}
}
What we found during development of our Code Editor libraries, is that the regular expression-based parsers are hard to adapt to fully support advanced syntax like contextual keywords (LINQ) or interpolated strings.
You might find a bit more information here:
https://www.alternetsoft.com/blog/code-parsing-explained
The most accurate syntax highlighting for VB.NET can be implemented using Microsoft.CodeAnalysis API, it's the same API used internally by Visual Studio text editor.
Below is sample code showing how to get classified spans for VB.NET code (every span contains start/end position within the text and classification type, i.e. keyword, string, etc.). These spans then can be used to highlight text inside a textbox.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.Classification;
using Microsoft.CodeAnalysis.Host.Mef;
using Microsoft.CodeAnalysis.Text;
public class VBClassifier
{
private Workspace workspace;
private static string FileContent = #"
Public Sub Run()
Dim test as TestClass = new TestClass()
End Sub";
public void Classify()
{
var project = InitProject();
var doc = AddDocument(project, "file1.vb", FileContent);
var spans = Classify(doc);
}
protected IEnumerable<ClassifiedSpan> Classify(Document document)
{
var text = document.GetTextAsync().Result;
var span = new TextSpan(0, text.Length);
return Classifier.GetClassifiedSpansAsync(document, span).Result;
}
protected Document AddDocument(Project project, string fileName, string code)
{
var documentId = DocumentId.CreateNewId(project.Id, fileName);
ApplySolutionChanges(s => s.AddDocument(documentId, fileName, code, filePath: fileName));
return workspace.CurrentSolution.GetDocument(documentId);
}
protected virtual void ApplySolutionChanges(Func<Solution, Solution> action)
{
var solution = workspace.CurrentSolution;
solution = action(solution);
workspace.TryApplyChanges(solution);
}
protected MefHostServices GetRoslynCompositionHost()
{
IEnumerable<Assembly> assemblies = MefHostServices.DefaultAssemblies;
var compositionHost = MefHostServices.Create(assemblies);
return compositionHost;
}
protected Project CreateDefaultProject()
{
var solution = workspace.CurrentSolution;
var projectId = ProjectId.CreateNewId();
var projectName = "VBTest";
ProjectInfo projectInfo = ProjectInfo.Create(
projectId,
VersionStamp.Default,
projectName,
projectName,
LanguageNames.VisualBasic,
filePath: null);
ApplySolutionChanges(s => s.AddProject(projectInfo));
return workspace.CurrentSolution.Projects.FirstOrDefault();
}
protected Project InitProject()
{
var host = GetRoslynCompositionHost();
workspace = new AdhocWorkspace(host);
return CreateDefaultProject();
}
}
Update:
Here's a Visual Studio project demonstrating both approaches:
https://drive.google.com/file/d/1LLuzy7yDFAE-v40I7EswECYQSthxheEf/view?usp=sharing

Using Lucene's highlighting, getting too much highlighted, is there a workaround for this?

I am using the highlighting feature of Lucene to isolate matching terms for my query, but some of the matched terms are excessive.
I have some simple test cases which are delivered in an Ant project (download details below).
Materials
You can download the test case here: mydemo_with_libs.zip
That archive includes the Lucene 8.6.3 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo_without_libs.zip
The necessary libraries are: core, analyzers, queries, queryparser, highlighter, and memory.
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant synsearch
Input
I have provided a short synonym list which is used for indexing and analysing in the highlighting methods:
cope,manage
jobs,tasks
simultaneously,at once
and there is one document being indexed:
Queues are a useful way of grouping jobs together in order to manage a number of them at once. You can:
hold or release multiple jobs at the same time;
group multiple tasks (for the same event);
control the priority of jobs in the queue;
Eventually log all events that take place in a queue.
Use either job.queue or task.queue in specifications.
Process
When building the index I am storing the text field, and using a custom analyzer. This is because (in the real world) the content I am indexing is technical documentation, so stripping out punctuation is inappropriate because so much of it may be significant in technical expressions. My analyzer uses a TechTokenFilter which breaks the stream up into tokens consisting of strings of words or digits, or individual characters which don't match the previous pattern.
Here's the relevant code for the analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
and here's my filter:
public class TechTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;
public TechTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();
if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(techTokens(currentTerm)));
current = captureState();
}
}
}
if (!this.termStack.isEmpty()) {
String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);
return true;
} else {
return false;
}
}
public static String[] techTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char minus = '-';
char underscore = '_';
char c, prec, subc;
// Boolean inWord = false;
for (int i = 0; i < t.length(); i++) {
prec = i > 0 ? t.charAt(i - 1) : 0;
c = t.charAt(i);
subc = i < (t.length() - 1) ? t.charAt(i + 1) : 0;
if (Character.isLetterOrDigit(c) || c == underscore) {
next.append(c);
// inWord = true;
}
else if (c == minus && Character.isLetterOrDigit(prec) && Character.isLetterOrDigit(subc)) {
next.append(c);
} else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input stream has been tokenized on
// whitespace
} else {
tokenlist.add(String.valueOf(c));
}
// inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}
Examining the index I can see that the index contains the separate terms I expect, including the synonym values. For example the text at the end of the first line has produced the terms
of
them
at , simultaneously
once
.
You
can
:
and the text at the end of the third line has produced the terms
same
event
)
;
When the application performs a search it analyzes the query without using the synonym list (because the synonyms are already in the index), but I have discovered that I need to include the synonym list when analyzing the stored text to identify the matching fragments.
Searches match the correct documents, but the code I have added to identify the matching terms over-performs. I won't show all the search method here, but will focus on the code which lists matched terms:
public static void doSearch(IndexReader reader, IndexSearcher searcher,
Query query, int max, String synList) throws IOException {
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\001", "\002");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
Analyzer analyzer;
if (synList != null) {
analyzer = new MyAnalyzer(synList);
} else {
analyzer = new MyAnalyzer();
}
// Collect all the docs
TopDocs results = searcher.search(query, max);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits.value);
System.out.println("\nQuery: " + query.toString());
System.out.println("Matches: " + numTotalHits);
// Collect matching terms
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min(numTotalHits, max);
for (int i = start; i < end; i++) {
int id = hits[i].doc;
float score = hits[i].score;
Document doc = searcher.doc(id);
String docpath = doc.get("path");
String doctext = doc.get("text");
try {
TokenStream tokens = TokenSources.getTokenStream("text", null, doctext, analyzer, -1);
TextFragment[] frag = highlighter.getBestTextFragments(tokens, doctext, false, 100);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
String match = frag[j].toString();
addMatchedWord(matchedWords, match);
}
}
} catch (InvalidTokenOffsetsException e) {
System.err.println(e.getMessage());
}
System.out.println("matched file: " + docpath);
}
if (matchedWords.size() > 0) {
System.out.println("matched terms:");
for (String word : matchedWords) {
System.out.println(word);
}
}
}
Problem
While the correct documents are selected by these queries, and the fragments chosen for highlighting do contain the query terms, the highlighted pieces in some of the selected fragments extend over too much of the input.
For example, if the query is
+text:event +text:manage
(the first example in the test case) then I would expect to see 'event' and 'manage' in the highlighted list. But what I actually see is
event);
manage
Despite the highlighting process using an analyzer which breaks terms apart and treats punctuation characters as single terms, the highlight code is "hungry" and breaks on whitespace alone.
Similarly if the query is
+text:queeu~1
(my final test case) I would expect to only see 'queue' in the list. But I get
queue.
job.queue
task.queue
queue;
It is so nearly there... but I don't understand why the highlighted pieces are inconsistent with the index, and I don't think I should have to parse the list of matches through yet another filter to produce the correct list of matches.
I would really appreciate any pointers to what I am doing wrong or how I could improve my code to deliver exactly what I need.
Thanks for reading this far!

I managed to get this working by replacing the WhitespaceTokenizer and TechTokenFilter in my analyzer with a PatternTokenizer; the regular expression took a bit of work but once I had it all the matching terms were extracted with pinpoint accuracy.
The replacement analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
private static final String tokenRegex = "(([\\w]+-)*[\\w]+)|[^\\w\\s]";
#Override
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRegex), 0);
TokenStream result = new LowerCaseFilter(src);
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}

Lucene 30 fuzzy search

I user LUCENE_30 for my search engine but i cannot make fuzzy search.How can i make it work?
I tried use GetFuzzyQuery but nothing happens.As i see is not supported.
Here my code :
if (searchQuery.Length < 3)
{
throw new ArgumentException("none");
}
FSDirectory dir = FSDirectory.Open(new DirectoryInfo(_indexFileLocation));
var searcher = new IndexSearcher(dir, true);
var analyzer = new RussianAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
var query = MultiFieldQueryParser.Parse(Lucene.Net.Util.Version.LUCENE_30, searchQuery, new[] {"Title" }, new[] { Occur.SHOULD }, analyzer);
var hits = searcher.Search(query, 11110);
var dto = new PerformSearchResultDto();
dto.SearchResults = new List<SearchResult>();
dto.Total = hits.TotalHits;
for (int i = pagesize * page; i < hits.TotalHits && i < pagesize * page + pagesize; i++)
{
// Document doc = hits.Doc(i);
int docId = hits.ScoreDocs[i].Doc;
var doc = searcher.Doc(docId);
var result = new SearchResult();
result.Title = doc.Get("Title");
result.Type = doc.Get("Type");
result.Href = doc.Get("Href");
result.LastModified = doc.Get("LastModified");
result.Site = doc.Get("Site");
result.City = doc.Get("City");
//result.Region = doc.Get("Region");
result.Content = doc.Get("Content");
result.NoIndex = Convert.ToBoolean(doc.Get("NoIndex"));
dto.SearchResults.Add(result);
}

Fuzzy queries certainly are supported. See the FuzzyQuery class.
The query parser also supports fuzzy queries, simply with a tilde appended: misspeled~

In Lucene, why do my boosted and unboosted documents get the same score?

At index time I am boosting certain document in this way:
if (myCondition)
{
document.SetBoost(1.2f);
}
But at search time documents with all the exact same qualities but some passing and some failing myCondition all end up having the same score.
And here is the search code:
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.Add(new TermQuery(new Term(FieldNames.HAS_PHOTO, "y")), BooleanClause.Occur.MUST);
booleanQuery.Add(new TermQuery(new Term(FieldNames.AUTHOR_TYPE, AuthorTypes.BLOGGER)), BooleanClause.Occur.MUST_NOT);
indexSearcher.Search(booleanQuery, 10);
Can you tell me what I need to do to get the documents that were boosted to get a higher score?
Many Thanks!

Lucene encodes boosts on a single byte (although a float is generally encoded on four bytes) using the SmallFloat#floatToByte315 method. As a consequence, there can be a big loss in precision when converting back the byte to a float.
In your case SmallFloat.byte315ToFloat(SmallFloat.floatToByte315(1.2f)) returns 1f because 1f and 1.2f are too close to each other. Try using a bigger boost so that your documents get different scores. (For exemple 1.25, SmallFloat.byte315ToFloat(SmallFloat.floatToByte315(1.25f)) gives 1.25f.)

Here is the requested test program that was too long to post in a comment.
class Program
{
static void Main(string[] args)
{
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer());
const string FIELD = "name";
for (int i = 0; i < 10; i++)
{
StringBuilder notes = new StringBuilder();
notes.AppendLine("This is a note 123 - " + i);
string text = notes.ToString();
Document doc = new Document();
var field = new Field(FIELD, text, Field.Store.YES, Field.Index.NOT_ANALYZED);
if (i % 2 == 0)
{
field.SetBoost(1.5f);
doc.SetBoost(1.5f);
}
else
{
field.SetBoost(0.1f);
doc.SetBoost(0.1f);
}
doc.Add(field);
writer.AddDocument(doc);
}
writer.Commit();
//string TERM = QueryParser.Escape("*+*");
string TERM = "T";
IndexSearcher searcher = new IndexSearcher(dir);
Query query = new PrefixQuery(new Term(FIELD, TERM));
var hits = searcher.Search(query);
int count = hits.Length();
Console.WriteLine("Hits - {0}", count);
for (int i = 0; i < count; i++)
{
var doc = hits.Doc(i);
Console.WriteLine(doc.ToString());
var explain = searcher.Explain(query, i);
Console.WriteLine(explain.ToString());
}
}
}

Exact search with Lucene.Net

I already have seen few similar questions, but I still don't have an answer. I think I have a simple problem.
In sentence
In this text, only Meta Files are important, and Test Generation.
Anything else is irrelevant
I want to index only Meta Files and Test Generation. That means that I need exact match.
Could someone please explain me how to achieve this?
And here is the code:
Analyzer analyzer = new StandardAnalyzer();
Lucene.Net.Store.Directory directory = new RAMDirectory();
indexWriter iwriter = new IndexWriter(directory, analyzer, true);
iwriter.SetMaxFieldLength(10000);
Document doc = new Document();
doc.Add(new Field("textFragment", text, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES));
iwriter.AddDocument(doc);
iwriter.Close();
IndexSearcher isearcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser("textFragment", analyzer);
foreach (DictionaryEntry de in OntologyLayer.OntologyLayer.HashTable)
{
List<string> buffer = new List<string>();
double weight = 0;
List<OntologyLayer.Term> list = (List<OntologyLayer.Term>)de.Value;
foreach (OntologyLayer.Term t in list)
{
Hits hits = null;
string label = t.Label;
string[] words = label.Split(' ');
int numOfWords = words.Length;
double wordWeight = 1 / (double)numOfWords;
double localWeight = 0;
foreach (string a in words)
{
try
{
if (!buffer.Contains(a))
{
Lucene.Net.Search.Query query = parser.Parse(a);
hits = isearcher.Search(query);
if (hits != null && hits.Length() > 0)
{
localWeight = localWeight + t.Weight * wordWeight * hits.Length();
}
buffer.Add(a);
}
}
catch (Exception ex)
{}
}
weight = weight + localWeight;
}
sbWeight.AppendLine(weight.ToString());
if (weight > 0)
{
string objectURI = (string)de.Key;
conceptList.Add(objectURI);
}
}

Take a look at Stupid Lucene Tricks: Exact Match, Starts With, Ends With.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Proper storage/retrieval of termVector - indexing

Related

Syntax Highlighting for go in vb.net

Using Lucene's highlighting, getting too much highlighted, is there a workaround for this?

Lucene 30 fuzzy search

In Lucene, why do my boosted and unboosted documents get the same score?

Exact search with Lucene.Net

Categories

Resources