Lucene 4.1 how to index facets?

Lucene 4.1 how to index facets? - api

I try to build a index with some facets, I'm folliwing the User Guide.
However, I run into a problem; the next line in the User Guide gives errors.
DocumentBuilder categoryDocBuilder = new CategoryDocumentBuilder(taxo);
Both the DocumentBuilder and the CatergoryDocumentBuilder do not exist in lucene-facet..
I cannot find the API changes in the Jira-issues.. Does anyone have this working and cares to share how it should be done?

I figured it out using the benchmark code as inspiration.
Indexing
Directory dir = FSDirectory.open( new File("index" ));
Directory dir_taxo = FSDirectory.open( new File("index-taxo" ));
IndexWriter writer = newIndexWriter(dir);
TaxonomyWriter taxo = new DirectoryTaxonomyWriter(dir_taxo, OpenMode.CREATE);
FacetFields ff= new FacetFields(taxo);
//for all documents:
d=new Document();
List<CategoryPath>=new ArrayList<CategoryPath>();
for (all fields in doc)
{
d.addField( ....)
}
for (all categories in doc)
{
CategoryPath cp = new CategoryPath(field, value);
categories.add( cp);
taxo.addCategory(cp); //not sure if necessary
}
ff.addFields(d, categories);
w.addDocument( d );
Searching:
Directory dir = FSDirectory.open( new File("index" ));
Directory dir_taxo = FSDirectory.open( new File("index-taxo" ));
final DirectoryReader indexReader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(indexReader);
TaxonomyReader taxo = new DirectoryTaxonomyReader(dir_taxo);
Query q = new TermQuery(new Term(...));
TopScoreDocCollector tdc = TopScoreDocCollector.create(1, true);
FacetSearchParams facetSearchParams = new FacetSearchParams(new CountFacetRequest(new CategoryPath("mycategory"),10));
FacetsCollector facetsCollector = new FacetsCollector(facetSearchParams, indexReader, taxo);
long ts= System.currentTimeMillis();
searcher.search(q, MultiCollector.wrap(tdc, facetsCollector));
List<FacetResult> res = facetsCollector.getFacetResults();
long te= System.currentTimeMillis();
for (FacetResult fr:res)
{
for ( FacetResultNode sr : fr.getFacetResultNode().getSubResults())
{
System.out.println(String.format( "%s : %f", sr.getLabel(), sr.getValue()));
}
}
System.out.println(String.format("Search took: %d millis", (te-ts)));
}

I'm not familiar with Lucene 4.1 only version 2.9.
But when I'm creating facets inside my result I normally use the Lucene.Net.Search.SimpleFacetedSearch.dll, below a sample code of my project.
Wouter
Dictionary<String, long> facetedResults = new Dictionary<String, long>();
try
{
SimpleFacetedSearch.MAX_FACETS = int.MaxValue;
SimpleFacetedSearch sfs = new SimpleFacetedSearch(indexReader, field);
SimpleFacetedSearch.Hits facetedHits = sfs.Search(query);
long totalHits = facetedHits.TotalHitCount;
for (int ihitsPerFacet = 0; ihitsPerFacet < facetedHits.HitsPerFacet.Count(); ihitsPerFacet++)
{
long hitCountPerGroup = facetedHits.HitsPerFacet[ihitsPerFacet].HitCount;
SimpleFacetedSearch.FacetName facetName = facetedHits.HitsPerFacet[ihitsPerFacet].Name;
if (hitCountPerGroup > 0)
facetedResults.Add(facetName.ToString(), hitCountPerGroup);
}
}
catch (Exception ex)
{
facetedResults.Add(ex.Message, -1);
}

Related

iTextSharp Document Isn't Working After Deployment

i'm using iTextSharp to create a pdf document then add it as an attachment to send an email using SendGrid.
The code is working locally but after deploying the project in Azure this function stopped working for some reason. I tried to analyze the problem and i think that the document didn't fully created of attached due to the connection. I can't pin point the exact issue to solve it. Any opinions or discussion is appreciated.
Action:
public async Task<IActionResult> GeneratePDF(int? id, string recipientEmail)
{
//if id valid
if (id == null)
{
return NotFound();
}
var story = await _db.Stories.Include(s => s.Child).Include(s => s.Sentences).ThenInclude(s => s.Image).FirstOrDefaultAsync(s => s.Id == id);
if (story == null)
{
return NotFound();
}
var webRootPath = _hostingEnvironment.WebRootPath;
var path = Path.Combine(webRootPath, "dump"); //folder name
try
{
using (System.IO.MemoryStream memoryStream = new System.IO.MemoryStream())
{
iTextSharp.text.Document document = new iTextSharp.text.Document(iTextSharp.text.PageSize.A4, 10, 10, 10, 10);
PdfWriter writer = PdfWriter.GetInstance(document, memoryStream);
document.Open();
string usedFont = Path.Combine(webRootPath + "\\fonts\\", "arial.TTF");
BaseFont bf = BaseFont.CreateFont(usedFont, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
iTextSharp.text.Font titleFont = new iTextSharp.text.Font(bf, 40);
iTextSharp.text.Font sentencesFont = new iTextSharp.text.Font(bf, 15);
iTextSharp.text.Font childNamewFont = new iTextSharp.text.Font(bf, 35);
PdfPTable T = new PdfPTable(1);
//Hide the table border
T.DefaultCell.BorderWidth = 0;
T.DefaultCell.HorizontalAlignment = 1;
T.DefaultCell.PaddingTop = 15;
T.DefaultCell.PaddingBottom = 15;
//Set RTL mode
T.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
//Add our text
if (story.Title != null)
{
T.AddCell(new iTextSharp.text.Paragraph(story.Title, titleFont));
}
if (story.Child != null)
{
if (story.Child.FirstName != null && story.Child.LastName != null)
{
T.AddCell(new iTextSharp.text.Phrase(story.Child.FirstName + story.Child.LastName, childNamewFont));
}
}
if (story.Sentences != null)
{
.................
}
document.Add(T);
writer.CloseStream = false;
document.Close();
byte[] bytes = memoryStream.ToArray();
var fileName = path + "\\PDF" + DateTime.Now.ToString("yyyyMMdd-HHMMss") + ".pdf";
using (FileStream fs = new FileStream(fileName, FileMode.Create))
{
fs.Write(bytes, 0, bytes.Length);
}
memoryStream.Position = 0;
memoryStream.Close();
//Send generated pdf as attchment
// Create the file attachment for this email message.
var attachment = Convert.ToBase64String(bytes);
var client = new SendGridClient(Options.SendGridKey);
var msg = new SendGridMessage();
msg.From = new EmailAddress(SD.DefaultEmail, SD.DefaultEmail);
msg.Subject = story.Title;
msg.PlainTextContent = "................";
msg.HtmlContent = "..................";
msg.AddTo(new EmailAddress(recipientEmail));
msg.AddAttachment("Story.pdf", attachment);
try
{
await client.SendEmailAsync(msg);
}
catch (Exception ex)
{
Console.WriteLine("{0} First exception caught.", ex);
}
//Remove form root
if (System.IO.File.Exists(fileName))
{
System.IO.File.Delete(fileName);
}
}
}
catch (FileNotFoundException e)
{
Console.WriteLine($"The file was not found: '{e}'");
}
catch (DirectoryNotFoundException e)
{
Console.WriteLine($"The directory was not found: '{e}'");
}
catch (IOException e)
{
Console.WriteLine($"The file could not be opened: '{e}'");
}
return RedirectToAction("Details", new { id = id });
}

try to edit usedfont variable as bellow :
var usedfont = Path.Combine(webRootPath ,#"\fonts\arial.TTF")

It turns out that the problem is far from iTextSharp. I did a remote debugging from this article.
Two parts of the code was causing the problem.
First, for some reason the folder "dump" was not created on Azure wwwroot folder while locally it is. so, i added these lines:
var webRootPath = _hostingEnvironment.WebRootPath;
var path = Path.Combine(webRootPath, "dump");
if (!Directory.Exists(path)) //Here
Directory.CreateDirectory(path);
Second, after debugging it shows that creating the file was failing every time. I replaced the following lines:
using (FileStream fs = new FileStream(fileName, FileMode.Create))
{
fs.Write(bytes, 0, bytes.Length);
}
memoryStream.Position = 0;
memoryStream.Close();
With:
using (FileStream fs = new FileStream(fileName, FileMode.Create))
using (var binaryWriter = new BinaryWriter(fs))
{
binaryWriter.Write(bytes, 0, bytes.Length);
binaryWriter.Close();
}
memoryStream.Close();
Hope this post helps someone.

Reading attachment from a secured PDF

I am working on a PDF file, which is a secured one and an excel is attached in the PDF file.
The following is the code i tried.
static void Main(string[] args)
{
Program pgm = new Program();
pgm.EmbedAttachments();
//pgm.ExtractAttachments(pgm.pdfFile);
}
private void ExtractAttachments(string _pdfFile)
{
try
{
if (!Directory.Exists(attExtPath))
Directory.CreateDirectory(attExtPath);
byte[] password = System.Text.ASCIIEncoding.ASCII.GetBytes("TFAER13052016");
//byte[] password = System.Text.ASCIIEncoding.ASCII.GetBytes("Password");
PdfDictionary documentNames = null;
PdfDictionary embeddedFiles = null;
PdfDictionary fileArray = null;
PdfDictionary file = null;
PRStream stream = null;
//PdfReader reader = new PdfReader(_pdfFile);
PdfReader reader = new PdfReader(_pdfFile, password);
PdfDictionary catalog = reader.Catalog;
documentNames = (PdfDictionary)PdfReader.GetPdfObject(catalog.Get(PdfName.NAMES));
if (documentNames != null)
{
embeddedFiles = (PdfDictionary)PdfReader.GetPdfObject(documentNames.Get(PdfName.EMBEDDEDFILES));
if (embeddedFiles != null)
{
PdfArray filespecs = embeddedFiles.GetAsArray(PdfName.NAMES);
for (int i = 0; i < filespecs.Size; i++)
{
i++;
fileArray = filespecs.GetAsDict(i);
file = fileArray.GetAsDict(PdfName.EF);
foreach (PdfName key in file.Keys)
{
stream = (PRStream)PdfReader.GetPdfObject(file.GetAsIndirectObject(key));
string attachedFileName = fileArray.GetAsString(key).ToString();
byte[] attachedFileBytes = PdfReader.GetStreamBytes(stream);
System.IO.File.WriteAllBytes(attExtPath + attachedFileName, attachedFileBytes);
}
}
}
else
throw new Exception("Unable to Read the attachment or There may be no Attachment");
}
else
{
throw new Exception("Unable to Read the document");
}
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
Console.ReadKey();
}
}
private void EmbedAttachments()
{
try
{
if (File.Exists(pdfFile))
File.Delete(pdfFile);
Document PDFD = new Document(PageSize.LETTER);
PdfWriter writer;
writer = PdfWriter.GetInstance(PDFD, new FileStream(pdfFile, FileMode.Create));
PDFD.Open();
PDFD.NewPage();
PDFD.Add(new Paragraph("This is test"));
PdfFileSpecification pfs = PdfFileSpecification.FileEmbedded(writer, #"C:\PDFReader\1.xls", "11.xls", null);
//PdfFileSpecification pfs = PdfFileSpecification.FileEmbedded(writer, attFile, "11", File.ReadAllBytes(attFile), true);
writer.AddFileAttachment(pfs);
//writer.AddAnnotation(PdfAnnotation.CreateFileAttachment(writer, new iTextSharp.text.Rectangle(100, 100, 100, 100), "File Attachment", PdfFileSpecification.FileExtern(writer, "C:\\test.xml")));
//writer.Close();
PDFD.Close();
Program pgm=new Program();
using (Stream input = new FileStream(pgm.pdfFile, FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (Stream output = new FileStream(pgm.epdfFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
PdfReader reader = new PdfReader(input);
PdfEncryptor.Encrypt(reader, output, true, "Password", "secret", PdfWriter.ALLOW_SCREENREADERS);
}
}
}
catch (Exception ex)
{
Console.WriteLine(ex.StackTrace.ToString());
Console.ReadKey();
}
}
}
The above code contains the creation of a encrypted PDF with an excel attachment and also to extract the same.
Now the real problem is with the file which I already have as a requirement document(I cannot share the file) which also has an excel attachment like my example.
But the above code works for the secured PDF which i have created but not for the actual Secured PDF.
While debugging, I found that the Issue is with the following code
documentNames = (PdfDictionary)PdfReader.GetPdfObject(catalog.Get(PdfName.NAMES));
In which,
catalog.Get(PdfName.NAMES)
is returned as NULL, Where as the File created by me, provides the expected output.
Please guide me on the above.
TIA.

As mkl suggested, It has been attached as an Annotated attachment. But the reference which is used in the example is provided ZipFile Method is no longer supported. Hence I found an alternate code attached below.
public void ExtractAttachments(byte[] src)
{
PRStream stream = null;
string attExtPath = #"C:\PDFReader\Extract\";
if (!Directory.Exists(attExtPath))
Directory.CreateDirectory(attExtPath);
byte[] password = System.Text.ASCIIEncoding.ASCII.GetBytes("TFAER13052016");
PdfReader reader = new PdfReader(src, password);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
PdfArray array = reader.GetPageN(i).GetAsArray(PdfName.ANNOTS);
if (array == null) continue;
for (int j = 0; j < array.Size; j++)
{
PdfDictionary annot = array.GetAsDict(j);
if (PdfName.FILEATTACHMENT.Equals(
annot.GetAsName(PdfName.SUBTYPE)))
{
PdfDictionary fs = annot.GetAsDict(PdfName.FS);
PdfDictionary refs = fs.GetAsDict(PdfName.EF);
foreach (PdfName name in refs.Keys)
{
//zip.AddEntry(
// fs.GetAsString(name).ToString(),
// PdfReader.GetStreamBytes((PRStream)refs.GetAsStream(name))
//);
stream = (PRStream)PdfReader.GetPdfObject(refs.GetAsIndirectObject(name));
string attachedFileName = fs.GetAsString(name).ToString();
var splitname = attachedFileName.Split('\\');
if (splitname.Length != 1)
attachedFileName = splitname[splitname.Length - 1].ToString();
byte[] attachedFileBytes = PdfReader.GetStreamBytes(stream);
System.IO.File.WriteAllBytes(attExtPath + attachedFileName, attachedFileBytes);
}
}
}
}
}
Please Let me Know if it can be achieved in any other way.
Thanks!!!

Lucene 30 fuzzy search

I user LUCENE_30 for my search engine but i cannot make fuzzy search.How can i make it work?
I tried use GetFuzzyQuery but nothing happens.As i see is not supported.
Here my code :
if (searchQuery.Length < 3)
{
throw new ArgumentException("none");
}
FSDirectory dir = FSDirectory.Open(new DirectoryInfo(_indexFileLocation));
var searcher = new IndexSearcher(dir, true);
var analyzer = new RussianAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
var query = MultiFieldQueryParser.Parse(Lucene.Net.Util.Version.LUCENE_30, searchQuery, new[] {"Title" }, new[] { Occur.SHOULD }, analyzer);
var hits = searcher.Search(query, 11110);
var dto = new PerformSearchResultDto();
dto.SearchResults = new List<SearchResult>();
dto.Total = hits.TotalHits;
for (int i = pagesize * page; i < hits.TotalHits && i < pagesize * page + pagesize; i++)
{
// Document doc = hits.Doc(i);
int docId = hits.ScoreDocs[i].Doc;
var doc = searcher.Doc(docId);
var result = new SearchResult();
result.Title = doc.Get("Title");
result.Type = doc.Get("Type");
result.Href = doc.Get("Href");
result.LastModified = doc.Get("LastModified");
result.Site = doc.Get("Site");
result.City = doc.Get("City");
//result.Region = doc.Get("Region");
result.Content = doc.Get("Content");
result.NoIndex = Convert.ToBoolean(doc.Get("NoIndex"));
dto.SearchResults.Add(result);
}

Fuzzy queries certainly are supported. See the FuzzyQuery class.
The query parser also supports fuzzy queries, simply with a tilde appended: misspeled~

In Lucene, why do my boosted and unboosted documents get the same score?

At index time I am boosting certain document in this way:
if (myCondition)
{
document.SetBoost(1.2f);
}
But at search time documents with all the exact same qualities but some passing and some failing myCondition all end up having the same score.
And here is the search code:
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.Add(new TermQuery(new Term(FieldNames.HAS_PHOTO, "y")), BooleanClause.Occur.MUST);
booleanQuery.Add(new TermQuery(new Term(FieldNames.AUTHOR_TYPE, AuthorTypes.BLOGGER)), BooleanClause.Occur.MUST_NOT);
indexSearcher.Search(booleanQuery, 10);
Can you tell me what I need to do to get the documents that were boosted to get a higher score?
Many Thanks!

Lucene encodes boosts on a single byte (although a float is generally encoded on four bytes) using the SmallFloat#floatToByte315 method. As a consequence, there can be a big loss in precision when converting back the byte to a float.
In your case SmallFloat.byte315ToFloat(SmallFloat.floatToByte315(1.2f)) returns 1f because 1f and 1.2f are too close to each other. Try using a bigger boost so that your documents get different scores. (For exemple 1.25, SmallFloat.byte315ToFloat(SmallFloat.floatToByte315(1.25f)) gives 1.25f.)

Here is the requested test program that was too long to post in a comment.
class Program
{
static void Main(string[] args)
{
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer());
const string FIELD = "name";
for (int i = 0; i < 10; i++)
{
StringBuilder notes = new StringBuilder();
notes.AppendLine("This is a note 123 - " + i);
string text = notes.ToString();
Document doc = new Document();
var field = new Field(FIELD, text, Field.Store.YES, Field.Index.NOT_ANALYZED);
if (i % 2 == 0)
{
field.SetBoost(1.5f);
doc.SetBoost(1.5f);
}
else
{
field.SetBoost(0.1f);
doc.SetBoost(0.1f);
}
doc.Add(field);
writer.AddDocument(doc);
}
writer.Commit();
//string TERM = QueryParser.Escape("*+*");
string TERM = "T";
IndexSearcher searcher = new IndexSearcher(dir);
Query query = new PrefixQuery(new Term(FIELD, TERM));
var hits = searcher.Search(query);
int count = hits.Length();
Console.WriteLine("Hits - {0}", count);
for (int i = 0; i < count; i++)
{
var doc = hits.Doc(i);
Console.WriteLine(doc.ToString());
var explain = searcher.Explain(query, i);
Console.WriteLine(explain.ToString());
}
}
}

Exact search with Lucene.Net

I already have seen few similar questions, but I still don't have an answer. I think I have a simple problem.
In sentence
In this text, only Meta Files are important, and Test Generation.
Anything else is irrelevant
I want to index only Meta Files and Test Generation. That means that I need exact match.
Could someone please explain me how to achieve this?
And here is the code:
Analyzer analyzer = new StandardAnalyzer();
Lucene.Net.Store.Directory directory = new RAMDirectory();
indexWriter iwriter = new IndexWriter(directory, analyzer, true);
iwriter.SetMaxFieldLength(10000);
Document doc = new Document();
doc.Add(new Field("textFragment", text, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES));
iwriter.AddDocument(doc);
iwriter.Close();
IndexSearcher isearcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser("textFragment", analyzer);
foreach (DictionaryEntry de in OntologyLayer.OntologyLayer.HashTable)
{
List<string> buffer = new List<string>();
double weight = 0;
List<OntologyLayer.Term> list = (List<OntologyLayer.Term>)de.Value;
foreach (OntologyLayer.Term t in list)
{
Hits hits = null;
string label = t.Label;
string[] words = label.Split(' ');
int numOfWords = words.Length;
double wordWeight = 1 / (double)numOfWords;
double localWeight = 0;
foreach (string a in words)
{
try
{
if (!buffer.Contains(a))
{
Lucene.Net.Search.Query query = parser.Parse(a);
hits = isearcher.Search(query);
if (hits != null && hits.Length() > 0)
{
localWeight = localWeight + t.Weight * wordWeight * hits.Length();
}
buffer.Add(a);
}
}
catch (Exception ex)
{}
}
weight = weight + localWeight;
}
sbWeight.AppendLine(weight.ToString());
if (weight > 0)
{
string objectURI = (string)de.Key;
conceptList.Add(objectURI);
}
}

Take a look at Stupid Lucene Tricks: Exact Match, Starts With, Ends With.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene 4.1 how to index facets? - api

Related

iTextSharp Document Isn't Working After Deployment

Reading attachment from a secured PDF

Lucene 30 fuzzy search

In Lucene, why do my boosted and unboosted documents get the same score?

Exact search with Lucene.Net

Categories

Resources