Lucene- Extracting sentence in which word match occurs - lucene

I'm a newbie to Lucene. In the course of understanding it, I could successfully index the files in a directory and I did a basic lucene search to get the list of files in which a particular word is present.
Now I'm trying to extract the sentence from a file in which the search word is present.
I've searched a lot but couldn't figure out.
Regards.

Thank you all for your response.
I was trying to extract index of sentences in the directory of files but not the "relavent/best text/fragment".
Here is how I solved the problem:
Using "two-level indexing" --> first index the files in a directory & then index sentences in each file. This made my job pretty easier & faster.
Anyways, thanks again for the help :)

You're looking for the method
org.apache.lucene.search.highlight.Highlighter.getBestFragment
Such method gets in input the set of tokens generated analyzing the original text, and returns in output the most relevant text fragments. Please remember to trim the fragments if they are too big.

Related

Lucene query result is not correct when running official demo

I tried Lucene official demo by running IndexFiles with arguments -index . -docs . , and console prints including pom.xml and *.java and *.class are added into index.
Then I tried SearchFiles with arguments -index . -query "lucene AND main", and console prints only IndexFiles.class and SearchFiles.class and IndexFiles.java, but not SearchFiles.java (which I think should be one of searched results).
Your search results are correct (for the .java files, at least).
The sample code uses the StandardAnalyzer which, in turn, uses the StandardTokenizer.
The StandardTokenizer splits input text into tokens using the rules described in this document. For example, from section 4 of that document:
When you have text such as the following, in the source files
org.apache.lucene.analysis.Analyzer
this is tokenized as a single token. There are no word boundaries.
Looking in the IndexFiles.java source file, there is the following text:
demonstrating simple Lucene indexing
This is tokenized into 4 separate tokens.
But in the SearchFiles.java source file, the text "lucene" only ever appears in text such as org.apache.lucene.analysis.Analyzer - and therefore the single token lucene is never created.
Your query therefore does not find any hits in the IndexFiles.java document because the query matches exact tokens. Both source files contain the word "main" but only one contains the word "lucene".
For the .class files, because these are compiled bytecode files, I would say they should not be indexed in the first place. Lucene works with text files, not binary files. Yes, the class files will contain fragments of text, but they will also typically contain unprintable control characters, which are not suitable to be indexed. I think indexing results could be unpredictable because of this.
You can explore the indexed data using Luke - which is bundled in the binary releases:

Pentaho - Spoon Decimal from Text File Input

I'm new to Pentaho and have a little problem with the Text file Input.
Currently I have to have several data records written to a database. In the files, the decimal numbers are separated by a point.
Pentaho is currently transforming the number 123.3659 € to 12.33 €.
Can someone help?
When you read the file, do you read it as a csv, excel or something like that? If that's the case, then you can specify the format of the column to interpret the number correctly (I think, I'm talking from memory now) Or maybe playing with the language of the file might work.
If it's a file containing a string, you can use some step like the string operator to replace the point with a comma.
This problem might come from various reasons.
Although I think that by following the next steps you can solve the issue.
-First, you must get a "Replace in String" step;
-Then search for the dot and replace it with nothing as I show in the following image, or with a coma if the number you show is a float;
Example snip
Hope this helped!
Give feedback if so!
Have a good day!

U-SQL extracting files complete contents (extracting full source code from html files)

I've got a bunch of HTML files in my Data Lake Store and would like to get their full source code into a table (just one column with the code from all the files, the output format is not relevant to me, but probably tsv). I can't find a way to use the standard Extractors or anything on the web that works for me. Do I have to write a custom Extractor for that?
I've tried the Extractors.Tsv() and Extractors.Text() with a whole bunch of delimiters. I first tried:
#data =
EXTRACT source string
FROM "<MY DIRECTORY IN ADL>"
USING Extractors.Text(delimiter:'');
This didnt work out as it seems to not like having no delimiter, but also when I tried using delimiters that aren't in the html files it didnt work out.
Has anyone got an idea how to get this done? It seems to me that I am just stupid, so I hope someone here is a little smarter.
Even better than just the source code would be if I had the source code + filename in two columns, but I wanna start small.
Thank you!
#files =
EXTRACT FileName string,
Text string
FROM #"/somepath/{FileName}.html"
USING Extractors.Text(silent: true, delimiter: '`');
OUTPUT #files
TO "/somepath/Test.txt"
USING Outputters.Tsv(outputHeader: false, quoting: false);

Reading a large-single XML line to a variable using Batch Script

I have a xml file which only contains a single line, but the problem is the line is very large, so it seems that I can't store in a variable.
What i want is this,
given tag1, tag2.....tag900, I want to break each tag into a line as follow:
tag1
tag2
tag3
......
tag900
Do not attempt to do this using native batch. It will be extremely difficult, and any solution will be very slow.
The problem is native batch cannot read lines > 8k, and batch does not have a good way to read partial lines.
There is a method that creates a test file that has size >= your file that consists of a single repeated character. A binary file compare ( FC /B ) is then done and the results are parsed character by character expressed as hex codes. It's a bit more complex than that, but I don't think you want to go there.
The only other option is to use SET /P to read in 1021 chars at a time, and then parse and piece things together. But this is unproven, and again, I don't think worth the effort.
If you want to use a native scripting language than I suggest VBScript or JScript. (Perhaps PowerShell, but I don't really know much about its capabilities).
You could download a Unix text processing tool like sed that has been ported to Windows.
I don't do much with XML, but I've got to believe there is a free tool geared specifically for XML that would make your job fairly easy.
Basically, use anything except batch! (this is coming from someone whose hobby is solving problems with batch)

Taking random string from text file Cocoa?

Having troubles finding a good way to get a string from a text file (separated by line breaks) randomly.
I want to do a setStringValue:#"random string from file here";
pretty much. Thanks in advance.
Reservoir sampling if you want to avoid loading the complete file into memory at once. For a file just a few lines in length I'd just go with vodkhang's answer, though.
How about you load the whole file (if it is not too big) into an array and then you randomize the index and use that index to get the string from the array?