Lucene Highlighter class: highlight different words in different colors

Lucene Highlighter class: highlight different words in different colors - lucene

Probably most people reading the title who know a bit about Lucene won't need much further explanation. NB I use Jython but I think most Java users will understand the Java equivalent...
It's a classic thing to want to do: you have more than one term in your search string... in Lucene terms this returns a BooleanQuery. Then you use something like this code to highlight (NB I am a Lucene newbie, this is all closely tweaked from Net examples):
yellow_highlight = SimpleHTMLFormatter( '<b style="background-color:yellow">', '</b>' )
green_highlight = SimpleHTMLFormatter( '<b style="background-color:green">', '</b>' )
...
stream = FrenchAnalyzer( Version.LUCENE_46 ).tokenStream( "both", StringReader( both ) )
scorer = QueryScorer( fr_query, "both" )
fragmenter = SimpleSpanFragmenter(scorer)
highlighter = Highlighter( yellow_highlight, scorer )
highlighter.setTextFragmenter(fragmenter)
best_fragments = highlighter.getBestTextFragments( stream, both, True, 5 )
if best_fragments:
for best_frag in best_fragments:
print "=== best frag: %s, type %s" % ( best_frag, type( best_frag ))
html_text += "&bull %s<br>\n" % unicode( best_frag )
... and then the html_text is put in a JTextPane for example.
But how would you make the first word in your query highlight with a yellow background and the second word highlight with a green background? I have tried to understand the various classes in org.apache.lucene.search... to no avail. So my only way of learning was googling. I couldn't find any clues...

I asked this question four years ago... At the time I did manage to implement a solution using javax.swing.text.html.HTMLDocument. There's also the interface org.w3c.dom.html.HTMLDocument in the standard Java library. This way is hard work.
But for anyone interested there's a far simpler solution. Taking advantage of the fact that Lucene's SimpleHTMLFormatter returns about the simplest imaginable "marked up" piece of text: chosen words are highlighted with the HTML B tag. That's it. It's not even a "proper" HTML fragment, just a String with <B>s and </B>s in it.
A multi-word query generates a BooleanQuery... from which you can extract multiple TermQuerys by going booleanQuery.clauses() ... getQuery()
I'm working in Groovy. The colouring I want to apply is console codes, as per BASH (or Cygwin). Other types of colouring can be worked out on this model.
So you set up a map before to hold your "markup details":
def markupDetails = [:]
Then for each TermQuery, you call this, with the same text param each time, stipulating a different colour param for each term. NB I'm using Lucene 6.
def createHighlightAndAnalyseMarkup( TermQuery tq, String text, String colour ) {
def termQueryScorer = new QueryScorer( tq )
def termQueryHighlighter = new Highlighter( formatter, termQueryScorer )
TokenStream stream = TokenSources.getTokenStream( fieldName, null, text, analyser, -1 )
String[] frags = termQueryHighlighter.getBestFragments( stream, text, 999999 )
// not sure under what circs you get > 1 fragment...
assert frags.size() <= 1
// NB you don't always get all terms in all returned LDocuments...
if( frags.size() ) {
String highlightedFrag = frags[ 0 ]
Matcher boldTagMatcher = highlightedFrag =~ /<\/?B>/
def pos = 0
def previousEnd = 0
while( boldTagMatcher.find()) {
pos += boldTagMatcher.start() - previousEnd
previousEnd = boldTagMatcher.end()
markupDetails[ pos ] = boldTagMatcher.group() == '<B>'? colour : ConsoleColors.RESET
}
}
}
As I said, I wanted to colourise console output. The colour parameter in the method here is per the console colour codes as found here, for example. E.g. yellow is \033[033m. ConsoleColors.RESET is \033[0m and marks the place where each coloured bit of text stops.
... after you've finished doing this with all TermQuerys you will have a nice map telling you where individual colours begin and end. You work backwards from the end of the text so as to insert the "markup" at the right position in the String. NB here text is your original unmarked-up String:
markupDetails.sort().reverseEach{ pos, markup ->
String firstPart = text.substring( 0, pos )
String secondPart = text.substring( pos )
text = firstPart + markup + secondPart
}
... at the end of which text contains your marked-up String: print to console. Lovely.

Related

how to search for series of strings in a sentence in kotlin

I have an ArrayList of items. Each item has long strings for example
("The cat is in the hat","It's warm outside","It's cold outside")
what I am trying to do is search for a series of strings for example "It's outside" in any given order in the ArrayList above and it should find 2 of them.
This is what I tried:
fun clickItem(criteria: String) {
productList = productListAll.filter {it: Data
it.title.contains(criteria, ignoreCase = true)
}
} as ArrayList<Data>
This works fine when the words I am looking for are in sequence. However, I am trying to get strings in any given order. Does anyone know how to accomplish that?

We can do this by splitting title and criteria by whitespaces to create a set of words. Then we use containsAll() to check if title contains all of words from criteria. Additionally, we need to convert both of them to lowercase (or uppercase), so the search will be case-insensitive:
private val whitespace = Regex("\\s+")
fun clickItem(criteria: String): List<Data> {
val criteriaWords = criteria.lowercase().split(whitespace).toSet()
return productListAll.filter {
it.title.lowercase().split(whitespace).containsAll(criteriaWords)
}
}
Note that searching through text is not that trivial, so simple solutions will be always limited. For example, we won't find "it's" when searching for "it is", etc.

Getting the name of the variable as a string in GD Script

I have been looking for a solution everywhere on the internet but nowhere I can see a single script which lets me read the name of a variable as a string in Godot 3.1
What I want to do:
Save path names as variables.
Compare the name of the path variable as a string to the value of another string and print the path value.
Eg -
var Apple = "mypath/folder/apple.png"
var myArray = ["Apple", "Pear"]
Function that compares the Variable name as String to the String -
if (myArray[myposition] == **the required function that outputs variable name as String**(Apple) :
print (Apple) #this prints out the path.
Thanks in advance!

I think your approach here might be a little oversimplified for what you're trying to accomplish. It basically seems to work out to if (array[apple]) == apple then apple, which doesn't really solve a programmatic problem. More complexity seems required.
First, you might have a function to return all of your icon names, something like this.
func get_avatar_names():
var avatar_names = []
var folder_path = "res://my/path"
var avatar_dir = Directory.new()
avatar_dir.open(folder_path)
avatar_dir.list_dir_begin(true, true)
while true:
var avatar_file = avatar_dir.get_next()
if avatar_file == "":
break
else:
var avatar_name = avatar_file.trim_suffix(".png")
avatar_names.append(avatar_name)
return avatar_names
Then something like this back in the main function, where you have your list of names you care about at the moment, and for each name, check the list of avatar names, and if you have a match, reconstruct the path and do other work:
var some_names = ["Jim","Apple","Sally"]
var avatar_names = get_avatar_names()
for name in some_names:
if avatar_names.has(name):
var img_path = "res://my/path/" + name + ".png"
# load images, additional work, etc...
That's the approach I would take here, hope this makes sense and helps.

I think the current answer is best for the approach you desire, but the performance is pretty bad with string comparisons.
I would suggest adding an enumeration for efficient comparisons. unfortunately Godot does enums differently then this, it seems like your position is an int so we can define a dictionary like this to search for the index and print it out with the int value.
var fruits = {0:"Apple",1:"Pear"}
func myfunc():
var myposition = 0
if fruits.has(myposition):
print(fruits[myposition])
output: Apple
If your position was string based then an enum could be used with slightly less typing and different considerations.
reference: https://docs.godotengine.org/en/latest/tutorials/scripting/gdscript/gdscript_basics.html#enums

Can't you just use the str() function to convert any data type to stirng?
var = str(var)

How can I use the StreamWriteAsText() to write data of the Number type?

My ultimate goal is to write a file of image data and the time it was taken, for multiple times. This could be used to produce time vs intensity plots.
To do this, I am trying to write a 1D image to a file stream repeatedly in time using the ImageWriteImageDataToStream() function. I go about this by attaching a Listener object to the camera view I am reading out and this listener executes a function that writes the image to a file stream using ImageWriteImageDataToStream() every time the data changes (messagemap = "data_changed:MyFunctiontoExecute") .
My question is, is there a way to also write a time stamp to this same file stream?
All I can find is StreamWriteAsText(), which takes a String data type. Can I convert time which is a Number type to a String type?
Does anyone have a better way to do this?
My solution at the moment is to create a separate file at the same time and record the timing using WriteFile(), so not using a file stream.
//MyFunctiontoExecute, where Img is the 1D image at the current time
My_file_stream.StreamSetPos(2,0)
ImageWriteImageDataToStream(Img, My_file_stream, 0)
//Write the time to the same file
Number tmp_time = GetHighResTickCount() - start_time
My_file_stream.StreamSetPos(2,0)
My_file_stream.StreamWriteAsText(0,tmp_time) //does not work
//instead using a different file
WriteFile(My_extrafileID,tmp_time+"/n")

I think your concept of streaming is wrong. When you stream to a file, at the end of the toStream() commands, the stream-position is already at the end. So you don't set the position.
Your script essentially tells the computer to set the stream back to that starting position and then to write the text - overwriting the data.
You only need the 'StreamSetPos()' command when you want to jump over some sections during reading (useful when defining import-scripts for specific file formats, for example. Or to extract only specific sub-sets from a file.).
If all you want to do is "stream-out some raw-data", you do exactly that: Just call the commands after each other:
void WriteDataPlusDateToStream( object fStream, image img, string dateStr )
{
number endian = 0
number encoding = 0
img.ImageWriteImageDataToStream(fStream,endian)
fStream.StreamWriteAsText(encoding,dateStr)
}
Similarly, you just "stream-in" by just following the same sequence:
void ReadDataPlusDateFromStream( object fStream, image img, string &dateStr )
{
number endian = 0
number encoding = 0
img.ImageReadImageDataFromStream(fStream,endian)
fStream.StreamReadTextLine(encoding,dateStr)
}
Two things are important here:
in ImageReadImageDataFromStream it is the size and data-type of the image img which defines how many bytes are read from the stream and how they are interpreted. Therefore img must have been pre-created and of fitting size and file-type.
in StreamReadTextLine the stream will continue to read in as text until it encounters the end-of-line character (\n) or the end of the stream. Therefore make sure to write this end-of-line character when streaming-out. Alternatively, you can make sure that the strings are always of a specific size and then use StreamReadAsText with the appropriate length specified.
Using the two methods above, you can use the following test-script as a starting point:
void WriteDataPlusDateToStream( object fStream, image img, string dateStr )
{
number endian = 0
number encoding = 0
img.ImageWriteImageDataToStream(fStream,endian)
fStream.StreamWriteAsText(encoding,dateStr)
}
void ReadDataPlusDateFromStream( object fStream, image img, string &dateStr )
{
number endian = 0
number encoding = 0
img.ImageReadImageDataFromStream(fStream,endian)
fStream.StreamReadTextLine(encoding,dateStr)
}
void writeTest(string path)
{
Result("\n Writing to :" + path )
image testImg := RealImage("Test",4,100)
string dateStr;
number loop = 5;
number doAutoClose = 1
object fStream = NewStreamFromFileReference( CreateFileForWriting(path), doAutoClose )
for( number i=0; i<loop; i++ )
{
testImg = icol * random()
dateStr = GetDate(1)+"#"+GetTime(1)+"|"+Format(GetHighResTickCount(),"%.f") + "\n"
fStream.WriteDataPlusDateToStream(testImg,dateStr)
sleep(0.33)
}
}
void readTest(string path)
{
Result("\n Reading form :" + path )
image testImg := RealImage("Test",4,100)
string dateStr;
number doAutoClose = 1
object fStream = NewStreamFromFileReference( OpenFileForReading(path), doAutoClose )
while ( fStream.StreamGetPos() < fStream.StreamGetSize() )
{
fStream.ReadDataPlusDateFromStream(testImg,dateStr)
result("\n time:"+dateStr)
testImg.ImageClone().ShowImage()
}
}
string path = "C:/test.dat"
ClearResults()
writeTest(path)
readTest(path)
Note, that when streaming "binary data" like this, it is you who defines the file-format. You must make sure that the writing and reading code matches up.

Split text file into several parts by character

I apologise in advance if there is already an answer to this problem; if so please just link it (I have looked, btw! I just didn't find anything relating to my specific example) :)
I have a text (.txt) file which contains data in the form 1.10.100.0.200 where 1, 10, 100, 0 and 200 are numbers storing the map terrain layout of a game. This file has multiple lines of 1.10.100.0.200 where each line represents an item of terrain in the map.
Here is what I would like to know:
How do I find out how many lines there are, so I know how many items of terrain to create when I read the map file?
What is the method I should use to get each of 1, 10, 100, 0 and 200:
E.g. when I am translating the file into a map terrain at runtime I might use the terrainitem1.Location = New Point(x, y) or terrainitem1.Size = New Size(p, q) commands, where x, y, p and q are integers or doubles relating to the terrain's location or size. Where would I then find x, y etc. out of 1, 10, 100, 0 and 200, if say x is equal to 1, y to 10 and so on?
I am sorry if this isn't clear, please just ask me and I'll try to explain.
N.B. I am using VB.NET WinForms

There is no way to know how many lines a file has without opening the file and reading its contents.
You didn't indicate how far you've got on this. Do you know how to open a file?
Here's some basic code to do what you want. (Sorry, this is C# but the idea is the same in VB.)
string line;
using (TextReader reader = File.OpenText(#"C:\filename.txt"))
{
// Read each line from the file (until null returned)
while ((line = myTextReader.ReadLine()) != null)
{
// Get each number in line (as string)
string[] values = line.Split(new[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
// Convert each number to integer
id = int.Parse(values[0]);
height = int.Parse(values[1]);
width = int.Parse(values[2]);
x = int.Parse(values[3]);
y = int.Parse(values[4]);
}
}

ALIGN_JUSTIFIED for iText list item

I want to set alignment to list items by writing this code -
ListItem alignJustifiedListItem =
new ListItem(bundle.getString(PrintKeys.AckProcess), normalFont8);
alignJustifiedListItem.setAlignment(Element.ALIGN_JUSTIFIED);
I see this doesn't make any change on alignment (defaulted as left aligned). Changing it to
alignJustifiedListItem.setAlignment(Element.ALIGN_JUSTIFIED_ALL); is actually working but then the last line of the content also expands (as mentioned in doc, as well)
I dont understand when ListItem extends Paragraph, how setAlignment() behaviour can change. I don't see any overriding as well.

Please take a look at the ListAlignment example.
In this example, I create a list with three list items of which I set the alignment to ALIGN_JUSTIFIED:
List list = new List(List.UNORDERED);
ListItem item = new ListItem(text);
item.setAlignment(Element.ALIGN_JUSTIFIED);
list.add(item);
text = "a b c align ";
for (int i = 0; i < 5; i++) {
text = text + text;
}
item = new ListItem(text);
item.setAlignment(Element.ALIGN_JUSTIFIED);
list.add(item);
text = "supercalifragilisticexpialidocious ";
for (int i = 0; i < 3; i++) {
text = text + text;
}
item = new ListItem(text);
item.setAlignment(Element.ALIGN_JUSTIFIED);
list.add(item);
document.add(list);
If you look at the result, you can see that the alignment works as expected:
I deliberately introduced a very long word such as "supercalifragilisticexpialidocious" to show you that all lines but the last are indeed justified.
Update:
In a comment, you claim that the alignment is wrong when you introduce the \ character, and you want me to fix iText. However, there is nothing to fix.
I have adapted the original example like this:
text = "a b c align ";
for (int i = 0; i < 5; i++) {
text = text + "\\" + text;
}
item = new ListItem(text);
item.setAlignment(Element.ALIGN_JUSTIFIED);
list.add(item);
text = "supercalifragilisticexpialidocious ";
text = text + text;
text = text + text;
text = text + "\n" + text;
item = new ListItem(text);
item.setAlignment(Element.ALIGN_JUSTIFIED);
list.add(item);
In the first case, I have introduce the \ character. This didn't change anything to the behavior of the ListItem. In the second case, I introduce a newline character. The result was as expected: a newline character was introduced and the last line of every "paragraph" that was defined by the newline character was indeed not justified. That is what one would normally expect. I would introduce a bug if I would change this.
This is the screen shot of the result:
The introduction of the '\' character in the lines with "a b c align " doesn't have any effect on the alignment. The introduction of the newline half way the "supercalifragilisticexpialidocious " part breaks the list item in two parts. The final line of each part is not justified, which is the desired behavior.
If you do not want this desired behavior, you have to parse the content first and remove all newlines characters (carriage return and line feed).
Update:
In a new comment, you mention the '\' character as an escape character for the ''' character (actually the \' character). I have adapted the original example once more:
text = "a b c\' align ";
for (int i = 0; i < 5; i++) {
text = text + text;
}
item = new ListItem(text);
item.setAlignment(Element.ALIGN_JUSTIFIED);
list.add(item);
The result looks like this:
The text is justified correctly. However, I can imagine that problems can occur if you handle Strings with escape characters incorrectly. In this case, the '\'' character was hardcoded. If you obtain the String from a database and you read that String incorrectly, then you can have strange results. Especially from my days as a PHP developer, I remember instances where a single quote ended up to be stored like this '\\\'' in a database if you didn't watch out.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene Highlighter class: highlight different words in different colors - lucene

Related

how to search for series of strings in a sentence in kotlin

Getting the name of the variable as a string in GD Script

How can I use the StreamWriteAsText() to write data of the Number type?

Split text file into several parts by character

ALIGN_JUSTIFIED for iText list item

Categories

Resources