Lucene - cannot query values with capital letters - lucene

I'd like to index and search lower-cased keywords. I attached test code which IMHO clearly demonstrates my simple goal. I index two words, one with capital letter, and then I search and print them back one by one. For this I created Analyzer which just converts keywords to lower-case (KeywordAnalyzer doesn't lower-case and SimpleAnalyzer splits on non-letter characters). I use this analyzer for both IndexWriter and QueryParser. However, for some reason I can't get back words with capital letters even if I search for lower-cased word ("bye" in the example).
Program expected output:
hello
Bye
Actual output:
hello
What's the problem?
I hope you don't mind code is in Scala. I'll gladly help you understand in case it's not clear what the code does.
import org.apache.lucene.store.FSDirectory
import java.io.{Reader, File}
import org.apache.lucene.index._
import org.apache.lucene.document._
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.queryparser.classic.QueryParser
import org.apache.lucene.analysis.util.CharTokenizer
import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.util.Version
import org.apache.lucene.analysis.Analyzer.TokenStreamComponents
final class LcAnalyzer(lucVer: Version) extends Analyzer {
def createComponents(fieldName: String, reader: Reader) =
new TokenStreamComponents(new CharTokenizer(lucVer, reader) {
def isTokenChar(c: Int) = true
override def normalize(c: Int) = Character.toLowerCase(c)
})
}
object LuceneTest {
val LV = Version.LUCENE_43
val F = "myf"
val VALS = Seq("hello", "Bye")
val indexDir = FSDirectory.open(new File("testindex"))
val anlz = new LcAnalyzer(LV)
def main(args: Array[String]) {
writeData()
val reader = DirectoryReader.open(indexDir)
val searcher = new IndexSearcher(reader)
val p = new QueryParser(LV, F, anlz)
for (v <- VALS) {
val hits = searcher.search(p.parse(F + ':' + v), 1).scoreDocs
for (i <- 0 until hits.length) {
val doc = searcher.doc(hits(i).doc)
println(doc.get(F))
}
}
}
def writeData() {
val writer = {
val wc = new IndexWriterConfig(LV, anlz)
val writer = new IndexWriter(indexDir, wc)
writer.commit
writer
}
for (v <- VALS) {
val doc = new Document
doc.add(new StringField(F, v, Field.Store.YES))
writer.addDocument(doc)
}
writer.commit
writer.close
}
}

Related

How to try every possible permutation in Kotlin

fun main () {
var integers = mutableListOf(0)
for (x in 1..9) {
integers.add(x)
}
//for or while could be used in this instance
var lowerCase = listOf("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z")
var upperCase = listOf('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z')
println(integers)
println(lowerCase)
println(upperCase)
//Note that for the actual program, it is also vital that I use potential punctuation
val passwordGeneratorKey1 = Math.random()*999
val passwordGeneratorKey2 = passwordGeneratorKey1.toInt()
var passwordGeneratorL1 = lowerCase[(Math.random()*lowerCase.size).toInt()]
var passwordGeneratorL2 = lowerCase[(Math.random()*lowerCase.size).toInt()]
var passwordGeneratorL3 = lowerCase[(Math.random()*lowerCase.size).toInt()]
var passwordGeneratorU1 = upperCase[(Math.random()*upperCase.size).toInt()]
var passwordGeneratorU2 = upperCase[(Math.random()*upperCase.size).toInt()]
var passwordGeneratorU3 = upperCase[(Math.random()*upperCase.size).toInt()]
val password = passwordGeneratorKey2.toString()+passwordGeneratorL1+passwordGeneratorL2+passwordGeneratorL3+passwordGeneratorU1+passwordGeneratorU2+passwordGeneratorU3
println(password)
//No, this isn't random, but it's pretty close to it
//How do I now run through every possible combination of the lists //lowerCase, integers, and upperCase?
}
How do I run through every possible permutation to eventually solve for the randomly generated password? This is in Kotlin.
I think you should append all the lists together and then draw from it by random index, this way you ensure that position of numbers, lower cases and uppercases is random too. Also you don't need to write all the characters, you can use Range which generates them for you.
fun main() {
val allChars = mutableListOf<Any>().apply {
addAll(0..9) // creates range from 0 to 9 and adds it to a list
addAll('a'..'z') // creates range from a to z and adds it to a list
addAll('A'..'Z') // creates range from A to Z and adds it to a list
}
val passwordLength = 9
val password = StringBuilder().apply {
for (i in 0 until passwordLength) {
val randomCharIndex =
Random.nextInt(allChars.lastIndex) // generate random index from 0 to lastIndex of list
val randomChar = allChars[randomCharIndex] // select character from list
append(randomChar) // append char to password string builder
}
}.toString()
println(password)
}
Even shorter solution can be achieved using list methods
fun main() {
val password = mutableListOf<Any>()
.apply {
addAll(0..9) // creates range from 0 to 9 and adds it to a list
addAll('a'..'z') // creates range from a to z and adds it to a list
addAll('A'..'Z') // creates range from A to Z and adds it to a list
}
.shuffled() // shuffle the list
.take(9) // take first 9 elements from list
.joinToString("") // join them to string
println(password)
}
As others pointed out there are less painful ways to generate the initial password in the format of: 1 to 3 digits followed by 3 lowercase characters followed by 3 uppercase characters.
To brute force this password, you will need to consider all 3-permutations of "a..z" and all 3-permitations of "A..Z". In both cases the number of such 3-permutations is 15600 = 26! / (26-3)!. In worst case you will have to examine 1000 * 15600 * 15600 combination, half of this on the average.
Probably doable in a few hours with the code below:
import kotlin.random.Random
import kotlin.system.exitProcess
val lowercaseList = ('a'..'z').toList()
val uppercaseList = ('A'..'Z').toList()
val lowercase = lowercaseList.joinToString(separator = "")
val uppercase = uppercaseList.joinToString(separator = "")
fun genPassword(): String {
val lowercase = lowercaseList.shuffled().take(3)
val uppercase = uppercaseList.shuffled().take(3)
return (listOf(Random.nextInt(0, 1000)) + lowercase + uppercase).joinToString(separator = "")
}
/**
* Generate all K-sized permutations of str of length N. The number of such permutations is:
* N! / (N-K)!
*
* For example: perm(2, "abc") = [ab, ac, ba, bc, ca, cb]
*/
fun perm(k: Int, str: String): List<String> {
val nk = str.length - k
fun perm(str: String, accumulate: String): List<String> {
return when (str.length == nk) {
true -> listOf(accumulate)
false -> {
str.flatMapIndexed { i, c ->
perm(str.removeRange(i, i + 1), accumulate + c)
}
}
}
}
return perm(str, "")
}
fun main() {
val password = genPassword().also { println(it) }
val all3LowercasePermutations = perm(3, lowercase).also { println(it) }.also { println(it.size) }
val all3UppercasePermutations = perm(3, uppercase).also { println(it) }.also { println(it.size) }
for (i in 0..999) {
println("trying $i")
for (l in all3LowercasePermutations) {
for (u in all3UppercasePermutations) {
if ("$i$l$u" == password) {
println("found: $i$l$u")
exitProcess(0)
}
}
}
}
}

NoSuchElementException java.lang.Scanner

I have no idea what the error is, I am having a hard time adapting to this language, any help thank you very much.
Error:
Exception in thread "main" java.util.NoSuchElementException
at java.base/java.util.Scanner.throwFor(Scanner.java:937)
at java.base/java.util.Scanner.next(Scanner.java:1594)
at java.base/java.util.Scanner.nextInt(Scanner.java:2258)
at java.base/java.util.Scanner.nextInt(Scanner.java:2212)
at Packing.<init>(Packing.kt:100)
at PackingKt.main(Packing.kt:7)
at PackingKt.main(Packing.kt)
My code:
import java.io.InputStream
import java.util.Scanner
fun main() {
val input = Scanner(InputStream.nullInputStream())
val packing1 = Packing(input)
val packing2 = Packing(input)
val packing3 = Packing(input)
var total = 0
var min = 0
val combinations = ArrayList<String>()
for(a in 1..3){
for(b in 1..3){
for(c in 1..3){
//here is a piece of code
}
}
combinations.sort()
println("${combinations.get(0)} $min")
}
}
class Packing {
var brownBottles = 0
var greenBottles = 0
var clearBottles = 0
constructor (input : Scanner){
brownBottles = input.nextInt() //this is the line 100
greenBottles = input.nextInt()
clearBottles = input.nextInt()
}
}
The idea is to enter values by console that initialize the variables of my objects.
I would just use
val input = Scanner(System.`in`)
If you enter 9 integers in the console the initialization of the Packing objects should work.
The nullInputStream() makes no sense to me. It's not possible to read from the console with that.
The combinations list is empty so it throws an exception accessing it here
println("${combinations.get(0)} $min")

How to update Lucene Spellchecker indexes without reindexing?

I have a Lucene SpellChecker indexing implementation like so:
def buildAutoSuggestIndex(path:Path):SpellChecker = {
val config = new IndexWriterConfig(new CustomAnalyzer())
val dictionary = new PlainTextDictionary(path)
val directory = FSDirectory.open(path.getParent)
val spellChecker = new SpellChecker(directory)
val jw = new JaroWinklerDistance()
jw.setThreshold(jaroWinklerThreshold)
spellChecker.setStringDistance(new JaroWinklerDistance())
spellChecker.indexDictionary(dictionary, config, true)
spellChecker
}
I need to update these Spellchecker dictionaries i.e. reindex new entries, without reindexing the whole index. Is there any way to update SpellChecker indexes?
SpellChecker.indexDictionary(...) already avoids reindexing terms right here:
terms: while ((currentTerm = iter.next()) != null) {
String word = currentTerm.utf8ToString();
int len = word.length();
if (len < 3) {
continue; // too short we bail but "too long" is fine...
}
if (!isEmpty) {
for (TermsEnum te : termsEnums) {
if (te.seekExact(currentTerm)) {
continue terms;
}
}
}
// ok index the word
Document doc = createDocument(word, getMin(len), getMax(len));
writer.addDocument(doc);
seelkExact will return false if the term is already contained, and the document with the n-grams for the term is not added (continue terms;).

Lucene custom scoring for numeric fields

I would like to have, in addition to standard term search with tf-idf similarity over text content field, scoring based on "similarity" of numeric fields. This similarity will be depending on distance between the value in query and in document (e.g. gaussian with m= [user input], s= 0.5)
I.e. let's say documents represent people, and person document have two fields:
description (full text)
age (numeric).
I want to find documents like
description:(x y z) age:30
but age to be not the filter, but rather part of score (for person of age 30 multiplier will be 1.0, for 25-year-old person 0.8 etc.)
Can this be achieved in a sensible manner?
EDIT: Finally I found out this can be done by wrapping ValueSourceQuery and TermQuery with CustomScoreQuery. See my solution below.
EDIT 2: With fast-changing versions of Lucene, I just want to add that it was tested on Lucene 3.0 (Java).
Okay, so here's (a bit verbose) proof-of-concept as a full JUnit test. Haven't tested its efficiency yet for large index, but from what I've read probably after a warm-up it should perform well, providing there's enough RAM available to cache numeric fields.
package tests;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.function.CustomScoreQuery;
import org.apache.lucene.search.function.IntFieldSource;
import org.apache.lucene.search.function.ValueSourceQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import junit.framework.TestCase;
public class AgeAndContentScoreQueryTest extends TestCase
{
public class AgeAndContentScoreQuery extends CustomScoreQuery
{
protected float peakX;
protected float sigma;
public AgeAndContentScoreQuery(Query subQuery, ValueSourceQuery valSrcQuery, float peakX, float sigma) {
super(subQuery, valSrcQuery);
this.setStrict(true); // do not normalize score values from ValueSourceQuery!
this.peakX = peakX; // age for which the age-relevance is best
this.sigma = sigma;
}
#Override
public float customScore(int doc, float subQueryScore, float valSrcScore){
// subQueryScore is td-idf score from content query
float contentScore = subQueryScore;
// valSrcScore is a value of date-of-birth field, represented as a float
// let's convert age value to gaussian-like age relevance score
float x = (2011 - valSrcScore); // age
float ageScore = (float) Math.exp(-Math.pow(x - peakX, 2) / 2*sigma*sigma);
float finalScore = ageScore * contentScore;
System.out.println("#contentScore: " + contentScore);
System.out.println("#ageValue: " + (int)valSrcScore);
System.out.println("#ageScore: " + ageScore);
System.out.println("#finalScore: " + finalScore);
System.out.println("+++++++++++++++++");
return finalScore;
}
}
protected Directory directory;
protected Analyzer analyzer = new WhitespaceAnalyzer();
protected String fieldNameContent = "content";
protected String fieldNameDOB = "dob";
protected void setUp() throws Exception
{
directory = new RAMDirectory();
analyzer = new WhitespaceAnalyzer();
// indexed documents
String[] contents = {"foo baz1", "foo baz2 baz3", "baz4"};
int[] dobs = {1991, 1981, 1987}; // date of birth
IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < contents.length; i++)
{
Document doc = new Document();
doc.add(new Field(fieldNameContent, contents[i], Field.Store.YES, Field.Index.ANALYZED)); // store & index
doc.add(new NumericField(fieldNameDOB, Field.Store.YES, true).setIntValue(dobs[i])); // store & index
writer.addDocument(doc);
}
writer.close();
}
public void testSearch() throws Exception
{
String inputTextQuery = "foo bar";
float peak = 27.0f;
float sigma = 0.1f;
QueryParser parser = new QueryParser(Version.LUCENE_30, fieldNameContent, analyzer);
Query contentQuery = parser.parse(inputTextQuery);
ValueSourceQuery dobQuery = new ValueSourceQuery( new IntFieldSource(fieldNameDOB) );
// or: FieldScoreQuery dobQuery = new FieldScoreQuery(fieldNameDOB,Type.INT);
CustomScoreQuery finalQuery = new AgeAndContentScoreQuery(contentQuery, dobQuery, peak, sigma);
IndexSearcher searcher = new IndexSearcher(directory);
TopDocs docs = searcher.search(finalQuery, 10);
System.out.println("\nDocuments found:\n");
for(ScoreDoc match : docs.scoreDocs)
{
Document d = searcher.doc(match.doc);
System.out.println("CONTENT: " + d.get(fieldNameContent) );
System.out.println("D.O.B.: " + d.get(fieldNameDOB) );
System.out.println("SCORE: " + match.score );
System.out.println("-----------------");
}
}
}
This can be achieved using Solr's FunctionQuery

How to use ScalaQuery to insert a BLOB field?

I used ScalaQuery and Scala.
If I have an Array[Byte] object, how do I insert it into the table?
object TestTable extends BasicTable[Test]("test") {
def id = column[Long]("mid", O.NotNull)
def extInfo = column[Blob]("mbody", O.Nullable)
def * = id ~ extInfo <> (Test, Test.unapply _)
}
case class Test(id: Long, extInfo: Blob)
Can I define the method used def extInfo = column[Array[Byte]]("mbody", O.Nullable), how to operate(UPDATE, INSERT, SELECT) with the BLOB type field?
BTW: no ScalaQuery tag
Since the BLOB field is nullable, I suggest changing its Scala type to Option[Blob], for the following definition:
object TestTable extends Table[Test]("test") {
def id = column[Long]("mid")
def extInfo = column[Option[Blob]]("mbody")
def * = id ~ extInfo <> (Test, Test.unapply _)
}
case class Test(id: Long, extInfo: Option[Blob])
You can use a raw, nullable Blob value if you prefer, but then you need to use orElse(null) on the column to actually get a null value out of it (instead of throwing an Exception):
def * = id ~ extInfo.orElse(null) <> (Test, Test.unapply _)
Now for the actual BLOB handling. Reading is straight-forward: You just get a Blob object in the result which is implemented by the JDBC driver, e.g.:
Query(TestTable) foreach { t =>
println("mid=" + t.id + ", mbody = " +
Option(t.extInfo).map { b => b.getBytes(1, b.length.toInt).mkString })
}
If you want to insert or update data, you need to create your own BLOBs. A suitable implementation for a stand-alone Blob object is provided by JDBC's RowSet feature:
import javax.sql.rowset.serial.SerialBlob
TestTable insert Test(1, null)
TestTable insert Test(2, new SerialBlob(Array[Byte](1,2,3)))
Edit: And here's a TypeMapper[Array[Byte]] for Postgres (whose BLOBs are not yet supported by ScalaQuery):
implicit object PostgresByteArrayTypeMapper extends
BaseTypeMapper[Array[Byte]] with TypeMapperDelegate[Array[Byte]] {
def apply(p: BasicProfile) = this
val zero = new Array[Byte](0)
val sqlType = java.sql.Types.BLOB
override val sqlTypeName = "BYTEA"
def setValue(v: Array[Byte], p: PositionedParameters) {
p.pos += 1
p.ps.setBytes(p.pos, v)
}
def setOption(v: Option[Array[Byte]], p: PositionedParameters) {
p.pos += 1
if(v eq None) p.ps.setBytes(p.pos, null) else p.ps.setBytes(p.pos, v.get)
}
def nextValue(r: PositionedResult) = {
r.pos += 1
r.rs.getBytes(r.pos)
}
def updateValue(v: Array[Byte], r: PositionedResult) {
r.pos += 1
r.rs.updateBytes(r.pos, v)
}
override def valueToSQLLiteral(value: Array[Byte]) =
throw new SQueryException("Cannot convert BYTEA to literal")
}
I just post an updated code for Scala and SQ, maybe it will save some time for somebody:
object PostgresByteArrayTypeMapper extends
BaseTypeMapper[Array[Byte]] with TypeMapperDelegate[Array[Byte]] {
def apply(p: org.scalaquery.ql.basic.BasicProfile) = this
val zero = new Array[Byte](0)
val sqlType = java.sql.Types.BLOB
override val sqlTypeName = "BYTEA"
def setValue(v: Array[Byte], p: PositionedParameters) {
p.pos += 1
p.ps.setBytes(p.pos, v)
}
def setOption(v: Option[Array[Byte]], p: PositionedParameters) {
p.pos += 1
if(v eq None) p.ps.setBytes(p.pos, null) else p.ps.setBytes(p.pos, v.get)
}
def nextValue(r: PositionedResult) = {
r.nextBytes()
}
def updateValue(v: Array[Byte], r: PositionedResult) {
r.updateBytes(v)
}
override def valueToSQLLiteral(value: Array[Byte]) =
throw new org.scalaquery.SQueryException("Cannot convert BYTEA to literal")
}
and then usage, for example:
...
// defining a column
def content = column[Array[Byte]]("page_Content")(PostgresByteArrayTypeMapper)