Find all available values for a field in lucene .net - lucene

If I have a field x, that can contain a value of y, or z etc, is there a way I can query so that I can return only the values that have been indexed?
Example
x available settable values = test1, test2, test3, test4
Item 1 : Field x = test1
Item 2 : Field x = test2
Item 3 : Field x = test4
Item 4 : Field x = test1
Performing required query would return a list of:
test1, test2, test4

I've implemented this before as an extension method:
public static class ReaderExtentions
{
public static IEnumerable<string> UniqueTermsFromField(
this IndexReader reader, string field)
{
var termEnum = reader.Terms(new Term(field));
do
{
var currentTerm = termEnum.Term();
if (currentTerm.Field() != field)
yield break;
yield return currentTerm.Text();
} while (termEnum.Next());
}
}
You can use it very easily like this:
var allPossibleTermsForField = reader.UniqueTermsFromField("FieldName");
That will return you what you want.
EDIT: I was skipping the first term above, due to some absent-mindedness. I've updated the code accordingly to work properly.

TermEnum te = indexReader.Terms(new Term("fieldx"));
do
{
Term t = te.Term();
if (t==null || t.Field() != "fieldx") break;
Console.WriteLine(t.Text());
} while (te.Next());

You can use facets to return the first N values of a field if the field is indexed as a string or is indexed using KeywordTokenizer and no filters. This means that the field is not tokenized but just saved as it is.
Just set the following properties on a query:
facet=true
facet.field=fieldname
facet.limit=N //the number of values you want to retrieve

I think a WildcardQuery searching on field 'x' and value of '*' would do the trick.

I once used Lucene 2.9.2 and there I used the approach with the FieldCache as described in the book "Lucene in Action" by Manning:
String[] fieldValues = FieldCache.DEFAULT.getStrings(indexReader, fieldname);
The array fieldValues contains all values in the index for field fieldname (Example: ["NY", "NY", "NY", "SF"]), so it is up to you now how to process the array. Usually you create a HashMap<String,Integer> that sums up the occurrences of each possible value, in this case NY=3, SF=1.
Maybe this helps. It is quite slow and memory consuming for very large indexes (1.000.000 documents in index) but it works.

Related

Combining Two List in Kotlin with Index

There is a data class as fruits.
data class Fruits(
val code: String, //Unique
val name: String
)
The base list indexed items with boolean variable is as below.
val indexList: MutableList<Boolean> = MutableList(baseFruitList.size) { false }
Now the Favourite Indexed list is as below
val favList: MutableList<Boolean> = MutableList(favFruitList.size) { true}
I want a combined full list which basically has the fav item indicated as true.
Ex:
baseFruitList = {[FT1,apple],[FT2,grapes],[FT3,banana],[FT4,mango],[FT5,pears]}
favList = {[FT2,grapes],[FT4,mango]}
The final index list should have
finalIndexed = {false,true,false,true,false}
How can we achieve in Kotlin, without iterating through each element.
You can do
val finalIndexed = baseFruitList.map { it in favList }
assuming, like #Tenfour04 is asking, that name is guaranteed to be a specific value (including matching case) for a specific code (since that combination is how a data class matches another, e.g. for checking if it's in another list)
If you can't guarantee that, this is safer:
val finalIndexed = baseFruitList.map { fruit ->
favList.any { fav.code == fruit.code }
}
but here you have to iterate over all the favs (at least until you find a match) looking to see if one has the code.
But really, if code is the unique identifier here, why not just store those in your favList?
favList = listOf("FT2", "FT4") // or a Set would be more efficient, and more correct!
val finalIndexed = baseFruitList.map { it.code in favList }
I don't know what you mean about "without iterating through each element" - if you mean without an explicit indexed for loop, then you can use these simple functions like I have here. But there's always some amount of iteration involved. Sets are always an option to help you minimise that

How can I generate schema from text file? (Hadoop-Pig)

Somehow i got filename.log which looks like for example (tab separated)
Name:Peter Age:18
Name:Tom Age:25
Name:Jason Age:35
because the value of key column may differ i cannot define schema when i load text like
a = load 'filename.log' as (Name:chararray,Age:int);
Neither do i want to call column by position like
b = foreach a generate $0,$1;
What I want to do is, from only that filename.log, to make it possible to call each value by key, for example
a = load 'filename.log' using PigStorage('\t');
b = group b by Name;
c = foreach b generate group, COUNT(b);
dump c;
for that purpose, i wrote some Java UDF which seperate key:value and get value for every field in tuple as below
public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
#Override
public Tuple exec(Tuple input){
TupleFactory mTupleFactory = TupleFactory.getInstance();
ArrayList<String> mProtoTuple = new ArrayList<String>();
Tuple output;
String target=input.toString().substring(1, input.toString().length()-1);
String[] tokenized=target.split(",");
try{
for(int i=0;i<tokenized.length;i++){
mProtoTuple.add(tokenized[i].split(":")[1]);
}
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}catch(Exception e){
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}
}
}
How should I alter this method to get what I want? or How should I write other UDF to get there?
Whatever you do, don't use a tuple to store the output. Tuples are intended to store a fixed number of fields, where you know what every field contains. Since you don't know that the keys will be in Name,Age form (or even exist, or that there won't be more) you should use a bag. Bags are unordered sets of tuples. They can have any number of tuples in them as long as the tuples have the same schema. These are all valid bags for the schema B: {T:(key:chararray, value:chararray)}:
{(Name,Foo),(Age,Bar)}
{(Age,25),(Name,Jim)}
{(Name,Bob)}
{(Age,30),(Name,Roger),(Hair Color,Brown)}
{(Hair Color,),(Name,Victor)} -- Note the Null value for Hair Color
However, it sounds like you really want a map:
myudf.py
#outputSchema('M:map[]')
def mapize(the_input):
out = {}
for kv in the_input.split(' '):
k, v = kv.split(':')
out[k] = v
return out
myscript.pig
register '../myudf.py' using jython as myudf ;
A = LOAD 'filename.log' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.mapize(total) ;
-- Sample usage, grouping by the name key.
C = GROUP B BY M#'Name' ;
Using the # operator you can pull out all values from the map using the key you give. You can read more about maps here.

How to index field with comma-separated text value - hibernate search

I am implementing a book search function based on hibernate search3.2.
Book object contains a field called authornames. Authornames value is a list of names and comma is the splitter, say "John Will, Robin Rod, James Timerberland"
#Field(index = org.hibernate.search.annotations.Index.UN_TOKENIZED,store=Store.YES)
#FieldBridge(impl=CollectionToCSVBridge.class)
private Set<String> authornames;
I need each of names to be UN_TOKENIZED, so that user search book by single author name: John Will, Robin Rod or James Timerberland.
I used Luke to check indexs, and value in authornames field is stored as "John Will, Robin Rod, James Timerberland", but I can not get result by querying "authornames:John Will"
Anybody can tell me how can I do it?
I gues CollectionToCSVBridge is concatenating all names with a ", " in a larger string.
You should keep them separate instead and add each element individually to the index:
#Override
public void set(String name, Object value, Document document, LuceneOptions luceneOptions) {
if ( value == null ) {
return;
}
if ( !( value instanceof Collection ) ) {
throw new IllegalArgumentException( "This FieldBridge only supports collections." );
}
Collection<?> objects = (Collection<?>) value;
for ( Object object : objects ) {
luceneOptions.addFieldToDocument( name, objectToString( object ), document ); // in your case objectToString could do just a #toString
}
}
See also https://forum.hibernate.org/viewtopic.php?f=9&t=1015286&start=0

Get a value from array based on the value of others arrays (VB.Net)

Supposed that I have two arrays:
Dim RoomName() As String = {(RoomA), (RoomB), (RoomC), (RoomD), (RoomE)}
Dim RoomType() As Integer = {1, 2, 2, 2, 1}
I want to get a value from the "RoomName" array based on a criteria of "RoomType" array. For example, I want to get a "RoomName" with "RoomType = 2", so the algorithm should randomize the index of the array that the "RoomType" is "2", and get a single value range from index "1-3" only.
Is there any possible ways to solve the problem using array, or is there any better ways to do this? Thank you very much for your time :)
Note: Code examples below using C# but hopefully you can read the intent for vb.net
Well, a simpler way would be to have a structure/class that contained both name and type properties e.g.:
public class Room
{
public string Name { get; set; }
public int Type { get; set; }
public Room(string name, int type)
{
Name = name;
Type = type;
}
}
Then given a set of rooms you can find those of a given type using a simple linq expression:
var match = rooms.Where(r => r.Type == 2).Select(r => r.Name).ToList();
Then you can find a random entry from within the set of matching room names (see below)
However assuming you want to stick with the parallel arrays, one way is to find the matching index values from the type array, then find the matching names and then find one of the matching values using a random function.
var matchingTypeIndexes = new List<int>();
int matchingTypeIndex = -1;
do
{
matchingTypeIndex = Array.IndexOf(roomType, 2, matchingTypeIndex + 1);
if (matchingTypeIndex > -1)
{
matchingTypeIndexes.Add(matchingTypeIndex);
}
} while (matchingTypeIndex > -1);
List<string> matchingRoomNames = matchingTypeIndexes.Select(typeIndex => roomName[typeIndex]).ToList();
Then to find a random entry of those that match (from one of the lists generated above):
var posn = new Random().Next(matchingRoomNames.Count);
Console.WriteLine(matchingRoomNames[posn]);

uniqueness of composite id with null component

I am running into problems using unique constraints.
The following combinations are allowed
A.name B.name
foo NULL
foo bar
foo bar1
foo1 bar
It should not be possible to create a new A with same name, only if it has a different B.
With the constraints below it is possible to create
A.name B.name
foo NULL
foo NULL
Because NULL seems not to have effect on unique.
Any hints how to fix this?
class A {
String name
static belongsTo = [b:B]
static constraints = {
name(unique:'b')
b(nullable:true)
}
}
class B {
String name
static hasMany = [as:A]
name(unique:true)
}
In the database structure, could you set the columns to NOT NULL DEFAULT 0 or similar, and then treat the zeros the same as you otherwise would the NULLs? Since the column is for names, there's likely to be no digits in the values anyway right?
I'm not entirely sure, but I think this will work:
name(unique:['b', 'name'])
Looking at the code for the unique constraint, it seems feasible. The constraint definitely lets you pass in a list of things to compare the uniqueness to. It calls this the uniquenessGroup. Then, when validating, it iterates over this list. Take a look starting at line 137 here: http://www.docjar.com/html/api/org/codehaus/groovy/grails/orm/hibernate/validation/UniqueConstraint.java.html
The code looks like this:
if(shouldValidate) {
Criteria criteria = session.createCriteria( constraintOwningClass )
.add( Restrictions.eq( constraintPropertyName, propertyValue ) );
if( uniquenessGroup != null ) {
for( Iterator it = uniquenessGroup.iterator(); it.hasNext(); ) {
String propertyName = (String) it.next();
criteria.add(Restrictions.eq( propertyName,
GrailsClassUtils.getPropertyOrStaticPropertyOrFieldValue(target, propertyName)));
}
}
return criteria.list();
}
So it depends on whether the GrailsClassUtils.getPropertyOrStaticPropertyOrFieldValue call will retrieve a property in the same class. Which based on the name it seems like it should.
I'm curious to know if it works for you.