I'm new to working with Lucene and trying to understand how I can use Lucene for a simpler scoring function.
I have objects in my dataset with 5-10 terms attached to each of them. Lucene uses TFIDF similarity by default to rank the objects.
TFIDF does not make sense as my data does not varying term frequencies. How can I change the default scoring function so that I rank based on the overlapping keywords?
Doc1 = {system engineering artificial intelligence}
Doc2 = {architecture logic programming}
Doc3 = {sytem architecture engineering}
For the query Query = {system architecture}, I want a ranking where Doc3 is ranked higher than Doc1 and Doc2.
I could propose to use something like this:
Query query = new BooleanQuery.Builder()
.add(new TermQuery(new Term("text", "system")), Occur.SHOULD)
.add(new TermQuery(new Term("text", "architecture")), Occur.SHOULD)
.build();
in this case doc3 will be ranked higher than doc1 and doc2, but the should clause nature will allow to rank other documents as well.
I am searching an index with 3 it's fields ("name", "addr" and "fullname"), and using a DisjunctionMaxQuery to rank the results by the max score of 3 fields. When the hits have same score, lucene ranks them by doc Id (low doc Id would be first).
But I don't want to rank by doc Id in that case. I would like to rank by field. If the hits have the same score, I expect that the hit whose score (max score) is from the field "name" would be before the hit whose score is from another field.
I think the customer Collector & HitQueue is good idea and rewrite the method PriorityQueue.lessThan could change rank in priority queue. Unfortunately, the info in ScoreDoc is too little, and it's hard to get the source of max score for every hit.
Someone else know how to solve it?
The simplest approach to this would be to simply boost the fields you want to come first in a tie slightly more than the other:
Query query = new DisjunctionMaxQuery(0);
Query subQueryOne = new TermQuery(new Term("one", searchterm))
subQueryOne.setBoost(1.2);
Query subQueryTwo = new TermQuery(new Term("two", searchterm))
subQueryOne.setBoost(1.1);
Query subQueryThree = new TermQuery(new Term("three", searchterm))
subQueryOne.setBoost(1.0);
query.add(subQueryOne);
query.add(subQueryTwo);
query.add(SubQueryThree);
I have two array I'm trying to compare at many levels. Both have the same structure with 3 "columns.
The first column contains the polygon's ID, the second a area type, and the third, the percentage of each area type for a polygone.
So, for many rows, it will compare, for example, ID : 1 Type : aaa % : 100
But for some elements, I have many rows for the same ID. For example, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 25% --- ID 2, type ccc, 50%. And in the second array, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 10% --- ID 2, type eee, 38% --- ID 2, type fff, 27%.
here's a visual example..
So, my function has to compare these two array and send me an email if there are differences.
(I wont show you the real code because there are 811 lines). The first "if" condition is
if array1.id = array2.id Then
if array1.type = array2.type Then
if array1.percent = array2.percent Then
zone_verification = True
Else
zone_verification = False
The probleme is because there are more than 50 000 rows in each array. So when I run the function, for each "array1.id", the function search through 50 000 rows in array2. 50 000 searchs for 50 000 rows.. it's pretty long to run!
I'm looking for something to get it running faster. How could I get my search more specific. Example : I have many id "2" in the array1. If there are many id "2" in the array2, find it, and push all the array2.id = 3 in a "sub array" or something like that, and search in these specific rows. So I'll have just X rows in array1 to compare with X rows in array 2, not with 50 000. and when each "id 2" in array1 is done, do the same thing for "id 4".. and for "id 5"...
Hope it's clear. it's almost the first time I use VB.net, and I have this big function to get running.
Thanks
EDIT
Here's what I wanna do.
I have two different layers in a geospatial database. Both layers have the same structure. They are a "spatial join" of the land parcels (55 000), and the land use layer. The first layer is the current one, and the second layer is the next one we'll use after 2015.
So I have, for each "land parcel" the percentage of each land use. So, for a "land parcel" (ID 7580-80-2532, I can have 50% of farming use (TYPE FAR-23), and 50% of residantial use (RES-112). In the first array, I'll have 2 rows with the same ID (7580-80-2532), but each one will have a different type (FAR-23, RES-112) and a different %.
In the second layer, the same the municipal zoning (land use) has changed. So the same "land parcel" will now be 40% of residential use (RES-112), 20% of commercial (COM-54) and 40% of a new farming use (FAR-33).
So, I wanna know if there are some differences. Some land parcels will be exactly the same. Some parcels will keep the same land use, but not the same percentage of each. But for some land parcel, there will be more or less land use types with different percentage of each.
I want this script to compare these two layers and send me an email when there are differences between these two layers for the same land parcel ID.
The script is already working, but it takes too much time.
The probleme is, I think, the script go through all array2 for each row in array 1.
What I want is when there are more than 1 rows with the same ID in array1, take only this ID in both arrays.
Maybe if I order them by IDs, I could write a condition. kind of "when you find what you're looking for, stop searching when you'll find a different value?
It's hard to explain it clearly because I've been using VB since last week.. And english isn't my first language! ;)
If you just want to find out if there are any differences between the first and second array, you could do:
Dim diff = New HashSet(of Polygon)(array1)
diff.SymmetricExceptWith(array2)
diff will contain any Polygon which is unique to array1 or array2. If you want to do other types of comparisons, maybe you should explain what you're trying to do exactly.
UPDATE:
You could use grouping and lookups like this:
'Create lookup with first array, for fast access by ID
Dim lookupByID = array1.ToLookup(Function(p) p.id)
'Loop through each group of items with same ID in array2
For Each secondArrayValues in array2.GroupBy(Function(p) p.id)
Dim currentID As Integer = secondArrayValues.Key 'Current ID is the grouping key
'Retrieve values with same ID in array1
'Use a hashset to easily compare for equality
Dim firstArrayValues As New HashSet(of Polygon)(lookupByID(currentID))
'Check for differences between the two sets of data, for this ID
If Not firstArrayValues.SetEquals(secondArrayValues) Then
'Data has changed, do something
Console.WriteLine("Differences for ID " & currentID)
End If
Next
I am answering this question based on the first part that you wrote (that is without the EDIT section). The correct answer should explain a good algorithm but I am suggesting you to use DB capabilities because they have optimized many queries for these purpose.
Put all the records in DB two tables - O(n) time ... If the records are static you dont need to perform this step every time.
Table 1
id type percent
Table 2
id type percent
Then use the DB query, some thing like this
select count(*) from table1 t1, table2 t2 where t1.id!=t2.id and t1.type!=t2.type
(you can use some better queries, what I am trying to say is give the control to DB to perform this operation)
retrieve the result in your code and perform the necessary operation.
EDIT
1) You can sort them in O(n logn) time based on ID + type + Percent and then perform binary search.
2) Store the first record in hash map with appropriate key - could be ID only or ID+type
this will take O(n) time and searching ,if key is correct, will take constant time.
You need to define a structure to store this data. We'll store all the data in a LandParcel class, which will have a HashSet<ParcelData>
public class ParcelData
{
public ParcelType Type { get; set; } // This can be an enum, string, etc.
public int Percent { get; set; }
// Redefine Equals and GetHashCode conveniently
}
public class LandParcel
{
public ID Id { get; set; } // Whatever the type of the ID is...
public HashSet<ParcelData> Data { get; set; }
}
Now you have to build your data structure, with something like this:
Dictionary<ID, LandParcel> data1 = new ....
foreach (var item in array1)
{
LandParcel p;
if (!data1.TryGetValue(item.id, out p)
data1[item.id] = p = new LandParcel(id);
// Can this data be repeated?
p.Data.Add(new ParcelData(item.type, item.percent));
}
You do the same with a data2 dictionary for the second array. Now you iterate for all items in data1 and compare them with the item with the same id for data2.
foreach (var parcel2 in data2.Values)
{
var parcel1 = data1[parcel2.ID]; // Beware with exceptions here !!!
if (!parcel1.Data.SetEquals(parcel2.Data))
// You have different parcels
}
(Now that I look at it, we are practically doing a small database query here, kind of smelly code ...)
Sorry for the C# code since I don't really feel so comfortable with VB, but it should be fairly straightforward.
hi have index simple document where you have 2 fields:
1. profileId as long
2. profileAttribute as long.
i need to know how many profileId's have a certain set of attribute.
for example i index:
doc1: profileId:1 , profileAttribute = 55
doc2: profileId:1 , profileAttribute = 57
doc3: profileId:2 , profileAttribute = 55
and i want to know how many profiles have both attribute 55 and 57
in this example the answer is 1 cuz only profile id 1 have both attributes
thanks in advance for your help
You can search for profileAttribute:(57 OR 55) and then iterate over the results and put their profileId property in a set in order to count the total number of unique profileIds.
But you need to know that Lucene will perform poorly at this compared to, say, a RDBMS. This is because Lucene is an inverted index, meaning it is very good at retrieving the top documents that match a query. It is however not very good at iterating over the stored fields of a large number of documents.
However, if profileId is single-valued and indexed, you can get its values using Lucene's fieldCache which will prevent you from performing costly disk accesses. The drawback is that this fieldCache will use a lot of memory (depending on the size of your index) and take time to load every time you (re-)open your index.
If changing the index format is acceptable, this solution can be improved by making profileIds uniques, your index would have the following format :
doc1: profileId: [1], profileAttribute: [55, 57]
doc2: profileId: [2], profileAttribute: [55]
The difference is that profileIds are unique and profileAttribute is now a multi-valued field. To count the number of profileIds for a given set of profileAttribute, you now only need to query for the list of profileAttribute (as previously) and use a TotalHitCountCollector.
My fields in lucene are product_name, type and sub_types.
I am querying on type with abc, this results me in products whose type is abc.
This abc type products have sub_types as pqr and xyz.
I can get total count of the xyz type using TopScoreDocCollector.getTotalHits().
But I want to get the count of sub_types. ie. pqr and xyz.
How can I get it?
Any reply would be of great help for me.
Thanks in advance.
One way to do this is to create a filter based on your abc query, and then use that filter to constrain results for the sub-type queries.
IndexSearcher searcher = // searcher to use
int nDocs = 100; // number of docs to retrieve
QueryParser parser = // query parser to use
Query typeQuery = parser.parse("type:abc");
Filter f = CachingWrapperFilter(new QueryWrapperFilter(typeQuery));
Query subtypeQuery = parser.parse("sub_type:xyz");
TopDocs results = searcher.search(subtypeQuery, f, nDocs);
Another thought: if you know up-front which sub-type you're interested in, you can simply add both a type and a sub-type to the query: +type:abc +sub_type:xyz.
Finally, you might consider using Solr to index your data if you have these kinds of queries.