read all document by using particular category name using alfresco search.luceneSearch or search.lib.js - lucene

Category Name
|
Geograpy (8)
Study Db (18)
i am implement my own advance search in alfresco. i need to read all files which related with particular category.
example:
if there is 20 file under geograpy, lucene query should read particular document under search key word "banana".
Further explanation -
I am using search.lib.js to search. I would like to analyze the result to find out to which category the documents belong to. For example I would like to know how many documents belong to the category under Languages and the subcategories. I experimented with the Classification API but I don't get the result I want. Any Idea how to go through the result to get the category name of each document?
is there any simple method like node.properties["cm:creator"]?
thanks
janaka

I think you should specify more your question:
Are you using cm:content or a customized content?
Are you going to search the keyword inside the content of the file? or are you going to search the keyword in a specific metadata(s)?
Do you want to create a webscript (java or javascript)?
One thing to take in consideration:
if you use +PATH:"cm:generalclassifiable/...." for the categorization in your lucene queries, the performance will be slow (following my experince)

You can use for example the next query to find all nodes at any depth below /cm:Languages:
var results = search.luceneSearch("+PATH:\"cm:generalclassifiable/cm:Languages//*\");
Take a look to this url: https://wiki.alfresco.com/wiki/Search#Path_Queries
Once you have all the elements, you can loop all, and get to which category below. Of course you need to create some counter per each category/subcategory:
for(i = 0; i < results.length; i++){
var node = results[i];
var categoryNodeRef = node.properties["cm:categories"];
var categoryDesc = categoryNodeRef.properties["cm:description"];
var categoryName = categoryNodeRef.properties["cm:name"];
}
This is not exactly the solution, but can be a useful idea to start.
Sorry if it's not what you're asking for, I have just arrived from my holidays.

Related

Retrieve a document class-description symbolicName without fetching

I'm triying to retrieve a ClassDescription symbolicName of an IDocument object. It seems that i have to fetch its ClassDescription even if I just want the symbolicName.
Is there a way to do it ? I just want to avoid doing a fetch for every browsed document...
(Also IDocument.GetClassName doesn't help, it returns "Document")
I finally found a way, by making an SQL SELECT request retrieving the classDescription ID (which is not the symbolicName ID, but rather an "internal" one) :
Select This, d.Id, d.ClassDescription
From Document d
where d.Id = ID
It seems to be lighter than a line like document.fetch(classDescription) (pseudo call) cause it should just retrieves the ID.
I thought it worth mentioning a problem regarding the accepted answer.
There are times that doing a query would be "lighter" however I believe you are missing something involving fetching a document.
FileNet's fetchInstance command can take in a PropertyFilter.
In your case you could do something along the lines of:
PropertyFilter pf = new PropertyFilter();
pf.AddIncludeProperty(new FilterElement(null, null, null, "ClassDescription", null));
doc = Factory.Document.FetchInstance(os, new Id("doc.ID()"), pf);
You would probably want to look at your original fetch of this document and make sure to specify the full list of property filters at that point.
See Working With Documents

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.
Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.
Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

Apache solr - more like this score

I have a small index with ~1000 documents with only two fields:
- id (string)
- content (text_general)
I noticed that when I do MLT search by id for similar content, the original document(which id is the searched id) have a score 5.241327.
There is 1:1 duplicated document and for the duplicated content it is returning score = 1.5258181. Why? Why it is not 5.241327 when it is 100% duplicate.
Another question is can I in any way to get similarity documents by content by passing some text in the query.
Example:
/mlt/?q=content:Some encoded long text&mlt.fl=content
I am trying to check if there is similar content uploaded and the check must be performed at new content upload time.
It might be worth to try some different parameters. I also use MLT on only one field, I use the following parameters:
'mlt.boost': 'true',
'mlt.fl': 'my_field_name',
'mlt.maxqt': 1000,
'mlt.mindf': '0',
'mlt.mintf': '0',
'qt': 'mlt',
'rows': '10'
See http://wiki.apache.org/solr/MoreLikeThis for an explanation of the parameters. I think with a small index mindf might be important and I see the default mintf (term frequency) is 2, so I assume an ID is only one term, so this is probably ignored!
First, how does Solr More-Like-This works?
A regular Solr query is conducted (e.g. "?q=content:Some encoded long text&.....".
For each document returned by the above query, More-Like-This conduct More like this query...
So, the first result set "response", is just like any Solr query results set.
The More-Like-This appears below and start with something like that (Json format):
"moreLikeThis":{
"57375":{"numFound":18155,"start":0,"docs":["
For an explanation about More Like This algorithm, please read that:
http://blog.brattland.no/node/18
and: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
If you didn't solved the problem yet, please let me know and I will guide you through.

MOSS 2007: What is the source of "Directories"?

I'm trying to generate a new SharePoint list item directly using SQL server. What's stopping me is damn tp_DirName column. I have no ideas how to create this value.
Just for instance, I have selected all tasks from AllUserData, and there are possible values for the column: 'MySite/Lists/Task', 'Lists/Task' and even 'MySite/Lists/List2'.
MySite is the FullUrl value from Webs table. I can obtain it. But what about 'Lists/Task' and '/Lists/List2'? Where they are stored?
If try to avoid SQL context, I can formulate it the following way: what is the object, that has such attribute as '/Lists/List2'? Where can I set it up in GUI?
Just a FYI. It is VERY not supported to try and write directly to SharePoint's SQL Tables. You should really try and write something that utilizes the SharePoint Object Model. Writing to the SharePoint database directly mean Microsoft will not support the environment.
I've discovered, that [AllDocs] table, in contrast to its title, contains information about "directories", that can be used to generate tp_DirName. At least, I've found "List2" and "Task" entries in [AllDocs].[tp_Leaf] column.
So the solution looks like this -- concatenate the following 2 components to get tp_DirName:
[Webs].[FullUrl] for the web, containing list, containing item.
[AllDocs].[tp_Leaf] for the list, containing item.
Concatenate the following 2 components to get tp_Leaf for an item:
(Item count in the list) + 1
'_.000'
Regards,
Well, my previous answer was not very useful, though it had a key to the magic. Now I have a really useful one.
Whatever they said, M$ is very liberal to the MOSS DB hackers. At least they provide the following documents:
http://msdn.microsoft.com/en-us/library/dd304112(PROT.13).aspx
http://msdn.microsoft.com/en-us/library/dd358577(v=PROT.13).aspx
Read? Then, you know that all folders are listed in the [AllDocs] table with '1' in the 'Type' column.
Now, let's look at 'tp_RootFolder' column in AllLists. It looks like a folder id, doesn't it? So, just SELECT the single row from the [AllDocs], where Id = tp_RootFolder and Type = 1. Then, concatenate DirName + LeafName, and you will know, what the 'tp_DirName' value for a newly generated item in the list should be. That looks like a solid rock solution.
Now about tp_LeafName for the new items. Before, I wrote that the answer is (Item count in the list) + 1 + '_.000', that corresponds to the following query:
DECLARE #itemscount int;
SELECT #itemscount = COUNT(*) FROM [dbo].[AllUserData] WHERE [tp_ListId] = '...my list id...';
INSERT INTO [AllUserData] (tp_LeafName, ...) VALUES(CAST(#itemscount + 1 AS NVARCHAR(255)) + '_.000', ...)
Thus, I have to say I'm not sure that it works always. For items - yes, but for docs... I'll inquire into the question. Leave a comment if you want to read a report.
Hehe, there is a stored procedure named proc_AddListItem. I was almost right. MS people do the same, but instead of (count + 1) they use just... tp_ID :)
Anyway, now I know THE SINGLE RIGHT answer: I have to call proc_AddListItem.
UPDATE: Don't forget to present the data from the [AllUserData] table as a new item in [AllDocs] (just insert id and leafname, see how SP does it itself).

Detailed information in Lucene/Solr results

After having performed a search in Lucene/Solr without having specified a field, how can I know in which fields of a result document the search string was found (and how often)?
You could use Query Highlighting.
Try setting debugQuery=on. See this example.
As mentioned, use debugQuery=true. The response will then include an "explain" section. By default, this will give you some awful formatted text that looks like this:
0.69102794 = (MATCH) weight(body:arrai^1.5 in 6357), product of:
0.46610788 = queryWeight(body:arrai^1.5), product of:
1.5 = boost
5.591044 = idf(docFreq=55709, maxDocs=5492855)
0.055577915 = queryNorm
1.4825494 = (MATCH) fieldWeight(body:arrai in 6357), product of:
2.828427 = tf(termFreq(body:arrai)=8)
5.591044 = idf(docFreq=55709, maxDocs=5492855)
0.09375 = fieldNorm(field=body, doc=6357)
For each match in each field, you will get a block like this that explains how SOLR computed the relevancy of this document to your query. What you're asking about (how many matches in this document's field) SOLR calls term frequency "tf". You can see this on the 7th line of the output i pasted above. In this line, SOLR is telling you that it found 8 matches for arrai in the field called "body".
The other lines stand for things like inverse document frequency-"idf" (how rare the matched term is) and fieldNorm, which relates to how short the document's field is relative to the match. You can learn about these here: http://wiki.apache.org/solr/SolrRelevancyFAQ
FYI if you need this "explain" information in a structured format instead of clumsy text you can pass this parameter with your query: debug.explain.structured=true However, its still pretty hard to use = )