Indexing file paths or URIs in Lucene - indexing

Some of the documents I store in Lucene have fields that contain file paths or URIs. I'd like users to be able to retrieve these documents if their query terms contain a path or URI segment.
For example, if the path is
C:\home\user\research\whitepapers\analysis\detail.txt
I'd like the user to be able to find it by queriying for path:whitepapers.
Likewise, if the URI is
http://www.stackoverflow.com/questions/ask
A query containing uri:questions would retrieve it.
Do I need to use a special analyzer for these fields, or will StandardAnaylzer do the job? Will I need to do any pre-processing of these fields? (To replace the forward slashes or backslashes with spaces, for example?)
Suggestions welcome!

You can use StandardAnalyzer.
I tested this, by adding the following function to Lucene's TestStandardAnalyzer.java:
public void testBackslashes() throws Exception {
assertAnalyzesTo(a, "C:\\home\\user\\research\\whitepapers\\analysis\\detail.txt", new String[]{"c","home", "user", "research","whitepapers", "analysis", "detail.txt"});
assertAnalyzesTo(a, "http://www.stackoverflow.com/questions/ask", new String[]{"http", "www.stackoverflow.com","questions","ask"});
}
This unit test passed using Lucene 2.9.1. You may want to try it with your specific Lucene distribution. I guess it does what you want, while keeping domain names and file names unbroken. Did I mention that I like unit tests?

Related

Can I get a list of Wikimedia files filtered by a regex?

I am looking to find all images by Kawahara Keiga from Wikimedia.
The filenames usually contain the strings "RMNH.ART" and "Kawahara Keiga" - see:
https://en.wikipedia.org/wiki/File:Naturalis_Biodiversity_Center_-_RMNH.ART.5_-_Carcinoplax_longimana_(De_Haan,_1833)_-_Kawahara_Keiga.jpg
https://en.wikipedia.org/wiki/File:Naturalis_Biodiversity_Center_-_RMNH.ART.537_-_Halieutaea_stellata_-_Kawahara_Keiga_-_Siebold_Collection.jpg
https://en.wikipedia.org/wiki/File:Naturalis_Biodiversity_Center_-_RMNH.ART.256_-_Hemitrygon_akajei_(M%C3%BCller_%26_Henle,_1841)_-_Kawahara_Keiga_-_Siebold_Collection.jpg
Is it possible to query a Wikimedia API and get a list of files filtered by "contains" or a regex or similar?
Answering your specific question, you can use:
https://commons.wikimedia.org/w/api.php?action=query&list=search&srsearch=RMNH.ART&srnamespace=6&srlimit=500&format=json
Alternatively though, since the images are categorised already, you could use this instead:
https://commons.wikimedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Kawahara_Collection_at_Naturalis_Biodiversity_Center&cmlimit=500&format=json
These will both return the first 500 files, and to get all of them, you will need to add &sroffset=500 or &cmcontinue. Admittedly, I've not quite sure how the second one works.
The docs for both of these are at https://www.mediawiki.org/wiki/API:Search and https://www.mediawiki.org/wiki/API:Categorymembers

Searching Alfresco Share site members with Lucene / FTS

Is it possible to search Alfresco Share site members with lucene or fts-alfresco? For example, I would like to find all the site members with lastname "Smith".
Additionally, is it possible to search users that have certain permissions to a site folder or document?
You cannot search site members directly using Lucene, because indexing does not have any data related to that. What you need to do is use siteService to get that information. You can use any of these API.Second one return Map so may be more relevent.
org.alfresco.service.cmr.site.Site.SiteService
listMembers(String shortName, String nameFilter, String roleFilter, boolean collapseGroups, SiteService.SiteMembersCallback callback)
or
listMembers(String shortName, String nameFilter, String roleFilter, int size)
You first need to access all site members using API of siteservice and then iterate over them to get your required users.
I am not sure you can do this with lucene or not, but if you want to find users ,below webscript is usefull.
If you see in below url nf=NameOfUsers parameter specifies name of user.If youare not specifying nf Parameter ,it will return all users.
http://localhost:8080/share/proxy/alfresco/api/sites/demo/memberships?size=250&nf=te&authorityType=USER
For more details of above webscript, you can use below URL.
http://localhost:8080/alfresco/service/script/org/alfresco/repository/site/membership/memberships.get
Yes, it is possible to search Alfresco Share site members with fts-alfresco, because site members belong to an Alfresco group.
For example, the following query returns members of the SWSDP site:
PATH:"/sys:system/sys:authorities/cm:GROUP_site_swsdp//*" AND TYPE:"cm:person"

Completely custom path with YII?

I have various products with their own set paths. Eg:
electronics/mp3-players/sony-hg122
fitness/devices/gymboss
If want to be able to access URLs in this format. For example:
http://www.mysite.com/fitness/devices/gymboss
http://www.mysite.com/electronics/mp3-players/sony-hg122
My strategy was to override the "init" function of the SiteController in order to catch the paths and then direct it to my own implementation of a render function. However, this doesn't allow me to catch the path.
Am I going about it the wrong way? What would be the correct strategy to do this?
** EDIT **
I figure I have to make use of the URL manager. But how do I dynamically add path formats if they are all custom in a database?
Eskimo's setup is a good solid approach for most Yii systems. However, for yours, I would suggest creating a custom UrlRule to query your database:
http://www.yiiframework.com/doc/guide/1.1/en/topics.url#using-custom-url-rule-classes
Note: the URL rules are parsed on every single Yii request, so be careful in there. If you aren't efficient, you can rapidly slow down your site. By default rules are cached (if you have a cache setup), but I don't know if that applies to dynamic DB rules (I would think not).
In your URL manager (protected/config/main.php), Set urlFormat to path (and toptionally set showScriptName to false (this hides the index.php part of the URL))
'urlManager' => array(
'urlFormat' => 'path',
'showScriptName'=>false,
Next, in your rules, you could setup something like:
catalogue/<category_url:.+>/<product_url:.+> => product/view,
So what this does is route and request with a structure like catalogue/electronics/ipods to the ProductController actionView. You can then access the category_url and product_url portions of the URL like so:
$_GET['category_url'];
$_GET['product_url'];
How this rule works is, any URL which starts with the word catalogue (directly after your domain name) which is followed by another word (category_url), and another word (product_url), will be directed to that controller/action.
You will notice that in my example I am preceding the category and product with the word catalogue. Obviously you could replace this with whatever you prefer or leave it out all together. The reason I have put it in is, consider the following URL:
http://mywebsite.com/site/about
If you left out the 'catalogue' portion of the URL and defined your rule only as:
<category_url:.+>/<product_url:.+> => product/view,
the URL Manager would see the site portion of the URL as the category_url value, and the about portion as the product_url. To prevent this you can either have the catalogue protion of the URL, or define rules for the non catalogue pages (ie; define a rule for site/about)
Rules are interpreted top to bottom, and only the first rule is matched. Obviously you can add as many rules as you need for as many different URL structures as you need.
I hope this gets you on the right path, feel free to comment with any questions or clarifications you need

match any part of a URL in lucene

Presently i am using PrefixQuery it's working fine but it get's a record like if my url is
http://xyz.com then it will get http://xyz.com and http://xyz.com/service/...
but it can't get http://www.xyz.com and http://xyz.co.in.i want to search based on any parts of url my code is :-
Term term = new Term("URL", siteUrl.toLowerCase());
Query query1 = new PrefixQuery(term);
booleanQuery.add(query1,BooleanClause.Occur.MUST);
You can use a WildcardQuery. But you need to know that it has bad performance, especially with queries with a leading wildcard (not because it has been poorly implemented but because of how Lucene internally stores its term dictionary).
Can't your use-case be solved by using a custom analyzer?

Lucene query that eliminates xml tags in full text search

In alfresco I need to write a lucene query such a way that It has to eliminate/exclude the xml tags from content while searching.
Example If a file try.xml is searched against the content, my search should not search for the xml tags.
try.xml
<sample>This is an example</sample>
If I give the search text as "sample" it should not return the file name "try.xml".
So how could I achieve this?
Edit
I have tried with the below query and no change.
#cm\:name:"try*" -TEXT:"<*>" +TEXT:"sample"
Whats wrong in the above query. I just tried to get the file name which starts with "try" and eliminating the text inside tag, and trying to search for text "sample".
By default Alfresco treats XML files as plain text and indexes the xml tags as words, that's why they can be found via full text search. XML content is handled by the StringExtractingContentTransformer in Alfresco which converts text/xml to text/plain before indexing it.
To check which transformers are registered in your Alfresco installation you can check
http://localhost:8080/alfresco/service/mimetypes?mimetype=text/xml#text/xml
To prevent the indexing of xml attributes you have to write a special transformer which strips out the XML tags. See http://wiki.alfresco.com/wiki/Content_Transformations for an introduction in content transformation with Alfresco. The easiest way would be to integrate a command line utility that converts the xml file into text or you could implement a java class which does the transformation.
There's no standard way to do what you need, here's an excerpt of the official documentation:
Wild card queries Wildcard queries
using * and ? are support as terms and
phrases. For tokenized fields the
pattern match can not be exact as all
the non token characters (whitespace,
punctuation, etc) will have been lost
and treated as equal.
Basically, angle brackets are stripped out by default. You need to hack the indexing and query parsing processes in order to enable your wanted behavior.
Could you not just exclude the xml mimetype? (See http://wiki.alfresco.com/wiki/Search#Finding_nodes_by_content_mimetype for the syntax)
I guess you might want to exclude html too (so you'd exclude text/html and text/xml), that'd prevent you getting any nodes in your results that contain xml tags.