MarkLogic generic language support - indexing

As per the documentation:
The generic language support only stems words to themselves, queries in these languages will not include variations of words based on their meanings in the results.
xdmp:document-insert("/lang/test.xml", <el xml:lang="fr">chats</el>)
Considering I do not have a license for french language, will the word
"chats" be stemmed to "chat" in the term list ? or will it be indexed as chats, i.e. without any variations ?

You can easily check this yourself using cts:stem:
cts:stem("chats", "fr")
Without necessary licenses it simply returns chats.
HTH!

No license for French means no stemming for French.
I'm actually kind of confused by the question. If it did work without the license, then what would the license get you?

Related

Lucene query-time boosting culture code

I'm using the Lucene.Net implementation packaged with the Kentico CMS. The site that we're indexing has articles in various languages. If a user is viewing the Japanese version of the site (for example) and runs a search for 'VPN', we'd like them to see Japanese articles about VPN first, but also see other language articles in the results.
I'm trying to achieve this with query-time boosting of the _culture field. Since we're using the standard analyzer (really don't want to change that), and the standard analyzer treats hyphens as whitespace, I thought I'd try appending '(_culture:jp)^4' to the user's query. As you can see from the Luke tool's Explain output, that isn't doing anything to boost the documents with 'jp' in the field. What gives?
I've also tried:
_culture:"en-jp"
_culture:en AND _culture:jp
_culture:"en jp"
Update: It's something with the field. There's another field in the index named 'documentculture' that contains the same data (don't know why). But when I try '(documentculture:jp)^4', it works as I expect. That solves my problem, but I still have an academic question of how the fields are different.
Even though the standard analyzer ignores hyphens I don't believe it will treat the two parts of your culture code as separate terms. Therefore under normal circumstances a wildcard would help you here. For example, the query vpn (_culture:en*)^4 would boost all documents with a culture starting with en.
However, in your case you want to match the end of the term. Unfortunately, Lucene syntax doesn't support wildcards at the start of terms for some reason (according to this reference). Therefore I think you're going to have to consider changing the analyzer you're using. I generally find the Whitespace analyzer fits my needs best. I've just tried your scenario using Whitespace analyzer and have found vpn (_culture:en-jp)^4 will give you what you need.
I understand if you don't accept this answer though since you stated you didn't want to change the analyzer!

List of reserved words in rails *3*

Is there a good, authoritative list of reserved words for RAILS-3?
Candidates:
http://oldwiki.rubyonrails.org/rails/pages/ReservedWords, but it seems a bit out of date and rails2.
http://cheat.errtheblog.com/s/rails_reserved_words (but there seems no authority to this - it could just grow and grow...)
http://latheesh.com/2010/02/02/rails-reserved-words/
Background: I'm maintaining a long-serving rails app and it has plenty of usages of reserved words (judging that http://oldwiki.rubyonrails.org/rails/pages/ReservedWords seems to apply to rails2). However, none of these are actually interfering with current activity (the app works... and the specs&features I'm slowly adding to all pass). But as time passes I'd like to remove those usages of reserved words, but don't want to bother if some reserved words are no longer really reserved. So while a longer list might be good for NEW rails apps, I need stronger justification for budget spend than "it has been listed on a webpage at somepoint"...)
Maybe the nature of rails is that you can't find an authoritative list, but you can find "things that didn't work for me at some point"....
This seems like a pretty comprehensive list. However, you are correct. The very nature of programming means that this list will be ever changing. However I will do some research...maybe there is a site that can list reserved words based on the version of Rails you are using. If not...maybe someone should get working on it :D
Here's a good collaborative list of reserved words in Rails: https://reservedwords.herokuapp.com/

Lucene to bring cheeseburger when searching for burger

I would like that if a lucene document contains the word cheeseburger and a user searches for burger for this documents to come up. I see that I will probably need a custom analyzer to break this compound word into cheese and burger. However, breaking words may also bring irrelevant results.
Ex: if when indexing production we index product and ion as well, then when the user searches for ion documents containing production will come out, which is not relevant.
So a simple word breaker won't cut it. I need a way of knowing that cheeseburger is associated to burger and cheese, but that production is not associated to ion.
Is there a more intelligent process to achieve this?
Does this has a name just like stemming is to reduce words to their root form?
Depending on how accurate you want your synonymy to be, you might need to look into approaches such as Latent Semantic Analysis (LSA) and its variants such as LDA etc. A simpler approach would be to use an Ontology such as Wordnet to augment your searches. A wordnet Lucene index is available. However if your scenario includes domain-specific vocab then you might need to generate a "mapping" Ontology.
You should look at DictionaryCompoundWordTokenFilter which uses a brute-force algorithm to split compound nouns based on a dictionary.
in most cases you can simply use wildcard queries with a leading wildcard *burger. You only have to enable the support for leading wildcards on your query parser:
parser = new QueryParser(LuceneVersion.getVersion(), searchedAttributes, analyzer);
parser.setAllowLeadingWildcard(true);
Take care:
Leading wildcards might slow your search down.
If you need a more specific solution I would suggest to go with stemming. If really a matter of finding the right analyzer.
There are stemming implementations for several languages e.g. the SnowballAnalyzer (http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/snowball/SnowballAnalyzer.html).
Best regards,
Chris
Getting associations by looking at the word is not going to scale to other words. For example, you cannot know "whopper" is associated with burger and "big-mac" is associated with cheese just by looking at the words. To make the search aware of the associations, you probably need a database of associations like "A is a B" or "A contains B". (As Mikos has mentioned, I think WordNet provides such a database.) Then, when you see B in a query, you translate the query so that it also searches for A.
I think the underlying question is -- how big is the collection you are indexing? If you are indexing some collection where all of the synonyms and related words are already known, then the index can just include the synonyms and related words directly, like 'cheeseburger' including the related words 'cheese' and 'burger'. (An approach successfully used in the LOINC standard medical terms Lucene index.)
If you are trying to solve the general problem for a whole human language (English, Chinese, etc.) then you have to move to some kind of semantic analysis as mentioned above.
It might be useful to talk with the subject matter experts of the area you are indexing to see how they search for terms -- what synonyms/related words do they use, do they have defined lists of synonyms/related words, do they need/use stemming, etc. This should give you some idea as to which approach (direct synonym/related-word inclusion or semantic analysis) you need to pursue.

L10N: Trusted test data for Locale Specific Sorting

I'm working on an internationalized database application that supports multiple locales in a single instance. When international users sort data in the applications built on top of the database, the database theoretically sorts the data using a collation appropriate to the locale associated with the data the user is viewing.
I'm trying to find sorted lists of words that meet two criteria:
the sorted order follows the collation rules for the locale
the words listed will allow me to exercise most / all of the specific collation rules for the locale
I'm having trouble finding such trusted test data. Are such sort-testing datasets currently available, and if so, what / where are they?
"words.en.txt" is an example text file containing American English text:
Andrew
Brian
Chris
Zachary
I am planning on loading the list of words into my database in randomized order, and checking to see if sorting the list conforms to the original input.
Because I am not fluent in any language other than English, I do not know how to create sample datasets like the following sample one in French (call it "words.fr.txt"):
cote
côte
coté
côté
The French prefer diacritical marks to be ordered right to left. If you sorted that using code-point order, it likely comes out like this (which is an incorrect collation):
cote
coté
côte
côté
Thank you for the help,
Chris
Here's what I found.
The Unicode Common Locale Data Repository (CLDR) is pretty much the authority on collations for international text. I was able to find several lists of words conforming to the rules found in CLDR in the ICU Project's ICU Demonstration - Locale Explorer tool. It turns out that ICU (International Components for Unicode) uses CLDR rules to help solve common internationalization problems. It's a great library; check it out.
In some cases, it was useful to construct some nonsense terms by reverse-engineering the CLDR rules directly. Search engines available in the United States were not suited for finding foreign terms with the case/diacritic/other nuances I was interested in for this testing (in retrospect, I wonder if international search engines would have been better-suited for this task).

Where can I find/learn industry standard SQL Conventions?

I work at a company that has some very non-standardized SQL conventions (They were written by Delphi Developers years ago). Where is the best place that I can find SQL industry standard convention definitions that are most commonly used?
In his book "SQL Programming Style," Joe Celko suggests a number of conventions, for example that a collection (e.g. a table) should be named in the plural, while a scalar data element (e.g. a column) should be named in the singular.
He cites ISO-11179-4 as a standard for metadata naming, which supports this guideline.
there aren't any
it there were, they'd be obsolete
if they're not obsolete, you won't like them
if you like them, they're insufficient
if they're sufficient, no one else will like them
seriously, strive for readability, i.e. use meaningful field and table names; nothing else is really necessary
(ok some common prefixes like usp and udf and udt may be useful, but not required)
This is the best one I've ever seen... Naming Conventions
However, standards should really be all about clarity, simplicity, and ease of adoption among your team.
There shouldn't be a bunch of incredibly strict naming guidelines, it should focus on style.
The point is not to torment developers, it is to create a congruent style throughout the system so that it is easy to move from one section to another.
There are no exact industry-wide SQL standards. The best option is to google for SQL standards because several knowlegable people have posted some rather good, extensive, and complete documents on the subject. Read through them and absorb the items that apply to your environment.
Here you can find a lot of Rules to better SQL /SQL Server.
Yes, it's the company that I am working for, but they are good!
And these rules come from experience with client projects.
Have a look and take what you like!