Regarding those per-field formats in Lucene - lucene

Everyone:
I just start studying lucene's source code.
I'm confused about those per-field formats, e.g. in this package "org.apache.lucene.codecs.perfield"
There are many formats in lucene's codec.
But there are only 3 per-field formats: "DocValues", "KnnVectors" and "Postings".
So, what is the purpose of those three per-field formats?
Why "DocValues", "KnnVectors" and "Postings" are so special that they need those per-field formats?
For example, I've studies the "KnnVectors" a little.
The "PerFieldKnnVectorsFormat.FieldsWriter" acutally uses the "Lucene94HnswVectorsFormat".
But why do we have this kind of structures?
Thanks & Regards

Related

Multiple headers, footers and details with FileHelpers

I would like to write / read a text file using the FileHelpers library.
However I have doubts on how to proceed when the file has several headers, footers and details.
The structure of my file is as follows:
FileHeader
AHeader
ADetail
ADetail
ADetail
AFooter
BHeader
BDetail
BDetail
BFooter
CHeader
CDetail
CDetail
CDetail
CDetail
CFooter
FileFooter
Does anyone know indicate a possible way to solve this?
You can use MultiRecording engine to read or write a file with many different layouts.
http://www.filehelpers.net/example/Advanced/MultiRecordEngine/
Out of the box, using FileHelpers for a format that complex would be difficult.
FileHelpers provides two methods of handling multiple record types: the master/detail engine and the multi-record engine.
Unfortunately it is likely that you need both for your format. It would be hard to combine them without some further coding.
To be clear
the MasterDetailEngine caters for header/footer situation, but it currently supports only one detail type and only one level of nesting.
the MultiRecordEngine allows multiple record types. However, it treats each row as an unrelated record and the hierarchy (that is, which detail record belongs to which master record) would be hard to determine.

Apache Solr: Conditional block

I am reading columns from HBase and indexing it in Solr using morphines file. Some field values will be in either English or German. Is there a way to specify the type of the field as "text_english_german" and inside the definition of "text_english_german" can we do an condition check to see if it is English or German and use the language specific Stemmer filter factory for indexing and querying the data?
Thanks,
Kishore
With a slightly different approach, you could define two fields:
text_en
text_de
Each of them would have a language-specific text analysis configured. Then, you can use the language autodetection UpdateRequestProcessor [1]. There a lot of parameters where you can tune the behaviour of such component.
[1] https://wiki.apache.org/solr/LanguageDetection
[2] https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.
for the most frequent terms look into the terms- and statistical component
besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393
For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.

Definitive country list for e-commerce applications

I am looking for a list of countries for use in the development of an e-commerce app including:
Country Name,
Country Code,
Language,
etc.
While only the country name (and probably the country code) are really necessary, some of the other info may be nice (as long as there isn't too much!). I used to have a good list but I can't find it anymore.
Thanks.
ISO (International Standards Organization) maintains a list of countries here.
Check ISO 3166-1 for country codes.
There are 3 sets of country codes:
ISO 3166-1 Alpha-2
ISO 3166-1 Alpha-3
ISO 3166-1 Numeric
For the list of all 3 sets, view here: http://en.wikipedia.org/wiki/ISO_3166-1
Wikipedia, as usual, has a pretty definitive list. It's not in a readily computer-readable form, but it shouldn't take long to massage it, if you can't find anything else:
ISO_3166-1 (country codes)
List of sovereign states (countries and their languages)
Several options for "country code" but ISO is probably best
Language will be difficult because it's often many per country. Take Switzerland where I am: do you support French/German only? Include Italian (many sites don't). And for the handful of Romansch...?
http://cldr.unicode.org/ - common standard multilanguage database include country list and other localizable data.
Here is the curated and regularly updated country list data available in CSV and JSON formats (ISO 3166-1):
https://datahub.io/core/country-list
If you need more comprehensive country codes, e.g., ISO3166-1-Alpha-2, ISO3166-1-Alpha-3, ITU dialing codes, ISO 4217 currency codes, and many others:
https://datahub.io/core/country-codes
There is a data set for language codes as well. ISO Language Codes (639-1 and 693-2) and IETF Language Types:
https://datahub.io/core/language-codes
For more reference data you can take a look at:
https://datahub.io/awesome/reference-data
Good luck!

XML Schema testing a value and applying extra restrictions

I have the following case:
All boats have a boat type like shark , yatch and so on. I need to register which type of boat name and also how many feet the boat is, but this is where the problem arises. If the user types in a shark I need to validate that its between 15-30 feet, if he type in a yatch it needs to be between 30-60 for instance.
Any help on this?
<boat>
<type>shark</type>
<foot>18</foot> //validates
</boat>
<boat>
<type>shark</type>
<foot>14</foot> //fails
</boat>
<boat>
<type>AnyOtherBoat</type>
<foot>14</foot>//validates since its another type of boat than shark and yatch
</boat>
Help appriciated! Thx
Schematron ("a language for making assertions about patterns found in XML documents") might be able to do what you need. It allows specifying additional rules which cannot be expressed within a regular XML schema definition (XSD, RelaxNG).
Here are some articles to get you started:
Schematron on Wikipedia
Schematron: XML Structure Validation Language Using Patterns in Trees
Improving XML Document Validation with Schematron
To answer your question: No, you can't do that in XML Schema.
Firstly, you can't use values to select which constraints to apply (but you could for elements like <shark>)
Secondly, you can't do arithmetic tests (but you can use regex to specify the permissible strings... so you might be able to hack it.)