Solr - Storing offsets and positions of numericValues - lucene

Just would like to know if it is possible to store the offsets,positions and frequencies of numeric values of int,float,double types in Solr. For terms, we have character and token attributes to which offsets can be set but for numeric values, while storing as Trie or Sortable, Is it possible to set the offset or attributes for the same?
I have tried considering payloads and payload filters but not able to understand which one would be best for this and also whether it is possible to perform range queries on payload values.
Otherwise, there is also the use of IndexOptions for setting this:DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS to the field.But again not sure if this is applicable for anything other than terms/characters.Another candidate, NumericTermAttribute doesnot have position or attribute set functions.
It is fine to store them the numeric values as terms and perform sortable str search sort but why would i do that when i have a Trie which is more efficient considering the performance?

Related

STRING type or SSTRING element for a text field in table? Pros and cons

I need to create a Z table to store reasons for modifications of a certain custom object.
In the UI, the user will pick a reason ID and then optionally fill a text box. The table will have more or less the fields below:
key objectID
key changeReasonID
changedOn
changedBy
comments
My doubt is with the comments field. I read the documentation about the limitations of STRING and SSTRING, but it's not clear to me if a STRING type field used in a transparent table has a limited length or not.
Even if the length is not limited (at least by the DB), I'm not sure if it's a good idea to use this approach or would you recommend CHAR/SSTRING types with a fix length instead?
**My system is running MSSQL database.
Strings have unlimited length, both in ABAP structures/tables, and in the database.
Most databases will store only a pointer in this column that points to the real CLOB value which is stored in a different memory segment. As a result, they restrict the usage of these columns, and may not allow you to use them as a key or index.
If I remember correctly, ABAP supports a maximum of 16 string fields per structure, which naturally limits its use cases. Also consider that ABAP structures have a maximum size.
For your case, if the comment will remain the only long field, and if you are actually fine with storing unlimited input (--> security constraints?), string sounds like a reasonable option.
If you are unsure what the future will bring, or to be on the safe side regarding security, you might want to opt for sstring or simply a long char instead.

Objectbox notIn with non numeric values

We are trying to build a query where we need to check that a non numeric value (an UUID) is not in a collection of UUIDs but the available implementations only accept int and long arrays.
Is there anyway to do this? We have tried using filter but it has to be used after the find method and that is not correct since it's a paginated query and that would mean to lose some results

Keen IO mixed property values (integers as strings)

Since Keen is not strongly typed, I've noticed it is possible to send data of different types into the same property. For instance, some events may have a property whose value is a String (sent surrounded by quotes), and some whose value is an integer (sent without quotes). In the case of mathematical operations, what is the expected behavior?
Our comparator will only compute mathematical operations on numbers. If you have a property whose values are mixed, the operation will only apply to the numbers, strings will be ignored. You can see the values in your property by running a select_unique query on that property as the target_property, then (if you're using the Explorer) selecting JSON from the drop-down in the top-right. Any values you see there that are surrounded by quotes will be ignored by a mathematical query type (minimum, maximum, median, average, percentile, and sum).
If you are just starting out, and you know you want to be able to do mathematical operations on this property, we recommend making sure that you always send integers as numbers (without quotes). If you really want to keep your dataset clean, you can even start a new collection once you've made sure you are no longer sending any strings.
Yes, you're correct, Keen can accept data of different types as the value for your properties. An example of Keen's lenient data type is that a property such as VisitorID can contain both numbers (ie 14558) or strings (ie "14558").
This is article from the Keen site is useful for seeing where you can check data types: https://keen.io/docs/data-collection/data-modeling-guide-200/#check-for-data-type-mismatch

Lucene - Expected behavior when indexing multiple occurrences of a token within a field

Lets say that I'm indexing a string value "useridA;useridB,userdidC,useridA,useridA"
The field is set to ANALYZED and uses a custom CharTokenizer which looks for a boundary comma char.
What is the expected behavior in the index, as the token "useridA" occurs multiples times within the same field?
Will it just re-index the same value an preserve the same space as if it would have been just one occurrence?
At the basic level lucene is an "inverted term index" it stores term->docID. So if a term occurs many times it'll only be recorded once.
Obviously this is a huge simplification. Positional information will also be stored depending on the TermVector value used when adding the field (you will need this to use phrase and slop queries).
Depending only your use-case I'd suggest you de-dupe the list either when indexing or just use a HashSet< string> for that property of whatever your class is.

Any bad affect if I use TEXT data-type to store a number

Is there any bad affect if I use TEXT data-type to store an ID number in a database?
I do something like:
CREATE TABLE GenData ( EmpName TEXT NOT NULL, ID TEXT PRIMARY KEY);
And actually, if I want to store a date value I usually use TEXT data-type. If this is a wrong way, what is its disadvantage?
I am using PostgreSQL.
Storing numbers in a text column is a very bad idea. You lose a lot of advantages when you do that:
you can't prevent storing invalid numbers (e.g. 'foo')
Sorting will not work the way you want to ('10' is "smaller" than '2')
it confuses everybody looking at your data model.
I want to store a date value I usually use TEXT
That is another very bad idea. Mainly because of the same reasons you shouldn't be storing a number in a text column. In addition to completely wrong dates ('foo') you can't prevent "invalid" dates either (e.g. February, 31st). And then there is the sorting thing, and the comparison with > and <, and the date arithmetic....
I really don't recommend using text for dates.
Look at all the functions you are missing with text
If you want to use them, you have to cast and it's only problems if by accident the dates stored are not valid cause with text there's no validation.
In addition to what the other answers already provided:
text is also subject to COLLATION and encoding, which may complicate portability and data interchange between platforms. It also slows down sorting operations.
Concerning storage size of text: an integer occupies 4 byte (and is subject to padding for data alignment). text or varchar occupy 1 byte plus the actual string, which is 1 byte for ASCII character in UTF-8 or more for special characters. Most likely, text will be much bigger.
It depends on what operations you are going to do on the data.
If you are going to be doing a lot of arithmetic on numeric data, it makes more sense to store it as some kind of numeric data type. Also, if you plan on sorting the data in numerical order, it really helps if the data is stored as a number.
When stored as text, "11" comes ahead of "9" because "1" comes ahead of "9". If this isn't what you want, don't use text.
On the other hand, it often makes sense to store strings of digits, such as zipcodes or social security number or phone numbers as text.