Lucene field from TokenStream with stored values - lucene

I have a field which needs to come from a token stream; it cannot be instantiated with a string and then analyzed into tokens. For example, I might want to combine the data from multiple columns (in my RDBMS) into a single Lucene field, but I want to analyze each column in its own way. So I cannot simply concat them all as a single string then analyze the resulting string.
The problem I am running into now is that fields created from token streams cannot be stored, which makes sense in the general case since the stream may not have an obvious string representation. However, I know the string representation, and I would like to store that.
I tried adding the same field twice, once with it being stored and having string data and once with it coming from a token stream, but it seems that this can't be done. Apart from some hack like adding a field with a name of "myfield__stored" is there a way to do this?
I am using 2.9.2.

I found a way. You can sneak it in by instantiating it as a normal field but calling SetTokenStream later:
Field f = new Field(Name, StringValue, Store, Analyzed, TV);
f.SetTokenStream(TokenStreamValue);
Because the reader/string value is only indexed if the token stream value is null, the token stream value will be indexed. The store methods look at string/reader regardless of token stream, so it will be this value which is stored.

Related

STRING type or SSTRING element for a text field in table? Pros and cons

I need to create a Z table to store reasons for modifications of a certain custom object.
In the UI, the user will pick a reason ID and then optionally fill a text box. The table will have more or less the fields below:
key objectID
key changeReasonID
changedOn
changedBy
comments
My doubt is with the comments field. I read the documentation about the limitations of STRING and SSTRING, but it's not clear to me if a STRING type field used in a transparent table has a limited length or not.
Even if the length is not limited (at least by the DB), I'm not sure if it's a good idea to use this approach or would you recommend CHAR/SSTRING types with a fix length instead?
**My system is running MSSQL database.
Strings have unlimited length, both in ABAP structures/tables, and in the database.
Most databases will store only a pointer in this column that points to the real CLOB value which is stored in a different memory segment. As a result, they restrict the usage of these columns, and may not allow you to use them as a key or index.
If I remember correctly, ABAP supports a maximum of 16 string fields per structure, which naturally limits its use cases. Also consider that ABAP structures have a maximum size.
For your case, if the comment will remain the only long field, and if you are actually fine with storing unlimited input (--> security constraints?), string sounds like a reasonable option.
If you are unsure what the future will bring, or to be on the safe side regarding security, you might want to opt for sstring or simply a long char instead.

What happens if I send integers to a BigQuery field "string"?

One of the columns I send (in my code) to BigQuery is integers. I added the columns to BigQuery and I was too fast and added them as type string.
Will they be automatically converted? Or will the data be totally corrupted (= I cannot trust at all the resulting string)?
Data shouldn't be automatically converted as this would destroy the purpose of having a table schema.
What I've seen people doing is saving a whole json line as string and then processing this string inside of BigQuery. Other than that, if you try to save values not correspondent to the field schema definition, you should see an error being thrown, like so:
If you need to change a table schema's definition, you can check this tutorial on updating a table schema.
Actually BigQuery converted automatically the integers that I have sent it to string, so my table populates ok

Lucene - Expected behavior when indexing multiple occurrences of a token within a field

Lets say that I'm indexing a string value "useridA;useridB,userdidC,useridA,useridA"
The field is set to ANALYZED and uses a custom CharTokenizer which looks for a boundary comma char.
What is the expected behavior in the index, as the token "useridA" occurs multiples times within the same field?
Will it just re-index the same value an preserve the same space as if it would have been just one occurrence?
At the basic level lucene is an "inverted term index" it stores term->docID. So if a term occurs many times it'll only be recorded once.
Obviously this is a huge simplification. Positional information will also be stored depending on the TermVector value used when adding the field (you will need this to use phrase and slop queries).
Depending only your use-case I'd suggest you de-dupe the list either when indexing or just use a HashSet< string> for that property of whatever your class is.

Dynamic type cast in select query

I have totally rewritten my question because of inaccurate description of the problem!
We have to store a lot of different informations about a specific region. For this we need a flexible data structure which does not limit the possibilities for the user.
So we've create a key-value table for this additional data which is described through a meta table which contains the datatype of the value.
We already use this information for queries over our rest api. We then automatically wrap the requested field with into a cast.
SQL Fiddle
We return this data together with information form other tables as a JSON object. We convert the corresponding rows from the data-table with array_agg and json_object into a JSON object:
...
CASE
WHEN count(prop.name) = 0 THEN '{}'::json
ELSE json_object(array_agg(prop.name), array_agg(prop.value))
END AS data
...
This works very well. Now the problem we have is if we store data like a floating point number into this field, we then get returned a string representation of this number:
e.g. 5.231 returns as "5.231"
Now we would like to CAST this number during our select statement into the right data-format so the JSON result would be correctly formatted. We have all the information we need so we tried following:
SELECT
json_object(array_agg(data.name),
-- here I cast the value into the right datatype!
-- results in an error
array_agg(CAST(value AS datatype))) AS data
FROM data
JOIN (
SELECT name, datatype
FROM meta)
AS info
ON info.name = data.name
The error message is following:
ERROR: type "datatype" does not exist
LINE 3: array_agg(CAST(value AS datatype))) AS data
^
Query failed
PostgreSQL said: type "datatype" does not exist
So is it possible to dynamically cast the text of the data_type column to a postgresql type to return a well-formatted JSON object?
First, that's a terrible abuse of SQL, and ought to be avoided in practically all scenarios. If you have a scenario where this is legitimate, you probably already know your RDBMS so intimately, that you're writing custom indexing plugins, and wouldn't even think of asking this question...
If you tell us what you're actually trying to do, there's about a 99.9% chance we can tell you a better way to do it.
Now with that disclaimer aside:
This is not possible, without using dynamic SQL. With a sufficiently recent version of PostgreSQL, you can accomplish this with the use of 'EXECUTE IMMEDIATE', which you can read about in the manual. It basically boils down to using EXEC.
Note, however, that even using this method, the result for every row fetched in the same query must have the same data type. In other words, you can't expect that row 1 will have a data type of VARCHAR, and row 2 will have INT. That is completely impossible.
The problem you have is, that json_object does create an object out of a string array for the keys and another string array for the values. So if you feed your JSON objects into this method, it will always return an error.
So the first problem is, that you have to use a JSON or JSONB column for the values. Or you can convert the values from string to json with to_json().
Now the second problem is that you need to use another method to create your json object because you want to feed it with a string array for the keys and a json-object array for the values. For this there is a method called json_object_agg.
Then your output should be like the one you expected! Here the full query:
SELECT
json_object_agg(data.name, to_json(data.value)) AS data
FROM data

Luke Where are my field values?

I've used Luke like four times per year for the past three years. I only break it out when I need it. One concept I've never understood is why only certain fields' values are displayed. I can query these "empty" fields for expected values and get the expected results, but Luke never displays these. I assume I'm missing something fundamental and obvious, but it's not so obvious to me.
Example Search tab:
Example Documents tab:
When a program creates a Lucene Document, it might tell Lucene whether to store the value of the field or not. See, for example, the stored argument to the StringField constructor. If the value is not stored then it can be searched on, but the original bytes of the value are not saved in the index, since they are not required nor used by the search.
A typical pattern with, say, http://www.elasticsearch.org/ is to store the original JSON in a single field and not to store the actually indexed fields. That way the application working with the retrieved data might use it's native data format and does not have to be aware of the Lucene and it's flat key-value Document.