How to get the original entity value from Wit.Ai? - wit.ai

I was wondering if there is a way to also return the original value of an entity from Wit.Ai.
For instance, my entity "states" correctly maps the misspelled and lower case word massachusets to Massachusetts. But it does not return the original word. So, I cannot easily tag the incorrect word.
{
"msg_id": "a6ac0938-d92c-45f4-8c41-4ca990d83415",
"_text": "What is the temperature in massachusets?",
"entities": {
"states": [
{
"confidence": 0.7956227869227184,
"type": "value",
"value": "Massachusetts"
}
]
}
}
I really appreciate if you know how I can accomplish that with Wit.Ai.
Thanks

Keep the search strategy of "states" as free-text & Keywords. This way you can extract the original word in the message. Declaring it as keyword matches it with the close one and returns that keyword where as if it is a free-text it returns original word in the message.

You may have to train wit each time to do so, by highlighting "Massachusetts" as a resolved value. This will make wit understand that you do not agree with the autocorrection.

Related

How to get the equivalent of combinig [contains] and [in] operators in the same query?

So I have a field that's a multi-choice on the Directus back end so when the JSON comes out of the API it's a one-dimensional array, like so:
"field_name": [
"",
"option 6",
"option 11",
""
]
(btw I have no idea why all these fields produce those blank values, but that's a matter for another day)
I am trying to make an interface on the front end where you can select one or more of these values and the result will come back if ANY of them are found for that record. Think of it like a tag list, if the item has just one of the values it should be returned.
I can use the [contains] operator to find if it has one of the values I'm looking for, but I can only pass a single value, whereas I need all that have either optionX OR optionY OR optionZ. I would basically need a combination of [contains] and [in] to achieve what I'm trying to do. Is there a way to achieve this?
I've also tried setting the [logical] operator to OR, but then that screws up the other filters that need to be included as AND (or I'm doing something wrong). Not to mention the query gets completely unruly.
Help?

Cloudant - Lucene range search using numbers stored as text

I have a number of documents in Cloudant, that have ID field of type string. ID can be a simple string, like "aaa", "bbb" or number stored as text, e.g. "111", "222", etc. I need to be able to full text search using the above field, but I encountered some problems.
Assuming that I have two documents, having ID="aaa" and ID="111", then searching with query:
ID:aaa
ID:"aaa"
ID:[aaa TO zzz]
ID:["aaa" TO "zzz"]
returns first document, as expected
ID:111
returns nothing, but
ID:"111"
returns second document, so at least there is a way to retrieve it.
Unfortunately, when searching for range:
ID:[111 TO 999]
ID:["111" TO "999"]
I get no results, and I have no idea what to do to get around this problem. Is there any special syntax for such case?
UPDATE:
Index function:
function(doc){
if(!doc.ID) return;
index("ID", doc.ID, { index:'not_analyzed_no_norms', store:true });
}
Changing index to analyzed doesn't help. Analyzer itself is keyword, but changing to standard doesn't help either.
UPDATE 2
Just to add some more context, because I think I missed one key point. The field I'm indexing will be searched using ranges, and both min and max values can be provided by user. So it is possible that one of them will be number stored as a string, while other will be a standard non-numeric text. For example search all document where ID >= "11" and ID <= "foo".
Assumig that database contains documents with ID "1", "5", "alpha", "beta", "gamma", this query should return "5", "alpha", "beta". Please note that "5" should actually be returned, because string "5" is greater than string "11".
Our team just came to a workaround solution. We managed to get proper results by adding some arbitrary character, e.g. 'a' to an upper range value, and by introducing additional search term, to exclude documents having ID between upper range value and upper range value + 'a'.
When searching for a range
ID:[X TO Y]
actual query would be
(ID:[X TO Ya] AND -ID:{Y TO Ya])
For example, to find a documents having ID between 23 and 758, we execute
(ID:[23 TO 758a] AND -ID:{758 TO 758a]).
First of all, I would suggest to use keyword analyzer, so you can control the right tokenization during both indexing and search.
"analyzer": "keyword",
"index": "function(doc){\n if(!doc.ID) return;\n index(\"ID\", doc.ID, {store:true });\n}
To retrieve you document with _id "111", use the following range query:
curl -X GET "http://.../facetrangetest/_design/ddoc/_search/f?q=ID:\[111%20TO%A\]"
If you use a query q=ID:\[111%20TO%20999\], Cloudant search seeing numbers on both size of the range, will interpret it as NumericRangeQuery; and since your ID of "111" is a String, it will not be part of the results returned. Including a string into query [111%20TO%20A], will make Cloudant interpret it as a range query on strings.
You can get both docs returned like this:
q=ID:["111" TO "CCC"]
Here's a working live example:
https://rajsingh.cloudant.com/facetrangetest/_design/ddoc/_search/f?q=ID:[%22111%22%20TO%20%22CCC%22]
I found something quirky. It seems that range queries on strings only work if at least one of the range values is a string. Querying on ID:["111" TO "555"] doesn't return anything either, so maybe this is resolving to a numeric query somehow? Could be a bug.
This could also be achieved using regular expressions in queries. Something line this:
curl -X POST "https://.../facetrangetest/_design/ddoc/_search/f" -d '{"q":"ID:/<23-758>/"}' | jq .
This regular expressions means to retrieve all documents with ID field from 23 to 758. Slashes: / / are used to enclose a regular expression; the interval is enclosed inside <>.

Can JSON Schema support constraints on array items at specific indexes

A good schema language will allow a high degree of control on value constraints.
My quick impression of JSON Schema, however, is that one cannot go beyond specifying that an item must be an array with a single allowable type; one cannot apparently specify, for example, that the first item must be of one type, and the item at the second index of another type. Is this view mistaken?
Yes it can be done, here is an example of an array with the three first item type specified:
{
"type": "array",
"items": [
{
"type": "number"
},
{
"type": "string"
},
{
"type": "integer"
}
]
}
When you validate the schema the 1st, 2nd and 3rd item need to match their type.
If you have more than four items in your array, the extra ones dont have a specified type so they wont fail validation also an array with less than 3 items will validate as long as the type for each item is correct.
Source and a good read I found last week when I started json schema: Understanding JSON Schema (array section in page 24 of PDF)
ps: english it's not my first languaje, let me know of any mistake in spelling, punctuation or grammar

What's the denominator for ElasticSearch scores?

I have a search which has multiple criterion.
Each criterion (grouped by should) has a different weighted score.
ElasticSearch returns a list of results; each with a score - which seems an arbitrary score to me. This is because I can't find a denominator for that score.
My question is - how can I represent each score as a ratio?
Dividing each score by max_score would not work since it'll show the best match as a 100% match with the search criteria.
The _score calculation depends on the combination of queries used. For instance, a simple query like:
{ "match": { "title": "search" }}
would use Lucene's TFIDFSimilarity, combining:
term frequency (TF): how many times does the term search appear in the title field of this document? The more often, the higher the score
inverse document frequency (IDF): how many times does the term search appear in the title field of all documents in the index? The more often, the lower the score
field norm: how long is the title field? The longer the field, the lower the score. (Shorter fields like title are considered to be more important than longer fields like body.)
A query normalization factor. (can be ignored)
On the other hand, a bool query like this:
"bool": {
"should": [
{ "match": { "title": "foo" }},
{ "match": { "title": "bar" }},
{ "match": { "title": "baz" }}
]
}
would calculate the _score for each clause which matches, add them together then divide by the total number of clauses (and once again have the query normalization factor applied).
So it depends entirely on what queries you are using.
You can get a detailed explanation of how the _score was calculated by adding the explain parameter to your query:
curl localhost:9200/_search?explain -d '
{
"query": ....
}'
My question is - how can I represent each score as a ratio?
Without understanding what you want your query to do it is impossible to answer this. Depending on your use case, you could use the function_score query to implement your own scoring algorithm.

Evaluate column value into rows

I have a column whose value is a json array. For example:
[{"att1": "1", "att2": "2"}, {"att1": "3", "att2": "4"}, {"att1": "5", "att2": "6"}]
What i would like is to provide a view where each element of the json array is transformed into a row and the attributes of each json object into columns. Keep in mind that the json array doesn't have a fixed size.
Any ideas on how i can achieve this ?
a stored procedure lexer to run against the string? anything else like trying a variable in the SQL or using regexp i imagine will be tricky.
if you need it for client-side viewing only, can you use JSON decode libraries (json_decode() if you are on PHP) and then build markup from that?
but if you're gonna use it for any Db work at all, i reckon it shouldn't be stored as JSON.