putting exact matches first in elasticsearch multimatch query - lucene

it sure seems that there's no easy way to do this ... how can i ensure that certain fields in my multi match query going to actually be boosted correctly so that exact matches show up at the top?
i honestly seem to have tried this a multitude of ways, but maybe someone knows the answer ...
in my movie and music database, i'm trying to search multiple fields at once, but ensure that exact matches make it to the top and that certain fields such as title and artist name have more boost.
here's the main portion of my query...
"query": {
"bool": {
"should": [
{
"multi_match": {
"type": "phrase_prefix",
"query": "brave",
"max_expansions": 10,
"fields": [
"title^3",
"artists.name^2",
"starring.name^2",
"credits.name",
"tracks^0.1"
]
}
}
],
"minimum_number_should_match": 1
}
}
as you see, the query is 'brave'. it just so happens there's a movie called brave. perfect, i want it at the top - since not only is it an exact match, but the match is in the title. however, there's a popular song called 'brave' from sara bareilles which ends up on top. why?
i've tried every analyzer known to man, custom and otherwise, and i've tried changing the 'type' parameter to every other permutation (phrase, best_fields, cross_fields, most_fields), and it just doesn't seem to honor the fact that i'm effectively trying to either promote 'title' and 'artists.name' and 'starring.name' and DEMOTE 'tracks'.
is there any way i can ensure all exact matches show up at the top (especially in title, etc) followed by expansions, etc?
any suggestions would be helpful.
EDIT
the analyzer i'm currently using which seems to work better than others is a custom one i call 'nameAnalyzer' which is made up of a 'lowercase' filter and 'keyword' tokenizer only.
here's some example documents in the order in which they're appearing in the results:
fields": {
"title": [
"Brave"
],
"credits.name": [
"Kelly MacDonald",
"Emma Thompson",
"Billy Connolly",
"Julie Walters",
"Kevin McKidd",
"Craig Ferguson",
"Robbie Coltrane"
],
"starring.name": [
"Emma Thompson",
"Julie Walters",
"Billy Connolly",
"Kevin Mckidd",
"Kelly Macdonald"
]
,
fields": {
"credits.name": [
"Hilary Weeks",
"Scott Wiley",
"Sarah Sample",
"Debra Fotheringham",
"Dustin Christensen",
"Russ Dixon"
],
"title": [
"Say Love"
],
"artists.name": [
"Hilary Weeks"
],
"tracks": [
"Say Love",
"Another Second Chance",
"It's A Good Day",
"Brave",
"I Found Me",
"Hero",
"Tell Me",
"Where I Am",
"Better Promises",
"Even When"
]
,
fields": {
"title": [
"Brave Little Toaster"
],
"credits.name": [
"Randy Bennett",
"Jim Jackman",
"Randy Cook",
"Judy Toll",
"Jon Lovitz",
"Tim Stack",
"Timothy E. Day",
"Thurl Ravenscroft",
"Deanna Oliver",
"Phil Hartman",
"Jonathon Benair",
"Joe Ranft"
],
"starring.name": [
"Jon Lovitz",
"Thurl Ravenscroft",
"Tim Stack",
"Timothy E. Day",
"Deanna Oliver"
]
},
"fields": {
"title": [
"Braveheart"
],
"credits.name": [
"Bernard Horsfall",
"Martin Dempsey",
"James Robinson",
"Robert Paterson",
"Alan Tall",
"Rupert Vansittart",
"Donal Gibson",
"Malcolm Tierney",
"Sandy Nelson",
"Sean Lawlor"
],
"starring.name": [
"Brendan Gleeson",
"Sophie Marceau",
"Mel Gibson",
"Patrick Mcgoohan",
"Catherine Mccormack"
]
}
maybe someone knows why the second title ... (in this case not sara bareilles as i said before, but) Hillary Weeks - who has a track called 'brave' ... why is it before title 'braveheart' and 'brave little toaster'?
EDIT AGAIN
to further complicate the situation, what if i had a 'rank' field that was a part of my document? i'm finding it very difficult to add that to my _score field using a script score function...
"functions": [
{
"script_score": {
"script": "_score * 1/ doc['rank'].value"
}
}
]

Related

How to get new Search Engine results in the past 24h using a SERP API?

Assume I am in possession of a SERP API, which given a keyword, returns me the Google results of that keyword in JSON format (for example: https://serpapi.com/):
{
"organic_results": [
{
"position": 1,
"title": "Coffee - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Coffee",
"displayed_link": "https://en.wikipedia.org › wiki › Coffee",
"snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. From the coffee fruit, the seeds are ...",
"sitelinks":{/*snip*/}
,
"rich_snippet":
{
"bottom":
{
"extensions":
[
"Region of origin: Horn of Africa and ‎South Ara...‎",
"Color: Black, dark brown, light brown, beige",
"Introduced: 15th century"
]
,
"detected_extensions":
{
"introduced_th_century": 15
}
}
}
,
"about_this_result":
{
"source":
{
"description": "Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system. Individual contributors, also called editors, are known as Wikipedians.",
"source_info_link": "https://en.wikipedia.org/wiki/Wikipedia",
"security": "secure",
"icon": "https://serpapi.com/searches/6165916694c6c7025deef5ab/images/ed8bda76b255c4dc4634911fb134de53068293b1c92f91967eef45285098b61516f2cf8b6f353fb18774013a1039b1fb.png"
}
,
"keywords":
[
"coffee"
]
,
"languages":
[
"English"
]
,
"regions":
[
"the United States"
]
}
,
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:U6oJMnF-eeUJ:https://en.wikipedia.org/wiki/Coffee+&cd=4&hl=en&ct=clnk&gl=us",
"related_pages_link": "https://www.google.com/search?q=related:https://en.wikipedia.org/wiki/Coffee+Coffee"
},
/* Results 2,3,4... */
]}
What is a good way to get new results from the past 24h? I added the &tbs=qdr:d query parameter, which only shows the results from the past day. That's a good first step.
The 2nd step is to filter out only relevant results. When there are no relevant results, Google shows this message box:
What is their algorithm to show this box?
Idea 1: "grep -i {exact_keywords}"
For example, if I search a keyword like "Alexander Pope", the 24h Google query might return results about the pope, written by a guy called Alexander. That's not super relevant. My naive idea is to grep (case insensitive) the exact keyword "Alexander Pope".
But that might leave out some good results.
Any other ideas?

Handling multiple rows returned by IMPORTJSON script on GoogleSheets

I am trying to populate a google sheet using an API. But the API has more than one row to be returned for a single query. Following is the JSON returned by API.
# https://api.dictionaryapi.dev/api/v2/entries/en/ABANDON
[
{
"word": "abandon",
"phonetics": [
{
"text": "/əˈbændən/",
"audio": "https://lex-audio.useremarkable.com/mp3/abandon_us_1.mp3"
}
],
"meanings": [
{
"partOfSpeech": "transitive verb",
"definitions": [
{
"definition": "Cease to support or look after (someone); desert.",
"example": "her natural mother had abandoned her at an early age",
"synonyms": [
"desert",
"leave",
"leave high and dry",
"turn one's back on",
"cast aside",
"break with",
"break up with"
]
},
{
"definition": "Give up completely (a course of action, a practice, or a way of thinking)",
"example": "he had clearly abandoned all pretense of trying to succeed",
"synonyms": [
"renounce",
"relinquish",
"dispense with",
"forswear",
"disclaim",
"disown",
"disavow",
"discard",
"wash one's hands of"
]
},
{
"definition": "Allow oneself to indulge in (a desire or impulse)",
"example": "they abandoned themselves to despair",
"synonyms": [
"indulge in",
"give way to",
"give oneself up to",
"yield to",
"lose oneself in",
"lose oneself to"
]
}
]
},
{
"partOfSpeech": "noun",
"definitions": [
{
"definition": "Complete lack of inhibition or restraint.",
"example": "she sings and sways with total abandon",
"synonyms": [
"uninhibitedness",
"recklessness",
"lack of restraint",
"lack of inhibition",
"unruliness",
"wildness",
"impulsiveness",
"impetuosity",
"immoderation",
"wantonness"
]
}
]
}
]
}
]
By using the following calls via IMPORTJSON,
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/phonetics/text", "noHeaders")
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/meanings/partOfSpeech", "noHeaders")
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/meanings/definitions/definition", "noHeaders")
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/meanings/definitions/synonyms", "noHeaders")
=ImportJSON(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2), "/meanings/definitions/example", "noHeaders")
I am able to get the following in GoogleSheets,
Whereas, the actual output according to JSON should be,
As you can see a complete row is being overwritten. How can this be fixed?
EDIT
Following is the link to sheet for viewing only.
I believe your goal as follows.
You want to achieve the bottom image in your question on Google Spreadsheet.
Unfortunately, I couldn't find the method for directly retrieving the bottom image using ImportJson. So in this answer, I would like to propose a sample script for retrieving the values you expect using Google Apps Script. I thought that creating a sample script for directly achieving your goal might be simpler rather than modifying ImportJson.
Sample script:
function SAMPLE(url) {
var res = UrlFetchApp.fetch(url, {muteHttpExceptions: true});
if (res.getResponseCode() != 200) return res.getContentText();
var obj = JSON.parse(res.getContentText());
var values = obj[0].meanings.reduce((ar, {partOfSpeech, definitions}, i) => {
definitions.forEach(({definition, example, synonyms}, j) => {
var v = [definition, Array.isArray(synonyms) ? synonyms.join(",") : synonyms, example];
var phonetics = obj[0].phonetics[i];
ar.push(j == 0 ? [(phonetics ? phonetics.text : ""), partOfSpeech, ...v] : ["", "", ...v]);
});
return ar;
}, []);
return values;
}
When you use this script, please put =SAMPLE(CONCATENATE("https://api.dictionaryapi.dev/api/v2/entries/en/"&$A2)) to a cell as the custom formula.
Result:
When above script is used, the following
Note:
In this sample script, when the structure of the JSON object is changed, it might not be able to be used. So please be careful this.
References:
Class UrlFetchApp
Custom Functions in Google Sheets

Searching within an array in kibana

I am pushing my logs to elasticsearch which stores a typical doc as-
{
"_index": "logstash-2014.08.11",
"_type": "machine",
"_id": "2tSlN1P1QQuHUkmoJfkmnQ",
"_score": null,
"_source": {
"category": "critical log with list",
"app_name": "attachment",
"stacktrace_array": [
"this is the first line",
"this is the second line",
"this is the third line",
"this is the fourth line",
],
"#timestamp": "2014-08-11T13:30:51+00:00"
},
"sort": [
1407763851000,
1407763851000
]
}
Kibana makes searching substrings very easy. For example searching for "critical" in the dashboard will fetch all logs with the word critical in any string mapped value.
How do i go about searching for something like "second line" which is a string nested in an array within my doc?
It would be a simple field:<search_term> query, like -
"query": {
"query_string": {
"query": "stacktrace_array:*second line*"
}
...
So in layman terms, for Kibana dashboard, put your search query like so -
stacktrace_array:*second line*

How to get list of statements for a given Wikidata ID?

The only thing I managed to do is this link:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q568&format=jsonfm
But this produces lots of useless data. What I need is to get all the statements for the given item, but I can't see any of the statements in the query above.
here it will be:
{ "instance of" : "chemical element",
"element symbol" : "Li",
"atomic number" : 3,
"oxidation state" : 1,
"subclass of" : ["chemical element", "alkali metal"]
// etc...
}
Is there an API for this or must I scrape the web page?
The information you want is in your query, except it's hard to decode. For example, this:
"P246": [
{
"id": "q568$E47B8CE7-C91D-484A-9DA4-6153F132997D",
"mainsnak": {
"snaktype": "value",
"property": "P246",
"datatype": "string",
"datavalue": {
"value": "Li",
"type": "string"
}
},
"type": "statement",
"rank": "normal",
"references": …
}
]
means that the “element symbol” (property P246) is “Li”. So, you will need to read all the properties from your query and then find out the name for each of the properties you found.
To get just the statements, you could also use action=wbgetclaims, but it's in the same format as above.

Freebase search_api and excluding results by specified type

is anyone know, how to exclude some topics with specified type(s) using search api and mql?
For example i'm try to find all topics "Voodoo People", and exclude only those, that have composition and release types, and sort result by score desc: http://tinyurl.com/3tjkb7y.
Sorting work perfect, but i can't find functionality for excluding :(
I'm try to use mql_filter: http://tinyurl.com/644xkow, but releases still there.
And one more question: i see in type_strict param possible values: "all", "any", "should". But there is no value "not" or "not in". Is needed result can be obtained in any other way?
The syntax that you're looking for is "optional" : "forbidden". In your query that would look like this:
[{
"search": {
"query": "Voodoo People",
"score": null,
"mql_filter": [{
"type": {
"id": "/music/release",
"optional": "forbidden"
}
}]
},
"name": null,
"id": null,
"type": [],
"/common/topic/notable_for": {
},
"limit": 15,
"sort": "-search.score"
}]​