Text search in aggregation using pymongo

Text search in aggregation using pymongo - mongodb-query

I have a collection named users, it has following attributes
{
“_id”: “937a04d3f516443e87abe8308a1fe83e”,
“username”: “andy”,
“full_name”: “andy white”,
“image” : “https://example.com/xyz.jpg”,
… etc
}
i want to make a text search on full_name and username using aggregation pipeline, so that if a user search for any 3 letters, then the most relevant full_name or username returned sorted by relevancy,
i have already created text index on username and full_name and then i tried query from below link:
https://www.mongodb.com/docs/manual/tutorial/text-search-in-aggregation/#return-results-sorted-by-text-search-score
pipeline_stage = [
{"$match": {"$text": {"$search": “whit”}}},
{"$sort": {“score”: {"$meta": “textScore”}}},
{"$project": {“username”: 1,“full_name”: 1,“image”:1}}
]
stages = [*pipeline_stage]
users = users_db.aggregate(stages)
but i am getting below error:
pymongo.errors.OperationFailure: FieldPath field names may not start with ‘$’. Consider using $getField or $setField., full error: {‘ok’: 0.0, ‘errmsg’: “FieldPath field names may not start with ‘$’. Consider using $getField or $setField.”, ‘code’: 16410, ‘codeName’: ‘Location16410’, ‘$clusterTime’: {‘clusterTime’: Timestamp(1657811022, 14), ‘signature’: {‘hash’: b’a\xb4rem\x02\xc3\xa2P\x93E\nS\x1e\xa6\xaa\xb0\xb1\x85\xb5’, ‘keyId’: 7062773414158663703}}, ‘operationTime’: Timestamp(1657811022, 14)}
I also tried below link (my query also below) but i am getting full text search results, not working for partial text search:
https://www.mongodb.com/docs/manual/tutorial/text-search-in-aggregation/#match-on-text-score
pipeline_stage = [
{"$match": {"$text": {"$search": search_key}}},
{"$project": {"full_name": 1, "score": {"$meta": "textScore"}}},
]
Any help will be appreciated,
Note: I want to do partial text search, sorted by relevant records at top,
Thanks

Your project stage is incorrect, it should be
pipeline_stage = [
{"$match": {"$text": {"$search": "and"}}},
{"$sort": {"score": {"$meta": "textScore"}}},
{"$project": { "username": "$username", "full_name": "$full_name", "image": "$image"}}
]
Also note if you use an English text search, words like and are not indexed.

Related

Google Docs API for creating invoice containing table of variable number of rows

I have a template file for my invoice with a table with sample row, but I want to add more rows dynamically based on a given array size, and write the cell values from the array...
Template's photo
I've been struggling for almost 3 days now.
Is there any easy way to accomplish that?
Here's the template file: Link to the Docs file(template)
And here's a few sample arrays of input data to be replaced in the Template file:
[
[
"Sample item 1s",
"Sample Quantity 1",
"Sample price 1",
"Sample total 1"
],
[
"Sample item 2",
"Sample Quantity 2",
"Sample price 2",
"Sample total 2"
],
[
"Sample item 3",
"Sample Quantity 3",
"Sample price 3",
"Sample total 3"
],
]
Now, the length of the parent array can vary depending on the number of items in the invoice, and that's the only problem that I'm struggling with.
And... Yeah, this is a duplicate question, I've found another question on the same topic, but looking at the answers and comments, everyone is commenting that they don't understand the question whereas it looks perfectly clear for me.
Google Docs Invoice template with dynamically items row from Google Sheets
I think the person who asked the question have already quit from it. :(
By the way I am using the API for PHP (Google API Client Library for PHP), and code for replacing dummy text a Google Docs Document by the actual data is given below:
public function replaceTexts(array $replacements, string $document_id) {
# code...
$req = new Docs\BatchUpdateDocumentRequest();
// var_dump($replacements);
// die();
foreach ($replacements as $replacement) {
$target = new Docs\SubstringMatchCriteria();
$target->text = "{{" . $replacement["targetText"] . "}}";
$target->setMatchCase(false);
$req->setRequests([
...$req->getRequests(),
new Docs\Request([
"replaceAllText" => [
"replaceText" => $replacement["newText"],
"containsText" => $target
]
]),
]);
}
return $this->docs_service->documents->batchUpdate(
$document_id,
$req
);
}

A possible solution would be the following
First prep the document by removing every row from the table apart from the title.
Get the full document tree from the Google Docs API.
This would be a simple call with the document id
$doc = $service->documents->get($documentId);
Traverse the document object returned to get to the table and then find the location of the right cell. This could be done by looping through the elements in the body object until one with the right table field is found. Note that this may not necessarily be the first one since in your template, the section with the {{CustomerName}} placeholder is also a table. So you may have to find a table that has the first cell with a text value of "Item".
Add a new row to the table. This is done by creating a request with the shape:
[
'insertTableRow' => [
'tableCellLocation' => [
'rowIndex' => 1,
'columnIndex' => 1,
'tableStartLocation' => [
'index' => 177
]
]
]
]
The tableStartLocation->index element is the paragraph index of the cell to be entered, i.e. body->content[i]->table->startIndex. Send the request.
Repeat steps 2 and 3 to get the updated $doc object, and then access the newly created cell i.e. body->content[i]->table->tableRows[j]->tableCells[k]->content->paragraph->elements[l]->startIndex.
Send a request to update the text content of the cell at the location of the startIndex from 5 above, i.e.
[
'insertText' => [
'location' => [
'index' => 206,
]
],
'text' => 'item_1'
]
]
Repeat step 5 but access the next cell. Note that after each update you need to fetch an updated version of the document object because the indexes change after inserts.
To be honest, this approach is pretty cumbersome, and it's probably more efficient to insert all the data into a spreadsheet and then embed the spreadsheet into your word document. Information on that can be found here How to insert an embedded sheet via Google Docs API?.
As a final note, I created a copy of your template and used the "Try this method" feature in the API documentation to validate my approach so some of the PHP syntax may be a bit off, but I hope you get the general idea.

How to make a Dynamic/Optional Filter(parameters) in mongo DB query at (Jasper Studio)

I'm creating a web aplication and it's working perfectly, but at the end user need to create a report from it's data.
On the report page I created some txt boxes where users will type for filtering. Those txt boxes could be empty and I need to return everything from the DB, or some parameter could be filled. Remenbering that I need to pass txt boxes content as params to JasperServer and they will be used in the Query.
A example of data input is:
txtName= empty (null),
txtCity= 'Belo Horizonte'
It should generate a report with all record of people how lives in Belo Horizonte no matter the name.
I made it in SQL and works perfect. After I tried to use the same logic on mongo but it doesn't work. I have tried with $lt, $gt, $lte, $gte, $exist, $ne and bunch other aggregation tool and I was not able to make it propertly.
SQL:
select * from myfirstreports
where ($P{city} is null or cidade =$P{city})
AND ($P{name} is null or nome =$P{name})
Mongo:
{
'collectionName' : 'myfirstreports',
'findFields' :
{
'nome': 1, 'numeros': 1, 'vulgo': 1, 'cidade': 1,
'usuResponsavelCadastro': 1, 'created_at': 1
},
findQuery :
{
$and: [
{$or:[{ $P{city}: {$eq: null}}, {'cidade': $P{city}}]},
{ $or:[{$P{name}: {'$eq': null}}, {'nome': $P{name}}]}
]
}
}

I used the following expression:
$P{city}.equals(null)? "{ }" : "{'cidade': '$P!{city}'}"//Need to create a non prompting parameter
$P{name}.equals(null)? "{ }": "{'nome': '$P!{name}'}"
$P!{...} parameters allows me to create a query as a string and pass to JasperSoft report.

Cloudant Search field was indexed without position data; cannot run PhraseQuery

I have the following Cloudant search index
"indexes": {
"search-cloud": {
"analyzer": "standard",
"index": "function(doc) {
if (doc.name) {
index("keywords", doc.name);
index("name", doc.name, {
"store": true,
"index": false
});
}
if (doc.type === "file" && doc.keywords) {
index("keywords", doc.keywords);
}
}"
}
}
For some reason when I search for specific phrases, I get an error:
Search failed: field "keywords" was indexed without position data; cannot run PhraseQuery (term=FIRSTWORD)
So If I search for FIRSTWORD SECONDWORD, it looks like I am getting an error on the first word.
NOTE: This does not happen to every search phrase I do.
Does anyone know why this would be happening?
doc.name and doc.keywords are just string.
doc.name is usually something like "2004/04/14 John Doe 1234 Document Folder"
doc.keywords is usually something random like "testing this again"
And the reason why I am storing name and keywords under the keywords index is because I want anyone to be able to search keywords or name by just typing on string value. Let me know if this is not the best practice.

Likely the problem is that some of your documents contain a keywords field with string values, while other documents contain a keywords field with a different type, probably an array. I believe that this scenario would result in the error that you received. Can you double check that all of the values for your keywords fields are, in fact, strings?

Fuzzy Like This on Attachment Returns Nothing on Partial Word

I have my mapping like this:
{
"doc": {
"mappings": {
"mydocument": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"store": true,
"term_vector": "with_positions_offsets"
},
"author": {
...
When I search for a complete word I get the result:
"query": {
"fuzzy_like_this" : {
"fields" : ["file"],
"like_text" : "This_is_something_I_want_to_search_for",
"max_query_terms" : 12
}
},
"highlight" : {
"number_of_fragments" : 3,
"fragment_size" : 650,
"fields" : {
"file" : { }
}
}
But if I change the search term to "This_is_something_I_want" I get nothing. What am I missing?

To implement a partial match, we must first understand what fuzzy like this does and then decide what you want partial matching to return. fuzzy like this will perform 2 key functions.
The like_text will be analyzed using the default analyzer. All the resulting tokens will then be used to find documents based on term frequency, or tf-idf
This typically means that the input term will be be split on space and lowercased. This_is_something_I_want will therefore be tokenized to this_is_something_i_want. Unless you have files with this exact term, no documents will match.
Secondly, all terms will be fuzzified. Fuzzy searches score terms based on how many character changes needs to made to a word to match another word. For instance to get from bat to hat we will need to make 1 character change.
For our case to get from this_is_something_i_want to this_is_something_i_want_to_search_for, we will need to make 14 character changes (adding _to_search_for.) Standard fuzzy search only allows for 3 character changes when working with terms longer that 5 or 6 characters. Increasing the fuzzy limit to 14 will however produce severely skewed results
So neither of these functions will help produce the results you seek.
Here is what I can suggest:
You can implement an analyzer that splits on underscore similar to this. Tokens produced will then be ['this', 'is', 'something', 'i', 'want'] which can correctly be matched to to the sample case
Alternatively, if all you want is a document that starts with the specified text, you can use a phrase prefix query instead of fuzzy like this. Documentations here

How to prevent Facet Terms from tokenizing

I am using Facet Terms to get all the unique values and their count for a field. And I am getting wrong results.
term: web
Count: 1191979
term: misc
Count: 1191979
term: passwd
Count: 1191979
term: etc
Count: 1191979
While the actual result should be:
term: WEB-MISC /etc/passwd
Count: 1191979
Here is my sample query:
{
"facets": {
"terms1": {
"terms": {
"field": "message"
}
}
}
}

If reindexing is an option, it would be the best to change mapping and mark this fields as not_analyzed
"your_field" : { "type": "string", "index" : "not_analyzed" }
You can use multi field type if keeping an analyzed version of the field is desired:
"your_field" : {
"type" : "multi_field",
"fields" : {
"your_field" : {"type" : "string", "index" : "analyzed"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
This way, you can continue using your_field in the queries, while running facet searches using your_field.untouched.
Alternatively, if this field is stored, you can use a script field facet instead:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_fields.your_field.value"
}
}
}
As the last resort, if this field is not stored, but record source is stored in the index, you can try this:
"facets" : {
"term" : {
"terms" : {
"script_field" : "_source.your_field"
}
}
}
The first solution is the most efficient. The last solution is the least efficient and may take a lot of time on a large index.

Wow, I also got this same issue today while term aggregating in the recent elastic-search. After googling and some partial understanding, found how this geeky indexing works(which is very simple).
Queries can find only terms that actually exist in the inverted index
When you index the following string
"WEB-MISC /etc/passwd"
it will be passed to an analyzer. The analyzer might tokenize it into
"WEB", "MISC", "etc" and "passwd"
with its position details. And this tokens might filtered to lowercase such as
"web", "misc", "etc" and "passwd"
So, after indexing,the search query can see the above 4 only. not the complete word "WEB-MISC /etc/passwd". For your requirement the following are my options you can use
1.Change the Default Analyzer used by elasticsearch([link][1])
2.If it is not need, just TurnOff the analyzer by setting 'not_analyzed' for the fields you need
3.To convert the already indexed data searchable, re-indexing is the only option

I have briefly explained this problem and proposed two solutions here.
I have talked about multiple approaches here.
One is use of not_analyzed to preserve the string as it is. But then as it has the drawback of being case insensitive , a better approach would be use keyword tokenizer + lowercase filter

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas