MongoDB using $and with slice and a match - mongodb-query

I'm using #Query from the spring data package and I want to query on the last element of an array in a document.
For example the data structure could be like this:
{
name : 'John',
scores: [10, 12, 14, 16]
},
{
name : 'Mary',
scores: [78, 20, 14]
},
So I've built a query, however it is complaining that "error message 'unknown operator: $slice' on server"
The $slice part of the query, when run separately, is fine:
db.getCollection('users').find({}, {scores: { $slice: -1 })
However as soon as I combine it with a more complex check, it gives the error as mentioned.
db.getCollection('users').find{{"$and":[{ } , {"scores" : { "$slice" : -1}} ,{"scores": "16"}]})
This query would return the list of users who had a last score of 16, in my example John would be returned but not Mary.
I've put it into a standard mongo query (to debug things), however ideally I need it to go into a spring-data #query construct - they should be fairly similar.
Is there anyway of doing this, without resorting to hand-cranked java calls? I don't see much documentation for #Query, other than it takes standard queries.
As commented with the link post, that refers to aggregate, how does that work with #Query, plus one of the main answers uses $where, this inefficient.

The general way forward with the problem is unfortunately the data, although #Veeram's response is correct, it will mean that you do not hit indexes. This is an issue where you've got very large data sets of course and you will see ever decreasing return times. It's something $where, $arrayElemAt cannot help you with. They have to pre-process the data and that means a full collection scan. We analysed several queries with these constructs and they involved a "COLSCAN".
The solution is ideally to create a field that contains the last item, for instance:
{
name : 'John',
scores: [10, 12, 14, 16],
lastScore: 16
},
{
name : 'Mary',
scores: [78, 20, 14],
lastScore: 14
}
You could create a listener to maintain this as follows:
#Component
public class ScoreListener extends AbstractMongoEventListener<Scores>
You then get the ability to sniff the data and make any updates:
#Override
public void onBeforeConvert(BeforeConvertEvent<Scores> event) {
// process any score and set lastScore
}
Don't forget to update your indexes (!):
#CompoundIndex(name = "lastScore", def = "{"
+ "'lastScore': 1"
+ " }")
Although this does contain a disadvantage of a slight duplication of data, in current Mongo (3.4) this really is the only way of doing this AND to include indexes in the search mechanism. The speed differences were dramatic, from nearly a minute response time down to milliseconds.
In Mongo 3.6 there may be better ways for doing that, however we are fixed on this version, so this has to be our solution.

Related

Solr: indexing nested JSON files + some fields independent of UniqueKey (need new core?)

I am working on an NLP project and I have a large amount of text data to index with Solr. I have already created an initial index (Solr core) with fields title, authors, publication date, authors, abstract. The is an ID that is unique to each article (PMID). Since then, I have extracted more information from the dataset and I am stuck with how to incorporate this new info into the existing index. I don't know how to approach the problem and I would appreciate suggestions.
The new information is currently stored in JSON files that look like this:
{id: {entity: [[33, 39, 0, subj], [103, 115, 1, obj], ...],
another_entity: [[88, 95, 0, subj], [444, 449, 1, obj], ...],
...},
another id,
...}
where the integers are the character span and the index of the sentence the entity appears in.
Is there a way to have something like subfields in Solr? Since the id is the same as the unique key in the main index I was thinking of adding a field entities, but then this field would need to have its own subfields start character, end character, sentence index, dependency tag. I have come across Nested Child Documents and I am considering changing the structure of the extracted information to:
{id: {entity: [{start:33, end:39, sent_idx:0, dep_tag:'subj'},
{start:103, end:115, sent_idx:1, dep_tag:'obj'}, ...],
another_entity: [{}, {}, ...],
...},
another id,
...}
Having keys for the nested values, I should be able to use the methods linked above - though I am still unsure if I am on the right track here. Is there a better way to approach this? All fields should be searchable. I am familiar with Python, and so far I have been using the library subprocess to post documents to Solr via Python script
sp.Popen(f"./post -c {core_name} {json_path}", shell=True, cwd=SOLR_BIN_DIR)
Additionally, I want to index some information that is not linked to a specific PMID (does not have the same unique key), so I assume I need to create a new Solr core for it? Does it mean I have to switch to SolrCloud mode? So far I have been using a simple, single core.
Example of such information (abbreviations and the respective long form - also stored in a JSON file):
{"IEOP": "immunoelectroosmophoresis",
"ELISA": "enzyme-linked immunosorbent assay",
"GAGs": "glycosaminoglycans",
...}
I would appreciate any input - thank you!
S.

Return inserted id with TypeORM & NestJS raw query: await connection.manager.query(`INSERT INTO

I'm looking to return the id or better yet, all information that was inserted, using a raw query with TypeORM and NestJS. Example as follows:
await connection.manager.query(`INSERT INTO...`)
When assigning the query to a constant and console logging it below, it does not yield any helpful information:
OkPacket {
fieldCount: 0,
affectedRows: 1,
insertId: 0,
serverStatus: 2,
warningCount: 1,
message: '',
protocol41: true,
changedRows: 0
}
As you can see, it returns no pertinent information, the insertId above is obviously incorrect, and it returns this every time, regardless of the actual parameters of the query.
I know with more typical TypeORM queries you can use .return(['name_of_column_you_want_returned']).execute()
and it will return the relevant information just fine. Is there any way to do this with a raw query? Thank you!
tl;dr You're getting the raw mariadb driver response (OkPacket) from the INSERT command, and you'd need a new SELECT query to see the data.
You're using the TypeORM EntityManager, and the docs don't mention a return value. Looking at the source code for query, the return type is any. Since it's a raw query, it probably returns an object based on the type of database you're using rather than having a standard format.
In this case, you're using MariaDb, which returned an OkPacket. Here's the documentation:
https://mariadb.com/kb/en/ok_packet/

Slick plain sql query with pagination

I have something like this, using Akka, Alpakka + Slick
Slick
.source(
sql"""select #${onlyTheseColumns.mkString(",")} from #${dbSource.table}"""
.as[Map[String, String]]
.withStatementParameters(rsType = ResultSetType.ForwardOnly, rsConcurrency = ResultSetConcurrency.ReadOnly, fetchSize = batchSize)
.transactionally
).map( doSomething )...
I want to update this plain sql query with skipping the first N-th element.
But that is very DB specific.
Is is possible to get the pagination bit generated by Slick? [like for type-safe queries one just do a drop, filter, take?]
ps: I don't have the Schema, so I cannot go the type-safe way, just want all tables as Map, filter, drop etc on them.
ps2: at akka level, the flow.drop works, but it's not optimal/slow, coz it still consumes the rows.
Cheers
Since you are using the plain SQL, you have to provide a workable SQL in code snippet. Plain SQL may not type-safe, but agile.
BTW, the most optimal way is to skip N-th element by Database, such as limit in mysql.
depending on your database engine, you could use something like
val page = 1
val pageSize = 10
val query = sql"""
select #${onlyTheseColumns.mkString(",")}
from #${dbSource.table}
limit #${pageSize + 1}
offset #${pageSize * (page - 1)}
"""
the pageSize+1 part tells you whether the next page exists
I want to update this plain sql query with skipping the first N-th element. But that is very DB specific.
As you're concerned about changing the SQL for different databases, I suggest you abstract away that part of the SQL and decide what to do based on the Slick profile being used.
If you are working with multiple database product, you've probably already abstracted away from any specific profile, perhaps using JdbcProfile. In that case you could place your "skip N elements" helper in a class and use the active slickProfile to decide on the SQL to use. (As an alternative you could of course check via some other means, such as an environment value you set).
In practice that could be something like this:
case class Paginate(profile: slick.jdbc.JdbcProfile) {
// Return the correct LIMIT/OFFSET SQL for the current Slick profile
def page(size: Int, firstRow: Int): String =
if (profile.isInstanceOf[slick.jdbc.H2Profile]) {
s"LIMIT $size OFFSET $firstRow"
} else if (profile.isInstanceOf[slick.jdbc.MySQLProfile]) {
s"LIMIT $firstRow, $size"
} else {
// And so on... or a default
// Danger: I've no idea if the above SQL is correct - it's just placeholder
???
}
}
Which you could use as:
// Import your profile
import slick.jdbc.H2Profile.api._
val paginate = Paginate(slickProfile)
val action: DBIO[Seq[Int]] =
sql""" SELECT cols FROM table #${paginate.page(100, 10)}""".as[Int]
In this way, you get to isolate (and control) RDBMS-specific SQL in one place.
To make the helper more usable, and as slickProfile is implicit, you could instead write:
def page(size: Int, firstRow: Int)(implicit profile: slick.jdbc.JdbcProfile) =
// Logic for deciding on SQL goes here
I feel obliged to comment that using a splice (#$) in plain SQL opens you to SQL injection attacks if any of the values are provided by a user.

How to get parents by IDs when using polymorphic associations

I have a many-to-many relation table site_sections with the following columns:
id
site_id
section_id
which is used a join table between sections and sites tables. So one site has many sections and a section is available in many sites.
Sites table has the following columns:
id
store_number
The sites_sections table is used in a polymorphic association with parameters table.
I'd like to find all the parameters corresponding to the site sections for a specific site by its store_number. Is it possible to pass in an array of site_settings.id to SQL using the IN clause, something like that:
Parameter.where("parent_id IN (" + [1, 2, 3, 4] + ") and parent_type ='com.models.SiteSection'");
where [1, 2, 3, 4] should be an array of IDs from sites_sections table or there is a better solution ?
Your solution is correct:
Site aSite = Site.findFirst("store_number=?", STORE_NUMBER);
List<SiteSection> siteSections= SiteSection.where("site_id=?", aSite.getId()).include(Parameter.class);
for (SiteSection siteSection : siteSections) {
List<Parameter> siteParams = siteSection.getAll(Parameter.class);
for (Parameter siteParam : siteParams) { ... }
}
In addition, by using the include(), you are also avoiding an N+1 problem: http://javalite.io/lazy_and_eager#improve-efficiency-with-eager-loading
However there can be a catch. If you have a very large number of parameters, you will be using a lot of memory, since include() loads all results into heap at once. If your result sets are relatively small, you are saving resources by running a single query. If your result sets are large, you are wasting heap space.
See docs: LazyList#include()
Side note: use aSite.getId() or aSite.getLongId() instead of aSite.get("id")

BigQuery UDF Internal Error

We had a simple UDF in BigQuery that somehow throws an error that keeps returning
Query Failed
Error: An internal error occurred and the request could not be completed.
The query was simply trying to use UDF to perform a SHA256.
SELECT
input AS title,
input_sha256 AS title_sha256
FROM
SHA256(
SELECT
title AS input
FROM
[bigquery-public-data:hacker_news.stories]
GROUP BY
input
)
LIMIT
1000
The in-line UDF is pasted below. However I can not post the full UDF as StackOverflow complaints too much code in the post. The full UDF can be seen this gist.
function sha256(row, emit) {
emit(
{
input: row.input,
input_sha256: CryptoJS.SHA256(row.input).toString(CryptoJS.enc.Hex)
}
);
}
bigquery.defineFunction(
'SHA256', // Name of the function exported to SQL
['input'], // Names of input columns
[
{'name': 'input', 'type': 'string'},
{'name': 'input_sha256', 'type': 'string'}
],
sha256 // Reference to JavaScript UDF
);
Not sure if it helps, but the Job-ID is
bigquery:bquijob_7fd3b51c_153c058dc7c
Looks like there is a similar issue at:
https://code.google.com/p/google-bigquery/issues/detail?id=478
Short answer - this is an issue related to memory allocation that I uncovered via my own testing and fixed today, but it will take a little while to flow out to production.
Slightly longer answer - we just rolled out a fix today for an issue where users who were having "out of memory" issues when scaling up their UDFs over larger number of rows, even though the UDF would succeed on smaller numbers of rows. The queries that were hitting that condition are now running fine on our internal / test trees. However, since public BigQuery hosts have much higher traffic loads, the JavaScript engine that executes the UDFs (V8) behaves somewhat differently in production than it does in internal trees. Specifically, there's a new memory allocation error that some of the previously OOMing jobs are now hitting that we couldn't observe until the queries ran on a fully-loaded tree.
It's a minor error with a quick fix, but we'd ideally let it flow through our regular testing and QA cycle. This should put the fix in production in about a week, assuming nothing else goes wrong with the candidate. Would that be acceptable for you?
i am re-using answer box to provide full query text. it works if uncomment LIMIT 40
SELECT input, input_sha256 FROM JS(
(
SELECT title AS input
FROM [bigquery-public-data:hacker_news.stories]
GROUP BY input
//LIMIT 40
),
input,
"[ {'name': 'input', 'type': 'string'}, {'name': 'input_sha256', 'type': 'string'} ] ",
"function(row, emit) {
var CryptoJS=CryptoJS||function(h,s){var f={},g=f.lib={},q=function(){},m=g.Base={extend:function(a){q.prototype=this;var c=new q;a&&c.mixIn(a);c.hasOwnProperty('init')||(c.init=function(){c.$super.init.apply(this,arguments)});c.init.prototype=c;c.$super=this;return c},create:function(){var a=this.extend();a.init.apply(a,arguments);return a},init:function(){},mixIn:function(a){for(var c in a)a.hasOwnProperty(c)&&(this[c]=a[c]);a.hasOwnProperty('toString')&&(this.toString=a.toString)},clone:function(){return this.init.prototype.extend(this)}}, r=g.WordArray=m.extend({init:function(a,c){a=this.words=a||[];this.sigBytes=c!=s?c:4*a.length},toString:function(a){return(a||k).stringify(this)},concat:function(a){var c=this.words,d=a.words,b=this.sigBytes;a=a.sigBytes;this.clamp();if(b%4)for(var e=0;e<a;e++)c[b+e>>>2]|=(d[e>>>2]>>>24-8*(e%4)&255)<<24-8*((b+e)%4);else if(65535<d.length)for(e=0;e<a;e+=4)c[b+e>>>2]=d[e>>>2];else c.push.apply(c,d);this.sigBytes+=a;return this},clamp:function(){var a=this.words,c=this.sigBytes;a[c>>>2]&=4294967295<< 32-8*(c%4);a.length=h.ceil(c/4)},clone:function(){var a=m.clone.call(this);a.words=this.words.slice(0);return a},random:function(a){for(var c=[],d=0;d<a;d+=4)c.push(4294967296*h.random()|0);return new r.init(c,a)}}),l=f.enc={},k=l.Hex={stringify:function(a){var c=a.words;a=a.sigBytes;for(var d=[],b=0;b<a;b++){var e=c[b>>>2]>>>24-8*(b%4)&255;d.push((e>>>4).toString(16));d.push((e&15).toString(16))}return d.join('')},parse:function(a){for(var c=a.length,d=[],b=0;b<c;b+=2)d[b>>>3]|=parseInt(a.substr(b, 2),16)<<24-4*(b%8);return new r.init(d,c/2)}},n=l.Latin1={stringify:function(a){var c=a.words;a=a.sigBytes;for(var d=[],b=0;b<a;b++)d.push(String.fromCharCode(c[b>>>2]>>>24-8*(b%4)&255));return d.join('')},parse:function(a){for(var c=a.length,d=[],b=0;b<c;b++)d[b>>>2]|=(a.charCodeAt(b)&255)<<24-8*(b%4);return new r.init(d,c)}},j=l.Utf8={stringify:function(a){try{return decodeURIComponent(escape(n.stringify(a)))}catch(c){throw Error('Malformed UTF-8 data');}},parse:function(a){return n.parse(unescape(encodeURIComponent(a)))}}, u=g.BufferedBlockAlgorithm=m.extend({reset:function(){this._data=new r.init;this._nDataBytes=0},_append:function(a){'string'==typeof a&&(a=j.parse(a));this._data.concat(a);this._nDataBytes+=a.sigBytes},_process:function(a){var c=this._data,d=c.words,b=c.sigBytes,e=this.blockSize,f=b/(4*e),f=a?h.ceil(f):h.max((f|0)-this._minBufferSize,0);a=f*e;b=h.min(4*a,b);if(a){for(var g=0;g<a;g+=e)this._doProcessBlock(d,g);g=d.splice(0,a);c.sigBytes-=b}return new r.init(g,b)},clone:function(){var a=m.clone.call(this); a._data=this._data.clone();return a},_minBufferSize:0});g.Hasher=u.extend({cfg:m.extend(),init:function(a){this.cfg=this.cfg.extend(a);this.reset()},reset:function(){u.reset.call(this);this._doReset()},update:function(a){this._append(a);this._process();return this},finalize:function(a){a&&this._append(a);return this._doFinalize()},blockSize:16,_createHelper:function(a){return function(c,d){return(new a.init(d)).finalize(c)}},_createHmacHelper:function(a){return function(c,d){return(new t.HMAC.init(a, d)).finalize(c)}}});var t=f.algo={};return f}(Math);
(function(h){for(var s=CryptoJS,f=s.lib,g=f.WordArray,q=f.Hasher,f=s.algo,m=[],r=[],l=function(a){return 4294967296*(a-(a|0))|0},k=2,n=0;64>n;){var j;a:{j=k;for(var u=h.sqrt(j),t=2;t<=u;t++)if(!(j%t)){j=!1;break a}j=!0}j&&(8>n&&(m[n]=l(h.pow(k,0.5))),r[n]=l(h.pow(k,1/3)),n++);k++}var a=[],f=f.SHA256=q.extend({_doReset:function(){this._hash=new g.init(m.slice(0))},_doProcessBlock:function(c,d){for(var b=this._hash.words,e=b[0],f=b[1],g=b[2],j=b[3],h=b[4],m=b[5],n=b[6],q=b[7],p=0;64>p;p++){if(16>p)a[p]= c[d+p]|0;else{var k=a[p-15],l=a[p-2];a[p]=((k<<25|k>>>7)^(k<<14|k>>>18)^k>>>3)+a[p-7]+((l<<15|l>>>17)^(l<<13|l>>>19)^l>>>10)+a[p-16]}k=q+((h<<26|h>>>6)^(h<<21|h>>>11)^(h<<7|h>>>25))+(h&m^~h&n)+r[p]+a[p];l=((e<<30|e>>>2)^(e<<19|e>>>13)^(e<<10|e>>>22))+(e&f^e&g^f&g);q=n;n=m;m=h;h=j+k|0;j=g;g=f;f=e;e=k+l|0}b[0]=b[0]+e|0;b[1]=b[1]+f|0;b[2]=b[2]+g|0;b[3]=b[3]+j|0;b[4]=b[4]+h|0;b[5]=b[5]+m|0;b[6]=b[6]+n|0;b[7]=b[7]+q|0},_doFinalize:function(){var a=this._data,d=a.words,b=8*this._nDataBytes,e=8*a.sigBytes; d[e>>>5]|=128<<24-e%32;d[(e+64>>>9<<4)+14]=h.floor(b/4294967296);d[(e+64>>>9<<4)+15]=b;a.sigBytes=4*d.length;this._process();return this._hash},clone:function(){var a=q.clone.call(this);a._hash=this._hash.clone();return a}});s.SHA256=q._createHelper(f);s.HmacSHA256=q._createHmacHelper(f)})(Math);
(function(){var h=CryptoJS,j=h.lib.WordArray;h.enc.Base64={stringify:function(b){var e=b.words,f=b.sigBytes,c=this._map;b.clamp();b=[];for(var a=0;a<f;a+=3)for(var d=(e[a>>>2]>>>24-8*(a%4)&255)<<16|(e[a+1>>>2]>>>24-8*((a+1)%4)&255)<<8|e[a+2>>>2]>>>24-8*((a+2)%4)&255,g=0;4>g&&a+0.75*g<f;g++)b.push(c.charAt(d>>>6*(3-g)&63));if(e=c.charAt(64))for(;b.length%4;)b.push(e);return b.join('')},parse:function(b){var e=b.length,f=this._map,c=f.charAt(64);c&&(c=b.indexOf(c),-1!=c&&(e=c));for(var c=[],a=0,d=0;d< e;d++)if(d%4){var g=f.indexOf(b.charAt(d-1))<<2*(d%4),h=f.indexOf(b.charAt(d))>>>6-2*(d%4);c[a>>>2]|=(g|h)<<24-8*(a%4);a++}return j.create(c,a)},_map:'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/='}})();
emit( { input: row.input, input_sha256: CryptoJS.SHA256(row.input).toString(CryptoJS.enc.Hex) } );
}"
)