CosmosDB: Is it a good practice to use ORDER BY on the property that is also used in range filter? - indexing

When I ran the query below on CosmosDB Explorer on Azure portal, several hundreds of RUs was consumed according to Query Stats.
select * from c where c.name = "john" and c._ts > 0
But after I added order by c._ts to the query above, only roughly 20 RUs was consumed.
According to the similar question, this behavior is expected.
(But I don't really understand why range filter is not enough to avoid looking at unnecessary indices)
So is it a good practice to use ORDER BY on the properties that are also used in range filter?

There is no guarantee that a ORDER BY query will use a range index although it normally does.
The best way to ensure you get a good index hit and thus lower RU consumption consistently is to use a composite index like below, of course adjusting your other properties as needed but you can see the _ts part in there as well.
This information can be found in the documentation here
{
"automatic":true,
"indexingMode":"Consistent",
"includedPaths":[
{
"path":"/*"
}
],
"excludedPaths":[],
"compositeIndexes":[
[
{
"path":"/foodGroup",
"order":"ascending"
},
{
"path":"/_ts",
"order":"ascending"
}
]
]
}

Related

How to automatically break down a SQL-like query with many joins into discrete, independent steps?

Note: This is a learning exercise to learn how to implement a SQL-like relational database. This is just one thin slice of a question in the overall grand vision.
I have the following query, given a test database with a few hundred records:
select distinct "companies"."name"
from "companies"
inner join "projects" on "projects"."company_id" = "companies"."id"
inner join "posts" on "posts"."project_id" = "projects"."id"
inner join "comments" on "comments"."post_id" = "posts"."id"
inner join "addresses" on "addresses"."company_id" = "companies"."id"
where "addresses"."name" = 'Address Foo'
and "comments"."message" = 'Comment 3/3/2/1';
Here, the query is kind of unrealistic, but it demonstrates the point which I am trying to make. The point is to have a query with a few joins, so that I can figure out how to write this in sequential steps.
The first part of the question is (which I think I've partially figured out), is how do you write these joins as a sequence of independent steps, with the output of one fed into the input of the other? Also, is there more than one way to do it?
// step 1
let companies = select('companies')
// step 2
let projects = join(companies, select('projects'), 'id', 'company_id')
// step 3
let posts = join(projects, select('posts'), 'id', 'project_id')
// step 4
let comments = join(posts, select('comments'), 'id', 'post_id')
// step 5
let finalPosts = posts.filter(post => !!comments.find(comment => comment.post_id === post.id))
// step 6
let finalProjects = projects.filter(project => !!posts.find(post => post.project_id === project.id))
// step 7, could also be run in parallel to step 2 potentially
let addresses = join(companies, select('addresses'), 'id', 'company_id')
// step 8
let finalCompanies = companies.filter(company => {
return !!posts.find(post => post.company_id === company.id)
&& !!addresses.find(address => address.company_id === company.id)
})
These filters could probably be more optimized using indexes of some sort, but that is beside the point I think. This just demonstrates that there seem to be about 8 steps to find the companies we are looking for.
The main question is, how do you automatically figure out the steps from the SQL query?
I am not asking about how to parse the SQL query into an AST. Assume we have some sort of object structure we are dealing with, like an AST, to start.
How would you have to have the SQL query in structured object form, such that it would lead to these 8 steps? I would like to be able to specify a query (using a custom JSON-like syntax, not SQL), and then have it divide the query into these steps to divide and conquer so to speak and perform the queries in parts (for learning how to implement distributed databases). But I don't see how we go from SQL-like syntax, to 8 steps. Can you show how that might be done?
Here is the full code for the demo, which you can run with psql postgres -f test.sql. The result should be "Company 3".
Basically looking for a high level algorithm (doesn't even need to be code), which describes the key way you would break down some sort of AST-like object representation of a SQL query, into the actual planned steps of the query.
My algorithm looks like this in my head:
represent SQL query in object tree.
convert object tree to steps.
I am not really sure what (1) should be structured like, and even if we had some sort of structure, I'm not sure how to get that to complete (2). Looking for more details on the implementations of these steps, mainly step (2).
My "object structure" for step 1 would be something like this:
const query = {
select: {
distinct: true,
columns: ['companies.name'],
from: ['companies'],
},
joins: [
{
type: 'inner join',
table: 'projects',
left: 'projects.company_id',
right: 'companies.id',
},
...
],
conditions: [
{
left: 'addresses.name',
op: '=',
right: 'Address Foo'
},
...
]
}
I am not sure how useful that is, but it doesn't relate to steps at all. At a high level, what kind of code would I have to write to convert that object sort of structure into steps? Seems like one potential avenue is do a topological sort on the joins. But then you need to combine that with the select and conditions somehow, not sure how you would even begin to programmatically know what step should be before what other step, or even what the steps are. Maybe if I somehow could break it into known "chunks", then it would be simple to apply TOP sort to it after that, but then the question is still, how to get into chunks from the object structure / SQL?
Basically, I have been reading about the theory behind "query planning and optimization", but don't know how to apply it in this regard. How did this site do it?
One aspect is breaking at least the where conditions into CNF.
Implementing joins is a huge topic which is probably out of scope for a StackOverflow answer.
If you're looking for practical information about how joins are implemented, I would suggest...
The Join Operation section of Use The Index, Luke for different types of join implementation.
Section 7 of the The SQLite Query Optimizer Overview covers joins. And reading the SQLite source code. It is about as small a practical SQL implementation will get.
The output of explain in Postgresql gives very detailed information about how it has implemented the query. And they are explained in Operator Optimization Information

AWS Config Advanced Query SQL Syntax

I am trying to use AWS Config Advanced Query to generate a report against a specific rule I have created.
SELECT
configuration.targetResourceId,
configuration.targetResourceType,
configuration.complianceType,
configuration.configRuleList
WHERE
configuration.configRuleList.configRuleName = 'aws_config-requiredtags-rule'
AND configuration.complianceType = 'NON_COMPLIANT'
Results look similar to this:
[
0:{
"configRuleName":"aws_configrequiredtags-rule"
"configRuleArn":"arn:aws:config:us-east-2:123456789:config-rule/config-rule-dl6gsy"
"configRuleId":"config-rule-dl6gsy"
"complianceType":"COMPLIANT"
}
1:{
"configRuleName":"eaws_config-instanceinvpc-rule"
"configRuleArn":"arn:aws:config:us-east-2:123456789:config-rule/config-rule-dc4f1x"
"configRuleId":"config-rule-dc4f1x"
"complianceType":"NON-COMPLIANT"
}
While this query produces results, it separates my config rule and compliance type, so I am not only getting results where my config rule is ONLY Non-compliance for 'aws_config-requiredtags-rule' results.
I am pretty novice with SQL, but hope there is a way for me to specify that I only want to see Non-Compliant results against a specific rule.
thanks,
This is a limitation of the AWS Config Service - and a pretty big one IMO. When you filter on properties within arrays, those filters are treated like OR operations instead of AND. There doesn't seem to be a good way of performing meaningful queries for individual rules.
From the docs:
When querying against multiple properties within an array of objects, matches are computed against all the array elements
...
The first condition configuration.configRuleList.complianceType = 'non_compliant' is applied to ALL elements in R.configRuleList, because R has a rule (rule B) with complianceType = ‘non_compliant’, the condition is evaluated as true. The second condition configuration.configRuleList.configRuleName is applied to ALL elements in R.configRuleList, because R has a rule (rule A) with configRuleName = ‘A’, the condition is evaluated as true. As both conditions are true, R will be returned.

Slick plain sql query with pagination

I have something like this, using Akka, Alpakka + Slick
Slick
.source(
sql"""select #${onlyTheseColumns.mkString(",")} from #${dbSource.table}"""
.as[Map[String, String]]
.withStatementParameters(rsType = ResultSetType.ForwardOnly, rsConcurrency = ResultSetConcurrency.ReadOnly, fetchSize = batchSize)
.transactionally
).map( doSomething )...
I want to update this plain sql query with skipping the first N-th element.
But that is very DB specific.
Is is possible to get the pagination bit generated by Slick? [like for type-safe queries one just do a drop, filter, take?]
ps: I don't have the Schema, so I cannot go the type-safe way, just want all tables as Map, filter, drop etc on them.
ps2: at akka level, the flow.drop works, but it's not optimal/slow, coz it still consumes the rows.
Cheers
Since you are using the plain SQL, you have to provide a workable SQL in code snippet. Plain SQL may not type-safe, but agile.
BTW, the most optimal way is to skip N-th element by Database, such as limit in mysql.
depending on your database engine, you could use something like
val page = 1
val pageSize = 10
val query = sql"""
select #${onlyTheseColumns.mkString(",")}
from #${dbSource.table}
limit #${pageSize + 1}
offset #${pageSize * (page - 1)}
"""
the pageSize+1 part tells you whether the next page exists
I want to update this plain sql query with skipping the first N-th element. But that is very DB specific.
As you're concerned about changing the SQL for different databases, I suggest you abstract away that part of the SQL and decide what to do based on the Slick profile being used.
If you are working with multiple database product, you've probably already abstracted away from any specific profile, perhaps using JdbcProfile. In that case you could place your "skip N elements" helper in a class and use the active slickProfile to decide on the SQL to use. (As an alternative you could of course check via some other means, such as an environment value you set).
In practice that could be something like this:
case class Paginate(profile: slick.jdbc.JdbcProfile) {
// Return the correct LIMIT/OFFSET SQL for the current Slick profile
def page(size: Int, firstRow: Int): String =
if (profile.isInstanceOf[slick.jdbc.H2Profile]) {
s"LIMIT $size OFFSET $firstRow"
} else if (profile.isInstanceOf[slick.jdbc.MySQLProfile]) {
s"LIMIT $firstRow, $size"
} else {
// And so on... or a default
// Danger: I've no idea if the above SQL is correct - it's just placeholder
???
}
}
Which you could use as:
// Import your profile
import slick.jdbc.H2Profile.api._
val paginate = Paginate(slickProfile)
val action: DBIO[Seq[Int]] =
sql""" SELECT cols FROM table #${paginate.page(100, 10)}""".as[Int]
In this way, you get to isolate (and control) RDBMS-specific SQL in one place.
To make the helper more usable, and as slickProfile is implicit, you could instead write:
def page(size: Int, firstRow: Int)(implicit profile: slick.jdbc.JdbcProfile) =
// Logic for deciding on SQL goes here
I feel obliged to comment that using a splice (#$) in plain SQL opens you to SQL injection attacks if any of the values are provided by a user.

Should paging be zero indexed within an API?

When implementing a Rest API, with parameters for paging, should paging be zero indexed or start at 1. The parameters would be Page, and PageSize.
For me, it makes sense to start at 1, since we are talking about pages
There's no standard for it. Just have a look around: there are hundreds of thousands of APIs using different approaches.
Most of APIs I know use one of the following approaches for pagination:
offset and limit or
page and size
Both can be 0 or 1 indexed. Which is better? That's up to you.
Just pick the one that fits your needs and document it properly.
Additionally, you could provide some links in the response payload to make the navigation easier between the pages.
Consider, for example, you are reading data from page 2. So, provide a link for the previous page (page 1) and for the next page (page 3):
{
"data": [
...
],
"paging": {
"previous": "http://api.example.com/foo?page=1&size=10",
"next": "http://api.example.com/foo?page=3&size=10"
}
}
And remember, always make an API you would love to use.
True, there's no standard for this.
I find that Microsoft based products used (old ones like DAO for Visual Basic 6, Visual C++ 6 and similar products) to start their pagination from 1, but a lot of other tech stacks uses 0. Gradually I find that more and more libraries are using 0 instead of 1.
Why is this? It's because, mathematically speaking, it's easier to map pageIndex starting from 0 to rowNumber in DB or Array. Suppose you have a dataset fetched from a Table in DB with 100 records. Now you want to send the second page (pageSize = 10 for example). With pageIndex starting from 0, then you only need to write
startRowNumber = pageIndex * pageSize;
return dataSet[startRowNumber, startRowNumber + pageSize]
Because in most DBs and languages, arrays/lists are 0-indexed. And even if your Rest API language uses a 1-indexed array, you would still have a problem when mapping a 1-indexed pageIndex to recordIds. For example: Suppose you have a dataset indexed 1..100 (not 0..99), and you want to send the 11th to 20th records, as the second page (here pageSize=10 and pageIndex=2, because in your case you start with 1). This means you need use the formula
((pageIndex - 1) * pageSize) + 1 ; // to get the number 11.
You see that it's easier to have a 0-indexed paging for developers.
1-indexed pagination makes more sense to human users, because we start with 1 when counting everything.

Boosting individual elasticsearch indices to have preference in results

I am trying to boost certain indices in my elastic search query. Right now, my query is looking like this.
var query = {
"query": {
"query_string": {
"fields": ["FirstName", "LastName"],
"query": "Hank Hill",
"default_operator": "AND"
}
}
};
var boosted_indices = {
"index_A" : 1.0,
"index_B" : 1.0,
"index_C" : 10.0
};
if (boosted_indices) {
query["indices_boost"] = boosted_indices;
}
// stringify and send query in an http.get request
I know that my query without boosting any indices works as I expect. However, I am still getting a lot of results from "index_A" in my query results, rather than the heavily boosted index_C. I know that there should be a similar number of matching results in A and C, so the issue must be that I am not boosting the query correctly.
Did I set up my query JSON incorrectly? On the tutorial I linked, it did not give much context.
One other thing I noticed.. the "_score" field for the returned documents... all of them are set to null. Might this have something to do with my documents not being boosted according to the index they came from?
I hope you are not using the sort parameter in query. This could be the reason that _score is null and you are not getting expected results.
Does this help?