How to implement a backend for GraphQL connections? - sql

In GraphQL the recommended way for pagination is to use connections as described here. I understand the reasons and advantages of this usage but I need an advice how to implement it.
The server side of the application works on top of a SQL database (Postgres in my case). Some of the GraphQL connection fields have optional argument to specify sorting. Now with knowing the sorting columns and a cursor from the GraphQL query, how can I build an SQL query? Of course it should be efficient - if there is a SQL index index for the combination of sorting columns it should be used.
The problem is that SQL doesn't know anything like GraphQL cursors - we can't tell it to select all rows after certain row. There is just WHERE, OFFSET and LIMIT. From my point of view it seems I need to firstly select a single row based on the cursor and then build a second SQL query using the values of the sorting columns in that row to specify a complicated WHERE clause - not sure if the database would use index in that case.
What bothers me is that I could not find any article on this topic. Does it mean that SQL database is not usually used when implementing a GraphQL server? What database should be used then? How are GraphQL queries to connection fields usually transformed to queries for the underlying database?
EDIT: This is more or less what I came up with myself. The problem is how to extend it to support sorting as well and how to implement it efficiently using database indexes.

The trick here is that, as the server implementer, the cursor can be literally any value you want encoded as a string. Most examples I've seen have been base64-encoded for a bit of opacity, but it doesn't have to be. (Try base64-decoding the cursors from the Star Wars examples in your link, for example.)
Let's say your GraphQL schema looks like
enum ThingColumn { FOO BAR }
input ThingFilter {
foo: Int
bar: Int
}
type Query {
things(
filter: ThingFilter,
sort: ThingColumn,
first: Int,
after: String
): ThingConnection
}
Your first query might be
query {
things(filter: { foo: 1 }, sort: BAR, first: 2) {
edges {
node { bar }
}
pageInfo {
endCursor
hasNextPage
}
}
}
This on its own could fairly directly translate into an SQL query like
SELECT bar FROM things WHERE foo=1 ORDER BY bar ASC LIMIT 2;
Now as you iterate through each item you can just use a string version of its offset as its cursor; that's totally allowed by the spec.
{
"data": {
"things": {
"edges": [
{ "node": { "bar": 17 } },
{ "node": { "bar": 42 } }
],
"pageInfo": {
"endCursor": "2",
"hasNextPage": true
}
}
}
}
Then when the next query says after: "2", you can turn that back into an SQL OFFSET and repeat the query.
If you're trying to build a generic GraphQL interface that gets translated to reasonably generic SQL queries, it's impossible to create indexes such that every query is "fast". Like other cases, you need to figure out what your common and/or slow queries are and CREATE INDEX as needed. You might be able to limit the options in your schema to things you know you can index:
type Other {
things(first: Int, after: String): ThingConnection
}
query OtherThings($id: ID!, $cursor: String) {
node(id: $id) {
... on Other {
things(first: 100, after: $cursor) { ... FromAbove }
}
}
}
SELECT * FROM things WHERE other_id=? ORDER BY id LIMIT ?;
CREATE INDEX things_other ON things(other_id);

Related

How many SQL database calls are made when you do a deeply nested GraphQL query?

I know with GraphQL you are to implement the backend handlers for the queries. So if you are using PostgreSQL, you might have a query like this:
query {
authors {
id
name
posts {
id
title
comments {
id
body
author {
id
name
}
}
}
}
}
The naive solution would be to do something like this:
const resolvers = {
Query: {
authors: () => {
// somewhat realistic sql pseudocode
return knex('authors').select('*')
},
},
Author: {
posts: (author) => {
return knex('posts').where('author_id', author.id)
},
},
Post: {
comments: (post) => {
return knex('comments').where('post_id', post.id)
},
},
};
However, this would be a pretty big problem. It would do the following essentially:
Make 1 query for all authors.
For each authors, make query for all posts. (n + 1 query)
For each post, make query for all comments. (n + 1 query)
So it's like a fanning out of queries. If there were 20 authors, each with 20 posts, that would be 21 db calls. If each post had 20 comments, that would be 401 db calls! 20 authors resolves 400 posts, which resolves 8000 comments, not like this is a real way you would do it, but to demonstrate the point. 1 -> 20 -> 400 db calls.
If we add the comments.author calls, that's another 8000 db calls (one for each comment)!
How would you batch this into let's say 3 db calls (1 for each type)? Is that what optimized GraphQL query resolvers do essentially? Or what is the best that can be done for this situation?
This is the GraphQL N+1 loading issue.
Basically there are two ways to solve it (For simplicity , assume it only needs to load the authors and its posts)
Use Dataloader pattern. Basically its idea is to defer the actual loading time of the posts of each author to a particular time such that the posts for N authors can be batched loaded together by a single SQL. It also provides caching feature to further improve the performance for the same request.
Use "look ahead pattern" (A Java example is described at here) . Basically its idea is that when resolving the authors , you just look ahead to see if the query includeS the posts or not in the sub fields. If yes , you can then use a SQL join to get the authors together with its post in a single SQL.
Also , to prevent the malicious client from making a request that retrieve a very big graph , some GraphQL server will analyse the query and impose a depth limit on it.

How to achieve generic Audit.NET json data processing?

I am using Audit.Net library to log EntityFramework actions into a database (currently everything into one AuditEventLogs table, where the JsonData column stores the data in the following Json format:
{
"EventType":"MyDbContext:test_database",
"StartDate":"2021-06-24T12:11:59.4578873Z",
"EndDate":"2021-06-24T12:11:59.4862278Z",
"Duration":28,
"EntityFrameworkEvent":{
"Database":"test_database",
"Entries":[
{
"Table":"Offices",
"Name":"Office",
"Action":"Update",
"PrimaryKey":{
"Id":"40b5egc7-46ca-429b-86cb-3b0781d360c8"
},
"Changes":[
{
"ColumnName":"Address",
"OriginalValue":"test_address",
"NewValue":"test_address"
},
{
"ColumnName":"Contact",
"OriginalValue":"test_contact",
"NewValue":"test_contact"
},
{
"ColumnName":"Email",
"OriginalValue":"test_email",
"NewValue":"test_email2"
},
{
"ColumnName":"Name",
"OriginalValue":"test_name",
"NewValue":"test_name"
},
{
"ColumnName":"OfficeSector",
"OriginalValue":1,
"NewValue":1
},
{
"ColumnName":"PhoneNumber",
"OriginalValue":"test_phoneNumber",
"NewValue":"test_phoneNumber"
}
],
"ColumnValues":{
"Id":"40b5egc7-46ca-429b-86cb-3b0781d360c8",
"Address":"test_address",
"Contact":"test_contact",
"Email":"test_email2",
"Name":"test_name",
"OfficeSector":1,
"PhoneNumber":"test_phoneNumber"
},
"Valid":true
}
],
"Result":1,
"Success":true
}
}
Me and my team has a main aspect to achieve:
Being able to create a search page where administrators are able to tell
who changed
what did they change
when did the change happen
They can give a time period, to reduce the number of audit records, and the interesting part comes here:
There should be an input text field which should let them search in the values of the "ColumnValues" section.
The problems I encountered:
Even if I map the Json structure into relational rows, I am unable to search in every column, with keeping the genericity.
If I don't map, I could search in the Json string with LIKE mssql function but on the order of a few 100,000 records it takes an eternity for the query to finish so it is probably not the way.
Keeping the genericity would be important, so we don't need to modify the audit search page every time when we create or modify a new entity.
I only know MSSQL, but is it possible that storing the audit logs in a document oriented database like cosmosDB (or anything else, it was just an example) would solve my problem? Or can I reach the desired behaviour using relational database like MSSQL?
Looks like you're asking for an opinion, in that case I would strongly recommend a document oriented DB.
CosmosDB could be a great option since it supports SQL queries.
There is an extension to log to CosmosDB from Audit.NET: Audit.AzureCosmos
A sample query:
SELECT c.EventType, e.Table, e.Action, ch.ColumnName, ch.OriginalValue, ch.NewValue
FROM c
JOIN e IN c.EntityFrameworkEvent.Entries
JOIN ch IN e.Changes
WHERE ch.ColumnName = "Address" AND ch.OriginalValue = "test_address"
Here is a nice post with lot of examples of complex SQL queries on CosmosDB

Elasticsearch query context vs filter context

I am little bit confused with ElasticSearch Query DSL's query context and filter context. I have 2 below queries. Both queries return same result, first one evaluate score and second one does not. Which one is more appropriate ?
1st Query :-
curl -XGET 'localhost:9200/xxx/yyy/_search?pretty' -d'
{
"query": {
"bool": {
"must": {
"terms": { "mcc" : ["5045","5499"]}
},
"must_not":{
"term":{"maximum_flag":false}
},
"filter": {
"geo_distance": {
"distance": "500",
"location": "40.959334, 29.082142"
}
}
}
}
}'
2nd Query :-
curl -XGET 'localhost:9200/xxx/yyy/_search?pretty' -d'
{
"query": {
"bool" : {
"filter": [
{"term":{"maximum_flag":true}},
{"terms": { "mcc" : ["5045","5499"]}}
],
"filter": {
"geo_distance": {
"distance": "500",
"location": "40.959334, 29.082142"
}
}
}
}
}'
Thanks,
In the official guide you have a good explanation:
Query context
A query clause used in query context answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a _score representing how well the document matches, relative to other documents.
Query context is in effect whenever a query clause is passed to a query parameter, such as the query parameter in the search API.
Filter context
In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.
Does this timestamp fall into the range 2015 to 2016?
Is the status field set to "published"?
Frequently used filters will be cached automatically by Elasticsearch, to speed up performance.
Filter context is in effect whenever a query clause is passed to a filter parameter, such as the filter or must_not parameters in the bool query, the filter parameter in the constant_score query, or the filter aggregation.
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-filter-context.html
About your case, we would need more information, but taking into account you are looking for exact values, a filter would suit it better.
The first query is evaluating score because your are using "term" here directly inside without wrapping it inside "filter" so by default "term" written directly inside query run in Query context format which result in calculating score.
But in the case of second query you "term" inside "filter" which change it's context from Query Context to filter Context . And in the case of filter no score is calculated (by default _score 1 is allocated to all matching documents).
You can find more details about queries behavior in this article
https://towardsdatascience.com/deep-dive-into-querying-elasticsearch-filter-vs-query-full-text-search-b861b06bd4c0

How to create elasticsearch index alias that excludes specific fields

I'm using Elasticsearch's index aliases to create restricted views on a more-complete index to support a legacy search application. This works well. But I'd also like to exclude certain sensitive fields from the returned result (they contain email addresses, and we want to preclude harvesting.)
Here's what I have:
PUT full-index/_alias/restricted-index-alias
{
"_source": {
"exclude": [ "field_with_email" ]
},
"filter": {
"term": { "indexflag": "noindex" }
}
}
This works for queries (I don't see field_with_email), and the filter term works (I get a restricted index) but I still see the field_with_email in query results from the index alias.
Is this supposed to work?
(I don't want to exclude from _source in the mapping, as I'm also using partial updates; these are easier if the entire document is available in _source.)
No, it is not supposed to work, and the documentation doesn't suggest that it should work.

MongoDB Update / Upsert Question - Schema Related

I have an problem representing data in MongoDB. I was using this schema design, where a combination of date and word is unique.
{'date':2-1-2011,
'word':word1'
users = [user1, user2, user3, user4]}
{'date':1-1-2011,
'word':word2'
users = [user1, user2]}
There are a fixed number of dates, approximately 200; potentially 100k+ words for each date; and 100k+ users.
I inserted records with an algorithm like so:
while records exist:
message, user, date = pop a record off a list
words = set(tokenise(message))
for word in words:
collection1.insert({'date':date, 'word':word}, {'user':user})
collection2.insert('something similar')
collection3.insert('something similar again')
collection4.insert('something similar again')
However, this schema resulted in extremely large collections and terrible performance was terrible. I am inserting different information into each of the four collections, so it is an extremely large number of operations on the database.
I'm considering representing the data in a format like so, where the words and users arrays are sets.
{'date':'26-6-2011',
'words': [
'word1': ['user1', 'user2'],
'word2': ['user1']
'word1': ['user1', 'user2', 'user3']]}
The idea behind this was to cut down on the number of database operations. So that for each loop of the algorithm, I perform just one update for each collection. However, I am unsure how to perform an update / upsert on this because with each loop of the algorithm, I may need to insert a new word, user, or both.
Could anyone recommend either a way to update this document, or could anyone suggest an alternative schema?
Thanks
Upsert is well suited for dynamically extending documents. Unfortunately I only found it working properly if you have an atomic modifier operation in your update object. like the $addToSet here (mongo shell code):
db.words is empty. add first document for a given date with an upsert.
var query = { 'date' : 'date1' }
var update = { $addToSet: { 'words.word1' : 'user1' } }
db.words.update(query,update,true,false)
check object.
db.words.find();
{ "_id" : ObjectId("4e3bd4eccf7604a2180c4905"), "date" : "date1", "words" : { "word1" : [ "user1" ] } }
now add some more users to first word and another word in one update.
var update = { $addToSet: { 'words.word1' : { $each : ['user2', 'user4', 'user5'] }, 'words.word2': 'user3' } }
db.words.update(query,update,true,false)
again, check object.
db.words.find()
{ "_id" : ObjectId("4e3bd7e9cf7604a2180c4907"), "date" : "date1", "words" : { "word1" : [ "user1", "user2", "user4", "user5" ], "word2" : [ "user3" ] } }
I'm using MongoDB to insert 105mil records with ~10 attributes each. Instead of updating this dataset with changes, I just delete and re insert everything. I found this method to be faster than individually touching each row to see if it was one that I needed to update. You will have better insert speeds if you create JSON formatted text files and use MongoDB's mongoimport tool.
format your data into JSON txt files (one file per collection)
mongoimport each file and specify the collection you want it inserted into