Optimizing GraphQL resolvers for SQL databases and in service-oriented architectures

Optimizing GraphQL resolvers for SQL databases and in service-oriented architectures - optimization

My company has a service-oriented architecture. My app's GraphQL server therefore has to call out to other services to fullfill the data requests from the frontend.
Let's imagine my GraphQL schema defines the type User. The data for this type comes from two sources:
A user account service that exposes a REST endpoint for fetching a user's username, age, and friends.
A SQL database used just by my app to store User-related data that is only relevant to my app: favoriteFood, favoriteSport.
Let's assume that the user account service's endpoint automatically returns the username and age, but you have to pass the query parameter friends=true in order to retrieve the friends data because that is an expensive operation.
Given that background, the following query presents a couple optimization challenges in the getUser resolver:
query GetUser {
getUser {
username
favoriteFood
}
}
Challenge #1
When the getUser resolver makes the request to the user account service, how does it know whether or not it needs to ask for the friends data as well?
Challenge #2
When the resolver queries my app's database for additional user data, how does it know which fields to retrieve from the database?
The only solution I can find to both challenges is to inspect the query in the resolver via the fourth info argument that the resolver receives. This will allow it to find out whether friends should be requested in the REST call to the user account service, and it will be able to build the correct SELECT query to retrieve the needed data from my app's database.
Is this the correct approach? It seems like a use-case that GraphQL implementations must be running into all the time and therefore I'd expect to encounter a widely accepted solution. However, I haven't found many articles that address this, nor does a widely used NPM module appear to exist (graphql-parse-resolve-info is part of PostGraphile but only has ~12k weekly downloads, while graphql-fields has ~18.5k weekly downloads).
I'm therefore concerned that I'm missing something fundamental about how this should be done. Am I? Or is inspecting the info argument the correct way to solve these optimization challenges? In case it matters, I am using Apollo Server.

If you want to modify your resolver based on the requested selection set, there's really only one way to do that and that's to parse the AST of the requested query. In my experience, graphql-parse-resolve-info is the most complete solution for making that parsing less painful.
I imagine this isn't as common of an issue as you'd think because I imagine most folks fall into one of two groups:
Users of frameworks or libraries like Postgraphile, Hasaura, Prisma, Join Monster, etc. which take care of optimizations like these for you (at least on the database side).
Users who are not concerned about overfetching on the server-side and just request all columns regardless of the selection set.
In the latter case, fields that represent associations are given their own resolvers, so those subsequent calls to the database won't be fired unless they are actually requested. Data Loader is then used to help batch all these extra calls to the database. Ditto for fields that end up calling some other data source, like a REST API.
In this particular case, Data Loader would not be much help to you. The best approach is to have a single resolver for getUser that fetches the user details from the database and the REST endpoint. You can then, as you're already planning, adjust those calls (or skip them altogether) based on the requested fields. This can be cumbersome, but will work as expected.
The alternative to this approach is to simple fetch everything, but use caching to reduce the number of calls to your database and REST API. This way, you'll fetch the complete user each time, but you'll do so from memory unless the cache is invalidated or expires. This is more memory-intensive, and cache invalidation is always tricky, but it does simply your resolver logic significantly.

Related

API design pattern to be integrated both by own web app and other systems

So this backend will be consumed by an ad-hoc front end application. But will also be integrated by other systems and we will expose API for them.
When designing the rest I see that there is ONE database table (we call it for table A) that can join many other tables, lets say about 10 to 20 other tables.
Now, my strategy would be to build routes in my backend that will "reason" according to the ad-hoc frontend we have.
So if there is a page in the frontend (let's call this page for page1) that requires to get rows from the table A but also fields from let's say 3 other join tables, then I want to create a route in the backend called maybe "page1" which will return rows from table A and also from the other 3 tables.
This is of course an ordinary way to build a backend. But as it will also be used by other systems then somebody could argue that these systems maybe don't have any need for the route "page1". Their frontend will maybe never build a "page1".
So according to people here, it would better to build the API more agnostically. And instead of creating the route "page1" I should build it according to "hateoas". And if I understand that principle, instead of letting my ad-hoc frontend to request the resource "page1" it would request "pageForTableA". And then, the resource "pageForTableA" should return which are the possible table to be joined.
In this case, for my frontend's page1, I would need to make 4 subsequent request to the server, instead of one like I would like to do if there was a "page1" resource in the backend.
What do you think?
I also see a thirt strategy. I don't know if there is a name for this pattern but it would be this way:
A resource in backend that returns only rows from table A. BUT, the route also takes arguments. And the argument is an array with the name of all the other tables someone want to include.
So if frontend calls:
getTableA(array('tableB', 'tableD', 'tableF'))
Then the resource would include/join the tables B, D and F. In short: API resource let's the frontend decide what it want to get delivered.
Which of these 3 strategies are best do you think? Or there is some more that could be taken in consideration?

You need to architect your API in a way that consumers shouldn't know about how the data is stored in the underlying data store.
Furthermore, if you want to allow consumers to decide which fields you want to project in the response, you could give them using some query string format.
BTW, maybe you should avoid re-inventing the wheel. There's a standard called Open Data (OData) which already defines a lot of things like you already require in your API, and since it has been made by Microsoft, it has deep support on .NET.
Check this tutorial (Create an OData v4 Endpoint Using ASP.NET Web API 2.2) to get more in touch with OData.

Best HTTP method for a stateless REST API service call

I want to make a REST API that does spellchecking on text that is passed in, without storing any of the text on the server.
The call would probably look something like `example.com/api/v1/spelling/mistakes', with optional query param for locale and an list of the mistakes as return value.
What would be the best HTTP method to use, given that the text passed in would be too large for a GET. Neither POST, PUT nor PATCH seem to reasonably map to the intended purpose and there don't seem to be any other suitable matches in the less commonly used methods either.
What is the best HTTP method to use for a "translation"-like REST API service, taking and returning large amounts of data?

I would say this is a POST. But it could have been a GET if the data was previously posted. The reason it is not a GET is because you are passing all the data in this API call, as you mentioned. For example, if the data was 'posted' somewhere else previously, then the GET can be used where the address (URI) of the location, or ID, of that 'posted' data is passed to the API as a param in the GET. But because we are both 'posting' the data and retrieving information about that in the same call, I would say then that this is a POST. Grant it the data being posted has a short life span, it is still being posted. If the data being posted was instead a customer order, then it would still be a POST but the data would be persisted somewhere. The difference here is the the short period of time that the data will exist for. And in future iterations of your API, you might actually want to keep that data and refer back to it with some ID. So by using POST you allow for future enhancements also.
By the way, as a precaution, be careful with the memory footprint of these calls. I can see this as being very memory intensive if the data being passed grows large and the API becomes very popular. Not a show stopped but something to consider when designing it.
Hope that helps alleviate what I call REST anxiety when designing an API.

Online Users Storing Elixir

I am working on one chatroom [all to all] application in Elixir using OTP Genserver and getting messages from js client as user gets registered with their names as first phase. Now, just bit not sure what would be the best approach to store these names at my elixir server somehow and send regular updates to client with list of users online or database storage. Please suggest the best approach.

I agree with bitwalker that ETS is a good fit.
Here's a short summary of what I did in production. It wasn't a chat server, but a server push with a couple of thousand of users connecting via long polling. Pushed data was divided in some 50 categories, and users were able to choose which ones they want. At peak times the server pushed new messages each 2 secs, and processed > 2000 reqs/sec.
Essentially, I kept a gen_server for each user, where I held pending messages and user's configuration (basically a list of selected channels). This was beneficial with long polling, since user's data is decoupled from the user's request, so the data remains while requests are transient. However, I think this approach is also good for permanent connections, such as websockets, since there might still be occasional disconnections, and keeping a more stable user's data gives you a chance of resuming after reconnect.
Obviously, when a request arrives, you need to find the user specific process, and for this, ETS is a good fit, since you don't have a single process bottleneck. Instead of manually working with ETS, I'd recommend using gproc in conjunction with via tuples. Basically, when starting a user's gen_server, you can provide name: {:via, :gproc, {:n, :l, key}} where key is some custom key (arbitrary term) you make based on your internal user's id(:n and :l indicate a unique name on the local node). You can then use that same via tuple when issuing calls/casts, and gen_server will use gproc to find the corresponding process.
Finally, you need to have some timeout/disconnect logic to cleanup user processes. In my case, I simply terminated a user's process if there was no activity from the web layer (no end-user came for data in some time). Gproc will automatically remove entries for terminated process from its internal ETS table. It's probably best to supervise user processes under a temporary strategy.
I realize all of this is still a bit vague, but I hope it makes some sense. Keep in mind that this is not the ultimate pattern (there's no such thing of course), but I think it's a reasonable first attempt.
You may also want to take a look at Phoenix web framework that has an interesting pub-sub facility in form ofTopics. I didn't try this out myself yet, but it seems interesting, and may even simplify some of the stuff I discussed above, or at least help for pushing notifications from chatroom to all users.

Sounds like a good use case for ETS.
A simpler approach might be to use an Agent to store the online users information, but it depends quite a lot on what you need from the storage mechanism you choose.

Should an API assign and return a reference number for newly created resources?

I am building a RESTful API where users may create resources on my server using post requests, and later reference them via get requests, etc. One thing I've had trouble deciding on is what IDs the clients should have. I know that there are many ways to do what I'm trying to accomplish, but I'd like to go with a design which follows industry conventions and best design practices.
Should my API decide on the ID for each newly created resource (it would most likely be the primary key for the resource assigned by the database)? Or should I allow users to assign their own reference numbers to their resources?
If I do assign a reference number to each new resource, how should this be returned to the client? The API has some endpoints which allow for bulk item creation, so I would need to list out all of the newly created resources on every response?
I'm conflicted because allowing the user to specify their own IDs is obviously a can of worms - I'd need to verify each ID hasn't been taken, makes database queries a lot weirder as I'd be joining on reference# and userID rather than foreign key. On the other hand, if I assign IDs to each resource it requires clients to have to build some type of response parser and forces them to follow my imposed conventions.

Why not do both? Let the user create there reference and you create your own uid. If the users have to login then you can use there reference and userid unique key. I would also give the uid created back if not needed the client could ignore it.

It wasn't practical (for me) to develop both of the above methods into my application, so I took a leap of faith and allowed the user to choose their own IDs. I quickly found that this complicated development so much that it would have added weeks to my development time, and resulted in much more complex and slow DB queries. So, early on in the project I went back and made it so that I just assign IDs for all created resources.
Life is simple now.
Other popular APIs that I looked at, such as the Instagram API, also assign IDs to certain created resources, which is especially important if you have millions of users who can interact with each-other's resources.

CakePHP: model-based permissions?

Struggling with a decision on how best to handle Client-level authentication with the following model hierarchy:
Client -> Store -> Product (Staff, EquipmentItem, etc.)
...where Client hasMany Stores, Store hasMany Products(hasMany Staff, hasMany EquipmentItem, etc.)
I've set up a HABTM relationship between User and Client, which is straightforward and accessible through the Auth session or a static method on the User model if necessary (see afterFind description below).
Right now, I'm waffling between evaluating the results in each model's afterFind callback, checking for relationship to Client based on the model I'm querying against the Clients that the current User is a member of. i.e. if the current model is Client, check the id; if the current model is a Store, check Store.clientid, and finally if Product, get parent Store from Item.storeid and check Store.clientid accordingly.
However, to keep in line with proper MVC, I return true or false from the afterFind, and then have to check the return from the calling action -- this is ok, but I have no way I can think of to determine if the Model->find (or Model->read, etc.) is returning false because of invalid id in the find or because of Client permissions in the afterFind; it also means I'd have to modify every action as well.
The other method I've been playing with is to evaluate the request in app_controller.beforeFilter and by breaking down the request into controller/action/id, I can then query the appropriate model(s) and eval the fields against the Auth.User.clients array to determine whether User has access to the requested Client. This seems ok, but doesn't leave me any way (afaik) to handle /controller/index -- it seems logical that the index results would reflect Client membership.
Flaws in both include a lengthy list of conditional "rules" I need to break down to determine where the current model/action/id is in the context of the client. All in all, both feel a little brittle and convoluted to me.
Is there a 3rd option I'm not looking at?

This sounds like a job for Cake ACL. It is a bit of a learning curve, but once you figure it out, this method is very powerful and flexible.
Cake's ACLs (Access Control Lists) allow you to match users to controllers down to the CRUD (Create Read Update Delete) level. Why use it?
1) The code is already there for you to use. The AuthComponent already has it built in.
2) It is powerful and integrated to allow you to control permissions every action in your site.
3) You will be able to find help from other cake developers who have already used it.
4) Once you get it setup the first time, it will be much easier and faster to implement full site permissions on any other application.
Here are a few links:
http://bakery.cakephp.org/articles/view/how-to-use-acl-in-1-2-x
http://book.cakephp.org/view/171/Access-Control-Lists
http://blog.jails.fr/cakephp/index.php?post/2007/08/15/AuthComponent-and-ACL
Or you could just google for CakePHP ACL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas