Optimize way to Match Policies in the PDP in a Distributed Environment - xacml

Hi I have gone through many use cases regarding XACML , but i don't know what is the best way to load policies in the PDP. As per the PDP workflow defined by the OASIS i understood that when the incoming request will come to the PDP . PDP is responsible for matching the corresponding policies based on request.
Since PDP is going to match each and every policy , just think about a scenario where i have 10,000 policies stored in a distributed environment what will happen that time. It is going to consume more and more time in matching , that's not an efficient way of matching the policy.
I need some clarifications on this issues:
How to distribute the policy on different servers ?
If I distribute the policy on different servers then how my PDP will recognize and fetch the corresponding policy from the particular server?
What is the best way by which PDP will recognize the exact policy to match with the incoming request?

The syntactical way to handle situations where you have huge number of (10,000) policies is to use the "Target" clause available at PolicySet, Policy and Rule level as judiciously as possible so as to prune the decision tree as quickly as possible.
So suppose you know that out of the 10,000 policies only 1000 are for the finance department operations, one could add a resource category attribute "dept-focus" and we prune the tree by checking
target: resource.dept-focus == "finance"
Once that gets you into a pruned tree, if you know that the finance department policies relate to 5 different applications (and maybe some common dept.-wide policies), you could then prune using a "app-id" attribute and so on.
Of course for this to work, the PEPs need to add the appropriate value of these attribute ids into the XACML request.
A more deployment-centric solution would be to split the 10,000 policies into smaller chunks that are then deployed into separate PDP groups. Various vendors use various names for such concepts (if available). Axiomatics, for whom I work for, calls these authorization domains.
So you could deploy all the policies associated with the finance department into a set of PDPs that make up the "Finance" authz. domain and another set into the R&D domain etc. These PDPs in the same domain can of course be load-balanced. Management can become a bit hairy if there is no central controller for these domains, something like the Axiomatics Services Manager.
Hope that helps.

Yes.. If there are 10,000 policies stored, there can be considerable time to match them..
If you think about Horizontal scaling, You can still look for following...
Caching all policies or Target elements
Making target to be not complex (Just some String match)
Parallel matching of policies with multiple threads.
I agree that, for 10000 policies, we may need to look for Vertical scaling.. I assume, you have defined policies based on the applications. Application id can be the Target element of your policies. (It can be any thing that helps to create a policy collection). If I answer your questions
Policies can distributed based on Target element (based the application). Therefore different servers has different policies based on the application id. Basically it is like one PDP for application. (Idea is to you want to group the policies based on some way.. that can distributed them in to separate PDPs)
There can be a central PDP hub, Once request is received it, It would check for application id and route message to relevant PDP. Sometime, It does not want to be a PDP, some routers (such as ESB) which can look for some parameter in the request and send in to the relevant PDP.
As mentioned.. It is better to have central server that route the requests
Also, If you do not achieve parallel evaluation of distributed policies.. It also can be done with PDP hub... Say, you can distribute your policies in to 10 PDPs and there is PDP hub. Once request is received for PDP hub, It would send request to 10 PDPs.. and 10 PDP would evaluate policies parallel. Once response are received to PDP hub, It can aggregated he results of 10 PDP and send the final result to PEP.

Related

Optimizing GraphQL resolvers for SQL databases and in service-oriented architectures

My company has a service-oriented architecture. My app's GraphQL server therefore has to call out to other services to fullfill the data requests from the frontend.
Let's imagine my GraphQL schema defines the type User. The data for this type comes from two sources:
A user account service that exposes a REST endpoint for fetching a user's username, age, and friends.
A SQL database used just by my app to store User-related data that is only relevant to my app: favoriteFood, favoriteSport.
Let's assume that the user account service's endpoint automatically returns the username and age, but you have to pass the query parameter friends=true in order to retrieve the friends data because that is an expensive operation.
Given that background, the following query presents a couple optimization challenges in the getUser resolver:
query GetUser {
getUser {
username
favoriteFood
}
}
Challenge #1
When the getUser resolver makes the request to the user account service, how does it know whether or not it needs to ask for the friends data as well?
Challenge #2
When the resolver queries my app's database for additional user data, how does it know which fields to retrieve from the database?
The only solution I can find to both challenges is to inspect the query in the resolver via the fourth info argument that the resolver receives. This will allow it to find out whether friends should be requested in the REST call to the user account service, and it will be able to build the correct SELECT query to retrieve the needed data from my app's database.
Is this the correct approach? It seems like a use-case that GraphQL implementations must be running into all the time and therefore I'd expect to encounter a widely accepted solution. However, I haven't found many articles that address this, nor does a widely used NPM module appear to exist (graphql-parse-resolve-info is part of PostGraphile but only has ~12k weekly downloads, while graphql-fields has ~18.5k weekly downloads).
I'm therefore concerned that I'm missing something fundamental about how this should be done. Am I? Or is inspecting the info argument the correct way to solve these optimization challenges? In case it matters, I am using Apollo Server.
If you want to modify your resolver based on the requested selection set, there's really only one way to do that and that's to parse the AST of the requested query. In my experience, graphql-parse-resolve-info is the most complete solution for making that parsing less painful.
I imagine this isn't as common of an issue as you'd think because I imagine most folks fall into one of two groups:
Users of frameworks or libraries like Postgraphile, Hasaura, Prisma, Join Monster, etc. which take care of optimizations like these for you (at least on the database side).
Users who are not concerned about overfetching on the server-side and just request all columns regardless of the selection set.
In the latter case, fields that represent associations are given their own resolvers, so those subsequent calls to the database won't be fired unless they are actually requested. Data Loader is then used to help batch all these extra calls to the database. Ditto for fields that end up calling some other data source, like a REST API.
In this particular case, Data Loader would not be much help to you. The best approach is to have a single resolver for getUser that fetches the user details from the database and the REST endpoint. You can then, as you're already planning, adjust those calls (or skip them altogether) based on the requested fields. This can be cumbersome, but will work as expected.
The alternative to this approach is to simple fetch everything, but use caching to reduce the number of calls to your database and REST API. This way, you'll fetch the complete user each time, but you'll do so from memory unless the cache is invalidated or expires. This is more memory-intensive, and cache invalidation is always tricky, but it does simply your resolver logic significantly.

Yodlee APIs: ContentServiceInfos versus SiteInfos

There appear to be two lines of APIs for adding, authenticating and aggregating sites. Depending upon which version of the Documentation/SDK set your rep started you off on, or where in the SDK Guide you started implementing from determines where you start.
Path #1 starts at
ContentServiceTraversal which allows for the retrieval of all ContentServiceInfo (by container type (such as BANK)
ItemManagementService is used to add these items
Refresh is done through RefreshService (most API not containing the word Site)
Path #2 starts at
SiteTranversalService which allows for the retrieval of all SiteInfo (no apparent support for Container Type filter)
SiteAccountManagementService is used to add these items
Refresh is done through Refreshservice (all API containing the word Site).
From the best that I can tell the aforementioned API have a lot of functionality duplication. I have noticed certain API that exist on one branch and not the other but usually they are minor changes (e.g. things you are able to filter by).
I started off with ContentServiceInfo because the documentation and samples that our rep initially gave us started there. Additionally this API started off by providing greater granularity (e.g. simply being able to filter by Container type since we were pretty much only interested in Banks and Processor sites (which I do not believe you guys support)).
My questions are:
Do the two branches of API do the exact same thing?
Do they mostly behave the same way?
Do they back-end to the exact same
System
Data store
Scraper?
Is one line of API supposed to be deprecated sooner in the future than another?
Does one line of API have more future in terms of actually adding new or augmenting existing functionality?
Site-level addition has been introduced through Yodlee APIs to overcome the fact that though a user had bank,creditcard,loan,rewards account at the same end site, user had to provide credentials for each of these containers. Site level addition APIs try to add all these containers with only 1 set of credentials. That's the only difference between container based addition and site based addition.
As to answer your questions:
Do the two branches of API do the exact same thing?
Do they mostly behave the same way?
If you mean the aggregation functionality, Yes.Except for the fact that Site level adds/refreshes all the container(bank,creditcard,loan,rewards) and Container level can add/refresh only one container per API call, all the other behavior will remain the same.
Do they back-end to the exact same
System
Data store
Scraper?
If you are referring to the Yodlee data gathering components, Yes.
Is one line of API supposed to be deprecated sooner in the future than another?
No.Both these sets of APIs cater to different needs. If you are a company who solely rely on Creditcard data, using site level addition will be overkill as it will take longer time for the aggregation and it makes more sense to use container based addition. There is also the factor of backward compatibility, which rules out deprecation of APIs.

Client and server-side validation with RESTful APIs

Let's assume I have a POST /orders operation that takes as input a collection of order items. An order can't contain more than 50 items, but where do I perform this validation?
Validating the order size in both the client and the server would be redundant, and increase the maintenance cost if I decide to change the order size limit,.
Validating it only in the server would prevent clients from "failing fast" (i.e., you add a thousand items to the order and is informed of the limit only when completing it).
I'm assuming client-only validation is not an option, as the API may have other clients.
The problem gets more complicated if I have dynamic validation rules. Suppose retail customers can have orders 50 items, but wholesale customers can have 500 items in their orders. Should the API expose an operation so clients can fetch the current validation rules?
You have to do both, but differently.
To guarantee valid operations, all critical validation must happen on the server/web service side. The client side UI is just that - a user interface to make interacting with the web service convenient for a person. Once the web service is stable and secure, create a default method to pass web service errors through the client to the user. After that, features in the UI layer are usability issues and should be based on testing (even if that is informal testing by watching over a user's shoulder or listening to feedback.)
I agree with what was said before.
Although, I think if you can predict almost every situation a user may come into, you could also create client-side validation.
As per your example about wholesale/retail, you could first create a drop-down that asks the client to choose whether they're wholesale or retail and then apply the 500/50 rule to the input box based on the first option.
The obvious problem comes in the fact that if your API is released to other developers, they may not be aware of the 50/500 rule and that is where I agree with the previous answer about critical validation happening on the server. If you're building the API for your own use then you could go either way because you're aware of the input restrictions. It will also save quite a bit on server-costs if the app is very big (validation on the server will be taxing).

The REST-way to check/uncheck like/unlike favorite/unfavorite a resource

Currently I am developing an API and within that API I want the signed in users to be able to like/unlike or favorite/unfavorite two resources.
My "Like" model (it's a Ruby on Rails 3 application) is polymorphic and belongs to two different resources:
/api/v1/resource-a/:id/likes
and
/api/v1/resource-a/:resource_a_id/resource-b/:id/likes
The thing is: I am in doubt what way to choose to make my resources as RESTful as possible. I already tried the next two ways to implement like/unlike structure in my URL's:
Case A: (like/unlike being the member of the "resource")
PUT /api/v1/resource/:id/like maps to Api::V1::ResourceController#like
PUT /api/v1/resource/:id/unlike maps to Api::V1::ResourceController#unlike
and case B: ("likes" is a resource on it's own)
POST /api/v1/resource/:id/likes maps to Api::V1::LikesController#create
DELETE /api/v1/resource/:id/likes maps to Api::V1::LikesController#destroy
In both cases I already have a user session, so I don't have to mention the id of the corresponding "like"-record when deleting/"unliking".
I would like to know how you guys have implemented such cases!
Update April 15th, 2011: With "session" I mean HTTP Basic Authentication header being sent with each request and providing encrypted username:password combination.
I think the fact that you're maintaining application state on the server (user session that contains the user id) is one of the problems here. It's making this a lot more difficult than it needs to be and it's breaking a REST's statelessness constraint.
In Case A, you've given URIs to operations, which again is not RESTful. URIs identify resources and state transitions should be performed using a uniform interface that is common to all resources. I think Case B is a lot better in this respect.
So, with these two things in mind, I'd propose something like:
PUT /api/v1/resource/:id/likes/:userid
DELETE /api/v1/resource/:id/likes/:userid
We also have the added benefit that a user can only register one 'Like' (they can repeat that 'Like' as many times as they like, and since the PUT is idempotent it has the same result no matter how many times it's performed). DELETE is also idempotent, so if an 'Unlike' operation is repeated many times for some reason then the system remains in a consistent state. Of course you can implement POST in this way, but if we use PUT and DELETE we can see that the rules associated with these verbs seem to fit our use-case really well.
I can also imagine another useful request:
GET /api/v1/resource/:id/likes/:userid
That would return details of a 'Like', such as the date it was made or the ordinal (i.e. 'This was the 50th like!').
case B is better, and here have a good sample from GitHub API.
Star a repo
PUT /user/starred/:owner/:repo
Unstar a repo
DELETE /user/starred/:owner/:repo
You are in effect defining a "like" resource, a fact that a user resource likes some other resource in your system. So in REST, you'll need to pick a resource name scheme that uniquely identifies this fact. I'd suggest (using songs as the example):
/like/user/{user-id}/song/{song-id}
Then PUT establishes a liking, and DELETE removes it. GET of course finds out if someone likes a particular song. And you could define GET /like/user/{user-id} to see a list of the songs a particular user likes, and GET /like/song/{song-id} to see a list of the users who like a particular song.
If you assume the user name is established by the existing session, as #joelittlejohn points out, and is not part of the like resource name, then you're violating REST's statelessness constraint and you lose some very important advantages. For instance, a user can only get their own likes, not their friends' likes. Also, it breaks HTTP caching, because one user's likes are indistinguishable from another's.

eCommerce Third Party API Data Best Practice

What would be best practice for the following situation. I have an ecommerce store that pulls down inventory levels from a distributor. Should the site, for everytime a user loads a product detail page use the third party API for the most up to date data? Or, should the site using third party APIs and then store that data for a certain amount of time in it's own system and update it periodically?
To me it seems obvious that it should be updated everytime the product detail page is loaded but what about high traffic ecommerce stores? Are completely different solutions used for that case?
In this case I would definitely cache the results from the distributor's site for some period of time, rather than hitting them every time you get a request. However, I would not simply use a blanket 5 minute or 30 minute timeout for all cache entries. Instead, I would use some heuristics. If possible, for instance if your application is written in a language like Python, you could attach a simple script to every product which implements the timeout.
This way, if it is an item that is requested infrequently, or one that has a large amount in stock, you could cache for a longer time.
if product.popularityrating > 8 or product.lastqtyinstock < 20:
cache.expire(productnum)
distributor.checkstock(productnum)
This gives you flexibility that you can call on if you need it. Initially, you can set all the rules to something like:
cache.expireover("3m",productnum)
distributor.checkstock(productnum)
In actual fact, the script would probably not include the checkstock function call because that would be in the main app, but it is included here for context. If python seems too heavyweiaght to include just for this small amount of flexibilty, then have a look at TCL which was specifically designed for this type of job. Both can be embedded easily in C, C++, C# and Java applications.
Actually, there is another solution. Your distributor keeps the product catalog on their servers and gives you access to it via Open Catalog Interface. When a user wants to make an order he gets redirected in-place to the distributor's catalog, chooses items then transfers selection back to your shop.
It is widely used in SRM (Supplier Relationship Management) branch.
It depends on many factors: the traffic to your site, how often the inventory levels change, the business impact of displaing outdated data, how often the supplers allow you to call their API, their API's SLA in terms of availability and performance, and so on.
Once you have these answers, there are of course many possibilities here. For example, for a low-traffic site where getting the inventory right is important, you may want to call the 3rd-party API on every call, but revert to some alternative behavior (such as using cached data) if the API does not respond within a certain timeout.
Sometimes, well-designed APIs will include hints as to the validity period of the data. For example, some REST-over-HTTP APIs support various HTTP Cache control headers that can be used to specify a validity period, or to only retrieve data if it has changed since last request.