RESTful API for large "mash up" data pulls?

RESTful API for large "mash up" data pulls? - api

I have recently been tasked with a small project of setting up a periodic (somewhere between daily & weekly) data dump from an internal database to a 3rd party product. This project dovetails nicely with my company's desire (one which I share) to start standing up a formal service layer/API over the top of our data.
My personal preference is that those APIs should take on the form of RESTful endpoints - however, now I have what I think is a big design question - let me explain...
Looking at the data pull in question, it's hardly complicated. If I were just going to construct a one-off query, it would conceptually look a little something like:
select o.order_num, o.order_date, p.product_description, sr.sales_rep_name
from order o, line_item li, product p, sales_rep sr
where li.order_num = o.order_num
and li.product_id = p.product_id
and sr.sales_rep_id = o.sales_rep_id
and o.order_date >= [some arbitrary date]
Flipping my brain into "Resource Mode", I can think about how to convert this basic data model into URI's/payloads without too much trouble:
GET /orders/123
{
"order_num": 59324,
"order_date": "2014-07-07",
"sales_rep_uri": "/salesRep/34",
"line_items_uri": "/order/123/lineItems"
}
Getting more information about the sales rep:
GET /salesRep/34
{
"sales_rep_name": "Jane Doe",
"open_orders_uri": "/salesRep/34/orders"
}
Getting more information about line items:
GET /orders/123/lineItems
{
"line_items": [
{"order_uri": "/order/123", "product_uri": "/products/68"},
{"order_uri": "/order/123", "product_uri": "/products/99"}
]
}
And so on. I'm not saying it's a perfect API, I'm just trying to demonstrate it's not exactly rocket science to go about thinking how you might express the data model in a nicely normalized, resource-oriented type of way via RESTful URIs. But that is exactly where the design question comes into play...
On one hand, I can crank out a query to solve the problem very easily, but the very nature of queries requires the various domain concepts to be tightly coupled (in other words, utilizing joins to bring all of the normalized data together into one nice, custom-purpose de-normalized structure).
On the other hand, going through the mental process of thinking out a RESTful API leads me right back down that road of keeping things nicely compartmentalized - e.g. asking for "Order 123" shouldn't send me back this huge graph where I can see the full product description, the sales rep's phone number, etc, etc. The concept of a full blown HATEOAS-level RESTful API dictates consumers should be making subsequent GETs to drill down for that kind of detail only as-needed.
My question boils down to this: solving this use case seems really easy to do with a direct query and really difficult to do against a nice & tidy RESTful API (I'm picturing the literally 1000's of individual GETs it would take for me to assemble a weeks worth of data vs the few seconds it would take for the query to run). Is there some elegant subtlety of good RESTful design that I don't understand that would prevent me from seeing a good solution, or am I trying to fit a round peg into a square hole (i.e. REST is not good at pulling big data batches across multiple resources)?

I'm just going to throw this out there as a potential solution:
Conceptually, I treat the results of this query as a resource unto itself - like "orderReport".
Treating this as it's own resource, the API could behave something like:
GET /orderReport/[some arbitrary date]
You could then send back either a 201 Created (if the query is relatively quick running) with a location header like Location: /orderReport/[GUID]. Alternatively, if the query takes a while to run (I honestly don't know if it does or not off the top of my head), you could send back a 202 Accepted with a location header of Location: /orderReport/[GUID]/status.
You could then do follow up GETs against those URLs to get either the report status (200 OK if still processing without error, 201 Created with location header pointing to the report URL if its done) or the report itself.
There's nothing to say the report data couldn't also incorporate HATEOAS in addition to the data strictly needed to fulfill the use case requirements, like:
{
[
{
"order_num": 123,
"order_uri": "/orders/123",
"order_date": 2014-07-03,
"product_description": "widget",
"sales_rep_name": "Jane Doe",
"sales_rep_uri": "/salesRep/34"
},
{
"order_num": 456,
"order_uri": "/orders/456",
"order_date": 2014-07-04,
"product_description": "gadget",
"sales_rep_name": "Frank Smith",
"sales_rep_uri": "/salesRep/53"
}
]
}

Related

REST GET mehod: Can return a list of enriched resources?

I have a doubt when I'm designing a REST API.
Consider I have a Resource "Customer" with two elements in my server, like this:
[
{
name : "Mary",
description : "An imaginary woman very tall."
},
{
name : "John",
description : "Just a guy."
}
]
And I want to make an endpoint, that will accept a GET request with a query. The query will provide a parameter with a value that will make an algorithm count how many occurrences for this text are there in all of its parameters.
So if we throw this request:
GET {baseURL}/customers?letters=ry
I should get something like
[
{
name : "Mary",
description : "An imaginary woman very tall.",
count : 3
},
{
name : "John",
description : "Just a guy.",
count : 0
}
]
Count parameter can not be included in the resource scheme as will depend on the value provided in the query, so the response objects have to be enriched.
I'm not getting a list of my resource but a modified resource.
Although it keeps the idempotent condition for GET Method, I see it escapes from the REST architecture concept (even the REST beyond CRUD).
Is it still a valid endpoint in a RESTful API? or should I create something like a new resource called "ratedCustomer"?

REST GET mehod: Can return a list of enriched resources?
TL;DR: yes.
Longer answer...
A successful GET request returns a representation of a single resource, identified by the request-target.
The fact that the information used to create the representation of the resource comes from multiple entities in your domain model, or multiple rows in your database, or from reports produced by other services... these are all implementation details. The HTTP transfer of documents over a network application doesn't care.
That also means that we can have multiple resources that include the same information in their representations. Think "pages in wikipedia" that duplicate each others' information.
Resource identifiers on the web are semantically opaque. All three of these identifiers are understood to be different resources
/A
/A?enriched
/B
We human beings looking at these identifiers might expect /A?enriched to be semantically closer to /A than /B, but the machines don't make that assumption.
It's perfectly reasonable for /A?enriched to produce representations using a different schema, or even a different content-type (as far as the HTTP application is concerned, it's perfectly reasonable that /A be an HTML document and /A?enriched be an image).
Because the machines don't care, you've got additional degrees of freedom in how you design both you resources and your resource identifiers, which you can use to enjoy additional benefits, including designing a model that's easy to implement, or easy to document, or easy to interface with, or easy to monitor, or ....
Design is what we do to get more of what we want than we would get by just doing it.

Should the response body of GET all parent resource return a list of child resource?

Please bear with me if the title is a bit confusing, I will try my best to explain my question below.
Say I have the following two endpoints
api/companies (returns a list of all companies like below)
[{name: "company1", id: 1}, {name: "company2", id: 2}]
api/companies/{companyeId}/employees (returns a list of all employees for a specific company like below)
[{name: "employee1", id: 1}, {name: "employee2", id: 2}]
What the client side needs is a list of companies, each one of which has a list of employees. The result should looks like this:
[
{
name: "company1",
id: 1,
employees: [ {name: "employee1", id: 1}, {name: "employee2", id: 2} ]
},
{
name: "company2",
id: 2,
employees: [ {name: "employee3", id: 3}, {name: "employee4", id: 4} ]
},
]
There are two ways I can think of to do this:
Get a list of company first and loop through the company list to
make a api call for each company to get its list of employees. (I'm wondering if this is a better way of design because of HATEOAS principle if I understand correctly? Because the smallest unit of resource of api/companies is company but not employees so client is expected to discover companies as the available resource but not employees.)
a REST client should then be able to use server-provided links dynamically to discover all the available actions and resources it needs
Return a list of employees inside each company object and then return a list of companies through api/companies. Maybe add a query parameter to this endpoint called responseHasEmployees which is a boolean default to be false, so when user make a GET through api/companies?responseHasEmployees=true, the response body will have a list of employees inside each company object.
So my question is, which way is a better way to achieve the client side's goal? (Not necessarily has to be the above two.)
Extra info that might be helpful: companies and employees are stored in different tables, and employees table has a company_fk column.

Start by asking yourself a couple of questions:
Is this a common scenario?
Is it logical to request data in this way?
If so, it might make sense to make data available in this way.
Next, do you already have api calls that pass variables implemented?
Based on your HATEOAS principle, you probably shouldn't. Clients shouldn't need to know, or understand, variable values in your url.
If not, stay away from it. Make it as clean to the client side as possible. You could make a third distinct api "api/companiesWithEmployees" This fits your HATEOAS principle, the client doesn't need to know anything about parameters or other workings of the api, only that they will get "Companies with Employees".
Also, the cost is minimal; an additional method in the code base. It's simpler for the client side at a low cost.
Next think about some of the developmental consequences:
Are you opening the door to more specific api requests?
Are you able to maintain a hard line on data you want accessible through the api?
Are you able to maintain your HATEOAS principle in that the clients know everything they need to know based on the api url?
Next incorporate scenarios like this into future api design:
Can you preemptively make similar api calls available? ie (Customers and Orders, would you simply make a single api call available that gets the two related to each other?)
Ultimately, my answer to your question would be to go ahead and make this a new api call. The overhead for setting up, testing, and maintaining this particular change seem extremely small, and the likelihood of data being requested in this way appears high.

I assume that the client you build is going to have an interface to view a list of companies where there will be an option to view employees of the company. So it is best to do it by pull on demand and not load the whole data at once.
If you can consider a property of your resource as a sub-resource, do not add the whole sub-resource data into the main resource API. You may include a referral link which can be used by the client to fetch the sub-resource data.
Here, in your case,
Main-Resource - Companies
Sub-Resource - Employees
Company name, contact number, address - These are properties of the company object and not the sub-resource of a company, whereas, employees can be very well considered as sub-resource.

RESTful API - Correct behaviour when spurious/not requested parameters are passed in the request

We are developing a RESTful api that accepts query parameters in the request in the form of JSON encoded data.
We were wondering what is the correct behaviour when non requested/not expected parameters are passed along with the required ones.
For example, we may require that a PUT request on a given endpoint have to provide exactly two values respectively for the keys name and surname:
{
"name": "Jeff",
"surname": "Atwood"
}
What if a spurious key is passed too, like color in the example below?
{
"name": "Jeff",
"surname": "Atwood",
"color": "red"
}
The value for color is not expected, neither documented.
Should we ignore it or reject the request with a BAD_REQUEST 400 status error?
We can assert that the request is bad because it doesn't conform to the documentation. And probably the API user should be warned about it (She passed the value, she'll expects something for that.)
But we can assert too that the request can be accepted because, as the required parameters are all provided, it can be fulfilled.

Having used RESTful APIs from numerous vendors over the years, let me give you a "users" perspective.
A lot of times documentation is simply bad or out of date. Maybe a parameter name changed, maybe you enforce exact casing on the property names, maybe you have used the wrong font in your documentation and have an I which looks exactly like an l - yes, those are different letters.
Do not ignore it. Instead, send an error message back stating the property name with an easy to understand message. For example "Unknown property name: color".
This one little thing will go a long ways towards limiting support requests around consumption of your API.
If you simply ignore the parameters then a dev might think that valid values are being passed in while cussing your API because obviously the API is not working right.
If you throw a generic error message then you'll have dev's pulling their hair out trying to figure out what's going on and flooding your forum, this site or your phone will calls asking why your servers don't work. (I recently went through this problem with a vendor that just didn't understand that a 404 message was not a valid response to an incorrect parameter and that the documentation should reflect the actual parameter names used...)
Now, by the same token I would expect you to also give a good error message when a required parameter is missing. For example "Required property: Name is missing".
Essentially you want to be as helpful as possible so the consumers of your API can be as self sufficient as possible. As you can tell I wholeheartedly disagree with a "gracious" vs "stern" breakdown. The more "gracious" you are, the more likely the consumers of your API are going to run into issues where they think they are doing the right thing but are getting unexpected behaviors out of your API. You can't think of all possible ways people are going to screw up so enforcing a strict adherence with relevant error messages will help out tremendously.

If you do an API design you can follow two path: "stern" or "gracious".
Stern means: If you do anything I didn't expect I will be mad at you.
Gracious means: If I know what you want and can fulfil it I will do it.
REST allows for a wonderful gracious API design and I would try to follow this path as long as possible and expect the same of my clients. If my API evolves I might have to add additional parameters in my responses that are only relevant for specific clients. If my clients are gracious to me they will be able to handle this.
Having said that I want to add that there is a place for stern API design. If you are designing in an sensitive domain (e.g. cash transactions) and you don't want to leave room for any misunderstanding between the client and server. Imagine the following POST request (valid for your /account/{no}/transaction/ API):
{ amount: "-100", currency : "USD" }
What would you do with the following (invalid API request)?
{ amount: "100", currency : "USD", type : "withdrawal" }
If you just ignore the "type" attribute, you will deposit 100 USD instead of withdrawing them. In such a domain I would follow a stern approach and show no grace whatsoever.
Be gracious if you can, be stern if you must.
Update:
I totally agree with #Chris Lively's answer that the user should be informed. I disagree that it should always be an error case even the message is non-ambiguous for the referenced resource. Doing it otherwise will hinder reuse of resource representations and require repackaging of semantically identical information.

It depends on your documentation.. how strict you want to be .. But commonly speaking, Just ignore it. Most other servers also ignore request parameters it didn't understand.
Example taken from my previous post
Extra Query parameters in the REST API Url
"""Google ignore my two extra parameters here https://www.google.com/#q=search+for+something&invalid=param&more=stuff"""

Imagine I have the following JSON schema:
{
"frequency": "YEARLY",
"date": 23,
"month": "MAY",
}
The frequency attribute accepts "WEEKLY", "MONTHLY" and "YEARLY" value.
The expected payload for "WEEKLY" frequency value is:
{
"frequency": "WEEKLY",
"day": "MONDAY",
}
And the expected payload for "MONTHLY" frequency value is:
{
"frequency": "MONTHLY",
"date": 23,
}
Give the above JSON schema, typically I will have need a POJO containing frequency, day, date, and month fields for deserialization.
If the received payload is:
{
"frequency": "MONTHLY",
"day": "MONDAY",
"date": 23,
"year": 2018
}
I will throw an error on "day" attribute because I will never know the intention of the sender:
frequency: "WEEKLY" and day: "MONDAY" (incorrect frequency value entered), or
frequency: "MONTHLY" and date: 23
For the "year" attribute, I don't really have choice.
Even if I wish to throw an error for that attribute, I may not be able to.
It's ignored by the JSON serialization/deserialization library as my POJO has no such attribute. And this is the behavior of GSON and it makes sense given the design decision.
Navigating the Json tree or the target Type Tree while deserializing
When you are deserializing a Json string into an object of desired type, you can either navigate the tree of the input, or the type tree of the desired type. Gson uses the latter approach of navigating the type of the target object. This keeps you in tight control of instantiating only the type of objects that you are expecting (essentially validating the input against the expected "schema"). By doing this, you also ignore any extra fields that the Json input has but were not expected.
As part of Gson, we wrote a general purpose ObjectNavigator that can take any object and navigate through its fields calling a visitor of your choice.
Extracted from GSON Design Document

Just ignore them.
Do not give the user any chance to reverse engineer your RESTful API through your error messages.
Give the developers the neatest, clearest, most comprehensive documentation and parse only parameters your API need and support.

I will suggest that you ignore the extra parameters. Reusing API is a game changer in the integration world. What if the same API can be used by other integration but with slightly extra parameters?
Application A expecting:
{
"name": "Jeff",
"surname": "Atwood"
}
Application B expecting:
{
"name": "Jeff",
"surname": "Atwood",
"color": "red"
}
Simple get application application A to ignore "color" will do the job rather to have 2 different API to handle that.

How can I speed my Entity Framework code?

My SQL and Entity Framework knowledge is a somewhat limited. In one Entity Framework (4) application, I notice it takes forever (about 2 minutes) to complete one of my method calls. The first queries do not take much time, but when I loop through the Entity Framework objects returned by the queries, even though I am only reading (not modifying) the data I supposedly got, it takes forever to complete the nested loops, even though there are only dozens of entries in each list and a few levels of looping.
I expect the example below could be re-written with a fancier query that could probably include all of the filtering I am doing in my loops with some SQL words I don't really know how to use, so if someone could show me what the equivalent SQL expression would be, that would be extremely educational to me and probably solve my current performance problem.
Moreover, since other parts of this and other applications I develop often want to do more complex computations on SQL data, I would also like to know a good way to retrieve data from Entity Framework to local memory objects that do not have huge delays in reading them. In my LINQ-to-SQL project there was a similar performance problem, and I solved it by refactoring the whole application to load all SQL data into parallel objects in RAM, which I had to write myself, and I wonder if there isn't a better way to either tell Entity Framework to not keep doing whatever high-latency communication it is doing, or to load into local RAM objects.
In the example below, the code gets a list of food menu items for a member (i.e. a person) on a certain date via a SQL query, and then I use other queries and loops to filter out the menu items on two criteria: 1) If the member has a rating of zero for any group id which the recipe is a member of (a many-to-many relationship) and 2) If the member has a rating of zero for the recipe itself.
Example:
List<PFW_Member_MenuItem> MemberMenuForCookDate =
(from item in _myPfwEntities.PFW_Member_MenuItem
where item.MemberID == forMemberId
where item.CookDate == onCookDate
select item).ToList();
// Now filter out recipes in recipe groups rated zero by the member:
List<PFW_Member_Rating_RecipeGroup> ExcludedGroups =
(from grpRating in _myPfwEntities.PFW_Member_Rating_RecipeGroup
where grpRating.MemberID == forMemberId
where grpRating.Rating == 0
select grpRating).ToList();
foreach (PFW_Member_Rating_RecipeGroup grpToExclude in ExcludedGroups)
{
List<PFW_Member_MenuItem> rcpsToRemove = new List<PFW_Member_MenuItem>();
foreach (PFW_Member_MenuItem rcpOnMenu in MemberMenuForCookDate)
{
PFW_Recipe rcp = GetRecipeById(rcpOnMenu.RecipeID);
foreach (PFW_RecipeGroup group in rcp.PFW_RecipeGroup)
{
if (group.RecipeGroupID == grpToExclude.RecipeGroupID)
{
rcpsToRemove.Add(rcpOnMenu);
break;
}
}
}
foreach (PFW_Member_MenuItem rcpToRemove in rcpsToRemove)
MemberMenuForCookDate.Remove(rcpToRemove);
}
// Now filter out recipes rated zero by the member:
List<PFW_Member_Rating_Recipe> ExcludedRecipes =
(from rcpRating in _myPfwEntities.PFW_Member_Rating_Recipe
where rcpRating.MemberID == forMemberId
where rcpRating.Rating == 0
select rcpRating).ToList();
foreach (PFW_Member_Rating_Recipe rcpToExclude in ExcludedRecipes)
{
List<PFW_Member_MenuItem> rcpsToRemove = new List<PFW_Member_MenuItem>();
foreach (PFW_Member_MenuItem rcpOnMenu in MemberMenuForCookDate)
{
if (rcpOnMenu.RecipeID == rcpToExclude.RecipeID)
rcpsToRemove.Add(rcpOnMenu);
}
foreach (PFW_Member_MenuItem rcpToRemove in rcpsToRemove)
MemberMenuForCookDate.Remove(rcpToRemove);
}

You can use EFProf http://www.hibernatingrhinos.com/products/EFProf to track see exactly what EF is sending to SQL. It can also show you how many queries you are sending and how many unique queries. It also provides you some analysis of each query (e.g. is it unbound etc). Entity Framework with its navigation properties, it is quite easy to not realize you are making a db request. When you are in a loop, and have a navigation property, you get in to the N + 1 problem.

You could use the Keyword Virtual on your List parts of your model if you are using code first to enable proxying, that way you will not have to get all the data back at once, only as you need it.
Also consider NoTracking for read only data
context.bigTable.MergeOption = MergeOption.NoTracking;

Using Magento API to get Products

I'm using the Magento API to get product data for products from a certain category from another domain. I have made the API call etc... The code I'm currently using to get the product data looks like this:
$productList = $client->call($session, 'catalog_category.assignedProducts', 7);
foreach ($productList as $product){
$theProduct = array();
$theProduct['info'] = $client->call($session, 'catalog_product.info', $product['sku']);
$allProducts[] = $theProduct;
}
The code works fine, but it goes extremely slow. When I add the image call to the loop it takes about 50 seconds for the page to load, and that's for a site with only 5 products. What I want to know is the following:
Is the code above correct and it's just Magento's API script is very slow?
Is the code above not the best way of doing what I need?
Could there be any other factors making this go so slow?
Any help would be much appreciated. At least if I know I'm using the code right I can look at other avenues.
Thanks in advance!
================= EDIT =================
Using multicall suggested by Matthias Zeis, the data arrives much quicker. Here's the code I used:
$apicalls = array();
$i = 0;
$productList = $client->call($session, 'catalog_category.assignedProducts', 7);
foreach ($productList as $product){
$apicalls[$i] = array('catalog_product.info', $product['product_id']);
$i++;
}
$list = $client->multiCall($session, $apicalls);
This now works much quicker than before! The next issue I've found is that the catalog_product_attribute_media.list call doesn't seem to work in the same way, even though the products all have images set.
The error I'm getting in the var_dump is:
Requested image not exists in product images' gallery.
Anybody know why this may now be happening? Thanks again in advance.

1. Is the code above correct and it's just Magento's API script is very slow?
Your code is correct, but the script is slow because (a) the SOAP API is not blazingly fast and (b) you are doing seperate calls for every single product.
2. Is the code above not the best way of doing what I need?
If you use the SOAP v1 API or XML-RPC, you can test multiCall. At first, call catalog_category.assignedProducts to fetch the product ids. Collect the product ids and execute a multiCall call. That should cut the waiting time down quite a bit.
Unfortunately, Magento doesn't provide a nice solution out of the box to deliver the data like you need it. I recommend that you implement your own custom API call.
Use a product collection model:
$collection = Mage::getModel('catalog/product')->getCollection();
This will get you a Mage_Catalog_Model_Resource_Product_Collection object which can be used to filter, sort, paginate, ... your product list. Iterate over the collection and build an array containing the data you need. You also can generate thumbnails for your products directly while building the data array:
foreach ($products as $product) {
$data[$product->getSku()] = array(
/* the attributes no need ... */
'small_image' => Mage::helper('catalog/image')->init($product, 'image')
->constrainOnly(true)
->keepAspectRatio(true)
->keepFrame(false)
->resize(100,150)
->__toString(),
/* some more attributes ... */
);
}
This should give you quite a performance improvement.
But of course this only is the tip of the iceberg. If this solution is not fast enough for you, avoid SOAP and bypass a part of the Magento stack by building your own API. This doesn't have to be a complex solution: it could be a simple PHP script with HTTP Basic Authentication which parses the URL for filter criteria etc., includes app/Mage.php and calls Mage::app() to initialise the Magento framework. The benefit is that you have the comfort of using Magento classes but you don't have to go through the whole routing process.
Not to forget, you may cache the results because I could imagine that you will show the same products to quite a few visitors on the other domain. Even caching for a few minutes may help your server.
3. Could there be any other factors making this go so slow?
There may be some reasons why the calls are that slow on your server - but without knowing the volume of your data, your server hardware and the customisations you have done, even a best guess won't be that good.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas