Using HTTP for REST API: automatically cacheable? - api

I was wondering, to make a "RESTful API" you need to satisfy the 6 architectural constraints listed below:
http://en.wikipedia.org/wiki/Representational_state_transfer#Architectural_constraints
Is it safe to state that when you are creating a REST API over the HTTP protocol, the "cacheable" constraint is automatically satisfied? Because HTTP already provides a cache system "out-of-the-box" through HTTP headers: http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html
So no need anymore to worry about that?
Maybe sounds like a stupid question, but I want to be sure. :-)
Kind regards!
K.

Let me expand a bit on the challenges of creating correct caching logic:
Typically, the backend of the API is a database holding all kinds of little pieces of information.
The typical presentation within a REST API can be an accumulated view (So, let's say, a users activity log, containing a list of the last user actions within a portal, something along those lines).
Now, in order to know if your API URL /user/123/activity has changed (after the timestamp the client is sending you in the "If-modified-since"-header), you would have to check if there have been any additional activities after the last request. The overhead of doing that might be the same as simply fetching the result again. So, in a lot of cases, people just don't really bother, which is a shame, as proper caching can have a huge impact on Web App performance.
Maybe this gives a bit more detail,
Jan

you are correct, HTTP already gives you the means to identify cacheable elements, but as your API will be generated by some server-side logic, you will still need to make sure the code "behind" your API will se the right HTTP headers and be ready and able to react to "If-modified-since" requests in an ideal world.
Creating a reliable "Last-modified" timestamp as well as checking against it reliably is actually quiet a feat ;-)
Hope this helps a bit,
Jan

Related

Which kind of information can be collect about "website third parties"?

I have collected all the requests made by websites with the aim to identify the third-parties through the requests which are made by a website. I used selenium and WebDriver to do that.
These requests can be made by the JavaScript present in the source code of the website or can be dynamically called by the web-page from the advertisements or can be initiated by Google or DoubleClick or Facebook. These requests help to track the data that is being shared by these websites with or without the user consent.
You can see an example of the requests when the browser wants to load this website: www.focuscamera.com/ in this excel file:
https://drive.google.com/file/d/16wNA0dFUehrjPww31TAIj8GZUZ05LsIU/view?usp=sharing
My questions are:
1- which kind of HTTP header field can be used for my analysis if I tend to gather some info about third parties? my goal is to distinguish and differentiate the third party behavior!
For example, the field content-length in the requests indicates the size of the entity-body. So a request with higher content-length means that the third party received and collect more data/information?
2- What does exactly content-length indicates? what does exactly "HTTP request body data" contain?
3- Are there any other HTTP header fields that I can use if I aim to distinguish and differentiate the third party behavior? ( a list of field I collect can be found in sheet1 of the excel file I shared before)
4- Are there any other information on the internet that I can use if I aim to distinguish and differentiate the third party behavior? For example, I use cookiepedia.co.uk in order to know what kind of services third parties provide? is it functionality, performance, or Targeting/advertising?
It sounds like you may be reinventing the wheel here. Take a look at https://webbkoll.dataskydd.net; they provide lots of security and privacy analysis on any site you like. Generate nice visual request maps using https://requestmap.webperf.tools:
Try using that tool on sites like wired.com and forbes.com to see how spectacularly bad it can get!
To answer your questions specifically:
Headers are not massively useful as they are within each request (it's the request itself that's more interesting), but the important ones from a privacy perspective will be Referer and Set-cookie. Content-length does indeed tell you how big the request body is – that will always be 0 on a GET request and so is usually omitted – large post requests indicate more data is being transmitted, but that may be down to inefficiency rather than anything else.
Content-length indicates the length of the data (in bytes) within the body of a POST request. An HTTP request body can contain any kind of data: text, images, video, audio, formatted data.
There are some, but most headers are functional rather than semantic, concerned with making the request actually work. It's more interesting that requests happen at all than what they contain.
You can't necessarily tell what kind of service a third party is providing from the requests themselves, but the domains they are going to are more interesting. For example anything going to doubleclick.com is going to be ad and tracking related because of what that domain is known to be used for (Webbkoll cites these as "known trackers"); So you're correct that sites like cookiepedia can help you find out what a particular service does. The divisions between functional/performance/profiling are mostly made up by ad companies to excuse their behaviour, and you can't tell what they are using data for, only whether they are receiving data, and what data they are receiving (because you can see what's in the requests they make using browser developer tools). To clarify - a site could receive your full name and address, but do absolutely nothing with it; but you can't tell that from looking at the data that's sent. In privacy terms, it's always best to assume the worst (because ad companies absolutely cannot be trusted!), so if they are receiving data, assume it will be abused.

Moderate content using HTTP DELETE

For a platform using a mostly-RESTful HTTP API to moderate many types of content, I am wondering if having clients call DELETE on the same endpoint they used to create the content makes sense.
The API would identify the client as either the content's creator, a platform moderator, or a regular user.
In the case of the first two, the content would be immediately deleted, but in the case of the regular user, the content would be flagged for review and essentially be deleted only for that user.
This is as opposed to POSTing to /flag and /remove endpoints for each type of content as this requires additional routes and other overhead.
Update: The real question here is:
Does it make sense to use HTTP DELETE to moderate content in the way described? Will that lead to future complications?
I'm assuming clients created the content by a PUT request to an endpoint of their choice.
From the client viewpoint, I don't see any obvious problems with the approach. In fact, this is exactly how DELETE is intended to be used in remote authoring applications, but there are some minor issues that depend on how much information you want the clients to have.
Do you want the regular user to know his resource is flagged for deletion, or do you want that to be completely transparent? If the first, the DELETE request should return 202 Accepted and some description of the status, and a further GET request might inform the client of the pending deletion in some way. If you don't care about that, you can simply return 404 Not Found or 410 Gone, but then you might have to deal with the possibility of the client creating new content for the same endpoint while the deletion is still pending. That might be a problem or not, depending on your implementation of the PUT semantics.

Consuming a REST API endpoint from a resource ID

Lets consider the following flow to a RESTfull API:
API root
|
v
user list
|
v
user details
|
v
user messages
Suppose I have a client to consume the API, and I want to retrieve messages from a user with ID 42.
From what I've been studying, my client is not supposed to know how to "build" urls, and it should follow the links given by the API.
How should I do to retrieve messages for the user with ID 42?
The only way I can think is "walk" the whole API from it's root to user messages, which doesn't look very pretty or efficient to me.
Eg:
1 - GET / and get the link to the list of users
2 - GET /user/?id=42 and get the link to details of the user with the ID 42
3 - GET /user/42/ and get the link to user 42 list of messages
4 - GET /user/42/messages/ and finally get the user messages
Did I get something wrong? Is this the right way according to Roy's Fielding paper?
Or is it ok to just assume the messages url is "/user/{id}/messages/" and make the request directly?
Use URL templates in your API root. Let the client consume the API root at runtime. It should look for a URL template named something like "user-messages" with the value of "/user/{userid}/messages/". Then let the client substitute "42" for "{userid}" in the template and do a GET on the resulting URL. You can add as many of these URL templates you want for all of the required, often used, use cases.
The difference between this solution and a "classic" web API is the late binding of URLs: the client reads the API root with its templates at runtime - as opposed to compiling the client with the knowledge of the URL templates.
Take a look at the HAL media type for some information about URL templates: http://stateless.co/hal_specification.html
I wrote this piece here some time ago to explain the benefits of hypermedia: http://soabits.blogspot.dk/2013/12/selling-benefits-of-hypermedia.html
I believe what your real concern is should you go about implementing HATEOAS or not. Now as it's an integral part of REST specifications, it is recommended that each entity should have a link to it's child entity that it encompasses. In your case, API ROOT should show list of users with each "user" having a link (/root/users/{id}) to corresponding user's details. And each User details entity will contain a link to the list of "messages" (/root/users/{id}/messages) which, finally, inturn encompass the link to the actual message detail as well (/root/users/{id}/messages/{messageId}). This concept is extremely useful (and thus a part of the specifications) because the client doesn't need to know the url to where your entity is exposed. For example, if your users were on http://users.abc.com/rest/users/{id} but your messages were on http://messages.abc.com/rest/{userId}/messages/{messageId}, the user entity that encompasses the list of "messages" will already have link embedded to point to the right resource on a different server.
Now that being said, I haven't actually seen many REST implementations out there (I must admit I do not have TOO MUCH of an experience, but enough to give an opinion) where HATEOAS is being used widespread. In most cases the resources are almost always on the same server (environment) and the paths to resources are almost always relative to the root url.Thus, it doesn't make sense for the clients to parse out the embedded links from the object when they can generate one by themselves, especially when the client would like to provide access to a resource directly (View the message directly without getting the user entity provided you already know what the messageId is).
In the end, it all depends on how close do you want your REST implementations to that of specifications and what kind of clients are you going to have. My 2 cents would be: if you have time, implement REST with HATEOAS and feel proud about it :). There are libraries out there that will make this implementation (HATEOAS) somewhat transparent to you REST implementation (I believe spring has one, although not very mature. You can look at it here). If you are like me and don't have much time to go that route, I think you can continue with a normal REST implementation without HATEOAS and your clients will still be OK with it (or so I hope!)
Hope this helps!
I found this article about hacking urls: Avoid hackable URLs.
There is a very interesting discussion about the topic of this question in the comments section.

How RESTful is using subdomains as resource identifiers?

We have a single-page app (AngularJs) which interacts with the backend using REST API. The app allows each user to see information about the company the user works at, but not any other company's data. Our current REST API looks like this:
domain.com/companies/123
domain.com/companies/123/employees
domain.com/employees/987
NOTE: All ids are GUIDs, hence the last end-point doesn't have company id, just the employee id.
We recently started working on enforcing the requirement of each user having access to information related exclusively the company where the user works. This means that on the backend we need to track who the logged in user is (which is simple auth problem) as well as determining the company whose information is being accessed. The latter is not easy to determine from our REST API calls, because some of them do not include company id, such as the last one shown above.
We decided that instead of tracking company ID in the UI and sending it with each request, we would put it in the subdomain. So, assuming that ACME company has id=123 our API would change as follows:
acme.domain.com
acme.domain.com/employees
acme.domain.com/employees/987
This makes identifying the company very easy on the backend and requires minor changes to REST calls from our single-page app. However, my concern is that it breaks the RESTfulness of our API. This may also introduce some CORS problems, but I don't have a use case for it now.
I would like to hear your thoughts on this and how you dealt with this problem in the past.
Thanks!
In a similar application, we did put the 'company id' into the path (every company-specific path), not as a subdomain.
I wouldn't care a jot about whether some terminology enthusiast thought my design was "RESTful
" or not, but I can see several disadvantages to using domains, mostly stemming from the fact that the world tends to assume that the domain identifies "the server", and the path is how you find an item on that server. There will be a certain amount of extra stuff you'll have to deal with with multiple domains which you wouldn't with paths:
HTTPS - you'd need a wildcard certificate instead of a simple one
DNS - you're either going to have wildcard DNS entries, or your application management is now going to involve DNS management
All the CORS stuff which you mention - may or may not be a headache in your specific application - anything which is making 'same domain' assumptions about security policy is going to be affected.
Of course, if you want lots of isolation between companies, and effectively you would be as happy running a separate server for each company, then it's not a bad design. I can't see it's more or less RESTful, as that's just a matter of viewpoint.
There is nothing "unrestful" in using subdomains. URIs in REST are opaque, meaning that you don't really care about what the URI is, but only about the fact that every single resource in the system can be identified and referenced independently.
Also, in a RESTful application, you never compose URLs manually, but you traverse the hypermedia links you find at the API endpoint and in all the returned responses. Since you don't need to manually compose URIs, from the REST point of view it's indifferent how they look. Having a URI such as
//domain.com/ABGHTYT12345H
would be as RESTful as
//domain.com/companies/acme/employees/123
or
//domain.com/acme/employees/smith-charles
or
//acme.domain.com/employees/123
All of those are equally RESTful.
But... I like to think of usable APIs, and when it comes to usability having readable meaningful URLs is a must for me. Also following conventions is a good idea. In your particular case, there is nothing unrestful with the route, but it is unusual to find that kind of behaviour in an API, so it might not be the best practice. Also, as someone pointed out, it might complicate your development (Not specifically on the CORS part though, that one is easily solved by sending a few HTTP headers)
So, even if I can't see anything non REST on your proposal, the conventions elsewhere would be against subdomains on an API.

Strategies for dealing with split application

To deal with recent growth our application has been split across two sets of separate infrastructure. Approximately half of our customers are on set 1 and the other half are on set 2.
Both sets have different urls (api1.ourdomain.com and api2.ourdomain.com).
Problem is clients accidentally putt the wrong url and then wonder why they get error messages.
Other then user education any other strategies for dealing with this mess?
Is it possible to redirect requests to the correct endpoint?
Thanks.
I don't think your question is detailed enough to provide meaningful feedback. There are obviously several factors that could easily contribute to a recommendation.
Does your application make use of user profiles (or a similar construct)? If so you might consider associating a primary URI for each user in their profile and include logic in your application to interrogate the profile for each request and redirect if a user goes to the wrong URI.
Is this an authorization issue? If so you might consider including some basic authorization routing that provides a custom 403 page with the proper URL.
If you could provide additional detail I think we could be more helpful.