Purge cache in Cloudflare using API - cloudflare

What would be best practice to refresh content that is already cached by CF?
We have few API that generate JSON and we cache them.
Once a while JSON should be updated and what we do right now is - purge them via API.
https://api.cloudflare.com/client/v4/zones/dcbcd3e49376566e2a194827c689802d/purge_cache
later on, when user hits the page with required JSON it will be cached.
But in our case we have 100+ JSON files that we purge at once and we want to send new cache to CF instead of waiting for users (to avoid bad experience for them).
Right now I consider to PING (via HTTP request) needed JSON endpoints just after we have purged cache.
My question if that is the right way and if CF already has some API to do what we need.
Thanks.

Currently, the purge API is the recommended way to invalidate cached content on-demand.
Another approach for your scenario could be to look at Workers and Workers KV, and combine it with the Cloudflare API. You could have:
A Worker reading the JSON from the KV and returning it to the user.
When you have a new version of the JSON, you could use the API to create/update the JSON stored in the KV.
This setup could be significantly performant, since the Worker code in (1) runs on each Cloudflare datacenter and returns quickly to the users. It is also important to note that KV is "eventually consistent" storage, so feasibility depends on your specific application.

Related

Cache data from external API in an orchestrated manner

I am building an application which uses the Amazon MWS API.
The API has limits for how frequently you can hit it.
I am looking for a tool that can act as a reverse-proxy, save the MWS API responses, and eventually masquerade as the MWS API without ever hitting it, returning only responses from the cache.
Some tools do this, but what I need is a bit more complicated.
Say I request a report from Amazon MWS:
I'll call RequestReport
I'll get a ReportRequestId back
I'll start server polling GetReportRequestList to find out what the current status of the report request is. The report request will go likely go through the statuses SUBMITTED then DONE, but it could also be set to ERROR or CANCELLED
When the report request status returned by GetReportRequestList is DONE, I can finally call GetReport and get the data.
The behavior from step 3 is what I'm trying to replicate.
This external API cache should be able to produce different results for the same request: the first response should yield SUBMITTED and then the second response should yield DONE.
I should be able to easily configure these flows as I wish, setting the responses I want for the 1st, 2nd, nth request.
I would like this tool to necessitate minimal configuration, I do not want to configure routes or anything, I want it to automatically cache everything and then return everything from the cache, never flushing it.
Also, I need this level of control over what's returned in a response, depending on the count of requests done up to that point.

how would I expose 200k+ records via an API?

what would be the best option for exposing 220k records to third party applications?
SF style 'bulk API' - independent of the standard API to maintain availability
server-side pagination
call back to a ftp generated file?
webhooks?
This bulk will have to happen once a day or so. ANY OTHER SUGGESTIONS WELCOME!
How are the 220k records being used?
Must serve it all at once
Not ideal for human consumers of this endpoint without special GUI considerations and communication.
A. I think that using a 'bulk API' would be marginally better than reading a file of the same data. (Not 100% sure on this.) Opening and interpreting a file might take a little bit more time than directly accessing data provided in an endpoint's response body.
Can send it in pieces
B. If only a small amount of data is needed at once, then server-side pagination should be used and allows the consumer to request new batches of data as desired. This reduces unnecessary server load by not sending data without it being specifically requested.
C. If all of it needs to be received during a user-session, then find a way to send the consumer partial information along the way. Often users can be temporarily satisfied with partial data while the rest loads, so update the client periodically with information as it arrives. Consider AJAX Long-Polling, HTML5 Server Sent Events (SSE), HTML5 Websockets as described here: What are Long-Polling, Websockets, Server-Sent Events (SSE) and Comet?. Tech stack details and third party requirements will likely limit your options. Make sure to communicate to users that the application is still working on the request until it is finished.
Can send less data
D. If the third party applications only need to show updated records, could a different endpoint be created for exposing this more manageable (hopefully) subset of records?
E. If the end-result is displaying this data in a user-centric application, then maybe a manageable amount of summary data could be sent instead? Are there user-centric applications that show 220k records at once, instead of fetching individual ones (or small batches)?
I would use a streaming API. This is an API that does a "select * from table" and then streams the results to the consumer. You do this using a for loop to fetch and output the records. This way you never use much memory and as long as you frequently flush the output the webserver will not close the connection and you will support any size of result set.
I know this works as I (shameless plug) wrote the mysql-crud-api that actually does this.

Does Amazon S3 send invalidation signals to CloudFront?

When I update a file on S3 and I have CloudFront enabled, does S3 send an invalidation signal to CloudFront? Or do I need to send it myself after updating the file?
I can't seem to see an obvious answer in the documentation.
S3 doesn't send any invalidation information to CloudFront. By default CloudFront will hold information up to the maximum time specified by the Cache Control headers that were set when it retrieved the data from the origin (it may remove items from its cache earlier if it feels like it).
You can invalidate cache entries by creating an invalidation batch. This will cost you money: the 1st 1000 requests a month are free but beyond that it costs $0.005 per request - if you were invalidating 1000 files a day it would cost you $150 a month (unless you can make use of the wildcard feature). You can of course trigger this in response to an s3 event using an Amazon Lambda function.
Another approach is to use a different path when the object changes (in effect a generational cache key). Similarly you could append a query parameter to the url and change that query parameter when you want cloudfront to fetch a fresh copy (to do this you'll need to tell CloudFront to use query string parameters - by default it ignores them).
Another way if you only do infrequent (but large) changes is to simply create a new cloudfront distribution.
As far as I know, all CDNs work like this.
It's why you generally use something like foo-x.y.z.ext to version assets on a CDN. I wouldn't use foo.ext?x.y.z because there was something about certain browsers and proxies never caching assets with a ?QUERY_STRING.
In general you may want to check this out:
https://developers.google.com/speed/docs/best-practices/caching
It contains lots of best practices and goes into details what to do and how it works.
In regard to S3 and Cloudfront, I'm not super familiar with the cache invalidation, but what Frederick Cheung mentioned is all correct.
Some providers also allow you to clear the cache directly but because of the nature of a CDN these changes are almost never instant. Another method is to set a smaller TTL (expiration headers) so assets will be refreshed more often. But I think that defeats the purpose of a CDN as well.
In our case (Edgecast), cache invalidation is possible (a manual process) and free of charge, but we rarely do this because we version our assets accordingly.

Best way to store data between two request

I need one a bit theoretical advice. Here is my situation : I have a search system, which returns a list of found items. But the user is allowed to display only particular amount of items on one page, so when his first request is sent to my WCF service, it gets the whole list, then tests if the list isn't longer then the ammount of items my user is allowed to get and if the list isn't longer, there is no problem and my service returns the whole list, but when it is, then there is problem. I need to let the user choose which page he wants to display, so I let the javascript know that the user should choose page and the "page number dialog" is shown and then user is sending the second request with page number. And based on this request the webservice selects relewant items and sends them back to user. So what I need to do is to store the whole list on the server between first and second request and I 'd appreciate any idehow to store it. I was thinking about session, but I don't know if it is possible to set timeout only to particular sesion (ex. Session["list"]), because the list is used only once and can have thousands of items, so I don't want to keep it on the server to long.
PS. I Can't use standart pagination, the scenario has to be exactly how is described above.
Thanks
This sounds like a classic use-case for memcached. It is a network based key-value store for storing temporary values. Unlike in-memory state, it can be used to share temporary cached values among servers (say you have multiple nodes), and it is a great way to save state across requests (avoiding the latency that would be caused by using cookies, which are transmitted to/from the server on each http request).
The basic approach is to create a unique ID for each request, and associate it with a particular (set of) memcached key for that user's requests. You then save this unique ID in a cookie (or similar mechanism).
A warning, though, the memory is volatile, so can be lost at any point. In practice, this is not frequent, and the memcached algorithm uses a LRU queue. More details http://code.google.com/p/memcached/wiki/NewOverview
http://memcached.org/
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
I'm not a .net programmer, but there appear to be implementations:
http://code.google.com/p/memcached/wiki/Clients
.Net memcached client
https://sourceforge.net/projects/memcacheddotnet .Net 2.0 memcached
client
http://www.codeplex.com/EnyimMemcached Client developed in .NET 2.0
keeping performance and extensibility in mind. (Supports consistent
hashing.) http://www.codeplex.com/memcachedproviders BeIT Memcached
Client (optimized C# 2.0)
http://code.google.com/p/beitmemcached jehiah
http://jehiah.cz/projects/memcached-win32

Streaming API vs Rest API?

The canonical example here is Twitter's API. I understand conceptually how the REST API works, essentially its just a query to their server for your particular request in which you then receive a response (JSON, XML, etc), great.
However I'm not exactly sure how a streaming API works behind the scenes. I understand how to consume it. For example with Twitter listen for a response. From the response listen for data and in which the tweets come in chunks. Build up the chunks in a string buffer and wait for a line feed which signifies end of Tweet. But what are they doing to make this work?
Let's say I had a bunch of data and I wanted to setup a streaming API locally for other people on the net to consume (just like Twitter). How is this done, what technologies? Is this something Node JS could handle? I'm just trying to wrap my head around what they are doing to make this thing work.
Twitter's stream API is that it's essentially a long-running request that's left open, data is pushed into it as and when it becomes available.
The repercussion of that is that the server will have to be able to deal with lots of concurrent open HTTP connections (one per client). A lot of existing servers don't manage that well, for example Java servlet engines assign one Thread per request which can (a) get quite expensive and (b) quickly hits the normal max-threads setting and prevents subsequent connections.
As you guessed the Node.js model fits the idea of a streaming connection much better than say a servlet model does. Both requests and responses are exposed as streams in Node.js, but don't occupy an entire thread or process, which means that you could continue pushing data into the stream for as long as it remained open without tying up excessive resources (although this is subjective). In theory you could have a lot of concurrent open responses connected to a single process and only write to each one when necessary.
If you haven't looked at it already the HTTP docs for Node.js might be useful.
I'd also take a look at technoweenie's Twitter client to see what the consumer end of that API looks like with Node.js, the stream() function in particular.