How do we make APIs with slow depedencies faster? - api

Recently I have attended two different job interviews and one of the questions they made was something like this:
1- You need to create an API that will use some microservices that are very slow. Some of them respond under a few seconds (let's say 2 seconds). We have to make our best to build our API very reliable in terms of latency. What would you do to make this system work fast?
2- This led me to other questions like if I choose to cache some data, what do I have to do avoid old cache? For example, if i cached the user personal info and he just updated his profile?
3- Finally if it was not a reading operation, what do we have to do to use services that take a long time not impact the user experience? In this case imagine that it's a writing operation
How would you answer these questions?

The question is a little vague but I'll try and throw a couple of solutions out there.
Before jumping into the cache, I would first ask questions about the data set. For instance, how large is this data set and how often does the data set change? If the data set isn't large, you can probably store all of it in memory indefinitely and on updates, you can update individual records in the cache.
Of course when we say we store it in cache, we also have to keep in mind data retrieval. If data retrieval requires grabbing the data in many different ways and the data set is large, caching may not be as great as a solution. This kind of addresses the first and second question that you've posted without further information from the interviewer. This in turn is really where you need to tease out requirements from the interviewer to see if you're on the right track.
Now finally for the third question, I think the interviewer is trying to get you to write asynchronously to something like a queuing mechanism that allows user to get a quick response and your system to take its time processing it. A follow up question here may be about how long can a write take to be processed and that will lead to a series of more domain specific questions. Again, you'll have to dig into the requirements of this to see what kind of trade-offs the interviewer wants you to make because there is no silver bullet.

Related

Good practice to fetch detail api data in react-redux app

Whats the best practice to fetch details data in react app when you are dealing with multiple master details view?
For an example if you have
- /rest/departments api which returns list of departments
- /rest/departments/:departmentId/employees api to return all employees within department.
To fetch all departments i use:
componentDidMount() {
this.props.dispatch(fetchDepartments());
}
but then ill need a logic to fetch all employees per department. Would be a great idea to call employee action creator for each department in department reducer logic?
Dispatching employees actions in render method does not look like a good idea to me.
Surely it is a bad idea to call an employee action creator inside the department reducer, as reducers should be pure functions; you should do it in your fetchDepartments action creator.
Anyway, if you need to get all the employees for every department (not just the selected one), it is not ideal to make many API calls: if possible, I would ask to the backend developers to have an endpoint that returns the array of departments and, for each department, an embedded array of employees, if the numbers aren't too big of course...
Big old "It depends"
This is something that in the end, you will need to pick a way and see how it works out with your specific data and user needs. This somewhat deals with network issues as well, such as latency. In a very nicely networked environment, such as a top-3 insurance company I was a net admin for, you can achieve super low latency network calls. In such a case, multiple network requests would be significantly different than a homeowner internet based environment could be. Even then, you have to consider a wide range of possibilities. And you ALWAYS need to consider your end goals.
(Not to get too down in the technical aspects, but latency can fairly accurately be defined as "the time you are waiting for a network request to actually start sending data". A classic example of where this can be important is online first person shooter gaming. You click shoot, and the data is not transmitted as fast as you would like since the network is waiting to send the data, then you die. A classic example where bandwidth is more useful than latency is downloading or uploading large files. If you have to wait a second or two for the actual data to move, but when it moves you can download a GB in seconds, then oh well, I'll take it.)
Currently, I have our website making multiple calls to load dynamic menus and dynamic content. It is very small data. It is done in three separate calls. On the internet. It's "ok", but I would not say that it is "good". Since users are waiting for all of it to even start, I might as well throw it all in a single network call. Also, in case two calls go ok, then the third chokes a bit, the user may start to navigate, then more menus pop in and it is not ideal. This is why regardless, you have to think about your specific needs, and what range of possible use cases may likely apply. (I am currently re-writing the entire site anyways)
As a previous (in my opinion "good") answer stated, it probably makes sense to have the whole data set shot to you in one gulp. It appears to me this is an internal, or at least commercial app, with decent network and much more importantly, no risk of losing customers because your stuff did not load super fast.
That said, if things do not work out well with that, especially if you are talking large data sets, then consider a lazy loading architecture. For example, your user cannot get to an employee until they see the departments. So it may be ok, depending on your network and large data size, to load departments, and then after it returns initiate an asynchronous load of the employee data. The employee data is now being loaded while your user browses the department names.
A huge question you may want to clarify is whether or not any employee list data is rendered WITH the departments. In one of my cases, I have a work order system that I load after login, but lazy, and when it is loaded it throws a badge on the Work Order menu to show how many are outstanding. Since I do not have a lot of orders, it is basically a one second wait. No biggie. It is not like the user has to wait for it to load to begin work. If you wanted a badge per department, then it may get weird. You could, if you load by department, have multiple badges popping in randomly. In this case, it may cause user confusion, and it probably a good choice to load it in one large chunk. If the user has to wait anyways, it may produce one less call with a user asking "is it ok that it is doing this?". Especially with software for the workplace, it is more acceptable to have to wait for an initial load at the beginning of the work day.
To be clear, with all of these complications to consider, it is extremely important that you develop with as good of software coding practices as you are able. This way, you can code one solution, and if it does not meet your performance or user needs, it is not a nightmare to make a change. In a general case with small data, I would just load it in one big gulp to start, and if there are problems with load times complicate it from there. Complicating code from the beginning for no clearly needed reason is a good way to clutter your code up to the point of making it completely unwieldy to maintain.
On a third note, if you are dealing with enterprise size data sets, that is a whole different thing. Then you have to deal with pagination, and yes it gets a bit more complicated.
Regards,
DB
I'm not sure what fetchDepartments does exactly but I'd ensure the actual fetch request is executed from a Redux middleware. By doing it from middleware, you can fingerprint / cache / debounce all your requests and make a single one across the app no matter how many components request the thing.
In general, middleware is the best place to handle asynchronous side effects.

Is RavenDB Right for my Situation?

I have an interesting situation where I'm near the end of an evaluation period for a RavenDB prototype for use with a project at our company. The reason it's interesting is that 99.99% of the time, I believe it fits Raven's sweet spot; it repeatedly queries for new data, often, and in small batches (< 1000 documents at a time).
However, we do have an initial load period, where we need to load two days' worth of data, which can be 3 million (or more) records in some cases.
A diagram might help:
It's the Transfer Service that is responsible for getting the correct data out of three production databases and storing it in RavenDB. The WCF service will query this data and make it available to its clients.
Once we do the initial load of millions of records/documents into RavenDB, we'll rarely have to do that again.
As an initial load test, on a machine with 4GB RAM and two processors, it took just over 23 minutes to read the initial data. In this case, it was only about 1.28 million records. I eliminated all async operations from this initial load, because I wanted each read to not be interfered with by other read operations. I found the best results this way.
I know it's not recommended, but to accomplish all this, I had to change settings that aren't recommended to be changed:
I had to increase the timeout:
documentStore.JsonRequestFactory.ConfigureRequest += (e, x) => ((HttpWebRequest)x.Request).Timeout = ravenTimeoutInMilliseconds;
In the Raven.Server.exe.config, I had to increase the page size (to int.MaxValue):
<add key="Raven/MaxPageSize" value="2147483647"/>
And in my retrieval methods, I had to use Take(int.MaxValue):
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
Remember this is all for that one-time, initial load. After that, it's many queries, quickly, and often. I should also note that each document is self-contained in RavenDB. There are no relationships to manage.
Knowing all this, is RavenDB a good fit?
A good fit for what?
Full text search? Yes. Background aggregations (map/reduce ones)? Yes. Easy replication and sharding, say scaling? Yes...
Ad-hoc reporting? No. Support for probably thousands of third party tools? No...
If you're talking about performance, you probably want to look at Orens latest post on that. His numbers are quite similar to your ones: http://ayende.com/blog/154913/ravendb-amp-freedb-an-optimization-story
From what I understand of your question, you need to "prep" the WCF web-service. To do this you read 1.2M docs from RavenDB (in about 23 mins) and hold them in memory, so the WCF service can then serve queries from them, is this right? Or am I missing something?
Why not get the WCF service to send it's queries to Raven one-at-a-time? I.e. for each query it gets from a Client, ask RavenDB to do the query for it?
From what you've told us in the other answers comments, I believe the only good way to serve the wcf clients fast enough, is to actually store everything in memory, so just the way you do it now.
The question, if RavenDB is a good fit for that situation depends on whether your data model benefits in others way from the document oriented nature. So, in case you have dynamic data that would require some kind of EAV in a relational databases and lots of joins, then RavenDB will probably be a very good solution. However, if you just need something you can throw flat data in, then I would go with a relational database here. In terms of licensing costs and ease of use, you might also want to take a look at PostgreSql, as this is a really awesome database that comes completely free.

Serialization or SQlite?

I'm making a patient database program using Visual C#. It will have forms and will consist of 3 tabs with information about the patient. It will also have add, save, previous, next buttons and a search function. The most important thing is each record will have like 60 items/columns/attributes per record and the records could reach to 50k-100k or more.
Now my question is, which is better for my program? Should I use SQlite or Serialization/Deserialization?
Thanks
The "database" word in the question strongly suggests that just serialization/deserialization isn't enough. Of course if you can fit all of your data into memory and you're happy to perform all the querying yourself, it could work - but you'll need to consider the cost of potentially reading everything into memory on startup, and possibly writing everything out whenever you change anything.
A database does sound like a better fit to me, to be honest. Whether SQLite is the most appropriate database for you or not is a different question though.
Having said all of this, for the C# in Depth website I keep all the information about comments / errata in a simple XML file, which is loaded lazily and saved every time I make a change. It works well, it's easy to manage, and the file is human readable in source control when I want it. However, I have vastly fewer records than you, and they're much simpler too. I don't have any search requirements - I just need to list everything and fetch by ID. My guess is that your needs are rather more complex, hence my recommendation to use a database.

Database structure for modelling one-way follower system like Twitter [duplicate]

I am designing an app that would involve users 'following' each other's activity, in the twitter sense, but I am not very experienced with database/query design/efficiency. Are there best practices for managing this, pitfalls to avoid, etc.? I gather this can create a very large load on the db if not done properly (or maybe even then?).
If it makes a difference it is likely that people will 'follow' only a relatively small number of people (but a person may have many followers). However this is not certain, and I wouldn't want to count on it.
Any advice gratefully received. Thanks.
Pretty simple and easy to do with full normalisation. If you have a table of users, each with a unique ID, you would have a TABLE_FOLLOWERS table with the columns, USERID and FOLLOWERID which would describe all the followers for each user as a one to one to many relationship.
Even with millions of assosciations on a half decent database server this will perform well and fast as long as you are using a good database (IE, not MS-Access).
The model is fairly simple. The problem is in the size of the Subscription table; if there are 1 million users, and each subscribes to 1000, then the Subscription table has 1 billion rows.
That depends on how many users you expect to need to support; how many followers you expect users to have; and what sort of funding/development-effort you expect to have access to should your answers to the previous questions prove optimistic.
For a small scale project I would likely ignore the database, design the application as a simple object model with User objects that maintain a List[followers]. Keep it all in RAM for normal operation and use an ORM to persist to a database periodically (probably postgresql or mysql).
For a larger project I would not be using a relational database at all; but exactly what I would use would depend on the specific details of the project.
If you are only trying to spike the concept, go with the ORM approach; but, keep in mind it won't scale.
You probably should read http://highscalability.com/ and it's articles on how this is managed by the big sites.

Most optimized way to store crawler states?

I'm currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are.
Thus, I'm able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.
Right now, I've just been using a mysql table to handle those storage action, mostly for fast prototyping.
Now I'd like to know how I could optimize this, since I believe a database shouldn't be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a great amount of data written in short times
For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...
Thanks in advance for suggestions
The fastest way of persisting things is typically to just append them to a log -- such a totally sequential access pattern minimizes disk seeks, which are typically the largest part of the time costs for storage. Upon restarting, you re-read the log and rebuild the memory structures that you were also building on the fly as you were appending to the log in the first place.
Your specific application could be further optimized since it doesn't necessarily require 100% reliability -- if you miss writing a few entries due to a sudden crash, ah well, you'll just crawl them again. So, your log file can be buffered and doesn't need to be obsessively fsync'ed.
I imagine the search structure would also fit comfortably in memory (if it's only for a few dozen sites you could probably just keep a set with all their URLs, no need for bloom filters or anything fancy) -- if it didn't, you might have to keep in memory only a set of recent entries, and periodically dump that set to disk (e.g., merging all entries into a Berkeley DB file); but I'm not going into excruciating details about these options since it does not appear you will require them.
There was a talk at PyCon 2009 that you may find interesting, Precise state recovery and restart for data-analysis applications by Bill Gribble.
Another quick way to save your application state may be to use pickle to serialize your application state to disk.