Efficient way to get CIO-1? - httprequest

I was assigned a task where I receive a bunch of employees and I need to get their CIO-1 (means their boss's boss's boss... all the way to the boss whose manager is the CIO). But I don't know if I'm doing it in an efficient way. This is my algorithm:
For every given employee, perform an api request to microsoft graph to get their manager. Then perform another api request to get that manager's manager... and do this until I get the one whose manager is our CIO. This means if I was given 500 employees, I would be performing an HTTP request inside a for loop, and perform another for loop inside to go up the "manager chain."
Is this ok? Would Microsoft Graph cut me off because I would be performing many many queries in a short amount of time?

This is more of an algorithm question than a graph question really. But yes you do need to query the data at the end of the day.
However there are smarter ways than others.
Rather than building a loop that queries all your initial employee's boss and within that have another loop to get the higher level boss and so on... You could do it a different way.
First make sure you leverage batch queries to minimize the number of round-trips.
Second, I think it's safe to assume that at some point, some sets of employees are going to have the same boss. And rather than querying multiple times who's that boss' boss, you need to make sure you only do it once.
There are a couple of different ways of doing that, maintaining a tree, an index, always applying distinct on data sets...
Lastly, keep in mind that a loop in a loop algorithmic complexity is o(n)^2. Which is always worse than having two subsequent loops (one after another). So trying to flatten your algorithm will help, and it will also help you build batches.
On the throttling part, make sure you go though the guidance. Yes you might experience that, however there are a couple of ways to optimize.

Related

Good practice to fetch detail api data in react-redux app

Whats the best practice to fetch details data in react app when you are dealing with multiple master details view?
For an example if you have
- /rest/departments api which returns list of departments
- /rest/departments/:departmentId/employees api to return all employees within department.
To fetch all departments i use:
componentDidMount() {
this.props.dispatch(fetchDepartments());
}
but then ill need a logic to fetch all employees per department. Would be a great idea to call employee action creator for each department in department reducer logic?
Dispatching employees actions in render method does not look like a good idea to me.
Surely it is a bad idea to call an employee action creator inside the department reducer, as reducers should be pure functions; you should do it in your fetchDepartments action creator.
Anyway, if you need to get all the employees for every department (not just the selected one), it is not ideal to make many API calls: if possible, I would ask to the backend developers to have an endpoint that returns the array of departments and, for each department, an embedded array of employees, if the numbers aren't too big of course...
Big old "It depends"
This is something that in the end, you will need to pick a way and see how it works out with your specific data and user needs. This somewhat deals with network issues as well, such as latency. In a very nicely networked environment, such as a top-3 insurance company I was a net admin for, you can achieve super low latency network calls. In such a case, multiple network requests would be significantly different than a homeowner internet based environment could be. Even then, you have to consider a wide range of possibilities. And you ALWAYS need to consider your end goals.
(Not to get too down in the technical aspects, but latency can fairly accurately be defined as "the time you are waiting for a network request to actually start sending data". A classic example of where this can be important is online first person shooter gaming. You click shoot, and the data is not transmitted as fast as you would like since the network is waiting to send the data, then you die. A classic example where bandwidth is more useful than latency is downloading or uploading large files. If you have to wait a second or two for the actual data to move, but when it moves you can download a GB in seconds, then oh well, I'll take it.)
Currently, I have our website making multiple calls to load dynamic menus and dynamic content. It is very small data. It is done in three separate calls. On the internet. It's "ok", but I would not say that it is "good". Since users are waiting for all of it to even start, I might as well throw it all in a single network call. Also, in case two calls go ok, then the third chokes a bit, the user may start to navigate, then more menus pop in and it is not ideal. This is why regardless, you have to think about your specific needs, and what range of possible use cases may likely apply. (I am currently re-writing the entire site anyways)
As a previous (in my opinion "good") answer stated, it probably makes sense to have the whole data set shot to you in one gulp. It appears to me this is an internal, or at least commercial app, with decent network and much more importantly, no risk of losing customers because your stuff did not load super fast.
That said, if things do not work out well with that, especially if you are talking large data sets, then consider a lazy loading architecture. For example, your user cannot get to an employee until they see the departments. So it may be ok, depending on your network and large data size, to load departments, and then after it returns initiate an asynchronous load of the employee data. The employee data is now being loaded while your user browses the department names.
A huge question you may want to clarify is whether or not any employee list data is rendered WITH the departments. In one of my cases, I have a work order system that I load after login, but lazy, and when it is loaded it throws a badge on the Work Order menu to show how many are outstanding. Since I do not have a lot of orders, it is basically a one second wait. No biggie. It is not like the user has to wait for it to load to begin work. If you wanted a badge per department, then it may get weird. You could, if you load by department, have multiple badges popping in randomly. In this case, it may cause user confusion, and it probably a good choice to load it in one large chunk. If the user has to wait anyways, it may produce one less call with a user asking "is it ok that it is doing this?". Especially with software for the workplace, it is more acceptable to have to wait for an initial load at the beginning of the work day.
To be clear, with all of these complications to consider, it is extremely important that you develop with as good of software coding practices as you are able. This way, you can code one solution, and if it does not meet your performance or user needs, it is not a nightmare to make a change. In a general case with small data, I would just load it in one big gulp to start, and if there are problems with load times complicate it from there. Complicating code from the beginning for no clearly needed reason is a good way to clutter your code up to the point of making it completely unwieldy to maintain.
On a third note, if you are dealing with enterprise size data sets, that is a whole different thing. Then you have to deal with pagination, and yes it gets a bit more complicated.
Regards,
DB
I'm not sure what fetchDepartments does exactly but I'd ensure the actual fetch request is executed from a Redux middleware. By doing it from middleware, you can fingerprint / cache / debounce all your requests and make a single one across the app no matter how many components request the thing.
In general, middleware is the best place to handle asynchronous side effects.

data population and manipulation in frontend vs. backend

I have very general question. For example, I have employee table which includes name, address, age, gender, dept etc.. When a user wants to see overall employee information, I do not need extract whole columns from DB. I want to show overall employee information first by grouping employees' names per dept. And then, a user can select one particular employee if he/she has more interest.
To implement this, which approach will be better.
1) Apply two different APIs which will produce two different data results by applying different queries. Therefore, if I take this approach, even though I have to call two different APIs, it seems effective.
2) Apply one API which will produce one data result including whole employee relevant columns even though the overall employee information does not need entire detail employee information. Once I get this data, I can re-format employee information by manipulating already extracted data from the front-end side according to different needs.
Normally, which approach developers take between 1) and 2)? I think 1) makes sense, but API will be too much specialized, not generalized. Is it a desirable way to manipulate data gotten from back-end side(RESTful) in a front-end side(Angular 2)?
Which approach is preferred between creating relatively much more numbers of specialized APIs or manipulating the data in the front-end once after getting whole data? If there is some criteria to take which approach, on which ground I have to consider?
Is this is correct thinking? If someone has some idea about it, could you give some guide?
That's a very interesting discussion. If you need a very optimized application then you would need to go with the first option. Also if you have some column on the employee that can be accessed only by a privileged user, you can imagine that a client side manipulation won't stop some malicious hacker from seeing it. In this case you must implement the first option as well.
The drawback is that at some point you could have too many APIs doing very similar things making It confusing and unproductive.
About the second option, as you pointed out having generalized APIs is a good thing, so in order to speed up the development process and have a cleaner set of APIs you can waste few internet traffic in return.
You need to balance your needs.
Is it optimization of resources? Like speed in terms of database queries and internet requests.
Or is it development time? Do you have enough time to implement all these things and optimizations?

DBI/SQL performance on SELECT via different methods?

I've been looking around and haven't been able to find a good answer to this question. Is there a real performance difference between the following DBI methods?
fetchrow_arrayref vs
selectall_arrayref vs
fetchall_arrayref
It probably doesn't matter when making a single SELECT call that will give a smallish (>50 records) resultset, but what about making multiple SELECT statements all in a row? What about if the resultsets are huge (i.e. thousands of records)?
Thanks.
The key question to ask yourself is whether you need to keep all the returned rows around in memory.
If you do then you can let the DBI fetch them all for you - it will be faster than writing the equivalent code yourself.
If you don't need to keep all the rows in memory, in other words, if you can process each row in turn, then using fetchrow_arrayref in a loop will typically be much faster.
The reason is that the DBI goes to great lengths to reuse the memory buffers for each row. That can be a significant saving in effort. You can see the effect on this slide, although the examples don't directly match your question. You can also see from that slide the importance of benchmarking. It can be hard to tell where the balance lies.
If you can work on a per-row basis, then binding columns can yield a useful performance gain by reducing the work of accessing the values in the fetched rows.
You also asked about "huge" results. (I doubt "thousands of records" would be a problem on modern machines unless the rows themselves were very 'large'.) Clearly processing a row at a time is preferable for very large result sets. Note that some databases default to streaming all the results to the driver which then buffers them in a compact form and returns the rows one by one as your perl code (or a DBI method) fetches them. Again, benchmark and test for yourself.
If you need more help, the dbi-users mailing list is a good place to ask. You don't need to subscribe.
The difference between fetchrow* and fetch all will be the location of the loop code in the call stack. That is, the fetchall* implies the fetch loop, while the fetchrow* implies that you will write your own loop.
The difference between the fetch* and select* methods is that one requires you to manually prepare() and execute() the query, while the other does that for you. The time differences will come from how efficient your code is compared to DBI's.
My reading has shown that the main differences between methods are between *_arrayref and *_hashref, where the *_hashref methods are slower due to the need to look up the hash key names in the database's metadata.

What's the best way to store/calculate user scores?

I am looking to design a database for a website where users will be able to gain points (reputation) for performing certain activities and am struggling with the database design.
I am planning to keep records of the things a user does so they may have 25 points for an item they have submitted, 1 point each for 30 comments they have made and another 10 bonus points for being awesome!
Clearly all the data will be there, but it seems like a lot or querying to get the total score for each user which I would like to display next to their username (in the form of a level). For example, a query to the submitted items table to get the scores for each item from that user, a query to the comments table etc. If all this needs to be done for every user mentioned on a page.... LOTS of queries!
I had considered keeping a score in the user table, which would seem a lot quicker to look up, but I've had it drummed into me that storing data that can be calculated from other data is BAD!
I've seen a lot of sites that do similar things (even stack overflow does similar) so I figure there must be a "best practice" to follow. Can anyone suggest what it may be?
Any suggestions or comments would be great. Thanks!
I think that this is definitely a great question. I've had to build systems that have similar behavior to this--especially when the table with the scores in it is accessed pretty often (like in your scenario). Here's my suggestion to you:
First, create some tables like the following (I'm using SQL Server best practices, but name them however you see fit):
UserAccount UserAchievement
-Guid (PK) -Guid (PK)
-FirstName -UserAccountGuid (FK)
-LastName -Name
-EmailAddress -Score
Once you've done this, go ahead and create a view that looks something like the following (no, I haven't verified this SQL, but it should be a good start):
SELECT [UserAccount].[FirstName] AS FirstName,
[UserAccount].[LastName] AS LastName,
SUM([UserAchievement].[Score]) AS TotalPoints
FROM [UserAccount]
INNER JOIN [UserAchievement]
ON [UserAccount].[Guid] = [UserAchievement].[UserAccountGuid]
GROUP BY [UserAccount].[FirstName],
[UserAccount].[LastName]
ORDER BY [UserAccount].[LastName] ASC
I know you've mentioned some concern about performance and a lot of queries, but if you build out a view like this, you won't ever need more than one. I recommend not making this a materialized view; instead, just index your tables so that the lookups that you need (essentially, UserAccountGuid) will enable fast summation across the table.
I will add one more point--if your UserAccount table gets huge, you may consider a slightly more intelligent query that would incorporate the names of the accounts you need to get roll-ups for. This will make it possible not to return huge data sets to your web site when you're only showing, you know, 3-10 users' information on the page. I'd have to think a bit more about how to do this elegantly, but I'd suggest staying away from "IN" statements since this will invoke a linear search of the table.
For very high read/write ratios, denormalizing is a very valid option. You can use an indexed view and the data will be kept in sync declaratively (so you never have to worry about there being bad score data). The downside is that it IS kept in sync.. so the updates to the store total are a synchronous aspect of committing the score action. This would normally be quite fast, but it is a design decision. If you denormalize yourself, you can choose if you want to have some kind of delayed update system.
Personally I would go with an indexed view for starting, and then later you can replace it fairly seamlessly with a concrete table if your needs dictate.
In the past we've always used some sort of nightly or perodic cron job to calculate the current score and save it in the database - sort of like a persistent view of the SUM on the activities table. Like most "best practices" they are simply guidelines and it's often better and more practical to deviate from a specific hard nosed practice on very specific areas.
Plus it's not really all that much of a deviation if you use the cron job as it's better viewed as a cache stored in the database.
If you have a separate scores table, you could update it each time an item is submitted or a comment is posted by a user. You could do this using a trigger or within the sites code.
The user scores would be updated continuously, and could be quickly queried for display.

BASIC Object-Relation Mapping question asked by a noob

I understand that, in the interest of efficiency, when you query the database you should only return the columns that are needed and no more.
But given that I like to use objects to store the query result, this leaves me in a dilemma:
If I only retrieve the column values that I need in a particular situation, I can only partially populate the object. This, I feel, leaves my object in a non-ideal state where only some of the properties and methods are available. Later, if a situation arises where I would like to the reuse the object but find that the new situation requires a different but overlapping set of columns to be returned, I am faced with a choice.
Should I reuse the existing SQL and add to the list of selected columns the additional fields that are required by the new situation so that the same query and object mapping logic can be reused for both? Or should I create another method that results in the execution of a slightly different SQL which results in the populating of only those object properties that were returned by the 2nd query?
I strongly suspect that there is no magic answer and that the answer really "depends" on the situation but I guess I am looking for general advice. In general, my approach has been to either return all columns from the queried table or to add to the query the additional columns as they are needed but to reuse the same SQL (and mapping code) that is, until performance becomes a concern. In general, I find that unless you are retrieving a large number of row - and I usually am not - that the cost of adding additional columns to the output does not have a noticable effect on performance and that the savings in development time and the simplified API that result are a good trade off.
But how do you deal with this situation when performance does become a factor? Do you create methods like
Employees.GetPersonalInfo
Employees.GetLittleMorePersonlInfoButMinusSalary
etc, etc etc
Or do you somehow end up creating an API where the user of your API has to specify which columns/properties he wants populated/returned, thereby adding complexity and making your API less friendly/easy to use?
Let's say you want to get Employee info. How many objects would typically be involved?
1) an Employee object
2) An Employees collection object containing one Employee object for each Employee row returned
3) An object, such as EmployeeQueries that returns contains methods such as "GetHiredThisWeek" which returns an Employees collection of 0 or more records.
I realize all of this is very subjective, but I am looking for suggestions on what you have found works best for you.
I would say make your application correct first, then worry about performance in this case.
You could be optimizing away your queries only to realize you won't use that query anyway. Create the most generalized queries that your entire app can use, and then as you are confident things are working properly, look for problem areas if needed.
It is likely that you won't have a great need for huge performance up front. Some people say the lazy programmers are the best programmers. Don't over-complicate things up front, make a single Employee object.
If you find a need to optimize, you'll create a method/class, or however your ORM library does it. This should be an exception to the rule; only do it if you have reason to do so.
...the cost of adding additional columns to the output does not have a noticable effect on performance...
Right. I don't quite understand what "new situation" could arise, but either way, it would be a much better idea (IMO) to get all the columns rather than run multiple queries. There isn't much of a performance penalty at all for getting more columns than you need (although the queries will take more RAM, but that shouldn't be a big issue; besides, hardware is cheap). Also, you'd save yourself quite a bit of development time.
As for the second part of your question, it's really up to you. As an example, Rails takes more of a "usability first, performance last" approach, but that may not be what you want. It just depends on your needs. If you're willing to sacrifice a little usability for performance, by all means, go for it. I would.
If you are using your Objects in a "row at a time" CRUD type application, then, by all means copy all the columns into your object, the extra overhead is minimal, and you object becomes truly re-usable for any program wanting row access to the table.
However if your SQL is doing a complex join or returning a large set of rows, then request precisely and only what you want. You get two performance penalties here, one handling each column each time will eat up cpu for no benefit, and, two most DBMS systems have a bag of tricks for optimising queries (such as index only access) which can only be used if you specify precisely which columns you want.
There is no reuse issue in most of these cases as scan/search processes tend to very specific to a particular use case.