I'm looking for some (more than intro-level) database optimizations, not just for Rails but for SQL in general.
In Rails, how/when do you start using these?
cache_counter
caching ids in a table (user has_many roles, so the user table gets a role_ids serialized column, since every request requires fetching the roles).
Are these types of things necessary, or when are they necessary, given that you can also just store these things (roles for user in this example) in memcached?
Right now I have an app where in every request a User has_many Role objects, and maybe a few preferences (where preferences are in another table). Is it good/bad to cache those preference/role ids in the user table?
Thanks for the tips!
Caching is a good thing if you need it and a terrible thing if you don't. Is your app showing strain? Have you benchmarked it? Have you isolated the bottleneck to the point where you could select an optimization if you had a cartload of them? The reason my answer is a question is that no two scaling issues have the same solution so without a much clearer idea where your app is hung up, it's hard to say whether caching will help.
On a side note, unless the principal function of your app queries users and roles, you will probably find the traffic down that path is not particularly dense. Typically you hit it once, squirrel the information away in the session and off you go. If you put a lot of work into denormalizing and caching, you may not find it pays the dividends you expect. And don't forget you now have to test for the cache-hit and cache-miss scenarios to make sure everything functions identically.
And more directly to your question:
memcached is a good LRU cache. There are other good KV stores like redis and tokyo cabinet.
Counter caching is low hanging fruit if (and only if) you have a frequently accessed counter. For example, if you need to know how many friends a user has, but don't need to know anything about the friends, you can just denormalize friends.count.
I'm not sure what caching IDs would buy you unless the roles remained pretty static. You have to develop a strategy for freshening them when the roles change and ... well, then you have to test the cache-hit and cache-miss scenarios.
My experience is that the biggest bang you can get for your optimization buck is caching pages. After that, fragments, then down to database-level and application-level tweaking.
Have you profiled your app?
Related
Whats the best practice to fetch details data in react app when you are dealing with multiple master details view?
For an example if you have
- /rest/departments api which returns list of departments
- /rest/departments/:departmentId/employees api to return all employees within department.
To fetch all departments i use:
componentDidMount() {
this.props.dispatch(fetchDepartments());
}
but then ill need a logic to fetch all employees per department. Would be a great idea to call employee action creator for each department in department reducer logic?
Dispatching employees actions in render method does not look like a good idea to me.
Surely it is a bad idea to call an employee action creator inside the department reducer, as reducers should be pure functions; you should do it in your fetchDepartments action creator.
Anyway, if you need to get all the employees for every department (not just the selected one), it is not ideal to make many API calls: if possible, I would ask to the backend developers to have an endpoint that returns the array of departments and, for each department, an embedded array of employees, if the numbers aren't too big of course...
Big old "It depends"
This is something that in the end, you will need to pick a way and see how it works out with your specific data and user needs. This somewhat deals with network issues as well, such as latency. In a very nicely networked environment, such as a top-3 insurance company I was a net admin for, you can achieve super low latency network calls. In such a case, multiple network requests would be significantly different than a homeowner internet based environment could be. Even then, you have to consider a wide range of possibilities. And you ALWAYS need to consider your end goals.
(Not to get too down in the technical aspects, but latency can fairly accurately be defined as "the time you are waiting for a network request to actually start sending data". A classic example of where this can be important is online first person shooter gaming. You click shoot, and the data is not transmitted as fast as you would like since the network is waiting to send the data, then you die. A classic example where bandwidth is more useful than latency is downloading or uploading large files. If you have to wait a second or two for the actual data to move, but when it moves you can download a GB in seconds, then oh well, I'll take it.)
Currently, I have our website making multiple calls to load dynamic menus and dynamic content. It is very small data. It is done in three separate calls. On the internet. It's "ok", but I would not say that it is "good". Since users are waiting for all of it to even start, I might as well throw it all in a single network call. Also, in case two calls go ok, then the third chokes a bit, the user may start to navigate, then more menus pop in and it is not ideal. This is why regardless, you have to think about your specific needs, and what range of possible use cases may likely apply. (I am currently re-writing the entire site anyways)
As a previous (in my opinion "good") answer stated, it probably makes sense to have the whole data set shot to you in one gulp. It appears to me this is an internal, or at least commercial app, with decent network and much more importantly, no risk of losing customers because your stuff did not load super fast.
That said, if things do not work out well with that, especially if you are talking large data sets, then consider a lazy loading architecture. For example, your user cannot get to an employee until they see the departments. So it may be ok, depending on your network and large data size, to load departments, and then after it returns initiate an asynchronous load of the employee data. The employee data is now being loaded while your user browses the department names.
A huge question you may want to clarify is whether or not any employee list data is rendered WITH the departments. In one of my cases, I have a work order system that I load after login, but lazy, and when it is loaded it throws a badge on the Work Order menu to show how many are outstanding. Since I do not have a lot of orders, it is basically a one second wait. No biggie. It is not like the user has to wait for it to load to begin work. If you wanted a badge per department, then it may get weird. You could, if you load by department, have multiple badges popping in randomly. In this case, it may cause user confusion, and it probably a good choice to load it in one large chunk. If the user has to wait anyways, it may produce one less call with a user asking "is it ok that it is doing this?". Especially with software for the workplace, it is more acceptable to have to wait for an initial load at the beginning of the work day.
To be clear, with all of these complications to consider, it is extremely important that you develop with as good of software coding practices as you are able. This way, you can code one solution, and if it does not meet your performance or user needs, it is not a nightmare to make a change. In a general case with small data, I would just load it in one big gulp to start, and if there are problems with load times complicate it from there. Complicating code from the beginning for no clearly needed reason is a good way to clutter your code up to the point of making it completely unwieldy to maintain.
On a third note, if you are dealing with enterprise size data sets, that is a whole different thing. Then you have to deal with pagination, and yes it gets a bit more complicated.
Regards,
DB
I'm not sure what fetchDepartments does exactly but I'd ensure the actual fetch request is executed from a Redux middleware. By doing it from middleware, you can fingerprint / cache / debounce all your requests and make a single one across the app no matter how many components request the thing.
In general, middleware is the best place to handle asynchronous side effects.
I'm working on an ASP.net web application that uses SQL as a database back-end. One issue that I have is that it sometimes takes a while to get my DBA to create or modify tables in the database which under no circumstance am I allowed to modify on my own.
Here is something that I do is when I expect users to upload files with their data.
Suppose the user uploads a new record for a table called Student_Records. The user uploads a record with fname Bob and lname Smith. The record is assigned primary key 123 The user also uploads two files: attendance_record.pdf and homework_record.pdf. Let's suppose that I have a network share: \\foo\bar where the files are saved.
One way of handling this situtation would be to have a table Student_Records_Files that associates the key 123 with Bob Smith. However, since I have trouble getting tables created, I've gone and done something different: When I save the files on the server, I call them 123_attendance_record.pdf and 123_homework_record.pdf. That way, I can easily identify what table record each file is associated with without having to create a new SQL table. I am, in essence, using the file system itself as a join table (Obviously, the file system is a type of database).
In my code for retrieving the files, I scan the directory \\foo\bar and look for files that begin with each primary key number from Student_Records.
It seems to work very well, but is it good practice?
There is nothing wrong with using the file system to store files. It's what it is used for.
There are a few things to keep in mind though.
I would consider a better method of storing the files - perhaps a directory for each user, rather than simply appending the user id to the filename.
Ensure that the file store is resilient and backed up with the same regularity as your database. If your database is configured to give you a backup every 10 minutes, but your file store only does a backup every day (or worse week) then you might be in for a world of pain.
Also consider what would happen if the user uploads two documents that are the same name.
First of all, I think it's a bad practice, in general, to design your architecture based on how responsive your DBA is. Any given compromise based on this approach may or may not be a big deal, but over time it will result in a poorly designed system.
Second, making the file name this critical seems dangerous to me; there's no protection against a person or application modifying the filename without realizing its importance.
Third, one of the advantages of having a table to maintain the join between the person and the file is that you can add additional data, such as: when was the file uploaded, what is the MIME type, has the file been read by anyone through the system, is this file a newer version of a previous file, etc. etc. Metadata can be very powerful, and the filesystem offers only limited ways to store it.
There are really two questions here. One is, given that for administrative reasons you cannot get changes made to the database schema, is it acceptable to devise some workaround. To that I'd have to say yes. What else can you do? In theory, if it takes two weeks to get the DBA to make a schema change for you, then this two weeks should be added to any deadline that you are given. In practice, this almost never happens. I've often worked places where some paperwork or whatever required two weeks before I could even begin work, and then I'd be given two weeks and one day to do the project. Sometimes you just have to put it together with rubber bands and bandaids.
Two is, is it a good idea to build a naming convention into file names and use this to identify files and their relationship to other data. I've done this at times and it's generally worked for me, though I have a perhaps irrational emotional feeling that it's not a good idea.
On the plus side, (a) By building information into a file name, you make it easy for both the computer and a human being to identify file associations. (Human readable as long as the naming convention is straightforward enough, anyway.) (b) By eliminating the separate storage of a link, you eliminate the possibility of a bad link. A file with the appropriate name may not exist, of course, but a database record with appropriate keys may not exist, or the file reference in such a record may be null or invalid. So it seems to solve one problem there without creating any new problems.
Potential minuses are: (a) You may have characters in the key that are not legal in file names. You may be able to just strip such characters out, or this may cause duplicates. The only safe thing to do is to escape them in some way, which is a pain. (b) You may exceed the legal length of a file name. Not as much of an issue as it was in the bad old 8.3 days. (c) You can't share files. If a database record points to a file, then two db records could point to the same file. If you must make two copies of a file, not only does this waste disk space, but it also means that if the file is updated, you must be sure to update all copies. If in your application it would make no sense to share files, than this isn't an issue.
You have to manage the files in some way, but you had to do that anyway.
I really can't think of any over-riding minuses. As I say, I've done this on occassion and didn't run into any particular problems. I'm interested in seeing others' responses.
I think it is not good practice because you are making your working application very dependent on specific implementation details and it would make it pretty hard to work with in the future to maintain, or if other people later needed access to your code/api.
Now weather you should do this or not is a whole different question. If you are really taking that much of a performance hit and it is significantly easier to work with how you have it, then I would say go ahead and break the rules. Ideally its good to follow best practice methods, but sometimes you have to bend the rules a little to make things work.
First, why is this a table change as opposed to a data change? Once you have the tables set up you should only need to update rows in that table every time that a user adds new files. If you have to put up with this one-time, two-week delay then bite the bullet and just get it done right.
Second, instead of trying to work around the problem why don't you try to fix the problem? Why is the process of implementing table changes so slow? Are you at least able to work on a development database (in which you have control to test and try out these changes)? Even if it's your own laptop you can at least continue on with development. Work with your manager, the DBA, and whoever else you need to, in order to improve the process. Would it help to speed things up if your scripts went through a formal testing process before you handed them off to the DBA so that he doesn't need to test the scripts, etc. himself?
Third, if this is a production database then you should probably be building in this two-week delay into your development cycle. You know that it takes two weeks for the DBA to review and implement changes in production, so make sure that if you have a deadline for releasing functionality that you have enough lead time for it.
Building this kind of "data" into a filename has inherent problems as others have pointed out. You have no relational integrity guarantees and the "data" can be changed without knowledge of the rest of the application/database.
It's best to keep everything in the database.
Network file I/O is spotty at best. In addition, its slower than the DB I/O.
If the DBA is difficult in getting small changes into the database, you
may be dealing with:
A political control issue. Maybe he just knows DB stuff and is threatened
when he perceives others moving in on his turf. Whatever his reasons, you need
to GET WORK DONE. Period. Document all the extra time / communication / work
you need to do for each small change and take that up with the management.
If the first level of management is unwilling to see things your way,
(it does not matter what their reasons are), escalate the issue
to the next level of management. In the past, I've gotten results this way.
It was more of a political territory problem than a technical problem.
The DBA eventually gave up and gave me full access to the TEST system BUT
he also stipulated that I would need to learn his testing process,
naming convention, his DB standards and practices, his way of testing, etc.
I was game.
I would also need to fix any database problems arising from changes I introduced.
This was fair and I got to wear the DBA hat in addition to the developer hat.
I got the freedom I needed and he got one less thing to worry about.
A process issue. Maybe the DBA needs to put every small DB change you submit
through a gauntlet of testing and performance analysis. Maybe he has a highly
normalized DB schema and because he has the big picture, he needs to normalize or
denormalize your requested DB changes to fit into the existing schema.
Ask to work with him. Ask him for a full DB design diagram.
Get a good sense of his DB design philosophy. Implement your DB changes with
his DB design philosophy in mind. Show that you understand that he's trying
to keep the DB in good order (understand normalization, relational constraints,
check constraints) Give him less to worry about. He needs to trust that you
will not muck up his database.
Accumulate all the small changes into a lengthy script and submit them to the DBA.
This way, you won't have to wait for each small change to go through all of his
process / testing. In addition, you're giving him a bigger picture view of your
development planning (that is in step with his DB design philosophy) instead of
just the play by play.
I am designing an app that would involve users 'following' each other's activity, in the twitter sense, but I am not very experienced with database/query design/efficiency. Are there best practices for managing this, pitfalls to avoid, etc.? I gather this can create a very large load on the db if not done properly (or maybe even then?).
If it makes a difference it is likely that people will 'follow' only a relatively small number of people (but a person may have many followers). However this is not certain, and I wouldn't want to count on it.
Any advice gratefully received. Thanks.
Pretty simple and easy to do with full normalisation. If you have a table of users, each with a unique ID, you would have a TABLE_FOLLOWERS table with the columns, USERID and FOLLOWERID which would describe all the followers for each user as a one to one to many relationship.
Even with millions of assosciations on a half decent database server this will perform well and fast as long as you are using a good database (IE, not MS-Access).
The model is fairly simple. The problem is in the size of the Subscription table; if there are 1 million users, and each subscribes to 1000, then the Subscription table has 1 billion rows.
That depends on how many users you expect to need to support; how many followers you expect users to have; and what sort of funding/development-effort you expect to have access to should your answers to the previous questions prove optimistic.
For a small scale project I would likely ignore the database, design the application as a simple object model with User objects that maintain a List[followers]. Keep it all in RAM for normal operation and use an ORM to persist to a database periodically (probably postgresql or mysql).
For a larger project I would not be using a relational database at all; but exactly what I would use would depend on the specific details of the project.
If you are only trying to spike the concept, go with the ORM approach; but, keep in mind it won't scale.
You probably should read http://highscalability.com/ and it's articles on how this is managed by the big sites.
Our software currently runs on MySQL. The data of all tenants is stored in the same schema. Since we are using Ruby on Rails we can easily determine which data belongs to which tenant. However there are some companies of course who fear that their data might be compromised, so we are evaluating other solutions.
So far I have seen three options:
Multi-Database (each tenant gets its own - nearly the same as 1 server per customer)
Multi-Schema (not available in MySQL, each tenant gets its own schema in a shared database)
Shared Schema (our current approach, maybe with additional identifying record on each column)
Multi-Schema is my favourite (considering costs). However creating a new account and doing migrations seems to be quite painful, because I would have to iterate over all schemas and change their tables/columns/definitions.
Q: Multi-Schema seems to be designed to have slightly different tables for each tenant - I don't want this. Is there any RDBMS which allows me to use a multi-schema multi-tenant solution, where the table structure is shared between all tenants?
P.S. By multi I mean something like ultra-multi (10.000+ tenants).
However there are some companies of
course who fear that their data might
be compromised, so we are evaluating
other solutions.
This is unfortunate, as customers sometimes suffer from a misconception that only physical isolation can offer enough security.
There is an interesting MSDN article, titled Multi-Tenant Data Architecture, which you may want to check. This is how the authors addressed the misconception towards the shared approach:
A common misconception holds that
only physical isolation can provide an
appropriate level of security. In
fact, data stored using a shared
approach can also provide strong data
safety, but requires the use of more
sophisticated design patterns.
As for technical and business considerations, the article makes a brief analysis on where a certain approach might be more appropriate than another:
The number, nature, and needs of the
tenants you expect to serve all affect
your data architecture decision in
different ways. Some of the following
questions may bias you toward a more
isolated approach, while others may
bias you toward a more shared
approach.
How many prospective tenants do you expect to target? You may be nowhere
near being able to estimate
prospective use with authority, but
think in terms of orders of magnitude:
are you building an application for
hundreds of tenants? Thousands? Tens
of thousands? More? The larger you
expect your tenant base to be, the
more likely you will want to consider
a more shared approach.
How much storage space do you expect the average tenant's data to occupy?
If you expect some or all tenants to
store very large amounts of data, the
separate-database approach is probably
best. (Indeed, data storage
requirements may force you to adopt a
separate-database model anyway. If so,
it will be much easier to design the
application that way from the
beginning than to move to a
separate-database approach later on.)
How many concurrent end users do you expect the average tenant to support?
The larger the number, the more
appropriate a more isolated approach
will be to meet end-user requirements.
Do you expect to offer any per-tenant value-added services, such
as per-tenant backup and restore
capability? Such services are easier
to offer through a more isolated
approach.
UPDATE: Further to update about the expected number of tenants.
That expected number of tenants (10k) should exclude the multi-database approach, for most, if not all scenarios. I don't think you'll fancy the idea of maintaining 10,000 database instances, and having to create hundreds of new ones every day.
From that parameter alone, it looks like the shared-database, single-schema approach is the most suitable. The fact that you'll be storing just about 50Mb per tenant, and that there will be no per-tenant add-ons, makes this approach even more appropriate.
The MSDN article cited above mentions three security patterns that tackle security considerations for the shared-database approach:
Trusted Database Connections
Tenant View Filter
Tenant Data Encryption
When you are confident with your application's data safety measures, you would be able to offer your clients a Service Level Agrement that provides strong data safety guarantees. In your SLA, apart from the guarantees, you could also describe the measures that you would be taking to ensure that data is not compromised.
UPDATE 2: Apparently the Microsoft guys moved / made a new article regarding this subject, the original link is gone and this is the new one: Multi-tenant SaaS database tenancy patterns (kudos to Shai Kerer)
Below is a link to a white-paper on Salesforce.com about how they implement multi-tenancy:
http://www.developerforce.com/media/ForcedotcomBookLibrary/Force.com_Multitenancy_WP_101508.pdf
They have 1 huge table w/ 500 string columns (Value0, Value1, ... Value500). Dates and Numbers are stored as strings in a format such that they can be converted to their native types at the database level. There are meta data tables that define the shape of the data model which can be unique per tenant. There are additional tables for indexing, relationships, unique values etc.
Why the hassle?
Each tenant can customize their own data schema at run-time without having to make changes at the database level (alter table etc). This is definitely the hard way to do something like this but is very flexible.
My experience (albeit SQL Server) is that multi-database is the way to go, where each client has their own database. So although I have no mySQL or Ruby On Rails experience, I'm hoping my input might add some value.
The reasons why include :
data security/disaster recovery. Each companies data is stored entirely separately from others giving reduced risk of data being compromised (thinking things like if you introduce a code bug that means something mistakenly looks at other client data when it shouldn't), minimizes potential loss to one client if one particular database gets corrupted etc. The perceived security benefits to the client are even greater (added bonus side effect!)
scalability. Essentially you'd be partitioning your data out to enable greater scalability - e.g. databases can be put on to different disks, you could bring multiple database servers online and move databases around easier to spread the load.
performance tuning. Suppose you have one very large client and one very small. Usage patterns, data volumes etc. can vary wildly. You can tune/optimise easier for each client should you need to.
I hope this does offer some useful input! There are more reasons, but my mind went blank. If it kicks back in, I'll update :)
EDIT:
Since I posted this answer, it's now clear that we're talking 10,000+ tenants. My experience is in hundreds of large scale databases - I don't think 10,000 separate databases is going to be too manageable for your scenario, so I'm now not favouring the multi-db approach for your scenario. Especially as it's now clear you're talking small data volumes for each tenant!
Keeping my answer here as anyway as it may have some use for other people in a similar boat (with fewer tenants)
As you mention the one database per tenant is an option and does have some larger trade-offs with it. It can work well at smaller scale such as a single digit or low 10's of tenants, but beyond that it becomes harder to manage. Both just the migrations but also just in keeping the databases up and running.
The per schema model isn't only useful for unique schemas for each, though still running migrations across all tenants becomes difficult and at 1000's of schemas Postgres can start to have troubles.
A more scalable approach is absolutely having tenants randomly distributed, stored in the same database, but across different logical shards (or tables). Depending on your language there are a number of libraries that can help with this. If you're using Rails there is a library to enfore the tenancy acts_as_tenant, it helps ensure your tenant queries only pull back that data. There's also a gem apartment - though it uses the schema model it does help with the migrations across all schemas. If you're using Django there's a number but one of the more popular ones seems to be across schemas. All of these help more at the application level. If you're looking for something more at the database level directly, Citus focuses on making this type of sharding for multi-tenancy work more out of the box with Postgres.
Assume Hibernate for the ORM.
I'm not sure how to ask this. I want to build an application that can replace part of another. For example, say I have an application with various modules, called the "big" app. This application may handle HR, financial, purchases, skill sets, etc. But maybe, for whatever reason, I don't like the skill set module, but I like the rest of the application. I want to build an app that uses the same database that the rest of the "big" app uses but use my software as the front end for that piece.
I could build my app and have it hit the database directly with no ORM. My question is is there an advantage to using an ORM here. I'm thinking there is because if the "big" app goes away and another app is purchased, we could continue to use my version of skill set because I am using hibernate instead of hitting things directly. I'm still learning but I thought that my application used objects that I named and that in the case I just described I'd have to change my mapping files only or/and my code very little.
Here is another example. I have a legacy application and legacy database. It uses database X. I decide that I no longer like the old terminal emulator application that is used to get the data and that I want a graphical version. I can use hibernate with my application and when I finally decide to get rid of the legacy database and change to the latest Oracle or SQL Server, I can do so with minimal headache? Or is my database going to change so much that it wouldn't have matter anyway (I'm suggesting that upon changing to a new database more information will want to be captured)?
I was hoping for comments, if I am misunderstanding why hibernate/ORM might or might not be a benefit.
Thank you.
I do not think you will have a huge benefit frmo hibernate if the database schema changes to something completely different, you might have to change more than just your mapping - especially if more "structure" is added to the database (tables, column and such schema things). That said, if the database was structured mostly the same way, but lets say just the column names and tables names changes and a couple of tables are merged or something like that - you can get by with just changing your mapping.
But I would really recommend using hbernate for database agnosticity, that's is a pretty easy path.
AND then just because it doesn't exactly helps you if your entire database is changed, it such an incredible amount of other forces, that I would choose that over direct DB access most of the time.
Lastly you could think about using a service-layer such as the repository pattern that abstracs away the data access, so the business of your appilcation wouldn't need to change if the database changes.
Switching from one DBMS to another (ala Oracle to SQL Server) is one thing that using an ORM would certainly make much easier.
As for switching from one "big app" to another "big app", I doubt if using an ORM would help that much. It's likely that the database structure and business logic would be different enough that you would find yourself rewriting lots of code anyways.
You can generate domain objects with Hibernate Tools, if you do that than it will be painless and fast. however if you write all the objects by hand you will die. i think its good idea to rewrite part of the app and get to know hibernate better.
I think it's generally a bad idea to make any decision based on the
unknowns versus the knowns. Whether you're deciding on a data
access/persistence strategy, what car to buy, or what college to go
to, you should put the most weight on the things you know you want
today, rather than worrying about what may or may not happen tomorrow.
So when considering ORMs, I wouldn't worry too much about things such as apps
"going away" or DBMSs changing (unless that's either already been talked about, or
there's a history of this in your company). I'm not saying that these aren't things that will never happen, but rather that they should take a back seat to the generally much more important considerations of maintainability, performance, and developer productivity.
So in short, choose an ORM based on its ability to solve the problems and satisfy the requirements that you have today.