web app OO concept confusion - sql

This is a concept question, regarding "best practice" and "efficient use" of resources.
Specifically dealing with large data sets in a db and on-line web applications, and moving from a procedural processing approach to a more Object Oriented approach.
Take a "list" page, found in almost all CRUD aspects of the application. The list displays a company, address and contact. For the sake of argument, and "proper" RDBM, assume we've normalized the data such that a company can have multiple addresses, and contacts.
- for our scenario, lets say I have a list of 200 companies, each with 2-10 addresses, each address has a contact. i.e. any franchise where the 'store' is named 'McDonalds', but there may be multiple addresses by that 'name').
TABLES
companies
addresses
contacts
To this point, I'd make a single DB call and use joins to pull back ALL my data, loop over the data and output each line... Some grouping would be done at the application layer to display things in a friendly manner. (this seems like the most efficient way, as the RDBM did the heavy lifting - there was a minimum network calls (one to the db, one from the db, one http request, one http response).
Another way of doing this, if you couldn't group at the application layer, is to query for the company list, loop over that, and inside the loop make separate DB call(s) for the address, contact. less efficient, because you're making multiple DB calls
Now - the question, or sticking point.... Conceptually...
If I have a company object, an address object and a contact object - it seems that in order to achieve the same result - you would call a 'getCompanies' method that would return a list, and you'd loop over the list, and call 'getAdderss' for each, and likewise a 'getContact' - passing in the company ID etc.
In a web app - this means A LOT more traffic from the application layer to the DB for the data, and a lot of smaller DB calls, etc. - it seems SERIOUSLY less effective.
If you then move a fair amount of this logic to the client side, for an AJAX application, you're incurring network traffic ON TOP of the increased internal network overhead.
Can someone please comment on the best ways to approach this. Maybe its a conceptual thing.
Someone suggested that a 'gateway' is when you access these large data-sets, as opposed to smaller more granular object data - but this doesn't really help my understanding,and Im not sure it's accurate.

Of course getting everything you need at once from the database is the most efficient. You don't need to give that up just because you want to write your code as an OO model. Basically, you get all the results from the database first, then translate the tabular data into a hierarchical form to fill objects with. "getCompanies" could make a single database call joining addresses and contacts, and return "company" objects that contain populated lists of "addresses" and "contacts". See Object-relational mapping.

I've dealt with exactly this issue many times. The first and MOST important thing to remember is : don't optimize prematurely. Optimize your code for readability, the DRY principle, etc., then come back and fix things that are "slow".
However, specific to this case, rather than iteratively getting the addresses for each company one at a time, pass a list of all the company IDs to the fetcher, and get all the addresses for all those company ids, then cache that list of addresses in a map. When you need to fetch an address by addressID, fetch it from that local cache. This is called an IdentityMap. However, like I said, I don't recommend recoding the flow for this optimization until needed. Most often there are 10 things on a page, not 100 so you are saving only a few milliseconds by changing the "normal" flow for the optimized flow.
Of course, once you've done this 20 times, writing code in the "optimized flow" becomes more natural, but you also have the experience of when to do it and when not to.

Related

MariaDB data separation in public and private, database design

I am working at a company that merged with another company a while ago.
There we have several business units that are basically equivalent. One in Europe, one in China, each. We already had an in-house MariaDB database, which we want to start sharing.
The problem is that there are different GDPR regulations and contracts that prohibit sharing certain data across sites. So what I can't do, is replicate data across sites and then just hide in from the user in the frontend. The private data has to stay at the facility, it belongs to.
So my idea was to separate each table that we have now and where possibly sensitive information is contained into two tables each.
One say table_contracts_private and table_contracts_public.
This would still seem pretty doable with basic database replication and replicating the public tables across sites. But how would you go about publishing private data? Also how would I best combine private and public data? Just using a view
I just could not find any good mechanisms for this, especially because we would also like to avoid data duplication, so the private entries would need to be removed and replaced by the public ones, which would entail also changing all referencing IDs.
Is this a possible application of sharding?
I'd be really grateful, if someone could point me in the right direction, or if someone has a demo project with similar requirements that I could check out.
Cheers
Is this a possible application of sharding?
I wouldn't think so. Sharding is a performance optimization method. What you need is to support legal constraints. Those are two very different problems.
I think you are on the right track. I call this a "walled garden" approach. You create a database with all non-PII information, using ids only. Nothing that remotely directly identifies people, their addresses, phones, credit cards, and so on. This can be tricky. In some jurisdictions combinations of demographics can be PII.
Some of those ids then refer to another database where you store all the sensitive information; this is the "walled garden". I would recommend that this second database be on a separate server. It has a very restricted access list. And this is where you implement requirements for things like "forgetting" a customer.
In any case, the point is that sharding is not the right approach. You want an application redesign with privacy and security as the top priorities. Happily, this is not actually that hard to implement, although if the databases are changing, you may need period auditing. For instance, in one database I worked with, we discovered that "coupon codes" sometimes contained unencrypted email addresses. Arrgggh!

How can blockchains be used in audit trails?

I'm currently trying to figure out how to use blockchain in audit trails and potentially in accounting (and if they actually make sense). Both Deloitte and EY mention them.
I somehow cannot understand how this could be of benefit for audits and/or accounting.
To my understanding to make use of the power of blockchains you need multiple users. Only one user means you cannot validate the integrity since all blocks of that user could be compromised (if one block of a blockchain of a user got changed maybe also all of the following where changed, making it impossible to detect the modification). This means blockchains only make sense if you can share them with different users?
Data and thus blockchains however aren't always shared between multiple users. In accounting you often only have one "user"/"owner" of the data. Sure you could create multiple users in one company but there wouldn't be any benefit since they are in one location (company) and potentially all compromised. Or if the admin want's to change something he could easily modify all users making it useless for audits.
To make it work you would need different partners (supplier/customer) to share the information with. In that case you could however only have two users share the same blockchain (depending on legal regulations in your country) and then again who do you trust if one of the two doesn't validate?
Deloitt mentions that they can be used for files. Again I don't see the benefit since you would need multiple users AND files might get compressed with a different algorithm over time rendering them invalid (the useful information didn't change but the block will still be invalid). Or is this a not an issue from your experience? To me it seems it could be a problem.
The same goes for all the internal data which may be important for audits from my point of view. Which company would like to share the information with independent users. Or is it only intendet for "public"/"shared" data?
To identify a modification of one block in a blockchain the user would have ot validate every single block (every hash in the header of a block needs to be compared to the data of the previous block). In terms of accounting a blockchain could be all transactions of one account during one fiscal year. This however could easily be thousands of transactions. Wouldn't this be very slow to validate?
Maybe I'm misunderstanding the point in terms of audit trails but as long as the users are not independent data can always be modified making it useless for audits. And you need a critical mass to share the blockchain with.
First of all, I think that it's neccesary to get the power of Blockchain. It gives us the chance to create descentralized data bases, i.e. data bases that are not controled by an authority. Also, the data of Blockchain is immutable and permanent, i.e. it can not be modified or deleted. Thanks to it you achive a unique descentralized registry in a distributed network, for example for audit trails.
It's true that it has no sense if you use it inside your company. But if you use it among different companies? Each one could encode its data, so the rest of the companies couldn't see it. However, all the data would be stored in all the companies, so anyone couldn't change it. Moreover, you can have more than one user (node) for each company.
Nowadays, there are many implementations of Blockchain, each one with a different objetive. To understan better the power of Blockchain, I suggest you to wathc the video were is explained the new version (the v 1.0) of the Hyperledger Fabric.

Data modeling Issue for Moqui custom application

We are working on one custom project management application on top of Moqui framework. Our requirement is, we need to inform any changes in ticket to the developers associated with the project through email.
Currently we are using WorkEffortParty entity to store all parties associated with the project and then PartyContactMech entity to store their email addresses. Here we need to iterate through WorkEffortParty and PartyContactMech everytime to fetch all email address to which we need to send emails for changes in tickets every time.
To avoid these iterations, we are now thinking of giving feature to add comma separated email addresses at project level. Project admin can add email addresses of associated parties or mailing list address to which he needs to send email notification for ticket change.
For this requirement, we studied around the data model but we didn't got the right place to store this information. Do we need to extend any entity for this or is there any best practice for this? This requirement is very useful in any project management application. We appreciate any help on this data modeling problem.
The best practice is to use existing data model elements as they are available. Having a normalized data model involves more work in querying data, but also more flexibility in addressing a wide variety of requirements without changes to the data structures.
In this case with a joined query you can get a list of email addresses in a single query based on the project's workEffortId. If you are dealing with massive data and message volumes there are better solutions than denormalizing source data, but I doubt that's the case... unless you're dealing with more than thousands of projects and millions of messages per day the basic query and iterate approach will work just fine.
If you need to go beyond that the easiest approach with Moqui is to use a DataDocument and DataFeed to send updates on the fly to ElasticSearch, and then use it for your high volume queries and filtering (with arbitrarily complex filtering, etc requirements).
Your question is way too open to answer directly, data modeling is a complex topic and without good understanding of context and intended usage there are no good answers. In general it's best to start with a data model based on decades of experience and used in a large number of production systems. The Mantle UDM is one such model.

Handling paging with changing sort orders

I'm creating a RESTful web service (in Golang) which pulls a set of rows from the database and returns it to a client (smartphone app or web application). The service needs to be able to provide paging. The only problem is this data is sorted on a regularly changing "computed" column (for example, the number of "thumbs up" or "thumbs down" a piece of content on a website has), so rows can jump around page numbers in between a client's request.
I've looked at a few PostgreSQL features that I could potentially use to help me solve this problem, but nothing really seems to be a very good solution.
Materialized Views: to hold "stale" data which is only updated every once in a while. This doesn't really solve the problem, as the data would still jump around if the user happens to be paging through the data when the Materialized View is updated.
Cursors: created for each client session and held between requests. This seems like it would be a nightmare if there are a lot of concurrent sessions at once (which there will be).
Does anybody have any suggestions on how to handle this, either on the client side or database side? Is there anything I can really do, or is an issue such as this normally just remedied by the clients consuming the data?
Edit: I should mention that the smartphone app is allowing users to view more pieces of data through "infinite scrolling", so it keeps track of it's own list of data client-side.
This is a problem without a perfectly satisfactory solution because you're trying to combine essentially incompatible requirements:
Send only the required amount of data to the client on-demand, i.e. you can't download the whole dataset then paginate it client-side.
Minimise amount of per-client state that the server must keep track of, for scalability with large numbers of clients.
Maintain different state for each client
This is a "pick any two" kind of situation. You have to compromise; accept that you can't keep each client's pagination state exactly right, accept that you have to download a big data set to the client, or accept that you have to use a huge amount of server resources to maintain client state.
There are variations within those that mix the various compromises, but that's what it all boils down to.
For example, some people will send the client some extra data, enough to satisfy most client requirements. If the client exceeds that, then it gets broken pagination.
Some systems will cache client state for a short period (with short lived unlogged tables, tempfiles, or whatever), but expire it quickly, so if the client isn't constantly asking for fresh data its gets broken pagination.
Etc.
See also:
How to provide an API client with 1,000,000 database results?
Using "Cursors" for paging in PostgreSQL
Iterate over large external postgres db, manipulate rows, write output to rails postgres db
offset/limit performance optimization
If PostgreSQL count(*) is always slow how to paginate complex queries?
How to return sample row from database one by one
I'd probably implement a hybrid solution of some form, like:
Using a cursor, read and immediately send the first part of the data to the client.
Immediately fetch enough extra data from the cursor to satisfy 99% of clients' requirements. Store it to a fast, unsafe cache like memcached, Redis, BigMemory, EHCache, whatever under a key that'll let me retrieve it for later requests by the same client. Then close the cursor to free the DB resources.
Expire the cache on a least-recently-used basis, so if the client doesn't keep reading fast enough they have to go get a fresh set of data from the DB, and the pagination changes.
If the client wants more results than the vast majority of its peers, pagination will change at some point as you switch to reading direct from the DB rather than the cache or generate a new bigger cached dataset.
That way most clients won't notice pagination issues and you don't have to send vast amounts of data to most clients, but you won't melt your DB server. However, you need a big boofy cache to get away with this. Its practical depends on whether your clients can cope with pagination breaking - if it's simply not acceptable to break pagination, then you're stuck with doing it DB-side with cursors, temp tables, coping the whole result set at first request, etc. It also depends on the data set size and how much data each client usually requires.
I am not aware of a perfect solution for this problem. But if you want the user to have a stale view of the data then cursor is the way to go. Only tuning you can do is to store only the data for 1st 2 pages in the cursor. Beyond that you fetch it again.

Sorting on the server or on the client?

I had a discussion with a colleague at work, it was about SQL queries and sorting. He has the opinion that you should let the server do any sorting before returning the rows to the client. I on the other hand thinks that the server is probably busy enough as it is, and it must be better for performance to let the client handle the sorting after it has fetched the rows.
Anyone which strategy is best for the overall performance of a multi-user system?
In general, you should let the database do the sorting; if it doesn't have the resources to handle this effectively, you need to upgrade your database server.
First off, the database may already have indexes on the fields you want so it may be trivial for it to retrieve data in sorted order. Secondly, the client can't sort the results until it has all of them; if the server sorts the results, you can process them one row at a time, already sorted. Lastly, the database is probably more powerful than the client machine and can probably perform the sorting more efficiently.
It depends... Is there paging involved? What's the max size of the data set? Is the entire dataset need to be sorted the same one way all the time? or according to user selection? Or, (if paging is involved), is it only the records in the single page on client screen need to be sorted? (not normally acceptable) or does the entire dataset need to be sorted and page one of the newly sorted set redisplayed?
What's the distribution of client hardware compared to the processing requirements of this sort operation?
bottom line is; It's the overall user experience (measured against cost of course), that should control your decision... In general client machines are slower than servers, and may cause additional latency. ...
... But how often will clients request additional custom sort operations after initial page load? (client sort of data already on client is way faster than round trip...)
But sorting on client always requires that entire dataset be sent to client on initial load... That delays initials page display.. which may require lazy loading, or AJAX, or other technical complexities to mitigate...
Sorting on server otoh, introduces additional scalability issues and may require that you add more boxes to the server farm to deal with additional load... if you're doing sorting in DB, and reach that threshold, that can get complicated. (To scale out on DB, you have to implement some read-only replication scheme, or some other solution that allows multiple servers (each doing processing) to share read only data)..
I am in favor of Roberts answer, but I wanted to add a bit to it.
I also favor the sorting of data in SQL Server, I have worked on many systems that have tried to do it on the client side and in almost every case we have had to re-write the process to have it done inside SQL Server. Why is this you might ask? Well we have two primary reasons.
The amount of data being sorted
The need to implement proper paging due to #1
We deal with interfaces that show users very large sets of data, and leveraging the power of SQL Server to handle sorting and paging is by far better performing than doing it client side.
To put some numbers to this, a SQL Server Side sort to a client side sort in our environment, no paging for either. Client side 28 seconds using XML for sorting, and Server side sort total load time 3 seconds.
Generally I agree with the views expressed above that server-side sorting is usually the way to go. However, there are sometimes reasons to do client-side sorting:
The sort criteria are user-selectable or numerous. In this case, it may not be a good idea to go adding a shedload of indices to the table - especially if insert performance is a concern. If some sort criteria are rarely used, an index isn't necessarily worth it since inserts will outnumber selects.
The sort criteria can't be expressed in pure SQL [uncommon], or can't be indexed. It's not necessarily any quicker client-side, but it takes load of the server.
The important thing to remember is that while balancing the load between powerful clients and the server may be a good idea in theory, only the server can maintain an index which is updated on every insert. Whatever the client does, it's starting with a non-indexed unsorted set of data.
As usual, "It Depends" :)
If you have a stored procedure, for instance, that sends results to your presentation layer (whether a report, grid, etc.), it probably doesn't matter which method you go with.
What I typically run across, though, are views which have sorting (because they were used directly by a report, for instance) but are also used by other views or other procedures with their own sorting.
So as a general rule, I encourage others to do all sorting on the client-side and only on the server when there's reasonable justification for it.
If the sorting is just cosmetic and the client is getting the entire set of data I would tend to let the client handle it as it is about the presentation.
Also, say in a grid, you may have to implement the sorting in the client anyway as the user may change the ordering by clicking a column header (don't want to have to ask the server to retrieve all the information again)
Like any other performance related question, the universal answer is... "It Depends." However, I have developed a preference for sorting on the client. We write browser-based apps, and my definition of client is split between the web servers an the actual end-user client, the browser. I have two reasons for preferring sorting on the client to sorting in the DB.
First, there's the issue of the "right" place to do it from a design point of view. Most of the time the order of data isn't a business rule thing but rather a end-user convenience thing, so I view it as a function of the presentation, and I don't like to push presentation issues into the database. There are exceptions, for example, where the current price for an item is the most recent one on file. If you're getting price with something like:
SELECT TOP 1 price
FROM itemprice
WHERE ItemNumber = ?
AND effectivedate <= getdate()
ORDER BY effectivedate DESC
Then the order of the rows is very much a part of the business rule and obviously belongs in the database. However, if you're sorting on LastName when the user views customer by last name, and then again on FirstName when they click the FirstName column header, and again on State when they click that header then your sorting is a function of the presentation and belongs in the presentation layer.
The second reason I prefer sorting in the client layer is one of performance. Web servers scale horizontally, that is, if I overload my web server with users I can add another, and another, and another. I can have as many frontend servers as I need to handle the load and everything works just fine. But, if I overload the database I'm screwed. Databases scale vertically, you can throw more hardware at the problem, sure, but at some point that becomes cost prohibitive, so I like to let the DB do the selection, which it has to do, and let the client do the sorting, which it can to quite simply.
I prefer custom sorting on the client, however I also suggest that most SQL statements should have some reasonable ORDER BY clause by default. It causes very little impact on the database, but without it you could wind up with problems later. Often times without ever realizing it, a developer or user will begin to rely on some initial default sort order. If an ORDER BY clause wasn't specified, the data is only in that order by chance. At some later date an index could change or the data might be re-organized and the users will complain because the initial order of the data might have changed out from under them.
Situations vary, and measuring performance is important.
Sometimes it's obvious - if you have a big dataset and you're interested in a small range of the sorted list (e.g. paging in a UI app) - sorting on the server saves the data transfer.
But often you have one DB and several clients, and the DB may be overloaded while the clients are idle. Sorting on the client isn't heavy, and in this situation it could help you scale.