Encapsulating Data Access logic - vb.net

I have recently purchased the DoFactory framework in an attempt to learn more about design patterns. The product is good but it concentrates on transactional operations only e.g. a user updating a customer record or inserting a customer record.
I have an app that has scheduled tasks e.g. one thousand customers are created overnight via a web service. I am trying to understand the best way to approach this:
Option 1
public sub shared InsertCustomerBatch(ByVal list as list(as Customer))
For Each Customer In Customers
'Connect to database
'Insert Customer
Next
end public
Option 2
Public Sub InsertCustomer(ByVal list as list(as typeCustomer))
For Each typeCustomer as typeCustomer In list
Customer.Insert(typeCustomer)
End For
End Public
Both options will obviously work, however I believe that option 2 is "better" because it follows design principles i.e. Customer.Insert is encapsulated in the Customer class.
However, after talking to a more senior developer earlier he said choose option 1, but I do not understand why. Is option 1 "better".

I think one has to justify why a connection has to be opened and closed with every row in a batch scenario (option 1). One advantage may be implicit commit. However, frequent commits are not usually required in batch processing of many LOB applications. A business decision may have to be taken to determine the sensitivity of the data of course. However, it makes sense to commit rows in groups of reasonable sizes (bound by db log size).
One way is to divide a large batch size into several small logical batches and commit each batch separately. Another way is to use bulk copy to insert rows in the db when appropriate (see for example: bulk insert data into db. Also note that by default, constraints on the table are not checked for the bulk copy operation unless CHECK_CONSTRAINTS is specified.
Also, it may be good to check the connection timeout setting in case it may have an effect on long transaction processing(not sure on that one). However, I guess in your case the defaults should work fine.
In conclusion, I would suggest you go with option 2, possibly with some modifications as suggested if your case calls for large number of rows.

Related

SQL code to avoid the unnecessary start of a calculation if it was started my some other process?

I do have the stored procedure to calculate some facts (say usp_calculate). It fills the cache-like table. The part of the table (determined by the arguments of the procedure) table must be recalculated every 20 minutes. Basically, the usp_calculate returns early if cached data is fresh, or it spends say a minute to calculate... and returns after that.
The usp_calculate is shared by more outer procedures that needs the data. How should I prevent starting the time-consuming part of the procedure if it was already started by some other process? How can I implement a kind of signaling and waiting for the result instead of starting the calculation again?
Context: I do have an SQL stored procedure named say usp_products. It finally performs a SELECT that returns the rows for a product code, a product name, and calculated information -- special price for a customer, and for the storage location. There is a lot of combinations (customers, price lists, other conditions) that prevent to precalculate the information by a separate process. It must be calculated on-demand, for the specific combination.
The third party database that is the source of the information is not designed for detecting changes. Anyway, the time condition (not older than 20 minutes) is considered a good enough to consider the data "fresh".
The building block for this is probably going to be application locks.
You can obtain an exclusive application lock using sp_getapplock. You can then determine if the data is "fresh enough" either using freshness information inherently contained in that data or a separate table that you use for tracking this.
At this point, if necessary, you refresh the data and update the freshness information.
Finally, you release the lock using sp_releaseapplock and let all of the other callers have their chance to acquire the lock and discover that the data is fresh.

When to use internal tables?

So, I have read that using internal tables increases the performance of the program and that we should make operations on DB tables as less as possible. But I have started working on a project that does not use internal tables at all.
Some details:
It is a scanner that adds or removes products in/from a store. First the primary key is checked (to see if that type of product exists) and then the product is added or removed. We use ‘Insert Into’ and ‘Delete From’ to add/remove the products directly from the DB table.
I have not asked why they do not use internal tables because I do not have a better solution so far.
Here’s what I have so far: Insert all products in an internal table, place the deleted products in another internal table.
Form update.
Modify zop_db_table from table gt_table." – to add all new products
LOOP AT gt_deleted INTO gs_deleted.
DELETE FROM zop_db_table WHERE index_nr = gs_deleted-index_nr.
ENDLOOP. " – to delete products
Endform.
But when can I perform this update?
I could set a ‘Save button’ to perform the update, but then there will be the risk that the user forgets to save large amounts of data, or drops the scanner, shutting it down or similar situations. So this is clearly not a good solution.
My final question is: Is there a (good) way to implement internal tables in a project like this?
internal tables should be used for data processing, like lists or arrays in other languages (c#, java...). From a performance and system load perspective it is preferred to first load all data you need into an internal table, then process that internal table instead of loading individual records from the database.
But that is mostly true for reporting, which is probably the most common type of custom abap program. You often see developers use select...endselect-statements, that in effect loop over a database table, transferring row after row to the report, one at a time. That is extremely slow compared to reading all records at once into an itab, then looping over the itab. More than once i've cut the execution time of a report down to a fraction by just eliminating roundtrips to the database.
If you have a good reason to read from the database or update records immediately, you should do so. If you can safely delay updates and deletes to a point in time where you can process all of them together, without risking inconsistencies, I'd consider than an improvement. But if there is a good reason (like consistency or data loss) to update immediately, do it.
Update: as #vwegert mentioned regarding the select-endselect statement, the statement doesn't actually create individual database queries for each row. The database interface of the application server optimizes the query, transferring rows in bulk to the application server. From there the records are transported to the abap report one by one (because in the report there is only the work area to store a single row), which has a significant performance impact especially for queries with large result sets. A select into an internal table can transport all rows directly to the abap report (as long as there is enough memory to hold them), as now there is the internal table to hold those records in the report.

Delete records from a database or simply hide them during Reads?

I'm wondering if someone can provide various rationales/solutions for knowing when to delete records from a database vs. simplying hiding them during read operations via a field value, e.g., is_hidden=1.
My application is a social network/e-commerce web application. I tend to favor the is_hidden strategy but as one's site grows I can see this leading to a really badly performing site.
Here's my list. What items on the list am I missing? Is the list's prioritization good?
Delete:
rationale: Reduce table size/improve database performance
rationale: Useful if data is trivial to create
solution: SQL DELETE
is_hidden:
rationale: allow users/DBA to restore data/useful for sensitive & hard to CREATE data
rationale: can DELETE it later if necessary
solution: SQL SELECT ... WHERE is_hidden!=1
Thoughts?
The major reason you might want to do a soft-delete is where an audit trail requires it. For example we might have an invoice table along with an voided column and we might normally just omit voided invoices. This preserves an audit trail so we know what invoices were entered and which ones were voided.
There are many fields (particularly in finance) where soft deletes are preferred for this reason. Typically the number of deletes are small compared to the data set, and you don't want to really delete because actually doing so might allow someone to cover for theft of money or real-world goods. The "deleted" data can then be shown for those queries which require it.
A good non-db example would be as follows: "When writing in your general journal or general ledger, write with a pen and if you make an error that you spot right away, cross it out with a single line so that the original data is still legible, and write correct values underneath. If you find out later, either write in an adjustment entry or write in a reversal and a new one." In that case, your principle reason is to see what was changed when so that you can audit those changes if there is ever a question.
The people typically needing to see such information are likely to be financial or other auditors.
You've already said everthing in your question:
DELETE will entirely delete the entry and
is_hidden=1 will hide it.
So: If there's the possibility that you will need the data in the future you should use the hiding method. If you are sure that the data will never ever be used again: Use delete.
Concerning performance:
You can use two tables:
1 for visible items
1 for the hidden ones
Or even three tables:
1 for visible
1 for hidden
1 as an archive, where you move all the hidden data that's older than 3 years or something
Or:
1 for visible and hidden ones (using the is_hidden flag)
1 as an archive for old entries
It's all up to you. But if you look at facebook or google: They will never ever delete anything! Data == Money == Power ;)
As far as performance and ease of development, it may be possible on your platform to have filtered indexes, indexed views etc which would mean that keeping the soft-deleted data around has little impact on your system.

applying business rules at the database level

I'm working on a project in which we will need to determine certain types of statuses for a large body of people, stored in a database. The business rules for determining these statuses are fairly complex and may change.
For example,
if a person is part of group X
and (if they have attribute O) has either attribute P or attribute Q,
or (if they don't have attribute O) has attribute P but not Q,
and don't have attribute R,
and aren't part of group Y (unless they also are part of group Z),
then status A is true.
Multiply by several dozen statuses and possibly hundreds of groups and attributes. The people, groups, and attributes are all in the database.
Though this will be consumed by a Java app, we also want to be able to run reports directly against the database, so it would be best if the set of computed statuses were available at at the data level.
Our current design plan, then, is to have a table or view that consists of a set of boolean flags (hasStatusA? hasStatusB? hasStatusC?) for each person. This way, if I want to query for everyone who has status C, I don't have to know all of the rules for computing status C; I just check the flag.
(Note that, in real life, the flags will have more meaningful names: isEligibleForReview?, isPastDueForReview?, etc.).
So a) is this a reasonable approach, and b) if so, what's the best way to compute those flags?
Some options we're considering for computing flags:
Make the set of flags a view, and calculate the flag values from the underlying data in real time using SQL or PL-SQL (this is an Oracle DB). This way the values are always accurate, but performance may suffer, and the rules would have to be maintained by a developer.
Make the set of flags consist of static data, and use some type of rules engine to keep those flags up-to-date as the underlying data changes. This way the rules can be maintained more easily, but the flags could potentially be inaccurate at a given point in time. (If we go with this approach, is there a rules engine that can easily manipulate data within a database in this way?)
In a case like this I suggest applying Ward Cunningham's question- ask yourself "What's the simplest thing that could possibly work?".
In this case, the simplest thing might be to come up with a view that looks at the data as it exists and does the calculations and computations to produce all the fields you care about. Now, load up your database and try it out. Is it fast enough? If so, good - you did the simplest possible thing and it worked out fine. If it's NOT fast enough, good - the first attempt didn't work, but you've got the rules mapped out in the view code. Now you can go on to try the next iteration of "the simplest thing" - perhaps your write a background task that watches for inserts and updates and then jumps in to recompute the flags. If that works, fine and dandy. If not, go to the next iteration...and so on.
Share and enjoy.
I would advise against making the statuses as column names but rather use a status id and value. such as a customer status table with columns of ID and Value.
I would have two methods for updating statuses. One a stored procedure that either has all the logic or calls separate stored procs to figure out each status. you could make all this dynamic by having a function for each status evaluation, and the one stored proc could then call each function. The 2nd method would be to have whatever stored proc(s), that updates user info, call a stored proc to go update all the users statuses based upon the current data. These two methods would allow you to have both realtime updates for the data that changed and if you add a new status, you can call the method to update all statuses with new logic.
Hopefully you have one point of updates to the user data, such as a user update stored proc, and you can put the status update stored proc call in that procedure. This would also save having to schedule a task every n seconds to update statuses.
An option I'd consider would be for each flag to be backed by a deterministic function that returns the up-to-date value given the relevant data.
The function might not perform well enough, however, if you're calling it for many rows at a time (e.g. for reporting). So, if you're on Oracle 11g, you can solve this by adding virtual columns (search for "virtual column") to the relevant tables based on the function. The Result Cache feature should improve the performance of the function as well.

Design Question - Put hundreds of Yes/No switches in columns, rows, or other?

We are porting an old application that used a hierarchical database to a relational web app, and are trying to figure out the best way to port configuration switches (Y/N values).
Our old system had 256 distinct switches (per client) that were each stored as a bit in one of 8 32-bit data fields. Each client would typically have ~100 switches set. To read or set a switch, we'd use bitwise arithmetic using a #define value. For example:
if (a_switchbank4 & E_SHOW_SALARY_ON_CHECKS) //If true, print salary on check
We were debating what approach to store switches in our new relational (MS-SQL) database:
Put each switch in its own field
Pros: fast and easy read/write/access - 1 row per client
Cons: seems kludgey, need to change schema every time we add a switch
Create a row per switch per client
Pros: unlimited switches, no schema changes necessary w/ new switches
Cons: slightly more arduous to pull data, lose intellisense w/o extra work
Maintain bit fields
Pros: same code can be leveraged, smaller XML data transmissions between machines
Cons: doesn't make any sense to our developers, hard to debug, too easy to use wrong 'switch bank' field for comparison
I'm leaning towards #1 ... any thoughts?
It depends on a few factors such as:
How many switches are set for each client
How many switches are actually used
How often switches are added
If I had to guess (and I would be guessing) I'd say what you really want are tags. One table has clients, with a unique ID for each, another has tags (the tag name and a unique ID) and a third has client ID / tag ID pairs, to indicate which clients have which tags.
This differs from your solution #2 in that tags are only present for the clients where that switch is true. In other words, rather than storing a client ID, a switch ID, and a boolean you store just a client ID and a switch ID, but only for the clients with that switch set.
This takes up about one third the space over solution number two, but the real advantage is over solutions one and three: indexing. If you want to find out things like which clients have switches 7, 45, and 130 set but not 86 or 14, you can do them efficiently with a single index on a tag table, but there's no practical way to do them with the other solutions.
You could think about using database views to give you the best of each solution.
For example store the data as one row per switch, but use a view that pivots the switches (rows) into columns where this is more convenient.
I would go with option #2, one row per flag.
However, I'd also consider a mix of #1 and #2. I don't know your app, but if some switches are related, you could group those into tables where you have multiple columns of switches. You could group them based on use or type. You could, and would probably still have a generic table with one switch per row, for ones that don't fit into the groups.
Remember too if you change the method, you may have a lot of application code to change that relys on the existing method of storing the data. Whether you should change the method may depend on exactly how hard it will be and how many hours it will take to change everything associated. I agree with Markus' solution, but you do need to consider exactly how hard refactoring is going to be and whether your project can afford the time. The refactoring book I've been reading would suggest that you maintain both for a set time period with triggers to keep them in synch while you then start fixing all the references. Then on a set date you drop the original (and the triggers) from the database. This allows you to usue the new method going forth, but gives the flexibility that nothing will break before you get it fixed, so you can roll out the change before all references are fixed. It requires discipline however as it is easy to not get rid of the legacy code and columns because everything is working and you are afraid not to. If you are in the midst of total redesign where everything will be tested thougroughly and you have the time built into the project, then go ahead and change everything all at once.
I'd also lean toward option 1, but would also consider an option 4 in some scenarios.
4- Store in dictionary of name value pairs. Serialize to database.
I would recommend option 2. It's relatively straightforward to turn a list of tags/rows into a hash in the code, which makes it fairly easy to check variables. Having a table with 256+ columns seems like a nightmare.
One problem with option #2 is that having a crosstab query is a pain:
Client S1 S2 S3 S4 S5 ...
A X X
B X X X
But there are usually methods for doing that in a database-specific way.