Django performance in raw sql sum query vs using loops - sql

I know in general it's good practice to rely on your database for actions like getting sum of records. However here's a special situation:
I'm using raw sql with django and
I want to get a list of records first, let's say I want a list of books and their prices and their discount percents, after that I want to have the sum of prices and the sum discount (price * discount price for each book). for example :
one way to do this is to use the SUM() function of SQL but since I already have the records why not just use a for loop to calculate that?
I know database is supposed to be faster but it takes some time for Django to connect to the database and the number of books are not that much (around 5).
can someone help me, which solution is better?
any help appreciated and thanks in advance.

You seem to be talking about making a separate database call to do the aggregation. Your intuition is right that the overhead of making that call would far outweigh any other performance differences.
But you don't need to make a separate call; you can do the aggregation in the same call that fetches the complete records. However, there's little point in doing that if you can just do the aggregation yourself in Python. It's possible that for large amounts of data there could be a performance advantage to letting the database do it, but on the other hand the database is more likely to be a performance bottleneck than your application server.
With five rows performance won't be a factor either way. I would personally prefer to do it Python, since it's a more powerful and flexible language than SQL, but it doesn't really matter.

Related

How to create random SQLqueries based on database?

For testing purposes I require a large amount of queries.
Creating this manually is not an option, so I am searching a tool which will do this automatically.
Sadly, the only solution I found (sqlsmith), is limited to postgres and SQLite.
Are there any similar tools for SQL-Server?
"I do not know from what random place people will want to travel to a random other place, so instead, let's create roads for every possible combination of origin and destination".
That sounds kind of insane, doesn't it? The same applies to what you seem to be wanting to achieve. You basically are hoping to find a tool that generates random queries against your database so you can feed them to the tuning advisor, which will then suggest query optimization indexes for hypothetical queries.
If you want to performance tune your database, you should have a pretty good idea of the type of questions your users will be throwing at it, as well as the structure of your data. Typical questions that will help you get started would be things like:
What is the most common search my users would do against this table?
What criteria are they most likely to use?
Which columns are guaranteed or likely to contain unique data in every row?
Which columns will most likely have a low selectivity of data? (I.e. Male/Female)
are you looking for generate random data for multiple tables ? we generally use redgate data genearator tool for the same.
for SQL tuning purpose I would suggest
https://www.brentozar.com/blitzindex/
http://www.nguyenlamminhdieu.com/zone/213/news/vi-VN/zone/213/news/351-database-engine-tuning-advisor-in-sql-server.aspx

Managing very large SQL queries

I'm looking for some ideas managing very large SQL queries in Oracle.
My employer is looking to build very wide reports ( 150 - 200 ) columns of data per report.
Each item is a sub-query or an element from a view. The data has to be real time, so DW style batch processing is not an option. We also don't use any BI tools , just a java app that generates Excel ( its a requirement to output data in Excel)
The query also contains unions as feeds from other systems.
The queries result in very large SQL ( about 1500 lines) that is very difficult to manage.
What strategies can I employ to make the work more manageable?
It is also not a performance problem. I was able to optimize the query to be very efficient , its mostly width of the query , managing 200 columns is a challenge in itself.
I deal with queries this length daily and here is some of what helps me out in manitaining them:
First alias every single one of the those columns. When you are building it you may know where each one came from but when it is time to make a change, it is really helpful to know exactly where each column came from. This applies to join conditions, group by and where conditions as well as the select columns.
Organize in easily understandable and testable chunks. I use temp tables to pull things that make sense together and so I can see the results before the final query while in test mode.
This brings me to test mode. If I have chunks of data, I design the proc with a test mode and then query individual temp tables when in test mode, so I can see where the data went wrong if there is a bug. Not sure how Oracle works but in SQL Server, I make this the last parameter and give it a default value, so that it doesn't need to be passed in by the application.
Consider logging the execution details and the values of passed in parameters and certainly log any error messages. This will help tremendously when you have to troubleshoot why this report that has functioned perfectly for six years doesn't work for this one user.
Put columns on a separate line for each one and do the same for where clauses. At times you may have to troublshoot by commenting out joins until you find the one that is causing the problem. It is easier if you can easily comment out the associated fields as well.
If you don't have a technical design document, then at least use comments to explain your thought process. You want to understand the whys not the hows in any comments. This stuff is hard to come back to later and understand even when you wrote it. Give your future self some help.
In developing from scratch, I put the select list in and then comment all but the first item. Then I build the query only until I get that value - testing until I am sure what I got was correct. Then I add the next one and whatever joins or where conditions I might need to get it. Test again making sure it is right. (Oops why did that go from 1000 records to 20000 when I added that? Hmm maybe there is something I need to handle there or is that right?) By adding only one thing at a time, you will find an error in the logic much faster and be much more confident of your results. It will also take you less time than trying to build a massive query in one go.
Finally, there is no substitute for understanding your data. There are plently of complex queries that work but do not give the correct answer. Know if you need an inner join or a left join. Know what where conditions you need to get the records you want. Know how to handle the records when you have a one-to-many relationship (this may require push back on the requirements); should you have 3 lines (one for each child record), or should you put that data in a comma delimited list or should you pick only one of the many records and have one line using aggregation. If the latter, what is the criteria for choosing the record you want to keep?
Without seeing the specifics of your problem, here are a couple of ideas that immediately come to mind:
If you are looking purely for management, I might suggest organizing your subqueries as a number of views and then referencing those views in your final query.
For performance on the other hand you may want to consider creating temp tables or even materialized views (which are fixed views) to break up the heavier parts of your process.
If your queries require an enormous amount of subquerying in order to gain usable data, you might need to rethink your database design and possibly create a number of datamarts to easily access reporting data. Think of these as mini-warehouses sans the multi-year trended data.
Finally, I know you said you don't use any BI tools but this problem certainly seems like one that might make sense by organizing your data into "cubes" or Business Object "universes". It might be worthwhile to at least entertain the cost of bringing on a BI tool vs. the programming hours to support the current setup.

Should I be concerned that ORMs, by default, return all columns?

In my limited experience in working with ORMs (so far LLBL Gen Pro and Entity Framework 4), I've noticed that inherently, queries return data for all columns. I know NHibernate is another popular ORM, and I'm not sure that this applies with it or not, but I would assume it does.
Of course, I know there are workarounds:
Create a SQL view and create models and mappings on the view
Use a stored procedure and create models and mappings on the result set returned
I know that adhering to certain practices can help mitigate this:
Ensuring your row counts are reasonably limited when selecting data
Ensuring your tables aren't excessively wide (large number of columns and/or large data types)
So here are my questions:
Are the above practices sufficient, or should I still consider finding ways to limit the number of columns returned?
Are there other ways to limit returned columns other than the ones I listed above?
How do you typically approach this in your projects?
Thanks in advance.
UPDATE: This sort of stems from the notion that SELECT * is thought of as a bad practice. See this discussion.
One of the reasons to use an ORM of nearly any kind is to delay a lot of those lower-level concerns and focus on the business logic. As long as you keep your joins reasonable and your table widths sane, ORMs are designed to make it easy to get data in and out, and that requires having the entire row available.
Personally, I consider issues like this premature optimization until encountering a specific case that bogs down because of table width.
First of : great question, and about time someone asked this! :-)
Yes, the fact an ORM typically returns all columns for a database table is something you need to take into consideration when designing your systems. But as you've mentioned - there are ways around this.
The main fact for me is to be aware that this is what happens - either a SELECT * FROM dbo.YourTable, or (better) a SELECT (list of all columns) FROM dbo.YourTable.
This is not a problem when you really want the whole object and all its properties, and as long as you load a few rows, that's fine, too - the convenience beats the raw performance.
You might need to think about changing your database structures a little bit - things like:
maybe put large columns like BLOBs into separate tables with a 1:1 link to your base table - that way, a select on the parent tables doesn't grab all those large blobs of data
maybe put groups of columns that are optional, that might only show up in certain situations, into separate tables and link them - again, just to keep the base tables lean'n'mean
Also: avoid trying to "arm-wrestle" your ORM into doing bulk operations - that's just not their strong point.
And: keep an eye on performance, and try to pick an ORM that allows you to change certain operations into e.g. stored procedures - Entity Framework 4 allows this. So if the deletes are killing you - maybe you just write a Delete stored proc for that table and handle that operation differently.
The question here covers your options fairly well. Basically you're limited to hand-crafting the HQL/SQL. It's something you want to do if you run into scalability problems, but if you do in my experience it can have a very large positive impact. In particular, it saves a lot of disk and network IO, so your scalability can take a big jump. Not something to do right away though: analyse then optimise.
Are there other ways to limit returned columns other than the ones I listed above?
NHibernate lets you add projections to your queries so you wouldn't need to use views or procs just to limit your columns.
For me this has only been an issue if the tables has LOTS of columns > 30 or if the column had alot of data for example a over 5000 character in a field.
The approach I have used is to just map another object to the existing table but with only the fields I need. So for a search that populates a table with 100 rows I would have a
MyObjectLite, but when I click to view the Details of that Row I would call a GetById and return a MyObject that has all the columns.
Another approach is to use custom SQL, Stroed procs but I only think you should go down this path if you REALLY need the performance gain and have users complaining. SO unless there is a performance problem do not waste your time trying to fix a problem that does not exist.
You can limit number of returned columns by using Projection and Transformers.AliasToBean and DTO here how it looks in Criteria API:
.SetProjection(Projections.ProjectionList()
.Add(Projections.Property("Id"), "Id")
.Add(Projections.Property("PackageName"), "Caption"))
.SetResultTransformer(Transformers.AliasToBean(typeof(PackageNameDTO)));
In LLBLGen Pro, you can return Typed Lists which not only allow you to define which fields are returned but also allow you to join data so you can pull a custom list of fields from multiple tables.
Overall, I agree that for most situations, this is premature optimization.
One of the big advantages of using LLBLGen and other ORMs as well (I just feel confident speaking about LLBLGen because I have used it since its inception) is that the performance of the data access has been optimized by folks who understand the issues better than your average bear.
Whenever they figure out a way to further speed up their code, you get those changes "for free" just by re-generating your data layer or by installing a new dll.
Unless you consider yourself an expert at writing data access code, ORMs probably improve most developers efficacy and accuracy.

How much calculation should be done by MySQL?

I am writing an application and using MySQL to return the difference between 2 dates in MySQL should MySQL do this or should I actually let PHP handle it?
I also just need the sum of all the results I am getting back should I return them and add them up on the php side or is there a way to add all the results together on the MySQL Server side?
It depends somewhat on the application, but in general, I'd push it to the PHP, because normally you're building a web site for multiple concurrent accesses; why put the calculation into the database and potentially have a bottle neck?
I think that you have two separate cases here. In the case where you are returning two values and performing a calculation on them, then doing that on the front end probably makes the most sense as long as it's not a complex calculation that requires significant business logic. If it does involve complex or specialized business logic then you should have a central place for that logic, whether it is in a business layer or in the database, so that it is done consistently. If you're just finding the difference between two dates or something, then just do it on the front end.
In the second case, where you are summing values, that sounds like something that should probably be done in the database. Networks tend to be much more of a bottleneck than modern day databases on today's hardware. Save sending a bunch of rows over the network just to add them up if you can just do it in the database.
There's no good answer to that. I personally do as much as possible in SQL but mostly because I can, not because I should.
And yes, you can ask MySQL to calculate a sum:
SELECT id, SUM(price) FROM items GROUP BY id WITH ROLLUP
The last result, the one with id equals NULL, is going to contain sum for all the rows.
If it's thousands of different numbers I'd try to do it on the database side. What Charlie said is pretty much the usual. I often do calculations in a database and add them as an additional column just in case I needed to do server side sorting, but this is obviously not your case.

use SUM() or caching

I don't have much experience with databases, so I don't know which is better for long-term performance, better practice, etc.
Here is my (hypothetical) case: imagine you have a database of customer information and history of purchase orders for each. You want to keep track of how much each customer is buying. I can think of two ways of calculating this:
1) Just do a SUM() every time it is needed. This is an easy solution, but the concern is that this database could be 20 years old with tens of thousands of rows for each customer. As more customers purchases are added to the database, the SUM() operation will take longer to calculate.
2) Keep the sum cached in the customer information table, and every time a new purchase is made (updated, deleted, etc), update this cache. That way no matter how many purchase orders there are, there won't be an increase in the calculation time. The downside is that, this is a less flexible solution (only the sum over all rows, what about per month sum? other intervals? etc); this cached value could get out of sync with the actual total somehow (technically shouldn't happen, but it might)
So what should I be doing for this? I know I shouldn't be storing anything I can calculate from what's already in the database, but part of me doesn't like the fact that this type of calculation will get worse in time, and that there is some kind of elegance to option 2.
In database terms, there is no elegance in option 2 - it would be considered a hack that you might use as a last resort if your database gets to be truly humongous - not likely to happen for a novice setting it up for the first time (but possible).
There would be a lot of work maintaining the totals; and you'd be forever dealing with the question: "Why doesn't the detail add up to the total?"
Go for option 1 until you've proven you can't. Which will in most cases be a long time.
What you're describing in option #2 is a case of premature optimization. Using the SUM() of all purchases is going to work for a really long time (years). When (if) you start to see this function degrade, you can add indexes or a totals table to your database to speed things up. Don't complicate things when a simple solution exists.
Of course, the real solution is to try both solutions with 20 years of made-up data and see if there's any real difference. I suspect there isn't.
Almost always 1.
How often will you be querying for the total for 20 year's worth of history? If the answer is often, and the performance is poor, then you can think about optimisation or OLAP.
I suspect you're worrying about optimisation too early. This is the sort of thing databases are designed for - let them worry about caching.
I'll just add that another possibility is to create roll-up tables. For instance, when tracking page hits, it's not very useful necessarily to know that IP such-and-such accessed page1.php at 14:42:04 on 11/19/2008; but you might want to track daily stats for page1.php. In this case, at the end of each day you might run a procedure to sum up the hits for each page and create a record in a summary table, which can in turn be heavily indexed. Your reporting then can run against that table. In addition to speeding up reporting, it can also speed up writing the original records, as you don't need to worry about table locking or building indexes.
That said, good indexes can go a long way with reporting; and like others here have admonished, it's best to go with the easier, even if less optimal, solution until (if ever) it becomes a problem.
Kudos for thinking ahead, but the question that begs asking, is your sales data going to stay in the transactional database for 20 years?
There comes a point where it's going to be a lot easier to move that data into a data warehouse, and just keep your frontline database current.
If this is a new project, be more concerned with making it work, and getting people to use it. Worry about scalability when you've crossed those bridges.
Use option 1. Later, if performance becomes poor, you can identify specific bottlenecks and resolve them with options like #2, or materialized views, or several other possibilities.