I am trying to improve performance of some report in my system. Application is written in .NET Core 3.0, I am using EF-Core as ORM Framework and PostgreSQL as database. This report returns thousands of records and in some view it is presented to user. The results are paginated, ordered by some selected column (like start_time or agent_name etc.).
This result is calculated from heavy query (one execution takes about 10 seconds). To implement pagination, we need to calculate results for one page and total count. I can see 2 approaches how to solve this problem. Both of them have some disadvantages.
In current solution I am downloading full report, and then I am sorting it and slicing one page in memory. Advantage of this solution is that data is fetched from database only once. Disadvantage - we load thousands of record, but in fact we need only one page (50 records).
Other approach which I can see is slice records in DB (using LIMIT and OFFSET operators). I will fetch only on page of data, but I don't have count of all records, so I need to make second query with same query parameters which returns count of all records.
What do you think about this problem? Which approach is in your opinion better? Maybe some other technique is best for this issue?
Related
I am having a System Setup in ASP.NET Webforms and there is Acccounts Records Generation Form In Some Specific Situation I need to Fetch All Records that are near to 1 Million .
One solution could be to reduce number of records to fetch but when we need to fetch records for more than a year of 5 years that time records are half million, 1 million etc. How can I decrease its time?
What could be points that I can use to reduce its time? I can't show full query here, it's a big view that calls some other views in it
Does it take less time if I design it in as a Linq query? That's why I asked Linq vs Views
I have executed a "Select * from TableName" Query and its 40 mins and its still executing table is having 1,17,000 Records Can we decrease this timeline
I started this as a comment but ran out of room.
Use the server to do as much filtering for you as possible and return as few rows as possible. Client side filtering is always going to be much slower than server side filtering. Eg, it does not have access to the indexes & optimisation techniques that exist on the server.
Linq uses "lazy evaluation" which means that it builds up a method for filtering but does not execute it until it is forced to. I've used it and was initially impressed with the speed ... until I started to access the data it returned. When you use the data you want from Linq, this will trigger the actual selection process, which you'll find is slow.
Use the server to return a series of small resultsets and then process those. If you need to join these resultsets on a key, save them into dictionaries with that key so you can join them quickly.
Another approach is to look at Entity Framework to create a mirror of the server database structure along with indexes so that the subset of data you retrieve can be joined quickly.
I'm re-designing a front-end for SQL Server in Access, to allow non-programmers in our company to query the database.
The problem is that some of the tables are very large. At the moment I'm using linked tables. One query that I'm trying to allow, accesses five tables including that large one. The table has millions of rows, as it has every transaction ever made in the company.
When I tried the query in Access it took minutes and would not finish, and Access just froze. So instead I decided to use a subquery to narrow down the large table before doing the joins. Every entry in the table has a date, so I made a subquery and filtered it to return only the current day just to test. In fact, because I was just testing, I even filtered it even further to only return the date column. This narrows it down to 80,000 entries or so. Eventually I did get results, but it took around three minutes, and that's just the subquery I'm testing. Once results DID return, Access would freeze every time I attempted to use the scroll bar.
Next I tried pass-through queries, thinking it'd be faster. It was faster, but still took around a minute and a half, and still had the freezing problems with the scroll bar. The issue is that this same query takes only 3 seconds on SQL server (the date query I mean.) I was hoping that I could get this query very fast and then use this for the join.
I could use views, but the problem is that I want the user to be able to specify the date range.
Is there anything I can do to speed up this performance or am I screwed?
It makes no sense to let the users scroll through 10th of thousands of records. They will be lost in the data flood. Instead, provide them means to analyze the data. First answer the question: "what kind of information do the users need? “ They might want to know how many transactions of a certain type have occurred during the day or within an hour. They might want to compare different days. Let the users group the data; this reduces the number of records that have to be transmitted and displayed. Show them counts, sums or averages. Let them filter the data or present them the grouped data in charts.
Working on a dashboard page which does a lot of analytics to display BOTH graphical and tabular data to users.
When the dashboard is filtered by a given year, I have to display analytics for the selected year, another year chosen for comparison, and historical averages from all time.
For the selected and comparison years, I create start/end DateTime objects that are set to the beginning_of_year and end_of_year.
year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all
comp = Model.where("closed_at >= ?", comp_start).where("closed_at <= ?", comp_end).all
These queries are essentially the same, just different date filters. I don't really see any way to optimize this besides trying to only "select(...)" the fields I need, which will probably be all of them.
Since there will be an average of 250-1000 records in a given year, they aren't "horrible" (in my not-very-skilled opinion).
However, the historical averages are causing me a lot of pain. In order to adequately show the averages, I have to query ALL the records for all time and perform calculations on them. This is a bad idea, but I don't know how to get around it.
all_for_average = Model.all
Surely people have run into these kinds of problems before and have some means of optimizing them? Returning somewhere in the ballpark of 2,000 - 50,000 records for historical average analysis can't be very efficient. However, I don't see another way to perform the analysis unless I first retrieve the records.
Option 1: Grab everything and filter using Ruby
Since I'm already grabbing everything via Model.all, I "could" remove the 2 year queries by simply grabbing the desired records from the historical average instead. But this seems wrong...I'm literally "downloading" my DB (so to speak) and then querying it with Ruby code instead of SQL. Seems very inefficient. Has anyone tried this before and seen any performance gains?
Option 2: Using multiple SQL DB calls to get select information
This would mean instead of grabbing all records for a given time period, I would make several DB queries to get the "answers" from the DB instead of analyzing the data in Ruby.
Instead of running something like this,
year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all
I would perform multiple queries:
year_total_count = Model.where(DATE RANGE).size
year_amount_sum = Model.where(DATE RANGE).sum("amount")
year_count_per_month = Model.where(DATE RANGE).group("MONTH(closed_at)")
...other queries to extract selected info...
Again, this seems very inefficient, but I'm not knowledgeable enough about SQL and Ruby code efficiencies to know which would lead to obvious downsides.
I "can" code both routes and then compare them with each other, but it will take a few days to code/run them since there's a lot of information on the dashboard page I'm leaving out. Certainly these situations have been run into multiple times for dashboard/analytics pages; is there a general principle for these types of situations?
I'm using PostgreSQL on Rails 4. I've been looking into DB-specific solutions as well, as being "database agnostic" really is irrelevant for most applications.
Dan, I would look into using a materialized view (MV) for the all-time historical average. This would definitely fall under the "DB-specific" solutions category, as MVs are implemented differently in different databases (or sometimes not at all). Here is the basic PG documentation.
A materialized view is essentially a physical table, except its data is based on a query of other tables. In this case, you could create an MV that is based on a query that averages the historical data. This query only gets run once if the underlying data does not change. Then the dashboard could just do a simple read query on this MV instead of running the costly query on the underlying table.
After discussing the issue with other more experienced DBAs and developers, I decided I was trying to optimize a problem that didn't need any optimization yet.
For my particular use case, I would have a few hundred users a day running these queries anywhere from 5-20 times each, so I wasn't really having major performance issues (ie, I'm not a Google or Amazon servicing billions of requests a day).
I am actually just having the PostgreSQL DB execute the queries each time and I haven't noticed any major performance issues for my users; the page loads very quickly and the queries/graphs have no noticeable delay.
For others trying to solve similar issues, I recommend trying to run it for a while a staging environment to see if you really have a problem that needs solving in the first place.
If I hit performance hiccups, my first step will be specifically indexing data that I query on, and my 2nd step will be creating DB views that "pre-load" the queries more efficiently than querying them over live data each time.
Thanks to the incredible advances in DB speed and technology, however, I don't have to worry about this problem.
I'm answering my own question so others can spend time resolving more profitable questions.
i'm currently working on a project where the client has handed me a database that includes a table with over 200 columns and 3 million rows of data lol. This is definitely poorly designed and currently exploring some options. I developed the app on my 2012 mbp with 16gb of ram and an 512 ssd. I had to develop the app using mvc4 so set up the development and test environment using parallels 8 on osx. As part of the design, I developed an interface for the client to create custom queries to this large table with hundreds of rows so I am sending a queryString to the controller which is passed using dynamic linq and the results are sent to the view using JSON (to populate a kendo ui grid). On my mbp, when testing queries using the interface i created it takes max 10 secs (which find too much) to return the results to my kendo ui grid. Similarly, when I test queries directly in sql server, it never takes really long.
However when I deployed this to the client for testing these same queries take in excess of 3 mins +. So long story short, the client will be upgrading the server hardware but in the mean time they still need to test the app.
My question is, despite the fact that the table holds 200 columns, each row is unique. More specifically, the design is:
PK-(GUID) OrganizationID (FK) --- 200 columns (tax fields)
If I redesign this to:
PK (GUID) OrganizationID (FK) FieldID(FK) Input
Field table:
FieldID FieldName
This would turn this 3 million rows of data table into 600 million rows but only 3 columns. Will I see performance enhancements?
Any insight would be appreciated - I understand normalization but most of my experience is in programming.
Thanks in advance!
It is very hard to make any judgements without knowing the queries that you are running on the table.
Here are some considerations:
Be sure that the queries are using indexes if they are returning only a handful of rows.
Check that you have enough memory to store the table in memory.
When doing timings, be sure to ignore the first run, because this is just loading the page cache.
For testing purposes, just reduce the size of the table. That should speed things up.
As for your question about normalization. Your denormalized structure takes up much less disk space than a normalized structure, because you do not need to repeat the keys for each value. If you are looking for one value on one row, normalization will not help you. You will still need to scan the index to find the row and then load the row. And, the row will be on one page, regardless of whether it is normalized or denormalized. In fact, normalization might be worse, because the index will be much larger.
There are some examples of queries where normalizing the data will help. But, in general, you already have a more efficient data structure if you are fetching the data by rows.
You can take a paging approach. There will be 2 queries: initial will return all rows but only a column with unique IDs. This array can be split into pages, say 100 IDs per page. When user selects a specific page - you pass 100 ids to the second query which this time will return all 200 columns but only for requested 100 rows. This way you don't have to return all the columns across all the rows at once, which should yield significant performance boost.
We currently have a search on our website that allows users to enter a date range. The page calls a stored procedure that queries for the date range and returns the appropriate data. However, a lot of our tables contain 30m to 60m rows. If a user entered a date range of a year (or some large range), the database would grind to a halt.
Is there any solution that doesn't involve putting a time constraint on the search? Paging is already implemented to show only the first 500 rows, but the database is still getting hit hard. We can't put a hard limit on the number of results returned because the user "may" need all of them.
If the user inputed date range is to large, have your application do the search in small date range steps. Possibly using a slow start approach: first search is limited to, say one month range and if it bings back less than the 500 rows, search the two preceding months until you have 500 rows.
You will want to start with most recent dates for descending order and with oldest dates for ascending order.
It sounds to me like this is a design and not a technical problem. No one ever needs millions of records of data on the fly.
You're going to have to ask yourself some hard questions: Is there another way of getting people their data than the web? Is there a better way you can ask for filtering? What exactly is it that the users need this information for and is there a way you can provide that level of reporting instead of spewing everything?
Reevaluate what it is that the users want and need.
We can't put a hard limit on the
number of results returned because the
user "may" need all of them.
You seem to be saying that you can't prevent the user from requesting large datasets for business reasons. I can't see any techical way around that.
Index your date field and force a query to use that index:
CREATE INDEX ix_mytable_mydate ON mytable (mydate)
SELECT TOP 100 *
FROM mytable WITH (INDEX ix_mytable_mydate)
WHERE mydate BETWEEN #start and #end
It seems that the optimizer chooses FULL TABLE SCAN when it sees the large range.
Could you please post the query you use and execution plan of that query?
Don't know which of these are possible
Use a search engine rather than a database?
Don't allow very general searches
Cache the results of popular searches
Break the database into shards on separate servers, combine the results on your application.
Do multiple queries with smaller date ranges internally
It sounds like you really aren't paging. I would have the stored procedure take a range (which you calculated) for the pages and then only get those rows for the current page. Assuming that the data doesn't change frequently, this would reduce the load on the database server.
How is your table data physically structured i.e. partitioned, split across Filegroups and disk storage etc. ?
Are you using table partitioning? If not you should look into using aligned partitioning. You could partition your data by date, say a partition for each year as an example.
Where I to request a query spanning three years, on a multiprocessor system, I could concurrently access all three partitions at once, thereby improving query performance.
How are you implementing the paging?
I remember I faced a problem like this a few years back and the issue was to do with how I implemented the paging. However the data that I was dealing with was not as big as yours.
Parallelize, and put it in ram (or a cloud). You'll find that once you want to access large amounts of data at the same time, rdbms become the problem instead of the solution. Nobody doing visualizations uses a rdbms.