Lazy Loading on a Collection of Objects - vb.net

i have a sql query that can bring back a large number of rows via a DataReader. Just now I query the DB transform the result set into a List(of ) and data bind the Grid to the List.
This can result occasionally in a timeout due to the size of Dataset.
I currently have a three teir setup where by the UI is acting on the List of objects in the business layer.
Can anyone suggest the best approach to implementing lazy loading in this scenatrio? or is there some other way of implementing this cleanly?
I am currently using Visual Studio 2005, .NET 2.0
EDIT: How would paging be used in this instance?

LINQ to SQL seems to make sense in your situation.
Otherwise if for any reason, you don't want to use LINQ to SQL (e.g. you are on .NET 2.0), consider writing an iterator that reads the DataReader and converts it to the appropriate object:
IEnumerator<MyObject> ReadDataReader() {
while(reader.MoveNext())
yield return FetchObject(reader);
}

Do you need to bring back all the data at once? You could considering paging.

Paging might be your best solution. If you are using SQL Server 2005 or greater there was new feature added. ROWNUMBER():
WITH MyThings AS
(
SELECT ThingID, DateEntered,
ROW_NUMBER() OVER (ORDER BY DateEntered) AS 'RowNumber'
FROM dbo.Things
)
SELECT *
FROM ThingDetails
WHERE RowNumber BETWEEN 50 AND 60;
There is an example by David Hayden which is very helpful in demonstrating the SQL .
This method would decrease the number of records returned, reducing the overall load time. It does mean that you will have to do a bit more to track where you are in the sequence of records, but it is worth the effort.
The standard paging technique requires everything to come back from the database and then be filtered at the middle tier, or client tier (code-behind) this method reduces the records to a more manageable subset.

Related

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.

Limiting Amount of Rows in List View

Simple enough question, how would I be able to limit the amount of rows on a ListView to the amount of items/rows that actually contain information. I know how to count the rows with items by using this code
ListView1.Items.Count
But how can I limit the amount of rows the listview has to the amount of items?
Assuming a version of .Net that includes LINQ (3.5+), you get some really nice features which help a lot. These apply to any IQueryable including IList..
Dim MyList = [Some code to get hundreds of items]
Dim MyShortList = MyList.Take(30)
You can also implement paging very easily by using Skip...
Dim MyShortListPage2 = MyList.Skip(30).Take(30)
You should look into using the Entity framework or equivalents which implement IQueryable. These reduce memory overhead by using deferred processing aka Lazy Loading.
In short, if I were to do the following using the EF:
Dim Users = DBContext.Set(Of Users)
Users won't actually contain all users in the database, instead it will contain the query to get all users. If I did Users.First, it would run the query against SQL to get the first user. If instead, I did Users.Where(function(x) x.Age=30).First it would only query SQL for the first user whose age is 30.
Thus, IQueryable lets you pare down a dataset quickly using the power of the underlying provider instead of doing it in-memory.
If, instead, I did
Dim Users = DBContext.Set(Of Users).ToList()
It would retrieve all users from the database into memory. The ToList() is what forces this to happen. A List has to be stored in local memory, an IQueryable does not, it can run the appropriate query at the last possible moment and get as little as possible to satisfy your request.
Whether you want this to happen or not depends on the use case.

Efficient way to compute accumulating value in sqlite3

I have an sqlite3 table that tells when I gain/lose points in a game. Sample/query result:
SELECT time,p2 FROM events WHERE p1='barrycarter' AND action='points'
ORDER BY time;
1280622305|-22
1280625580|-9
1280627919|20
1280688964|21
1280694395|-11
1280698006|28
1280705461|-14
1280706788|-13
[etc]
I now want my running point total. Given that I start w/ 1000 points,
here's one way to do it.
SELECT DISTINCT(time), (SELECT
1000+SUM(p2) FROM events e WHERE p1='barrycarter' AND action='points'
AND e.time <= e2.time) AS points FROM events e2 WHERE p1='barrycarter'
AND action='points' ORDER BY time
but this is highly inefficient. What's a better way to write this?
MySQL has #variables so you can do things like:
SELECT time, #tot := #tot+points ...
but I'm using sqlite3 and the above isn't ANSI standard SQL anyway.
More info on the db if anyone needs it: http://ccgames.db.94y.info/
EDIT: Thanks for the answers! My dilemma: I let anyone run any
single SELECT query on "http://ccgames.db.94y.info/". I want to give
them useful access to my data, but not to the point of allowing
scripting or allowing multiple queries with state. So I need a single
SQL query that can do accumulation. See also:
Existing solution to share database data usefully but safely?
SQLite is meant to be a small embedded database. Given that definition, it is not unreasonable to find many limitations with it. The task at hand is not solvable using SQLite alone, or it will be terribly slow as you have found. The query you have written is a triangular cross join that will not scale, or rather, will scale badly.
The most efficient way to tackle the problem is through the program that is making use of SQLite, e.g. if you were using Web SQL in HTML5, you can easily accumulate in JavaScript.
There is a discussion about this problem in the sqlite mailing list.
Your 2 options are:
Iterate through all the rows with a cursor and calculate the running sum on the client.
Store sums instead of, or as well as storing points. (if you only store sums you can get the points by doing sum(n) - sum(n-1) which is fast).

Performance bottleneck - Linq to SQL or the database - how do I tell?

I am currently trying to ring more performance out of my reporting website which uses linq to sql and an sql server express 2008 database.
I am finding that as I now approach a million rows in on of my more 'ugly' tables that performance is becoming a real issue, with one report in particular taking 3 minutes to generate.
Essentially, I have a loop that, for each user, hits the database and grabs a collection of data on them. This data is then queried in various ways (and more rows loaded as needed) until I have a nice little summary object that I can fire off to a set of silverlight charts. Lazy loading comes is used and the reporting pulls into data from around 8 linked tables.
The problem is I don't know where the bottleneck now is and how to improve performance. Due to certain constraints I was forced to use uniqueidentifiers for a number of primary keys in the tables involved - could this be an issue?
Basically, I need to put time into increasing performance but don't have enough to do that with both the database or the linq to sql. Is there anyway I can see where the bottlenecks are?
As im running express I don't have access to the profiler. I am considering rewriting my queries into compiled linq to sql but fear the database may be the culprit.
I understand this question is a bit open ended and its hard to answer without knowing much more about my setup (database schema etc) but any advice on how to find out where the bottlenecks are is more appreciated!
Thanks
UPDATE:
Thanks for all the great advice guys, and some links to some great tools.
UPDATE for those interested
I have been unable to make my queries any quicker through tweaking the linq. the problem seems to be that the majority of my database access code takes place in a loop. I can't see a way around it. Basically I am building up a report by looking through a number of users data - hence the loop. Pulling all the records up front seems a bit crazy - 800,000 + rows. My gut feeling is that there is a much better way, but its a technological leap too far for me!
However, adding another index to one of the foreign keys in one of the tables boosted performance so now the report takes 20 seconds to generate as opposed to 3 minutes!
I used this excelent tool: Linq2Sql profiler. It works on the application side, so there is no need for database server profiling functionality.
You have to add one line of initialization code to your application and then in separate desktop application profiler shows you SQL query for each LINQ query with exact line of code where it was executed (cs or aspx), database time and application time of executions and it even detects some common performance problems like n+1 queries (query executed for iteration) or unbounded datasets. You have to pay for it, but the trial version is also available.
As you're using SQL Express which doesn't have Profiler, there is a free third party profiler you can download here. I've used it when running SQL Express. That will allow you to trace what's going on in the database.
Also, you can query the Dynamic Management Views to see what the costly queries are:
e.g. TOP 10 queries that have taken the most time
SELECT TOP 10 t.text, q.*, p.query_plan
FROM sys.dm_exec_query_stats q
CROSS APPLY sys.dm_exec_sql_text(q.sql_handle) t
CROSS APPLY sys.dm_exec_query_plan (q.plan_handle) AS p
ORDER BY q.total_worker_time DESC
There are 2 tools I use for this, LinqPad and the Visual Studio Debugger. First, check out LinqPad, even the free version is very powerful, showing you execution time, the SQL generated and you can use it to run any code snippet...it's tremendously useful.
Second, you can use the Visual studio debugger, this is something we use on our DataContext (note: only use this in debug, it's a performance hit and completely unnecessary outside of debugging)
#if DEBUG
private readonly Stopwatch Watch = new Stopwatch();
private static void Connection_StateChange(object sender, StateChangeEventArgs e)
{
if (e.OriginalState == ConnectionState.Closed && e.CurrentState == ConnectionState.Open)
{
Current.Watch.Start();
}
else if (e.OriginalState == ConnectionState.Open && e.CurrentState == ConnectionState.Closed)
{
Current.Watch.Stop();
string msg = string.Format("SQL took {0}ms", Current.Watch.ElapsedMilliseconds);
Trace.WriteLine(msg);
}
}
#endif
private static DataContext New
{
get
{
var dc = new DataContext(ConnectionString);
#if DEBUG
if (Debugger.IsAttached)
{
dc.Connection.StateChange += Connection_StateChange;
dc.Log = new DebugWriter();
}
#endif
return dc;
}
}
In a debug build, as an operation completes with each context, we see the timestamp in the debug window and the SQL it ran. The DebugWriter class you see can be found here (Credit: Kris Vandermotten). We can quickly see if a query's taking a while. To use it we just initiate a DataContext by:
var DB = DataContext.New;
(The profiler is not an option for me since we don't use SQL server, this answer is simply to give you some alternatives that have been very useful for me)

SQL server string manipulation in a view... Or in XSLT

I have been passed a piece of work that I can either do in my application or perhaps in SQL:
I have to get a date out of a string that may look like this:
1234567-DSP-01/01-VER-01/01
or like this:
1234567-VER-01/01-DSP-01/01
but may look like this:
00 12345 DISCH 01/01-VER-01/01 XXX X XXXXX
Yay. if it is a "DSP" then I want that date, if a "DISCH" then that date.
I am pulling the data out in a SQL Server view and would be happy to have the view transform the data for me. My application could do it but would add processor time. I could also see if the data could be manipulated before it is entered into the DB, I suppose.
Thank you for your time.
An option would be to check for the presence of DSP or DISCH then substring out the date as necessary.
For example (I don't have sqlserver today so I can verify syntax, sorry)
select
date = case date_attribute
when charindex('DSP',date_attribute) > 0 then substring(date_attribute,beg,end)
when charindex('DISCH',date_attribute) > 0 then substring(date_attribute,beg,end)
else 'unknown'
end
from myTable
don't store multiple items in the same column!
store the date in its own column when inserting the row!
add a new nullable column for the date
write an update that pulls the date out and sets the new column
alter the column to be not nullable
fix your save routine to pull the date out and insert it in for you
If you do it in the view your adding processing time on SQL which in general a more expensive resource then an app, web or some other type of client.
I'd recommend you try and format the data out when you insert the data, or you handle in the application tier. Scaling horizontally an app tier is so much easier then scalling your SQL.
Edit
I am talking the database server's physical resources are usually more expensive then a properly designed applications server's physical resources. This is because it is very easy to scale an application horizontally, it is in my opinion an order of magnitude more expensive to scale a DB server horizontally. Especially if your dealing with a transactional database and need to manage merging
I am not saying it is not possible just that scaling a database server horizontally is a much more difficult task, hence it's more expensive. The only reason I pointed this out is the OP raised a concern about using CPU cycles on the app server vs the database server. Most applications I have worked with have been data centric applications which processed through GB's of data to get a user an answer. We initially put everything on the database server because it was easier then doing it in classic asp and vb6 at the time. Over time the DB server was more and more loaded until scaling veritcally was no longer an option.
Database Servers are also designed at retrieving and joining data together. You should leave the formating of the data to the application and business rules (in general of course)