groovy sql eachRow and rows method - sql

I am new to grails and groovy.
Can anyone please explain to me the difference between these two groovy sql methods
sql.eachRow
sql.rows
Also, which is more efficient?
I am working on an application that retrieves data from the database(the resultset is very huge) and writes it to CSV file or returns a JSON format.
I was wondering which of the two methods mentioned above to use to have the process done faster and efficient.

Can anyone please explain to me the
difference between these two groovy
sql methods sql.eachRow sql.rows
It's difficult to tell exactly which 2 methods you're referring 2 because there are a large number of overloaded versions of each method. However, in all cases, eachRow returns nothing
void eachRow(String sql, Closure closure)
whereas rows returns a list of rows
List rows(String sql)
So if you use eachRow, the closure passed in as the second parameter should handle each row, e.g.
sql.eachRow("select * from PERSON where lastname = 'murphy'") { row ->
println "$row.firstname"
}
whereas if you use rows the rows are returned, and therefore should be handled by the caller, e.g.
rows("select * from PERSON where lastname = 'murphy'").each {row ->
println "$row.firstname"
}
Also, which is more efficient?
This question is almost unanswerable. Even if I had implemented these methods myself there's no way of knowing which one will perform better for you because I don't know
what hardware you're using
what JVM you're targeting
what version of Groovy you're using
what parameters you'll be passing
whether this method is a bottleneck for your application's performance
or any of the other factors that influence a method's performance that cannot be determined from the source code alone. The only way you can get a useful answer to the question of which method is more efficient for you is by measuring the performance of each.
Despite everything I've said above, I would be amazed if the performance difference between these two was in any way significant, so if I were you, I would choose whichever one you find more convenient. If you find later on that this method is a performance bottleneck, try using the other one instead (but I'll bet you a dollar to a dime it makes no difference).

If we set aside minor syntax differences, there is one difference that seems important. Let's consider
sql.rows("select * from my_table").each { row -> doIt(row) }
vs
sql.eachRow("select * from my_table") { row -> doIt(row) }
The first one opens connection, retrieves results, closes connection and returns them. Now you can iterate over the results while connection is released. The drawback is you now have entire result list in memory which in some cases might be a lot.
EachRow on the other hand opens a connection and while keeping it open executes your closure for each row. If your closure operates on the database and requires another connection your code will consume two connections from the pool at the same time. The connection used by eachRow is released after it iterates though all the resulting rows. Also if you don't perform any database operations but closure takes a while to execute, you will be blocking one database connection until eachRow completes.
I am not 100% sure but possibly eachRow allows you not to keep all resulting rows in memory but access them through a cursor - this may depend on the database driver.
If you don't perform any database operations inside your closure, closure executes fast and results list is big enough to impact memory then I'd go for eachRow. If you do perform DB operations inside closure or each closure call takes significant time while results list is manageable, then go for rows.

They differ in signature only - both support result sets paging, so both will be efficient. Use whichever fits your code.

Related

What's more optimal: query chaining parent & child or selecting from parent's child objects

Curious which of these is better performance wise. If you have a User with many PlanDates, and you know your user is the user with an id of 60 in a variable current_user, is it better to do:
plan_dates = current_user.plan_dates.select { |pd| pd.attribute == test }
OR
plan_dates = PlanDate.joins(:user).where("plan_dates.attribute" => test).where("users.id" => 60)
Asking because I keep reading about the dangers of using select since it builds the entire object...
select is discouraged because, unlike the ActiveRelation methods where, joins, etc., it's a Ruby method from Enumerable which means that the entire user.plan_dates relation must be loaded into memory before the selection can begin. That may not make a difference at a small scale, but if an average user has 3,000 plan dates, then you're in trouble!
So, your second option, which uses just one SQL query to get the result, is the better choice. However, you can also write it like so:
user.plan_dates.where(attribute: test)
This is still just one SQL query, but leverages the power of ActiveRelation for a more expressive result.
The second. The select has to compare objects on the code level, and the second is just a query.
In addition, the second expression may not be actually executed unless you use the variable, while the first will be always executed.

How to speed up table-retrieval with MATLAB and JDBC?

I am accessing a PostGreSQL 8.4 database with JDBC called by MATLAB.
The tables I am interested in basically consist of various columns of different datatypes. They are selected through their time-stamps.
Since I want to retrieve big amounts of data I am looking for a way of making the request faster than it is right now.
What I am doing at the moment is the following:
First I establish a connection to the database and call it DBConn. Next step would be to prepare a Select-Statement and execute it:
QUERYSTRING = ['SELECT * FROM ' TABLENAME '...
WHERE ts BETWEEN ''' TIMESTART ''' AND ''' TIMEEND ''''];
QUERY = DBConn.prepareStatement(QUERYSTRING);
RESULTSET = QUERY.executeQuery();
Then I store the columntypes in variable COLTYPE (1 for FLOAT, -1 for BOOLEAN and 0 for the rest - nearly all columns contain FLOAT). Next step is to process every row, column by column, and retrieve the data by the corresponding methods. FNAMES contains the fieldnames of the table.
m=0; % Variable containing rownumber
while RESULTSET.next()
m = m+1;
for n = 1:length(FNAMES)
if COLTYPE(n)==1 % Columntype is a FLOAT
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getDouble(n);
elseif COLTYPE(n)==-1 % Columntype is a BOOLEAN
DATA{1}.(FNAMES{n})(m,1) = RESULTSET.getBoolean(n);
else
DATA{1}.(FNAMES{n}){m,1} = char(RESULTSET.getString(n));
end
end
end
When I am done with my request I close the statement and the connection.
I don´t have the MATLAB database toolbox so I am looking for solutions without it.
I understand that it is very ineffective to request the data of every single field. Still, I failed on finding a way to get more data at once - for example multiple rows of the same column. Is there any way to do so? Do you have other suggestions of speeding the request up?
Summary
To speed this up, push the loops, and then your column datatype conversion, down in to the Java layer, using the Database Toolbox or custom Java code. The Matlab-to-Java method call overhead is probably what's killing you, and there's no way of doing block fetches (multiple rows in one call) with plain JDBC. Make sure the knobs on the JDBC driver you're using are set appropriately. And then optimize the transfer of expensive column data types like strings and dates.
(NB: I haven't done this with Postgres, but have with other DBMSes, and this will apply to Postgres too because most of it is about the JDBC and Matlab layers above it.)
Details
Push loops down to Java to get block fetching
The most straightforward way to get this faster is to push the loops over the rows and columns down in to the Java layer, and have it return blocks of data (e.g. 100 or 1000 rows at a time) to the Matlab layer. There is substantial per-call overhead in invoking a Java method from Matlab, and looping over JDBC calls in M-code is going to incur (see Is MATLAB OOP slow or am I doing something wrong? - full disclosure: that's my answer). If you're calling JDBC from M-code like that, you're incurring that overhead on every single column of every row, and that's probably the majority of your execution time right now.
The JDBC API itself does not support "block cursors" like ODBC does, so you need to get that loop down in to the Java layer. Using the Database Toolbox like Oleg suggests is one way to do it, since they implement their lower-level cursor stuff in Java. (Probably for precisely this reason.) But if you can't have a database toolbox dependency, you can just write your own thin Java layer to do so, and call that from your M-code. (Probably through a Matlab class that is coupled to your custom Java code and knows how to interact with it.) Make the Java code and Matlab code share a block size, buffer up the whole block on the Java side, using primitive arrays instead of object arrays for column buffers wherever possible, and have your M-code fetch the result set in batches, buffering those blocks in cell arrays of primitive column arrays, and then concatenate them together.
Pseudocode for the Matlab layer:
colBufs = repmat( {{}}, [1 nCols] );
while (cursor.hasMore())
cursor.fetchBlock();
for iCol = 1:nCols
colBufs{iCol}{end+1} = cursor.getBlock(iCol); % should come back as primitive
end
end
for iCol = 1:nCols
colResults{iCol} = cat(2, colBufs{iCol}{:});
end
Twiddle JDBC DBMS driver knobs
Make sure your code exposes the DBMS-specific JDBC connection parameters to your M-code layer, and use them. Read the doco for your specific DBMS and fiddle with them appropriately. For example, Oracle's JDBC driver defaults to setting the default fetch buffer size (the one inside their JDBC driver, not the one you're building) to about 10 rows, which is way too small for typical data analysis set sizes. (It incurs a network round trip to the db every time the buffer fills.) Simply setting it to 1,000 or 10,000 rows is like turning on the "Go Fast" switch that had shipped set to "off". Benchmark your speed with sample data sets and graph the results to pick appropriate settings.
Optimize column datatype transfer
In addition to giving you block fetch functionality, writing custom Java code opens up the possibility of doing optimized type conversion for particular column types. After you've got the per-row and per-cell Java call overhead handled, your bottlenecks are probably going to be in date parsing and passing strings back from Java to Matlab. Push the date parsing down in to Java by having it convert SQL date types to Matlab datenums (as Java doubles, with a column type indicator) as they're being buffered, maybe using a cache to avoid recalculation of repeated dates in the same set. (Watch out for TimeZone issues. Consider Joda-Time.) Convert any BigDecimals to double on the Java side. And cellstrs are a big bottleneck - a single char column could swamp the cost of several float columns. Return narrow CHAR columns as 2-d chars instead of cellstrs if you can (by returning a big Java char[] and then using reshape()), converting to cellstr on the Matlab side if necessary. (Returning a Java String[]converts to cellstr less efficiently.) And you can optimize the retrieval of low-cardinality character columns by passing them back as "symbols" - on the Java side, build up a list of the unique string values and map them to numeric codes, and return the strings as an primitive array of numeric codes along with that map of number -> string; convert the distinct strings to cellstr on the Matlab side and then use indexing to expand it to the full array. This will be faster and save you a lot of memory, too, since the copy-on-write optimization will reuse the same primitive char data for repeated string values. Or convert them to categorical or ordinal objects instead of cellstrs, if appropriate. This symbol optimization could be a big win if you use a lot of character data and have large result sets, because then your string columns transfer at about primitive numeric speed, which is substantially faster, and it reduces cellstr's typical memory fragmentation. (Database Toolbox may support some of this stuff now, too. I haven't actually used it in a couple years.)
After that, depending on your DBMS, you could squeeze out a bit more speed by including mappings for all the numeric column type variants your DBMS supports to appropriate numeric types in Matlab, and experimenting with using them in your schema or doing conversions inside your SQL query. For example, Oracle's BINARY_DOUBLE can be a bit faster than their normal NUMERIC on a full trip through a db/Matlab stack like this. YMMV.
You could consider optimizing your schema for this use case by replacing string and date columns with cheaper numeric identifiers, possibly as foreign keys to separate lookup tables to resolve them to the original strings and dates. Lookups could be cached client-side with enough schema knowledge.
If you want to go crazy, you can use multithreading at the Java level to have it asynchronously prefetch and parse the next block of results on separate Java worker thread(s), possibly parallelizing per-column date and string processing if you have a large cursor block size, while you're doing the M-code level processing for the previous block. This really bumps up the difficulty though, and ideally is a small performance win because you've already pushed the expensive data processing down in to the Java layer. Save this for last. And check the JDBC driver doco; it may already effectively be doing this for you.
Miscellaneous
If you're not willing to write custom Java code, you can still get some speedup by changing the syntax of the Java method calls from obj.method(...) to method(obj, ...). E.g. getDouble(RESULTSET, n). It's just a weird Matlab OOP quirk. But this won't be much of a win because you're still paying for the Java/Matlab data conversion on each call.
Also, consider changing your code so you can use ? placeholders and bound parameters in your SQL queries, instead of interpolating strings as SQL literals. If you're doing a custom Java layer, defining your own #connection and #preparedstatement M-code classes is a decent way to do this. So it looks like this.
QUERYSTRING = ['SELECT * FROM ' TABLENAME ' WHERE ts BETWEEN ? AND ?'];
query = conn.prepare(QUERYSTRING);
rslt = query.exec(startTime, endTime);
This will give you better type safety and more readable code, and may also cut down on the server-side overhead of query parsing. This won't give you much speed-up in a scenario with just a few clients, but it'll make coding easier.
Profile and test your code regularly (at both the M-code and Java level) to make sure your bottlenecks are where you think they are, and to see if there are parameters that need to be adjusted based on your data set size, both in terms of row counts and column counts and types. I also like to build in some instrumentation and logging at both the Matlab and Java layer so you can easily get performance measurements (e.g. have it summarize how much time it spent parsing different column types, how much in the Java layer and how much in the Matlab layer, and how much waiting on the server's responses (probably not much due to pipelining, but you never know)). If your DBMS exposes its internal instrumentation, maybe pull that in too, so you can see where you're spending your server-side time.
It occurs to me that to speed up the query to the table, you have to remove if, for that in the JDBC ResulSetMetadata there that give you the information about the data type of the column and the same name, so you save time of if, else if
  ResultSetMetaData RsmD = (ResultSetMetaData) rs.getMetaData ();
        int cols=rsMd.getColumnCount();
        
  while (rs.next)
for i=1 to cols
row[i]=rs.getString(i);
My example is pseudocode becouse i´m not matlab programmer.
I hope you find it useful JDBC, anything let me know!

Is there a way for VBA UDF to "know" what other functions will be run?

Assume I have a UDF that will be used in a worksheet 100,000+ times. Is there a way, within the function, for it to know how many more times it is going to be called in the batch? Basically what I want to do is have every function create a to-do list of work to do. I want to do something like:
IF remaining functions to be executed after this one = 0 then ...
Is there a way to do this?
Background:
I want to make a UDF that will perform SQL queries with the user just giving parameters(date, hour, node, type). This is pretty easy to make if you're willing to actually execute the SQL query every time the function is run. I know its easy because I did this and it was ridiculously slow. My new idea is to have the function first see if the data it is looking for exists in a global cache variable and if it isn't to add it to a global variable "job-list".
What I want it to do is when the last function is called to then go through the job list and perform the fewest number of SQL queries and fill the global cache variable. Once the cache variable is full it would do a table refresh to make all the other functions get called again since on the subsequent call they'll find the data they need in the cache.
Firstly:
VBA UDF performance is extremely sensitive to the way the UDF is coded:
see my series of posts about writing efficient VBA UDFs:
http://fastexcel.wordpress.com/2011/06/13/writing-efficient-vba-udfs-part-3-avoiding-the-vbe-refresh-bug/
http://fastexcel.wordpress.com/2011/05/25/writing-efficient-vba-udfs-part-1/
You should also consider using an Array UDF to return multiple results:
http://fastexcel.wordpress.com/2011/06/20/writing-efiicient-vba-udfs-part5-udf-array-formulas-go-faster/
Secondly:
The 12th post in this series outlines using the AfterCalculate event and a cache
http://fastexcel.wordpress.com/2012/12/05/writing-efficient-udfs-part-12-getting-used-range-fast-using-application-events-and-a-cache/
Basically the approach you would need is for the UDF to check the cache & if not current or available then add a request to the queue. Then use the after-calculation event to process the queue and if neccessary trigger another recalc.
Performing 100,000 SQL queries from an Excel spreadsheet seems like a poor design. Creating a cache'ing mechanism on top of these seems to compound the problem, making it more complicated than it probably needs to be. There are some circumstances where this might be appropriate, but I would consider other design approaches instead.
The most obvious is to take the data from the Excel spreadsheet and load it into a table in the database. Then use the database to do the processing on all the rows as once. The final step is to read the result back into Excel.
I find that the best way to get large numbers of rows from Excel into a database is to save the Excel file as csv and bulk insert them.
This approach may not work for your problem. In general, though, set-based approaches running in the database are going to perform much better.
As for the cach'ing mechanism, if you have to go down that route. I can imagine a function that has the following pseudo-code:
Check if input values are in cache.
If so, read values from cache.
Else do complex processing.
Load values in cache.
This logic could go in the function. As #Bulat suggests, though, it is probably better to add an additional caching layer around the function.

What state is saved between rerunning queries in Linqpad?

What state is saved between rerunning queries in Linqpad? I presumed none, so if you run a script twice it will have the same results both time.
However run the C# Program below twice in the same Linqpad tab. You'll find the first it prints an empty list, the second time a list with the message 'hey'. What's going on?
System.ComponentModel.TypeDescriptor.GetAttributes(typeof(String)).OfType<ObsoleteAttribute>().Dump();
System.ComponentModel.TypeDescriptor.AddAttributes(typeof(String),new ObsoleteAttribute("hey"));
LINQPad caches the application domain between queries, unless you request otherwise in Edit | Preferences (or press Ctrl+Shift+F5 to clear the app domain). This means that anything stored in static variables will be preserved between queries, assuming the types are numerically identical. This is why you're seeing the additional type description attribute in your code, and also explains why you often see a performance advantage on subsequent query runs (since many things are cached one way or another in static variables).
You can take advantage of this explicitly with LINQPad's Cache extension method:
var query = <someLongRunningQuery>.Cache();
query.Select (x => x.Name).Dump();
Cache() is a transparent extension method that returns exactly what it was fed if the input was not already seen in a previous query. Otherwise, it returns the enumerated result from the previous query.
Hence if you change the second line and re-execute the query, the query will execute quickly since will be supplied from a cache instead of having to re-execute.

How can I tell when a LINQ query is enumerated?

This might seem to be a silly question at first, but please read on.
I know that LINQ queries are deferred and only executed when the query is enumerated, but I'm having trouble figuring out exactly when that happens. Certainly in a For Each loop, the query would be enumerated. What's the rule of thumb to follow? I don't want to accidentally enumerate over my query twice if it's a huge result.
For example, does System.Linq.Enumerable.First enumerate over the whole query? I ask for performance reasons. I want to pass a LINQ result set to an ASP.NET MVC view, and I also want to pass the First element separately. Enumerating over the results twice would be painful.
It would be great to turn on some kind of flag that alerts me each time a LINQ query is enumerated. That way I could catch scenarios when I accidentally enumerate twice.
You can add your own logging quite easily to see what's going on. Other than that, the lazy/eager bit is reasonably clear. Basically it's lazy when it can be - any time the return type is IEnumerable<T> or IOrderedEnumerable<T>. It's possible for those to be lazy because you can't get at any of the data without calling GetEnumerator(). Compare that to First() for example - it has to return a value to you. It can't defer anything.
As a general point, if you want to make sure that a query won't be evaluated more than once, call ToList or ToArray on it, then use the results of that several times. Again, those methods have to return a list or an array immediately, neither of which allows for lazy population. The query is evaluated, but then it's effectively disconnected from the resulting populated collection - the query won't be executed again, however much you examine the list.
In addition to the lazy/eager question, there's streaming/non-streaming: will the method read everything from the source enumerable, or just "sip" at it, reading when it needs to. Again, in general LINQ will only read when it has to - so while Reverse is non-streaming (but still lazy), Where and Select are streaming.
There is no hard and fast rule as to when a LINQ query will be enumerated and when it won't. Partially because some methods will or won't based on the underlying type of the query source.
Here is a quick break down. This is not a complete break down by any means, mainly what I could come up with in 5 minutes.
Aggregate Functions
They enumerate the list entirely and immediately. They are usually spotted by the extension methods which return a scalar value. For example Sum, Min, Max, Count, Last etc ...
Note: Count and Last do not necessarily enumerate the entire list. If the underlying type is convertible to ICollection<T> they will instead use a more efficient method.
Front of the list Selectors
They only look at the first element of the list and potentially the second. They are First, FirstOrDefault, Single, SingleOrDefault.
The above is referencing the versions which do not take a predicate. If they take a predicate they are better classified as Inquiries (see below)
Inquiries
They will only enumerate the minimal amount of the list necessary to do the operation. This can be as little as 1 element and as many as the entire list.
Examples: Any, Contains
Create a new list and do no enumeration immediately.
This is the vast majority of the operators in LINQ. Their cost is incurred when the new list is enumerated. Examples: Select, Where, Group, Join, SkipWhile, Skip.