I'm building a ruby on rails application that uses raw SQL to query my database because I heard that it performs better than using ActiveRecord. I have an array that stores a list of items that are BigInts. For example:
my_items_id = [43627164222, 43667161211, 43667161000]
And my sql statement is supposed to return all values from a table where the id is any of the ones in my_items_id.
sql = "select * from table1 where id IN #{my_items_id}"
records_array = ActiveRecord::Base.connection.execute(sql)
The reason why this doesn't work is because my_items_id is an array and i know that. But what is the best way to convert it to: (43627164222, 43667161211, 43667161000) so that the sql statement actually works.
You write that you want to avoid ActiveRecord because read that raw SQL performs better. That is probably correct especially when you have to handle millions of records.
But what makes ActiveRecord slower compared to raw SQL is certainly not building and sanitizing the query upfront. ActiveRecord is slower because it parses the result and returns instances of your database models instead of a simple hash-like structure.
That said IMO it is perfectly fine to build the query with the ActiveRecord query language but run it as raw SQL on the plain connection. Then you would still benefit from ActiveRecord's query language and it's security features against SQL injections.
my_items_id = [43627164222, 43667161211, 43667161000]
sql = Table1.where(id: my_items_id).to_sql # <= Note the `to_sql` where
records_array = ActiveRecord::Base.connection.execute(sql)
Another issue when handling millions of records is not only the parsing part of ActiveRecord but the fact that millions of records consume a lot of RAM and might, therefore, be slower than expected. ActiveRecord has helper methods for this problem too – like find_each or find_in_batches. These methods do not load all records into memory at the same time but in smaller batches and can improve the overall performance of the operation a lot.
Table1.where(id: my_items_id).find_each do |item|
# handle each item
end
Or you might only need parts of the original records and not all columns then using pluck will be helpful. It improves again the performance of the ActiveRecord query because it returns a simple nested array instead of complex ActiveRecord instances – what saves time on parsing and memory.
Another issue with slow queries on millions of records in certainly missing database indexes. But without knowing more about the database structure and the slow queries it is impossible to give any advice.
Related
I would like to use the find_by_sql method in place of the active record query interface because of the flexibility I get in writing my own SQL.
For example,
It is considered a good practice in SQL to order the tables in your query from smallest to largest. When using the active record query interface, you will have to start with the result model which could be the largest table.
I can also easily avoid the N+1 problem by including the target table in the join itself and including the required columns from multiple tables.
I would like to know if there are reasons I should not be using the find_by_sql option, especially when I will be writing ANSI SQL that should be compatible with all, if not most databases.
Thanks!
Writing SQL code directly is normally not encouraged because it will prevent you from accessing features such as the lazy-execution of SQL queries and may potentially lead to invalid or unsafe queries.
In fact, ActiveRecord automatically converts Ruby objects into corresponding SQL statements (such as ranges into BETWEEN or arrays into IN), filters strings from SQL injections and ActiveRecord::Relations provide lazy query executions.
That said, if you know what you are doing or using ActiveRecord would be extremely complex to achieve a specific query, there is nothing wrong to write your SQL.
My suggestion is to avoid doing that for queries that can easily be reproduced in ActiveRecord and AREL.
When finding objects in Rails, is it better to combine query conditions first then performing the query or start with a less strict criteria and perform array operations on the results to narrow down what I want. I want to know which one would perform faster and/or the standard use.
If possible, narrowing results in SQL is generally faster, so add as many query conditions as you need, if the query is not too complex. Running code in Ruby to narrow results is not faster than SQL because Ruby is interpreted anyways.
However, narrowing results with SQL benefits from:
Searching results via indices which is much faster, if there is a query condition on a column which is indexed (eg. add an index for created_by, then find results created within a specific timeframe), or a sort clause on that column
Less communication from the SQL database to Ruby/Rails (if on the same computer, then a small improvement in speed, otherwise where the database is on a separate computer, uses less bandwidth - results returned faster)
I am wondering which is a more efficent method to retrieve data from the database.
ex.
One particular part of my application can have well over 100 objects. Right now I have it setup to query the database twice for each object. This part of the application periodically refreshes itself, say every 2 minutes, and this application will probably end of being installed on 25-30 pc's. I am thinking that this is a large number of select statements to make from the database, and I am thinking about trying to optimize the procedure. I have to pull the information out of several tables, and both queries are using join statements.
Would it be better to rewrite the queries so that I am only executing the queries twice per update instead of 200 times? For example using a large where statement to include each object, and then do the processing of the data outside of the object, rather than inside each object?
Using SQL Server, .net No indexes on the tables, size of database is less than 10-5th
all things being equal, many statements with few rows is usually worse than few statements with many rows.
show the actual code and get better answers.
The default answer for optimization must always be: don't optimize until you have a demonstrated need for it. The followup is: once you have a demonstrated need for it, and an alternative approach: try both ways to determine which is faster. You can't predict which will be faster, and we certainly can't. KM's answer is good - fewer queries tends to be better - but the way to know is to test.
I'm overloading a vb.net search procedure which queries a SQL database.
One of the older methods i'm using as a comparison uses a Stored Procedure to perform the search and return the query.
My new method uses linq.
I'm slightly concerned about the performance when using contains queries with linq. I'm looking at equally comparable queries using both methods.
Basically having 1 where clause to
Here are some profiler results;
Where name = "ber10rrt1"
Linq query : 24reads
Stored query : 111reads
Where name = "%ber10%"
Linq query : 53174reads
Stored proc query : 23386reads
Forgetting for a moment, the indexes (not my database)... The fact of the matter is that both methods are fundamentally performing the same query (albeit the stored procedure does reference a view for [some] of the tables).
Is this consitent with other peoples experiance of linq to sql?
Also, interestingly enough;
Using like "BER10%"
resultset.Where(Function(c) c.ci.Name.StartsWith(name))
Results in the storedproc using 13125reads and linq using 8172reads
I'm not sure there is enough there for a complete analysis... I'm assuming we are talking about string.Contains/string.StartsWith here (not List<T>.Contains).
If the generated TSQL is similar, then the results should be comparable. There are a few caveats to this - for example, is the query column a calculated+persisted value? If so, the SET options must be exact matches for it to be usable "as is" (otherwise it has to re-calculate per row).
So: what is the TSQL from the SP and LINQ? Are they directly comparable?
You mention a VIEW - I'm guessing this could make a big difference if (for example) it filters out data (either via a WHERE or an INNER JOIN).
Also - LIKE clauses starting % are rarely a good idea - not least, it can't make effective use of any index. You might have better performance using "full text search"; but this isn't directly available via LINQ, so you'll have to wrap it in an SP and expose the SP via the LINQ data-context (just drag the SP into the designer).
My money is on the VIEW (and the other code in the SP) being the main difference here.
I am concentrating this question on 'reporting-type' queries (count, avg etc. i.e. ones that don't return the domain model itself) and I was just wondering if there is any inherent performance benefit in using HQL, as it may be able to leverage the second level cache. Or perhaps even better - cache the entire query.
The obvious implied benefit is that NHibernate knows what the column names are as it already knows about the model mapping.
Any other benefits I should be aware of?
[I am using NHibernate but I assume that in this instance what applies to Hibernate will be equally applicable to NHibernate]
There are zero advantages. HQL will not outperform a direct database query to perform data aggregation and computation.
The result of something like:
Select count(*), dept from employees group by dept
Will always perform faster at the DB then in HQL. Note I say always because it is lame to take the 'depends on your situation' line of thinking. If it has to do with data and data aggregation; do it in SQL.
The objects in the second level cache are only retrieved by id, so Hibernate will always run a query to obtain a list of ids and then read those objects either from the second-level cache or with another query.
On the other hand, Hibernate can cache the query and avoid the DB call completely in some situations. However you have to consider that a change to any of the tables involved in the query will invalidate it, so you might not hit the cache very often. See here a description of how the query-cache works.
So the cost of your query is either 0, if the query is cached, or about the same as doing the query in straight SQL. Depending on how often your data changes you might save a lot by enabling query caching or you might not save anything.
If you have a high volume of queries and you can tolerate stale results, I'd say it's a lot better to use another cache for the query results that only expires every x minutes.
The only advantage i can think of is that ORM queries are typically cached at the (prepared) statement level, so if you do the same query lots of times chances are you are reusing a prepared statement.
But since you asked specifically for reporting queries and performance, I cannot think of any practical advantages (I'm glossing over the fact that you have other advantages like data-access consistency, ORM querying vs SQL (most of the times it's easier to write a query with HQL), data-type conversions, etc )
HQL is an object query language. SQL is a relational query language.