Rails - find_by_sql vs active record querying - sql

I would like to use the find_by_sql method in place of the active record query interface because of the flexibility I get in writing my own SQL.
For example,
It is considered a good practice in SQL to order the tables in your query from smallest to largest. When using the active record query interface, you will have to start with the result model which could be the largest table.
I can also easily avoid the N+1 problem by including the target table in the join itself and including the required columns from multiple tables.
I would like to know if there are reasons I should not be using the find_by_sql option, especially when I will be writing ANSI SQL that should be compatible with all, if not most databases.
Thanks!

Writing SQL code directly is normally not encouraged because it will prevent you from accessing features such as the lazy-execution of SQL queries and may potentially lead to invalid or unsafe queries.
In fact, ActiveRecord automatically converts Ruby objects into corresponding SQL statements (such as ranges into BETWEEN or arrays into IN), filters strings from SQL injections and ActiveRecord::Relations provide lazy query executions.
That said, if you know what you are doing or using ActiveRecord would be extremely complex to achieve a specific query, there is nothing wrong to write your SQL.
My suggestion is to avoid doing that for queries that can easily be reproduced in ActiveRecord and AREL.

Related

Putting array in WHERE IN clause Ruby on Rails

I'm building a ruby on rails application that uses raw SQL to query my database because I heard that it performs better than using ActiveRecord. I have an array that stores a list of items that are BigInts. For example:
my_items_id = [43627164222, 43667161211, 43667161000]
And my sql statement is supposed to return all values from a table where the id is any of the ones in my_items_id.
sql = "select * from table1 where id IN #{my_items_id}"
records_array = ActiveRecord::Base.connection.execute(sql)
The reason why this doesn't work is because my_items_id is an array and i know that. But what is the best way to convert it to: (43627164222, 43667161211, 43667161000) so that the sql statement actually works.
You write that you want to avoid ActiveRecord because read that raw SQL performs better. That is probably correct especially when you have to handle millions of records.
But what makes ActiveRecord slower compared to raw SQL is certainly not building and sanitizing the query upfront. ActiveRecord is slower because it parses the result and returns instances of your database models instead of a simple hash-like structure.
That said IMO it is perfectly fine to build the query with the ActiveRecord query language but run it as raw SQL on the plain connection. Then you would still benefit from ActiveRecord's query language and it's security features against SQL injections.
my_items_id = [43627164222, 43667161211, 43667161000]
sql = Table1.where(id: my_items_id).to_sql # <= Note the `to_sql` where
records_array = ActiveRecord::Base.connection.execute(sql)
Another issue when handling millions of records is not only the parsing part of ActiveRecord but the fact that millions of records consume a lot of RAM and might, therefore, be slower than expected. ActiveRecord has helper methods for this problem too – like find_each or find_in_batches. These methods do not load all records into memory at the same time but in smaller batches and can improve the overall performance of the operation a lot.
Table1.where(id: my_items_id).find_each do |item|
# handle each item
end
Or you might only need parts of the original records and not all columns then using pluck will be helpful. It improves again the performance of the ActiveRecord query because it returns a simple nested array instead of complex ActiveRecord instances – what saves time on parsing and memory.
Another issue with slow queries on millions of records in certainly missing database indexes. But without knowing more about the database structure and the slow queries it is impossible to give any advice.

What's the common practice in constituting the WHERE clause based on the user input

If we take a database table, we can query all the rows or we can choose to apply a filter on it. The filter can vary depending on the user input. In cases when there are few options we can specify different queries for those few specific conditions. But if there are lots and lots of options that user might or might not specify, aforementioned method does not come handy. I know, I can compose the filter based upon the user input and send it as a string to the corresponding stored procedure as a parameter, build the query with that filter and finally execute the query string with the help of EXECUTE IMMEDIATE(In Oracle's case). Don't know why but I really don't like this way of query building. I think this way I leave the doors open for SQL injectors. And besides, that I always have trouble with the query itself as everything is just a string and I need to handle dates and numbers carefully.What is the best and most used method of forming the WHERE clause of a query against a database table?
Using database parameters instead of attempting to quote your literals is the way forward.
This will guard you against SQL injection.
A common way of approaching this problem is building expression trees that represent your query criteria, converting them to parameterized SQL (to avoid SQL injection risks), binding parameter values to the generated SQL, and executing the resultant query against your target database.
The exact approach depends on your client programming framework: .NET has Entity Framework and LINQ2SQL that both support expression trees; Java has Hibernate and JPA, and so on. I have seen several different frameworks used to construct customizable queries, with great deal of success. In situations when these frameworks are not available, you can roll your own, although it requires a lot more work.

NHibernate problem choosing between CreateSql and CreateCriteria

I have a very silly doubt in NHibernate. There are two or three entities of which two are related and one is not related to other two entities. I have to fetch some selected columns from these three tables by joining them. Is it a good idea to use session.CreateSql() or we have to use session.CreateCriteria(). I am really confused here as I could not write the Criteria queries here and forced to use CreateSql. Please advise.
in general you should avoid writing SQL whenever possible;
one of the advantages of using an ORM is that it's implementation-agnostic.
that means that you don't know (and don't care) what the underlying database is, and you can actually switch DB providers or tweak with the DB structure very easily.
If you write your own SQL statements you run the risk of them not working on other providers, and also you have to maintain them yourself (for example- if you change the name of the underlying column for the Id property from 'Id' to 'Employee_Id', you'd have to change your SQL query, whereas with Criteria no change would be necessary).
Having said that- there's nothing stopping you from writing a Criteria / HQL that pulls data from more than one table. for example (with HQL):
select emp.Id, dep.Name, po.Id
from Employee emp, Department dep, Posts po
where emp.Name like 'snake' //etc...
There are multiple ways to make queries with NH.
HQL, the classic way, a powerful object oriented query language. Disadvantage: appears in strings in the code (actually: there is no editor support).
Criteria, a classic way to create dynamic queries without string manipulations. Disadvantages: not as powerful as HQL and not as typesafe as its successors.
QueryOver, a successor of Criteria, which has a nicer syntax and is more type safe.
LINQ, now based on HQL, is more integrated then HQL and typesafe and generally a matter of taste.
SQL as a fallback for cases where you need something you can't get the object oriented way.
I would recommend HQL or LINQ for regular queries, QueryOver (resp. Criteria) for dynamic queries and SQL only if there isn't any other way.
To answer your specific problem, which I don't know: If all information you need for the query is available in the object oriented model, you should be able to solve it by the use of HQL.

Linq to SQL Performance using contains

I'm overloading a vb.net search procedure which queries a SQL database.
One of the older methods i'm using as a comparison uses a Stored Procedure to perform the search and return the query.
My new method uses linq.
I'm slightly concerned about the performance when using contains queries with linq. I'm looking at equally comparable queries using both methods.
Basically having 1 where clause to
Here are some profiler results;
Where name = "ber10rrt1"
Linq query : 24reads
Stored query : 111reads
Where name = "%ber10%"
Linq query : 53174reads
Stored proc query : 23386reads
Forgetting for a moment, the indexes (not my database)... The fact of the matter is that both methods are fundamentally performing the same query (albeit the stored procedure does reference a view for [some] of the tables).
Is this consitent with other peoples experiance of linq to sql?
Also, interestingly enough;
Using like "BER10%"
resultset.Where(Function(c) c.ci.Name.StartsWith(name))
Results in the storedproc using 13125reads and linq using 8172reads
I'm not sure there is enough there for a complete analysis... I'm assuming we are talking about string.Contains/string.StartsWith here (not List<T>.Contains).
If the generated TSQL is similar, then the results should be comparable. There are a few caveats to this - for example, is the query column a calculated+persisted value? If so, the SET options must be exact matches for it to be usable "as is" (otherwise it has to re-calculate per row).
So: what is the TSQL from the SP and LINQ? Are they directly comparable?
You mention a VIEW - I'm guessing this could make a big difference if (for example) it filters out data (either via a WHERE or an INNER JOIN).
Also - LIKE clauses starting % are rarely a good idea - not least, it can't make effective use of any index. You might have better performance using "full text search"; but this isn't directly available via LINQ, so you'll have to wrap it in an SP and expose the SP via the LINQ data-context (just drag the SP into the designer).
My money is on the VIEW (and the other code in the SP) being the main difference here.

How to create dynamic and safe queries

A "static" query is one that remains the same at all times. For example, the "Tags" button on Stackoverflow, or the "7 days" button on Digg. In short, they always map to a specific database query, so you can create them at design time.
But I am trying to figure out how to do "dynamic" queries where the user basically dictates how the database query will be created at runtime. For example, on Stackoverflow, you can combine tags and filter the posts in ways you choose. That's a dynamic query albeit a very simple one since what you can combine is within the world of tags. A more complicated example is if you could combine tags and users.
First of all, when you have a dynamic query, it sounds like you can no longer use the substitution api to avoid sql injection since the query elements will depend on what the user decided to include in the query. I can't see how else to build this query other than using string append.
Secondly, the query could potentially span multiple tables. For example, if SO allows users to filter based on Users and Tags, and these probably live in two different tables, building the query gets a bit more complicated than just appending columns and WHERE clauses.
How do I go about implementing something like this?
The first rule is that users are allowed to specify values in SQL expressions, but not SQL syntax. All query syntax should be literally specified by your code, not user input. The values that the user specifies can be provided to the SQL as query parameters. This is the most effective way to limit the risk of SQL injection.
Many applications need to "build" SQL queries through code, because as you point out, some expressions, table joins, order by criteria, and so on depend on the user's choices. When you build a SQL query piece by piece, it's sometimes difficult to ensure that the result is valid SQL syntax.
I worked on a PHP class called Zend_Db_Select that provides an API to help with this. If you like PHP, you could look at that code for ideas. It doesn't handle any query imaginable, but it does a lot.
Some other PHP database frameworks have similar solutions.
Though not a general solution, here are some steps that you can take to mitigate the dynamic yet safe query issue.
Criteria in which a column value belongs in a set of values whose cardinality is arbitrary does not need to be dynamic. Consider using either the instr function or the use of a special filtering table in which you join against. This approach can be easily extended to multiple columns as long as the number of columns is known. Filtering on users and tags could easily be handled with this approach.
When the number of columns in the filtering criteria is arbitrary yet small, consider using different static queries for each possibility.
Only when the number of columns in the filtering criteria is arbitrary and potentially large should you consider using dynamic queries. In which case...
To be safe from SQL injection, either build or obtain a library that defends against that attack. Though more difficult, this is not an impossible task. This is mostly about escaping SQL string delimiters in the values to filter for.
To be safe from expensive queries, consider using views that are specially crafted for this purpose and some up front logic to limit how those views will get invoked. This is the most challenging in terms of developer time and effort.
If you were using python to access your database, I would suggest you use the Django model system. There are many similar apis both for python and for other languages (notably in ruby on rails). I am saving so much time by avoiding the need to talk directly to the database with SQL.
From the example link:
#Model definition
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
def __unicode__(self):
return self.name
Model usage (this is effectively an insert statement)
from mysite.blog.models import Blog
b = Blog(name='Beatles Blog', tagline='All the latest Beatles news.')
b.save()
The queries get much more complex - you pass around a query object and you can add filters / sort elements to it. When you finally are ready to use the query, Django creates an SQL statment that reflects all the ways you adjusted the query object. I think that it is very cute.
Other advantages of this abstraction
Your models can be created as database tables with foreign keys and constraints by Django
Many databases are supported (Postgresql, Mysql, sql lite, etc)
DJango analyses your templates and creates an automatic admin site out of them.
Well the options have to map to something.
A SQL query string CONCAT isn't a problem if you still use parameters for the options.