Invoking a large set of SQL from a Rails 4 application - sql

I have a Rails 4 application that I use in conjunction with sidekiq to run asynchronous jobs. One of the jobs I normally run outside of my Rails application is a large set of complex SQL queries that cannot really be modeled by ActiveRecord. The connection this set of SQL queries has with my Rails app is that it should be executed anytime one of my controller actions is invoked.
Ideally, I'd queue a job from my Rails application within the controller for Sidekiq to go ahead and run the queries. Right now they're stored in an external file, and I'm not entirely sure what the best way is to have Rails run the said SQL.
Any solutions are appreciated.

I agree with Sharagoz, if you just need to run a specific query, the best way is to send the query string directly into the connection, like:
ActiveRecord::Base.connection.execute(File.read("myquery.sql"))
If the query is not static and you have to compose it, I would use Arel, it's already present in Rails 4.x:
https://github.com/rails/arel

You didn't say what database you are using, so I'm going to assume MySQL.
You could shell out to the mysql binary to do the work:
result = `mysql -u #{user} --password #{password} #{database} < #{huge_sql_filename}`
Or use ActiveRecord::Base.connection.execute(File.read("huge.sql")), but it won't work out of the box if you have multiple SQL statements in your SQL file.
In order to run multiple statements you will need to create an initializer that monkey patches the ActiveRecord::Base.mysql2_connection to allow setting MySQL's CLIENT_MULTI_STATEMENTS and CLIENT_MULTI_RESULTS flags.
Create a new initializer config/initializers/mysql2.rb
module ActiveRecord
class Base
# Overriding ActiveRecord::Base.mysql2_connection
# method to allow passing options from database.yml
#
# Example of database.yml
#
# login: &login
# socket: /tmp/mysql.sock
# adapter: mysql2
# host: localhost
# encoding: utf8
# flags: 131072
#
# #param [Hash] config hash that you define in your
# database.yml
# #return [Mysql2Adapter] new MySQL adapter object
#
def self.mysql2_connection(config)
config[:username] = 'root' if config[:username].nil?
if Mysql2::Client.const_defined? :FOUND_ROWS
config[:flags] = config[:flags] ? config[:flags] | Mysql2::Client::FOUND_ROWS : Mysql2::Client::FOUND_ROWS
end
client = Mysql2::Client.new(config.symbolize_keys)
options = [config[:host], config[:username], config[:password], config[:database], config[:port], config[:socket], 0]
ConnectionAdapters::Mysql2Adapter.new(client, logger, options, config)
end
end
end
Then update config/database.yml to add flags:
development:
adapter: mysql2
database: app_development
username: user
password: password
flags: <%= 65536 | 131072 %>
I just tested this on Rails 4.1 and it works great.
Source: http://www.spectator.in/2011/03/12/rails2-mysql2-and-stored-procedures/

Executing one query is - as outlined by other people - quite simply done through
ActiveRecord::Base.connection.execute("SELECT COUNT(*) FROM users")
You are talking about a 20.000 line sql script of multiple queries. Assuming you have the file somewhat under control, you can extract the individual queries from it.
script = Rails.root.join("lib").join("script.sql").read # ah, Pathnames
# this needs to match the delimiter of your queries
STATEMENT_SEPARATOR = ";\n\n"
ActiveRecord::Base.transaction do
script.split(STATEMENT_SEPARATOR).each do |stmt|
ActiveRecord::Base.connection.execute(stmt)
end
end
If you're lucky, then the query delimiter could be ";\n\n", but this depends of course on your script. We had in another example "\x0" as delimiter. The point is that you split the script into queries to send them to the database. I wrapped it in a transaction, to let the database know that there is coming more than one statement. The block commits when no exception is raised while sending the script-queries.
If you do not have the script-file under control, start talking to those who control it to get a reliable delimiter. If it's not under your control and you cannot talk to the one who controls it, you wouldn't execute it, I guess :-).
UPDATE
This is a generic way to solve this. For PostgreSQL, you don't need to split the statements manually. You can just send them all at once via execute. For MySQL, there seem to be solutions to get the adapter into a CLIENT_MULTI_STATEMENTS mode.

If you want to execute raw SQL through active record you can use this API:
ActiveRecord::Base.connection.execute("SELECT COUNT(*) FROM users")

If you are running big SQL every time, i suggest you to create a sql view for it. It be boost the execution time. The other thing is, if possible try to split all those SQL query in such a way that it will be executed parallely instead of sequentially and then push it to sidekiq queue.
You have to use ActiveRecord::Base.connection.execute or ModelClass.find_by_sql to run custom SQL.
Also, put an eye on ROLLBACK transactions, you will find many places where you dont need such ROLLBACK feature. If you avoid that, the query will run faster but it is dangerous.
Thanks all i can suggest.

use available database tools to handle the complex queries, such as views, stored procedures etc and call them as other people already suggested (ActiveRecord::Base.connection.execute and ModelClass.find_by_sql for example)- it might very well cut down significantly on query preparation time in the DB and make your code easier to handle
http://dev.mysql.com/doc/refman/5.0/en/create-view.html
http://dev.mysql.com/doc/connector-cpp/en/connector-cpp-tutorials-stored-routines-statements.html
abstract your query input parameters into a hash so you can pass it on to sidekiq, don't send SQL strings as this will probably degrade performance (due to query preparation time) and make your life more complicated due to funny SQL driver parsing bugs
run your complex queries in a dedicated named queue and set concurrency to such a value that will prevent your database of getting overwhelmed by the queries as they smell like they could be pretty db heavy
https://github.com/mperham/sidekiq/wiki/API
https://github.com/mperham/sidekiq/wiki/Advanced-Options
have a look at Squeel, its a great addition to AR, it might be able to pull some of the things you are doing
https://github.com/activerecord-hackery/squeel
http://railscasts.com/episodes/354-squeel
I'll assume you use MySQL for now, but your mileage will vary depending on the DB type that you use. For example, Oracle has some good gems for handling stored procedures, views etc, for example https://github.com/rsim/ruby-plsql
Let me know if some of this stuff doesn't fit your use case and I'll expand

I see this post is kind of old. But I would like to add my solution to it. I was in a similar situation; I also needed a way to force feed "PRAGMA foreign_keys = on;" into my sqlite connection (I could not find a previous post that spelled it out how to do it.) Anywho, this worked like a charm for me. It allowed me to write "pretty" sql and still get it executed. Blank lines are ignored by the if statement.
conn = ActiveRecord::Base.establish_connection(adapter:'sqlite3',database:DB_NAME)
sqls = File.read(DDL_NAME).split(';')
sqls.each {|sql| conn.connection.execute(sql<<';') unless sql.strip.size == 0 }
conn.connection.execute('PRAGMA foreign_keys = on;')

I had the same problem with a set of sql statements that I needed to execute all in one call to the server. What worked for me was to set up an initializer for Mysql2 adapter (as explained in infused answer) but also do some extra work to process multiple results. A direct call to ActiveRecord::Base.connection.executewould only retrieve the first result and issue an Internal Error.
My solution was to get the Mysql2 adapter and work directly with it:
client = ActiveRecord::Base.connection.raw_connection
Then, as explained here, execute the query and loop through the results:
client.query(multiple_stms_query)
while client.next_result
result = client.store_result
# do something with it ...
end

Related

Executing raw SQL in migrations: keep SQL statements as strings inside a migration or as code inside a separate SQL file?

In the database, I have multiple materialized views with big definitions. I also have multiple migrations that change the definitions of some of these materialized views using DROP and CREATE statements. Thus, we often are dropping / recreating the same views over and over, with small changes. These (rather bulky) statements are now stored inside strings:
class MyMigrationName < ActiveRecord::Migration[5.2]
def up
sql = <<~SQL
...
create materialized view if not exists foo_1 as ... ;
create materialized view if not exists foo_2 as ... ;
...
SQL
execute sql
end
def down
...
end
I am considering switching from this current approach to a different one, where the SQL code is stored inside separate SQL files, for example in db/migrate/concerns/create_foo_matviews.sql. The code is read from the file and executed from inside the migrations, like so:
class MyMigrationName < ActiveRecord::Migration[5.2]
def up
execute File.read(File.expand_path('./concerns/create_foo_matviews.rb', __FILE__))
end
def down
...
end
The pros of this approach are:
It is easier to see the differences between the old and the new SQL code using git diff (especially important given that materialized views' definitions are big, but the actual changes in migrations are relatively small).
The SQL file adds syntax highlighting to the SQL code.
There is less copy/pasted code if I only change the relevant parts in the SQL file.
Are there any problems associated with this proposed approach? If yes, what would be an alternative solution to maximize maintainability?
See also
Is it possible to use an external SQL file in a Rails migration?
Running sql file using rails migration file
Execute SQL-Statement from File with ActiveRecord
I'd leave it in the Migration.
Mainly because the migration then contains everything that actually makes up the DB change.
You would need to have two external SQL files (up and down) that I need to search/find first before I understand what the migration does.
Depending on the Editor you are using, you will get (limited) syntax highlighting
The migrations that execute custom SQL would all look the same, just the name of the external file would be different.
What problem are you trying to solve? Just the "bulky" strings? I don't think that this is problem (to be honest once the migration is run, you not go back to it anyhow) that is worth spennding a lot of time on. Just to the simplest thing: SQL in heredoc string.
There are also gems that allow you to create (materialized) views with normal migration code (by adding support for create_view or similar) but i'd not add an additional dependency for something this simple.
Also consider changing from schema.rb to structure.sql, if not yet done.
Sound like you want to create your own helpers to create materialized views, something like add_index or add_column.
You could make a module named like MaterializedMigrations in your lib directory. then you can required it in a initializer and for last you include it in your migration code, like this:
class MyMigrationName < ActiveRecord::Migration[5.2]
include MaterializedMigrations
def up
create_materialized_view("name_of_view")
end
end
The helper API is only a suggestion, you could design better API for your use cases.

Airflow + pandas read_sql_query() with commit

Question
Can I commit a SQL transaction to a DB using read_sql()?
Use Case and Background
I have a use case where I want to allow users to execute some predefined SQL and have a pandas dataframe returned. In some cases, this SQL will need to query a pre-populated table, and in other cases, this SQL will execute a function which will write to a table and then that table will be queried.
This logic is currently contained inside of method in an Airflow DAG in order to leverage database connection information accessible to Airflow using the PostgresHook - the method is eventually called in a PythonOperator task. It's my understanding through testing that the PostgresHook creates a psycopg2 connection object.
Code
from airflow.hooks.postgres_hook import PostgresHook
import pandas as pd
def create_df(job_id,other_unrelated_inputs):
conn = job_type_to_connection(job_type) # method that helps choose a database
sql = open('/sql_files/job_id_{}.sql'.format(job_id)) #chooses arbitrary SQL
sql_template = sql.read()
hook = PostgresHook(postgres_conn_id=conn) #connection information for alias is predefined elsewhere within Airflow
try:
hook_conn_obj = hook.get_conn()
print(type(hook_conn_obj)) # <class 'psycopg2.extensions.connection'>
# Runs SQL template with variables, but does not commit. Alternatively, have used hook.get_pandas_df(sql_template)
df = pd.io.sql.read_sql(sql_template, con = hook_conn_obj)
except:
#catches some errors#
return df
Problem
Currently, when executing a SQL function, this code generates a dataframe, but does not commit any of the DB changes made in the SQL function. For example, to be more precise, if the SQL function INSERTs a row into a table, that transaction will not commit and the row will not appear in the table.
Attempts
I've attempted a few fixes but am stuck. My latest effort was to change the autocommit attribute of the psycopg2 connection that read_sql uses in order to autocommit the transaction.
I'll admit that I haven't been able to figure out when the attributes of the connection have an impact on the execution of the SQL.
I recognize that an alternative path is to replicate some of the logic in PostgresHook.run() to commit and then add some code to push results into a dataframe, but it seems more parsimonious and easier for future support to use the methods already created, if possible.
The most analogous SO question I could find was this one, but I'm interested in an Airflow-independent solution.
EDIT
...
try:
hook_conn_obj = hook.get_conn()
print(type(hook_conn_obj)) # <class 'psycopg2.extensions.connection'>
hook_conn_obj.autocommit = True
df = pd.io.sql.read_sql(sql_template, con = hook_conn_obj) # Runs SQL template with variables, but does not commit
except:
#catches some errors#
return df
This seems to work. If anyone has any commentary or thoughts on a better way to achieve this, I'm still interested in learning from a discussion.
Thank you!
read_sql won't commit because as that method name implies, the goal is to read data, not write. It's good design choice from pandas. This is important because it prevents accidental writes and allows interesting scenarios like running a procedure, read its effects but nothing is persisted. read_sql's intent is to read, not to write. Expressing intent directly is a gold standard principle.
A more explicit way to express your intent would be to execute (with commit) explicitly before fetchall. But because pandas offers no simple way to read from a cursor object, you would lose the ease of mind provided by read_sql and have to create the DataFrame yourself.
So all in all your solution is fine, by setting autocommit=True you're indicating that your database interactions will persist whatever they do so there should be no accidents. It's a bit weird to read, but if you named your sql_template variable something like write_then_read_sql or explain in a docstring, the intent would be clearer.
I had a similar use case -- load data into SQL Server with Pandas, call a stored procedure that does heavy lifting and writes to tables, then capture the result set into a new DataFrame.
I solved it by using a context manager and explicitly committing the transaction:
# Connect to SQL Server
engine = sqlalchemy.create_engine('db_string')
with engine.connect() as connection:
# Write dataframe to table with replace
df.to_sql(name='myTable', con=connection, if_exists='replace')
with connection.begin() as transaction:
# Execute verification routine and capture results
df_processed = pandas.read_sql(sql='exec sproc', con=connection)
transaction.commit()

Why does ActiveRecord generate parameterized queries for most operations, but not for find_by?

I'm working on a basic Rails 4.0 app to learn how it works, and I've run into something that I can't seem to figure out. I've been doing queries to the default Sqlite DB via ActiveRecord, and for most queries, according to the debug output, it seems to generate parameterized queries, like so:
2.0.0-p247 :070 > file.save
(0.2ms) begin transaction
SQL (0.6ms) UPDATE "rep_files" SET "report_id" = ?, "file_name" = ?, "updated_at" = ?
WHERE "rep_files"."id" = 275 [["report_id", 3], ["file_name", "hello.jpg"],
["updated_at", Mon, 09 Sep 2013 04:30:19 UTC +00:00]]
(28.8ms) commit transaction
However, whenever I do a query using find_by, it seems to just stick the provided parameters into the generated SQL:
2.0.0-p247 :063 > file = RepFile.find_by(report_id: "29", file_name: "1.png")
RepFile Load (6.2ms) SELECT "rep_files".* FROM "rep_files" WHERE
"rep_files"."report_id" = 29 AND "rep_files"."file_name" = '1.png' LIMIT 1
It does seem to be escaping the parameters properly to prevent SQL injection:
2.0.0-p247 :066 > file = RepFile.find_by(report_id: "29", file_name: "';")
RepFile Load (0.3ms) SELECT "rep_files".* FROM "rep_files" WHERE
"rep_files"."report_id" = 29 AND "rep_files"."file_name" = ''';' LIMIT 1
However, it was my understanding that sending parameterized queries to the database was considered a better option than trying to escape query strings, since the parameterized option will cause the query data to bypass the database's parsing engine entirely.
So what's going on here? Is this some oddity in the Sqlite adapter or the way that the debug output is generated? If ActiveRecord is actually working like this, is there some reason for it? I can't find anything about this anywhere I've looked. I've started looking through the ActiveRecord code, but haven't figured anything out yet.
If we look at find_by in the source, we see this:
def find_by(*args)
where(*args).take
end
The take just tacks the limit 1 onto the query so we're left with where. The where method can deal with arguments in various forms with various placeholder formats, in particular, you can call where like this:
where('c = :pancakes', :pancakes => 6)
Using named placeholders is quite nice when you have a complicated query that is best expressed with an SQL snippet or a query that uses the same value several times so named placeholders are quite a valuable feature. Also, you can apply where to the ActiveRecord::Relation that you got from a where call and you can build the final query in pieces spread across several methods and scopes that don't know about each other. So, where has a problem: multiple things that don't know about each other can use the same named placeholder and conflicts can arise. One way around this problem would be to rename the named placeholders to ensure uniqueness, another way is to manually fill in the placeholders through string wrangling. Another problem is that different databases support different placeholder syntaxes. ActiveRecord has chosen to manually fill in the placeholders.
Summary: find_by doesn't use placeholders because where doesn't and where doesn't because it is easier to build the query piecemeal through string interpolation than it is to keep track of all the placeholders and database-specific syntaxes.

Getting the SQL from a Doctrine Migration

I have been researching a way to get the SQL statements that are built by a generated Migration file. These extend Doctrine_Migration_Base. Essentially I would like to save the SQL as change scripts.
The execution path leads me to Doctrine_Export which has methods that build the SQL statement and executes them. I have found no way of asking for just them. The export methods found in Doctrine_Export only operate on Doctrine_Record models and not Migration scripts.
From the command line './doctrine migrate version#' the path goes:
Doctrine_Cli::run(cmd)
Doctrine_Task_Migrate::setArguments(args)
Doctrine_Task_Migrate::execute()
Doctrine_Migration::migrate(to)
Doctrine_Migration_Process::Doctrine_Export::various
create, drop, alter methods with sql
equivalents.
Has anyone tackled this before? I really would not like to change Doctrine base files. Any help is greatly appreciated.
Could you make a dev server, and do the migration on that, storing a SQL Trace as you go?you don't have to keep the results, but you would get a list of every command.
Taking into account Rob Farley's suggestion, I modified:
Doctrine_Core::migrate
Doctrine_Task_Migrate::execute
When the execute method is called the optional argument 'dryRun' is checked. If true
then a 'Doctrine_Connection_Profiler' instance is created. The 'dryRun' value is then passed onto
the 'Doctrine_Core::migrate' method. The 'dryRun' value of true allows the changes to rollback when done executing the SQL statements. When the method returns, the profiler is parsed and non-empty SQL statements
not containing 'migration_version' are saved and displayed to the terminal.

Session.SetBatchSize does not change the batchsize

I create the Session factory like:
FluentConfiguration cfg =
Fluently.Configure().Database(MsSqlConfiguration.MsSql2005.ConnectionString(
c => c.Is(dbConnectionString)).**AdoNetBatchSize(100)**.ShowSql()).
Mappings(m => m.FluentMappings.AddFromAssembly(mappingAssembly)).
Mappings(m => m.HbmMappings.AddFromAssembly(mappingAssembly));
If I later set
session.SetBatchSize(someOtherSize); during the later program execution
nothing happens. it is as if this command is just a mock.
Why that?
Thanks in advance
I have no idea if and how the NHProf reports batching but using the normal SQL Profiler you cannot notice it.
To verify how it works and if it is indeed enabled as I have set it up, I had to debug the NHibernate's code.
What NHinernate does is to add each generated SQL command in a collection of SQL commands that it is flushed (send to the DB) when the defined BatchSize is reached or when there are no more SQL commands to execute.
Observing the SQL profiler this is not noticable as SQL queries appear but actually NHibernate sends the commands in bactches to the DB.
This way if you want to execute 10 SQL statements without setting the BatchSize NHinerante will talk to the DB 10 times but setting the BatchSize to 10 then it will talk to the DB only once sending the all SQL queries in one go. Unfortunately this is not noticeable in the SQL Profiler...
How are you checking that batching actually occurs and what batch size is being used? SQL profiler does not show batching, you have to use NHibernate Profiler to get a good understanding of what is being batched.
Looking at the NH source session.SetBatchSize() does what it says it does, so it should work :)
Don't forget to set the <property name="adonet.batch_size">3</property> in the config file. The max value, I think is 50. But NH doesn't throw any error if set an higher value and I don't know the default value the.