Seeding thousands of records in Rails 3 - ruby-on-rails-3

I have several tables that need to be populated when I move my project to production, each of these tables has several thousand rows. I have the data stored in a CSV file now, but using the seed.rb file seems like it would be cumbersome because the data from my CSV file would have to be formatted to meet the seed.rb format. If this were only a handful of rows, it wouldn't such a problem. What would be the best/easiest way to get this data loaded?

I would probably use a little custom script and the faster_csv gem which has good tools to parse .csv files quickly. Then you can map the fields to model attributes.
I would implement this via TDD as model methods and use ActiveRecords's create method to instantiate instances. While this is slower than writing SQL straight out, it's safer in that your data will run through all the model validations and you have a better confidence in the data integrity.
Flushing out data integrity issues from legacy data import up front will save you a lot of trouble later.

If I were doing this using MySQL, I'd use MySQL's load data function, like
load data infile '/my/rails/project/data.csv' replace into table table_name fields terminated by 'field_terminator' lines terminated by 'line_terminator';
If the tables' designs do not change frequently, you could put a such a statement into a perl, ruby, or shell script.

Like others have mentioned many db's have bulk load support. But if your looking for a Rails style solution; ar-extensions has bulk insert
http://rorstuff.blogspot.com/2010/05/activerecord-bulk-insertion-of-data-in.html
https://github.com/zdennis/ar-extensions
You also can checkout ActiveWarehouse
https://github.com/zdennis/ar-extensions

fast_seeder gem will help you. It populates database from CSV files using multiple inserts and supports different DB adapters

As most of the answers are outdated, this gem will help you: https://github.com/zdennis/activerecord-import.
E.g. if you have a collection of books, you can use:
Book.import(books)
It will execute only one SQL statement.
The gem also works with associations.

Related

How to load csv file to multiple tables in postgres (mainly concerned about best practice)

I'm new to DB/postgres SQL.
Scenario:
Need to load an csv file into postgres DB. This CSV data needs to loaded into multiple tables according DB schema. I'm looking for better design using python script.
My thought:
1. Load CSV file to intermediate table in postgres
2. Write a trigger on intermediate table to insert data into multiple tables on event of insert
3. Trigger includes truncate data at end
Any suggestions for better design/other ways without any ETL tools, and also any info on modules in Python 3.
Thanks.
Rather than using a trigger, use an explicit INSERT or UPDATE statement. That is probably faster, since it is not invoked per row.
Apart from that, your procedure is fine.

Do while loop with GPDB using talend

I have a very large data set in GPDB from which I need to extract close to 3.5 million records. I use this for a flatfile which is then used to load to different tables. I use Talend, and do a select * from table using the tgreenpluminput component and feed that to a tfileoutputdelimited. However due to the very large volume of the file, I run out of memory while executing it on the Talend server.
I lack the permissions of a super user and unable to do a \copy to output it to a csv file. I think something like a do while or a tloop with more limited number of rows might work for me. But my table doesnt have any row_id or uid to distinguish the rows.
Please help me with suggestions how to solve this. Appreciate any ideas. Thanks!
If your requirement is to load data into different tables from one table, then you do not need to go for load into file and then from file to table.
There is a component named tGreenplumRow which allows you to write direct sql queries (DDL and DML queries) in it.
Below is a sample job,
If you notice, there are three insert statements inside this component. It will be executed one by one separated by semicolon.

Migrating two columns in SQL script to the db of my rails app

I have a SQL script that has four columns and about 100 records. I only need two columns. I want to transfer the two columns into my seeds.rb file so that I can have these records in my db when I deploy my app. What would be the easiest way possible to do this? How would it look in my seeds.rb file?
The first thing to do is get the database in the format you want, and then create a database dump of some sort. MySQL makes this easier than Sqlite. Put the INSERT statements into your file like this:
ActiveRecord::Base.connection.execute("INSERT INTO `example` (`abbreviation`,`name`)
VALUES
('ABC', 'Alphabet Broadcasting Company'),
('DEF', 'Denver Echo Factory'),
('GHI', 'Gimbal Helper Industries')
")
Although seeds.rb is a handy way of pre-populating certain critical things, like essential administrators or lookup tables for countries, it does become difficult to maintain over time as seeds.rb must always conform to the latest schema.
It may be easier to simply deploy a seed Sqlite file and migrate that instead. With MySQL you typically deploy and load in a seed database dump to get things started, then migrate and enhance as required.

What is the easiest way to add a bunch of content to a SQL database?

Nothing technical here. Suppose I have a lot of different categorized data, and I would like to create a database out of it. Would someone literally hand plug in all that info with SQL code itself? Or do some people make a mock website just to input data? What are some of your strategies?
If there would be no way to do it automatically, then a mock website would be the way to go: you can even use it with more people at once, actually multiplying the input speed (as long as you don't mess up assigning each of them a different part of the data).
What format is your data in? And how much of it is there? If its Excel then SQL Server has tools to import it in. I'm not sure if MySQL has anything similar. Even if it doesn't one other technique I have used with Excel data is to use a formula to concatenate as required to generate the INSERT statements. Then just paste those into a query window and run that.
I wouldn't do a website for it unless I was building an admin site for it already and wanted to test that with the initial load.
Most databases have a way to do bulk inserts or have tools for data import.
My strategies normally involve such tools.
Here is an example of importing a CSV file to SQL Server.
Most database servers provide a way to import data from a variety of formats, you could look into that first.
If not, you could write a simple script or console application to parse your input data, and write out a SQL script to insert the data into appropriate tables.
For example, if you data was in a CSV file, you would parse each line in the file, and generate an insert statement to write out to a .sql file.
MyData.csv
1,2,3,'Test',4
2,3,4,'Test2,6
GeneratedInsert.sql
insert into table (col1,col2,col3,col4,col5) values (1,2,3,'Test',4)
insert into table (cal1,col2,col3,col4,col5) values (2,3,4,'Test2',6)
MySQL has a statement LOAD DATA INFILE that is intended for loading bulk data from flat files. It's easy to use and much faster than alternative methods.
But first you do have to use SQL to design tables with fields that match the field of your import data. That is, if you have some file with comma-separated data:
Titanic;1997;4 stars
Batman Begins;2005;5 stars
"Harry Potter and the Sorcerer's Stone";2001;3 stars
...
You would create a table:
CREATE TABLE Movies (
title VARCHAR(100) NOT NULL,
year YEAR NOT NULL
rating VARCHAR(10)
);
Then load data:
LOAD DATA INFILE 'movies.txt' INTO TABLE Movies
FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"';
Most web languages have some sort of auto-scaffolding that you can quickly set up. Useful for admin work as well, if your site is hosted without direct access to DB.
Otherwise, yeah - write the SQL statements. Useful to bring a database up as part of your build process.

How do I handle large SQL SERVER batch inserts?

I'm looking to execute a series of queries as part of a migration project. The scripts to be generated are produced from a tool which analyses the legacy database then produces a script to map each of the old entities to an appropriate new record. THe scripts run well for small entities but some have records in the hundreds of thousands which produce script files of around 80 MB.
What is the best way to run these scripts?
Is there some SQLCMD from the prompt which deals with larger scripts?
I could also break the scripts down into further smaller scripts but I don't want to have to execute hundreds of scripts to perform the migration.
If possible have the export tool modified to export a BULK INSERT compatible file.
Barring that, you can write a program that will parse the insert statements into something that BULK INSERT will accept.
BULK INSERT uses BCP format files which come in traditional (non-XML) or XML. Does it have to get a new identity and use it in a child and you can't get away with using SET IDENTITY INSERT ON because the database design has changed so much? If so, I think you might be better off using SSIS or similar and doing a Merge Join once the identities are assigned. You could also load the data into staging tables in SQL using SSIS or BCP and then use regular SQL (potentially within SSIS in a SQL task) with the OUTPUT INTO feature to capture the identities and use them in the children.
Just execute the script. We regularly run backup / restore scripts that are 100's Mb in size. It only takes 30 seconds or so.
If it is critical not to block your server for this amount to time, you'll have to really split it up a bit.
Also look into the -tab option of mysqldump with outputs the data using TO OUTFILE, which is more efficient and faster to load.
It sounds like this is generating a single INSERT for each row, which is really going to be pretty slow. If they are all wrapped in a transaction, too, that can be kind of slow (although the number of rows doesn't sound that big that it would cause a transaction to be nearly impossible - like if you were holding a multi-million row insert in a transaction).
You might be better off looking at ETL (DTS, SSIS, BCP or BULK INSERT FROM, or some other tool) to migrate the data instead of scripting each insert.
You could break up the script and execute it in parts (especially if currently it makes it all one big transaction), just automate the execution of the individual scripts using PowerShell or similar.
I've been looking into the "BULK INSERT" from file option but cannot see any examples of the file format. Can the file mix the row formats or does it have to always be consistent in a CSV fashion? The reason I ask is that I've got identities involved across various parent / child tables which is why inserts per row are currently being used.