Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'm looking for recommendations of a good, free tool for generating sample data for the purpose of loading into test databases. By analogy, something that produces "lorem ipsum" text for any RDBMS. Features I'm looking for include:
Flexibility to generate data for an existing table definition.
Ability to generate small and large data sets (> 1 million rows or more).
Generate in SQL script format (INSERT statements) or else in a flat file format suitable for bulk import (which is usually faster).
A command-line interface for easy scripting.
Extensible, open source, written in a dynamic language (these are nice-to-haves, not strong requirements).
PS: I did search for a duplicate question on StackOverflow, but I didn't find one. If there is one, I'll be grateful to get a pointer to it.
Thanks for the great responses everyone! I should amend my requirements that I use Mac OS X as my primary development environment, not Windows (though I did say command-line interface is desirable, and that practically rules out Windows). The Windows-specific suggestions will no doubt be useful to other readers of this question, though, so thanks.
Here is my conclusion:
GenerateData:
PHP web app interface, not command line
limited to generating 200 records (or pay $20 for license to generating 5,000 records)
RedGate SQL Data Generator
not free, price $295
requires Windows, .NET, SQL Server
Visual Studio 2008 Database Edition
requires Windows
requires costly MSDN or ISV subscription
Banner Datadect
not free, price $595
requires Windows (?)
no support for MySQL (?)
GUI, not command line or scriptable
Ruby Faker gem
way too slow to use ActiveRecord for bulk data load
Super Smack
chiefly a load-testing tool, with a random data generator built in
pretty simple to use nevertheless
overall a good runner-up tool
Databene Benerator
best solution for my needs
XML scripts, compatible with DbUnit
open source (GPL) Java code
command-line usage
access many databases directly via JDBC
Take a look at databene benerator, a test data generator that looks close to your requirements.
it can generate data for an existing table definition (or even anonymize production data)
it can generate larges data set (unlimited size)
it supports various input (CSV, Flat Files, DBUnit) and output format (CSV, Flat Files, DBUnit, XML, Excel, Scripts)
it can be used on the command line or through a maven plugin
it's open source and customizable
I would give it a try.
BTW, a list of similar products is available on databene benerator's web site.
This looks quite promising: generatedata.com. Open-source, has lots of built-in data types.
There are several others listed here: Test (Sample) Data Generators. I don't have experience with any of them, but a few on that list look like they could be pretty decent.
Try http://www.mockaroo.com
This is a tool my company made to help test our own applications. We've made it free for anyone to use. It's basically the Forgery ruby gem with a web app wrapped around it. You can generate data in CSV, txt, or SQL formats. Hope this helps.
I know you said you were looking for a free tool, but this is one case where I would suggest that spending $295 will pay you back quickly in time saved. I've been using the RedGate tool SQL Data Generator for the last year and it is, to be short, an awesome tool. It allows for setting dependencies between columns, generates realistic data for business objects such as phone numbers, urls, names, etc. I can honestly state that this tool has paid for itself time and time again.
If you are looking or willing to use something MySQL-specific, you could take a look at Super Smack. It is currently maintained by Tony Bourke.
Super Smack allows you to generate random data to insert into your database tables. It is customizable, allowing you to use the packaged words.dat file, or any test data of your choice.
One of the nice things about it is that it is command-line is highly customizable. There is some fairly decent examples of usage in the book High Performance MySQL which is also excerpted here.
Not sure if that is along the lines of what you are looking for, but just a thought.
A Ruby script with one of the available fake data generators should do you just fine.
http://faker.rubyforge.org/ is one such gem. Unfortunately, this doesn't fulfill all your requirements.
Here is another: http://random-data.rubyforge.org/
And a tutorial for using Faker: http://www.rubyandhow.com/how-to-generate-fake-names-addresses-in-ruby/
RE: Flexibility to generate data for an existing table definition. Combine the Faker gem with one of the available ORMs. ActiveRecord would probably be easiest.
Normally very costly, but if you are a small ISV you can get Visual Studio 2008 Database Edition very cheaply, see the empower and bizspark promotions. It provides a lot more functionality then just generating test data (Integration with SCC, Unit Testing, DB Refactoring, etc.)
As I like the fact that Red-Grate tools are so easy to learn, I would still look at SQL Data Generator
a tool that really should not be missing from the list is the Data Generator from Datanamic that populates databases directly or generates insert scripts, has a large collection of pre-installed generators ( and supports multiple databases...
http://www.datanamic.com/datagenerator/index.html
I know you're not looking for actual lorem ipsum text; but in case anyone else searches for an actual lorem ipsum generator and finds this thread: lipsum.com does a great job of it.
Not free, but Visual Studio 2008 Database Edition is a good alternative and it provides a lot more functionality (Integration with SCC, Unit Testing, DB Refactoring, etc...)
I use a tool called Datatect:
Generates data to flat files or any ODBC compliant database.
Extensible via VBScript.
Referentially aware; will populate foreign keys with values from parent table.
Data is context aware; city, state and phone numbers for given zip codes, first names and titles with gender.
Can create custom, complex data types.
Generate over 2 billion proper names, business names, street addresses, cities, states, and zip codes.
I've used this tool to generate as many as 40,000,000 rows of data to a SQLServer database, and 8,000,000 rows of data to an Oracle database.
I am in no way affiliated with Banner Systems, just a satisfied customer.
Here is the list of such tools (both free and commercial):
http://c2.com/cgi/wiki?TestDataGenerator
For OS X there is Data Creator (US $ 7). Download is free for test purpose. You can use it to evaluate the software and its features.
It requires OS X Lion or successive. It can generate a lot of different field type and has a custom export mode plus some pre-set (TSV, CSV, Html table, web page with table inside).
http://www.tensionsoftware.com/osx/datacreator/
here at the App Store:
https://itunes.apple.com/us/app/data-creator/id491686136?mt=12
You can use DbSchema, www.dbschema.com it's a database management tool and it has a Random Data Generator to populate your database.
Not direct answer to your question but this can be helpful for certain kind of data :
Fake Name Generator can be useful - http://www.fakenamegenerator.com/ , not for everything but user accounts or stuff like that. AFAIK They provide support for bulk order.
+1 for Benerator: I tried 3 or 4 of the other tools on offer (including dbmonster) but found Benerator to be very quick, to deliver realistic data and to be flexible. I also got very quick & helpful feedback from the tool's creator when I posted on the forum.
Related
Just a bit of background on where my question is coming from: my company has multiple databases across the globe that uses the same schema and once of my department's responsibility is to monitor and make sure all these DBs are in sync from a schema SQL change perspective.
Now, my question is if anyone knows of any Software/tool that has a a Frontend UI which is able to do the following (the lower number the more important to have):
Able to track what SQL code change was applied on which database and when. Basically, if we write a SQL query that changed the structure of a table and we need it applied to 80% or 100% percent of the DBs, either via manual input or some automatic check the tool will tell me that yes, this was indeed applied.
Code distribution tool: we give it the query or a file that contains the code and it's able to push to the Databases it needs to (and create the audit log for that)
Code/object repository: keeps track of what was custom developed and pushed to the databases
I know SSIS might be able to do some of these things, but we need a tool that also has a simple frontend interface that can be accessed by non-IT personnel. (*clarification: we are not planning on giving non-DBA people access to change things, just to the audit aspect of said tool)
I've tried searching the internet, but i have a feeling i'm not using the right vocabulary to get the results i'm looking for.
Hence i wanted to see if the community was aware of any such tool or something similar?
Try searching for one of these two types of systems:
Release/Build/Deployment Automation Complex programs like Serena that have modules for pushing, tracking, and auditing any kind of software, anywhere. These will include all the GUI bells and whistles. But you'll have to deal with extra databases, configuration, agents, workflows, consultants(?), etc. These programs are geared more towards developers.
Remote Execution/Configuration Management Simpler programs like Salt, Fabric, and Ansible that let you run operating system commands anywhere. They don't offer as many features, and you have to do more of the work yourself, but in some ways that's liberating. If you know exactly what commands you want to run you don't need some other program holding your hand. These programs are geared more towards administrators.
From a database administrator's point of view, the main problem with those types of programs is that none of them are relational. Yes they can connect to a database and run a script, but none of them really speak SQL. Their native languages are Java, XML, SSH, etc. There's nothing wrong with those technologies, but if you only care about databases you don't want to deal with all that complexity.
If you're not happy with either of those types of programs I recommend you look at my open source program Method5. It is a remote execution program built as an extension to Oracle SQL. It works entirely inside an Oracle database, so you can install it yourself and won't need any additional websites, agents, configuration files, GUIs, etc.
Based on your comment about getting bogged down by links, and my answer to your question about half a year ago, I think this is the kind of program you were gradually heading towards creating. It took my team a couple thousand hours of developing and testing to get it right so you were probably wise to give up on making your own.
To specifically answer your requirements:
Tracking Changes are stored in an audit trail. But more importantly it has the ability and a pre-built script to compare an unlimited number of schemas, all in one view. At the end of the day what you really want to know is "are my schemas the same", not necessarily "did the same thing get run everywhere?".
Code Distribution If you just have SQL or PL/SQL, deploying it through Method5 is as easy as it can possibly get. Just specify what you want to run, and where you want to run it, like this: select * from table(m5('create index ...', 'dev, qa, prodDB1, prodDB2')); The program does not (yet) run SQL*Plus scripts. But when you have the ability to run SQL and PL/SQL so easily there's little need for SQL*Plus.
Code Repository All executions are stored in a simple table, M5_AUDIT. It contains the code, who ran it, where they ran it, and how they ran it. It wasn't designed to be a repository like SVN but it's good enough for simple auditing and tracking code.
Method5 does not contain a GUI but in some ways I consider that to be a feature. Since everything is done relationally, everything is in a simple table. You can use any of your existing GUIs - Toad, PL/SQL Developer, Excel, Apex, etc. It's a robust back-end solution that will hopefully make a good foundation for easily building a simple front end.
I am working on Selenium automation using WebDriver. It is keyword and data driven approach. I am handling all the inputs of objects,data and test configuration from Microsoft Excel.
Now client want to use database. He is asking me which one is more good to use in framework. Either database or Microsoft Excel utility? I have to reply to him with valid points.
Which one is the better to use in framework and also why second one is not good to use?
This question really requires an opinionated answer, but because there are some important benefits from one or the other I will still answer it!
With Excel, data entry is easy and accessible to non-technical testers. It has good sorting ability and the data can be organised in a pretty intuitive fashion. However, you cannot add or alter data during a test easily (I understand you could do this but I would this this is overboard). This means data must be previously organised and specific to the test requirements.
In a database data can come from queries and be altered on the fly. You could write helper functions that populate required data if it can't be found in the db. This means tests can run without worrying about what data is currently in the db (to an extent) and can alleviate some project issues. However, with a database there can be issues importing/exporting and there could be a lot of coding overhead. Also, obviously non-technical testers will find it hard to change the data.
As I said this is really an opinion in the end and is very much a project by project decision.
With my experience efficiency plays an integral role comparing these too approaches. Using a database tend to slow down the performance unlike excel sheets. Using a file(excel) increases the efficiency. And in our company automations suits we rarely use databases. Instead will push data in to files and use them as the repository.
I'm about to release a FOSS data generator that can generate random yet meaningful data in CSV format. Rather belatedly, I guess, I need to poll the state of the art for such products - because if there is a well known and useful existing tool, I can write my work off to experience. I am aware of of a couple of SQL Server specific tools, but mine is not database specific.
So, links? And if you have used such a product,
what features did you find it was missing?
Edit: To add a bit more info on my tool (Ooh, Matron!) it is intended to allow generation of any kind of random data from existing data files, and
supports weighting. It is XML based (sorry, folks) and lets you say things like:
<pick distribute="20,80" >
<datafile file="femalenames.dat"/>
<datafile file="malenames.dat"/>
<pick/>
to select female names about 20% of the time and male names 80% of the time.
But the purpose of this question is not to describe my product but to get info on other tools.
Latest: If anyone is interested, they can get the alpha of my data generator at http://code.google.com/p/csvtest
That can be a one-liner in R where I use the littler scripting front-end:
# generate the data as a one-liner from the command-line
# we set the RNG seed, and draw from a bunch of distributions
# indented just to fit the box here
edd#ron:~$ r -e'set.seed(42); write.csv(data.frame(y=runif(10), x1=rnorm(10),
x2=rt(10,4), x3=rpois(10, 0.4)), file="/tmp/neil.csv",
quote=FALSE, row.names=FALSE)'
edd#ron:~$ cat /tmp/neil.csv
y,x1,x2,x3
0.914806043496355,-0.106124516091484,0.830735621223563,0
0.937075413297862,1.51152199743894,1.6707628713402,0
0.286139534786344,-0.0946590384130976,-0.282485683052060,0
0.830447626067325,2.01842371387704,0.714442314565005,0
0.641745518893003,-0.062714099052421,-1.08008578470128,0
0.519095949130133,1.30486965422349,2.28674786332467,0
0.736588314641267,2.28664539270111,-0.73270267483628,1
0.134666597237810,-1.38886070111234,-1.45317770550920,1
0.656992290401831,-0.278788766817371,-1.01676025893376,1
0.70506478403695,-0.133321336393658,0.404860813371462,0
edd#ron:~$
You have not said anything about your data-generating process, but rest assured that R can probably cope with just about any requirement, including multivariate normal, t, skew-t, and more. The (six different) random-number generators in R are also of very high quality.
R can also write to DBs, or read parameters from it, and if it needs to be on Windoze then the Rscript front-end could be used instead of littler.
I asked a similar question some months ago:
Tools for Generating Mock Data?
I got some sincere suggestions, but most were not suitable for my needs. Either expensive (non-free) software, or else not flexible enough w.r.t. data types and database structure, or range of mock data, or way too slow (e.g. the Rails ActiveRecord solution).
Features I was looking for were:
Generate mock data to fill existing database tables
Quick to generate > 1 million rows
Produce either SQL script format or flat file suitable for importing
Scriptable command-line interface, not a GUI
Not dependent on Microsoft Windows environment
Nice-to-have features:
Extensible/configurable
Open-source, free license
Written in a dynamic language like Perl/PHP/Python
Point it at a database and let it "discover" the metadata
Integrated with testing tools (e.g. DbUnit)
Option to fill directly into the database as it generates data
The answer I accepted as Databene Benerator. Though since asking the question, I admit I haven't used it very much.
I was surprised that even when asking the community, the range of tools for generating mock data was so thin. This seems like a niche waiting to be filled! I'll be interested to see what you release.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am interested in learning how a database engine works (i.e. the internals of it). I know most of the basic data structures taught in CS (trees, hash tables, lists, etc.) as well as a pretty good understanding of compiler theory (and have implemented a very simple interpreter) but I don't understand how to go about writing a database engine. I have searched for tutorials on the subject and I couldn't find any, so I am hoping someone else can point me in the right direction. Basically, I would like information on the following:
How the data is stored internally (i.e. how tables are represented, etc.)
How the engine finds data that it needs (e.g. run a SELECT query)
How data is inserted in a way that is fast and efficient
And any other topics that may be relevant to this. It doesn't have to be an on-disk database - even an in-memory database is fine (if it is easier) because I just want to learn the principals behind it.
Many thanks for your help.
If you're good at reading code, studying SQLite will teach you a whole boatload about database design. It's small, so it's easier to wrap your head around. But it's also professionally written.
SQLite 2.5.0 for Code Reading
http://sqlite.org/
The answer to this question is a huge one. expect a PHD thesis to have it answered 100% ;)
but we can think of the problems one by one:
How to store the data internally:
you should have a data file containing your database objects and a caching mechanism to load the data in focus and some data around it into RAM
assume you have a table, with some data, we would create a data format to convert this table into a binary file, by agreeing on the definition of a column delimiter and a row delimiter and make sure such pattern of delimiter is never used in your data itself. i.e. if you have selected <*> for example to separate columns, you should validate the data you are placing in this table not to contain this pattern. you could also use a row header and a column header by specifying size of row and some internal indexing number to speed up your search, and at the start of each column to have the length of this column
like "Adam", 1, 11.1, "123 ABC Street POBox 456"
you can have it like
<&RowHeader, 1><&Col1,CHR, 4>Adam<&Col2, num,1,0>1<&Col3, Num,2,1>111<&Col4, CHR, 24>123 ABC Street POBox 456<&RowTrailer>
How to find items quickly
try using hashing and indexing to point at data stored and cached based on different criteria
taking same example above, you could sort the value of the first column and store it in a separate object pointing at row id of items sorted alphabetically, and so on
How to speed insert data
I know from Oracle is that they insert data in a temporary place both in RAM and on disk and do housekeeping on periodic basis, the database engine is busy all the time optimizing its structure but in the same time we do not want to lose data in case of power failure of something like that.
so try to keep data in this temporary place with no sorting, append your original storage, and later on when system is free resort your indexes and clear the temp area when done
good luck, great project.
There are books on the topic a good place to start would be Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom
SQLite was mentioned before, but I want to add some thing.
I personally learned a lot by studying SQlite. The interesting thing is, that I did not go to the source code (though I just had a short look). I learned much by reading the technical material and specially looking at the internal commands it generates. It has an own stack based interpreter inside and you can read the P-Code it generates internally just by using explain. Thus you can see how various constructs are translated to the low-level engine (that is surprisingly simple -- but that is also the secret of its stability and efficiency).
I would suggest focusing on www.sqlite.org
It's recent, small (source code 1MB), open source (so you can figure it out for yourself)...
Books have been written about how it is implemented:
http://www.sqlite.org/books.html
It runs on a variety of operating systems for both desktop computers and mobile phones so experimenting is easy and learning about it will be useful right now and in the future.
It even has a decent community here: https://stackoverflow.com/questions/tagged/sqlite
Okay, I have found a site which has some information on SQL and implementation - it is a bit hard to link to the page which lists all the tutorials, so I will link them one by one:
http://c2.com/cgi/wiki?CategoryPattern
http://c2.com/cgi/wiki?SliceResultVertically
http://c2.com/cgi/wiki?SqlMyopia
http://c2.com/cgi/wiki?SqlPattern
http://c2.com/cgi/wiki?StructuredQueryLanguage
http://c2.com/cgi/wiki?TemplateTables
http://c2.com/cgi/wiki?ThinkSqlAsConstraintSatisfaction
may be you can learn from HSQLDB. I think they offers small and simple database for learning. you can look at the codes since it is open source.
If MySQL interests you, I would also suggest this wiki page, which has got some information about how MySQL works. Also, you might want to take a look at Understanding MySQL Internals.
You might also consider looking at a non-SQL interface for your Database engine. Please take a look at Apache CouchDB. Its what you would call, a document oriented database system.
Good Luck!
I am not sure whether it would fit to your requirements but I had implemented a simple file oriented database with support for simple (SELECT, INSERT , UPDATE ) using perl.
What I did was I stored each table as a file on disk and entries with a well defined pattern and manipulated the data using in built linux tools like awk and sed. for improving efficiency, frequently accessed data were cached.
This question already has answers here:
Closed 14 years ago.
I'd like to stress test some of my SQL queries and find out about bad query plans and bottlenecks. I plan to fill some tables with random test data.
Are there tools or a set of scripts available for this purpose, preferably for SQL Server?
Thanks!
UPDATE: Sorry, didn't know these two question already existed:
Data generators for SQL server?
Creating test data in a database
This website will generate reams of customized data for you.
From that site:
Ever needed custom formatted sample / test data, like, bad? Well, that's the idea of the Data Generator. It's a free, open source script written in JavaScript, PHP and MySQL that lets you quickly generate large volumes of custom data in a variety of formats for use in testing software, populating databases, and scoring with girls.
This site offers an online demo where
you're welcome to tinker around to get
a sense of what the script does, what
features it offers and how it works.
Then, once you've whet your appetite,
there's a free, fully functional,
GNU-licensed version available for
download.
I've use this data generator with success in the past - may not be big enough for your needs though.