My company has a legacy micro-simulation program that simulates a population and changes to that population over a period of years.
For each year, the program produces a binary file with a record for each individual that holds their characteristics (e.g., age, maritial status, income ... about 20 fields).
We currently have several utility programs that read these files and produce summary reports. Problem is that each time somebody wants a new report, a new utility program has to be written.
Changing the program so that the records are stored in a database instead of binary files is out of the question (I have asked ... several times). I have written a few programs that import the binary files into a database and then run queries on the tables I have created. The problem here is that it always takes longer to import the data and run the query than it does to run a utility program written in c++ that just read the records one by one and accumulate the desired data. Often the binary files contain over 30 million records and the import step alone takes forever.
So here is my question. Is there anything out there that would allow me to specify the structure of my binary file and then run SQL queries on the file? I think you can use ODBC to run queries on plain text files, but I've never seen anything like that for binary files.
If there isn't anything available, what are the steps I would need to take to build something that could run a query directly on my file? I understand this would probably be way beyond my ability, but it can't hurt to know where I would need to start.
OpenAccess is a toolkit that you can use to build ODBC or JDBC drivers for arbitrary systems. Disclaimer: I've not used it, and another division of my company sells it.
It's possible using SSIS: Loading Binary Files into SQL Server Using SSIS
This amight also be of interest: Reading and Writing Files in SQL Server using T-SQL
I do not have much experience with LINQ, but couldn't you use InteropServices to parse the binary files into C# objects and then query stuff out with LINQ's SQL?
Related
I suppose this is somewhat of an extension of the question asked here.
However, I believe the linked OP and my own reason for reading a file with SQL Developer are different. I am learning SQL and databases and am attempting to create a model database (as in, I won't be editing the data after insertion, just set up search queries and what not). I want it to be large (over 100,000 entries), so I've created a C++ program that wrote randomly generate entries for the database on a .txt file (one entry per line in the .txt file) instead of hard coding the insertion of each entry. Now what I want to do is read the .txt file in SQL Developer and inserts it into a table.
My problem lies in the fact that I am not able to create directories. I am using a university oracle connection and I do not have the privileges to actually make a directory so that I can use UTL_FILE on my .txt file as was answered in the linked question. Assuming there is no way for me to gain this permission, is there an alternate way I can accomplish reading a .txt file for data for my table? Is there a better way to go about creating "dummy data" for my database?
What I ended up doing to insert my mock data was change the way the .txt file was formatted. Instead of having my C++ code write the data one entry per row, I actually made the code so that it wrote SQL code to the .txt file as I think #toddlermenot was suggesting, more or less. After I had the C++ code write as many inserts-with-mock-entries as I needed to the text file, I just copy/pasted it to SQL developer and achieved the desired results.
My problem is a classic case of making the process more complicated than it needed to be.
Also, even though I did not use the method,#Multisync provided an interesting way to go about achieving my goal. I had no idea SQL had the tools for me to generate mock data. Thanks for introducing me to that.
I am wondering if it makes sense to use multiple SQL database files like sqllite (which I believe is single file based?) as project files in my software. The project files contain basic information as well as multiple records (spectra) with lists of parameters (floating point values) and lists of measurement data (also floating point).
I currently use my own binary format, which is a pain to maintain. I tried to use XML which works very well, but the file sizes explode (500 kB before, 7.5 MB as XML).
Now I wonder if I can structure SQL databases to contain this kind of information and effectively load and save this data in my .NET software.
(I am not very experienced in SQL) so:
Can SQL tables contain sub-tables (like subnodes in XML) or be linked to other tables?
E.g. Can I make a table for the record, and this table has subtables for the lists of measurement data and parameters?
Will this be more efficient than XML in terms of storage space?
I went with a SQLite database. It can be easily implemented into .NET using the System.Data.SQLite Project, that can even be used with AnyCPU Builds.
It is working very nicely, both performance and storage space wise.
You still need to take a lot of care with different versions of your databases. If you try and save a new scheme into a database using an older scheme, some columns or tables might not exist. You need to implement a migration method to a new database file for this.
The real advantage is, that it is an open format, and I stand behind the premise, that the stuff a user saves is his, and does not need to be hidden in an obscure, file structure, if the latter does not bring any significant advantages to the table.
If the user can no longer use your software, he or she can still access all data, using other tools like the Database Browser for SQLite if need be.
For most database-backed projects I've worked on, there is a need to get "startup" or test data into the database before deploying the project. Examples of startup data: a table that lists all the countries in the world or a table that lists a bunch of colors that will be used to populate a color palette.
I've been using a system where I store all my startup data in an Excel spreadsheet (with one table per worksheet), then I have a utility script in SQL that (1) creates the database, (2) creates the schemas, (3) creates the tables (including primary and foreign keys), (4) connects to the spreadsheet as a linked server, and (5) inserts all the data into the tables.
I mostly like this system. I find it very easy to lay out columns in Excel, verify foreign key relationships using simple lookup functions, perform concatenation operations, copy in data from web tables or other spreadsheets, etc. One major disadvantage of this system is the need to sync up the columns in my worksheets any time I change a table definition.
I've been going through some tutorials to learn new .NET technologies or design patterns, and I've noticed that these typically involve using Visual Studio to create the database and add tables (rather than scripts), and the data is typically entered using the built-in designer. This has me wondering if maybe the way I'm doing it is not the most efficient or maintainable.
Questions
In general, do you find it preferable to build your whole database via scripts or a GUI designer, such as SSMSE or Visual Studio?
What method do you recommend for populating your database with startup or test data and why?
Clarification
Judging by the answers so far, I think I should clarify something. Assume that I have a significant amount of data (hundreds or thousands of rows) that needs to find its way into the database. This data could be sourced from various places, such as text files, spreadsheets, web tables, etc. I've received several suggestions to script this process using INSERT statements, but is this really viable when you're talking about a lot of data?
Which leads me to...
New questions
How would you write a SQL script to take the country data on this page and insert it into the database?
With Excel, I could just copy/paste the table into a worksheet and run my utility script, and I'd basically be done.
What if you later realized you needed a new column, CapitalCity?
With Excel, I could take that information from this page, paste it into Excel, and with a quick text-to-column manipulation, I'd have the data in the format I need.
I honestly didn't write this question to defend Excel as the best way or even a good way to get data into a database, but the answers so far don't seem to be addressing my main concern--how to get all this data into your database. Writing a script with hundreds of INSERT statements by hand would be extremely time consuming and error prone. Somehow, this script needs to be machine generated, but how?
I think your current process is fine for seeding the database with initial data. It's simple, easy to maintain, and works for you. If you've got a good database design with adequate constraints then it doesn't really matter how you seed the initial data. You could use an intermediate tool to generate scripts but why bother?
SSIS has a steep learning curve, doesn't work well with source control (impossible to tell what changed between versions), and is very finicky about type conversions from Excel. There's also an issue with how many rows it reads ahead to determine the data type -- you're in deep trouble if your first x rows contain numbers stored as text.
1) I prefer to use scripts for several reasons.
• Scripts are easy to modify, and plus when I get ready to deploy my application to a production environment, I already have the scripts written so I'm all set.
• If I need to deploy my database to a different platform (like Oracle or MySQL) then it's easy to make minor modifications to the scripts to work on the target database.
• With scripts, I'm not dependent on a tool like Visual Studio to build and maintain the database.
2) I like good old fashioned insert statements using a script. Again, at deployment time scripts are your best friend. At our shop, when we deploy our applications we have to have scripts ready for the DBA's to run, as that's what they expect.
I just find that scripts are simple, easy to maintain, and the "least common denominator" when it comes to creating a database and loading up data to it. By least common denominator, I mean that the majority of people (i.e. DBA's, other people in your shop that might not have visual studio) will be able to use them without any trouble.
The other thing that's important with scripts is that it forces you to learn SQL and more specfically DDL (data definition language). While the hand-holding GUI tools are nice, there's no substitute for taking the time to learn SQL and DDL inside out. I've found that those skills are invaluable to have in almost any shop.
Frankly, I find the concept of using Excel here a bit scary. It obviously works, but it's creating a dependency on an ad-hoc data source that won't be resolved until much later. Last thing you want is to be in a mad rush to deploy a database and find out that the Excel file is mangled, or worse, missing entirely. I suppose the severity of this would vary from company to company as a function of risk tolerance, but I would be actively seeking to remove Excel from the equation, or at least remove it as a permanent fixture.
I always use scripts to create databases, because scripts are portable and repeatable - you can use (almost) the same script to create a development database, a QA database, a UAT database, and a production database. For this reason it's equally important to use scripts to modify existing databases.
I also always use a script to create bootstrap data (AKA startup data), and there's a very important reason for this: there's usually more scripting to be done afterward. Or at least there should be. Bootstrap data is almost invariably read-only, and as such, you should be placing it on a read-only filegroup to improve performance and prevent accidental changes. So you'll generally need to script the data first, then make the filegroup read-only.
On a more philosophical level, though, if this startup data is required for the database to work properly - and most of the time, it is - then you really ought to consider it part of the data definition itself, the metadata. For that reason, I don't think it's appropriate to have the data defined anywhere but in the same script or set of scripts that you use to create the database itself.
Test data is a little different, but in my experience you're usually trying to auto-generate that data in some fashion, which makes it even more important to use a script. You don't want to have to manually maintain an ad-hoc database of millions of rows for testing purposes.
If your problem is that the test or startup data comes from an external source - a web page, a CSV file, etc. - then I would handle this with an actual "configuration database." This way you don't have to validate references with VLOOKUPS as in Excel, you can actually enforce them.
Use SQL Server Integration Services (formerly DTS) to pull your external data from CSV, Excel, or wherever, into your configuration database - if you need to periodically refresh the data, you can save the SSIS package so it ends up being just a couple of clicks.
If you need to use Excel as an intermediary, i.e. to format or restructure some data from a web page, that's fine, but the important thing IMO is to get it out of Excel as soon as possible, and SSIS with a config database is an excellent repeatable method of doing that.
When you are ready to migrate the data from your configuration database into your application database, you can use SQL Server Management Studio to generate a script for the data (in case you don't already know - when you right click on the database, go to Tasks, Generate Scripts, and turn on "Script Data" in the Script Options). If you're really hardcore, you can actually script the scripting process, but I find that this usually takes less than a minute anyway.
It may sound like a lot of overhead, but in practice the effort is minimal. You set up your configuration database once, create an SSIS package once, and refresh the config data maybe once every few months or maybe never (this is the part you're already doing, and this part will become less work). Once that "setup" is out of the way, it's really just a few minutes to generate the script, which you can then use on all copies of the main database.
Since I use an object-relational mapper (Hibernate, there is also a .NET version), I prefer to generate such data in my programming language. The ORM then takes care of writing things into the database. I don't have to worry about changing column names in the data because I need to fix the mapping anyway. If refactoring is involved, it usually takes care of the startup/test data also.
Excel is an unnecessary component of this process.
Script the current version the database components that you want to reuse, and add the script to your source control system. When you need to make changes in the future, either modify the entities in the database and regenerate the script, or modify the script and regenerate the database.
Avoid mixing Visual Studio's db designer and Excel as they only add complexity. Scripts and SQL Management Studio are your friends.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm project managing a development that's pulling data from all kinds of data sources (SQL MySQL, Filemaker, excel) before installing into a new database structure with a record base through 10 years. Obviously I need to clean all this before exporting, and am wondering if there are any apps that can simplify this process for me, or any guides that I can follow.
Any help would be great
I do this all the time and like Tom do it in SQl Server using DTS or SSIS depending on the version of the final database.
Some things I strongly recommend:
Archive all files received before you process them especially if you are getting this data from outside sources, you may have to research old imports and go back to the raw data. After the archive is successful, copy the file to the processing location.
For large files especially, it is helpful to get some sort of flag file that is only copied after the other file is completed or even better whcich contains the number of records in the file. This can help prevent problems from corrupted or incomplete files.
Keep a log of number of records and start failing your jobs if the file size or number of records is suspect. Put in a method to process anyway if you find the change is correct. Sometimes they really did mean to cut the file in half but most of the time they didn't.
If possible get column headers in the file. You would be amazed at how often data sources change the columns, column names or order of the columns without advance warning and break imports. It is easier to check this before processing data if you have column headers.
Never import directly to a production table. Always better to use a staging table where you can check and clean data before putting it into prod.
Log each step of your process, so you can easily find what caused a failure.
If you are cleaning lots of files consider creating functions to do specific types of cleaning (phone number formatting for instance) then you can use the same function in multiple imports.
Excel files are evil. Look for places where leading zeros have been stripped in the import process.
I write my processes so I can run them as a test with a rollback at the end. Much better to do this than realize your dev data is so hopelessly messed up that you can't even do a valid test to be sure everything can be moved to prod.
Never do a new import on prod without doing it on dev first. Eyeball the records directly when you are starting a new import (not all of them if it is a large file of course, but a good sampling). If you think you should get 20 columns and it imports the first time as 21 columns, look at the records in that last column, many times that means the tab delimited file had a tab somewhere in the data and the column data is off for that record.
Don't assume the data is correct, check it first. I've had first names in the last name column, phones in the zip code column etc.
Check for invalid characters, string data where there should just be numbers etc.
Any time it is possible, get the identifier from the people providing the data. Put this in a table that links to your identifier. This will save you from much duplication of records becuase the last name changed or the address changed.
There's lots more but this should get you started on thinking about building processes to protect your company's data by not importing bad stuff.
I work mostly with Microsoft SQL Server, so that's where my expertise is, but SSIS can connect to a pretty big variety of data sources and is very good for ETL work. You can use it even if none of your data sources are actually MS SQL Server. That said, if you're not using MS SQL Server there is probably something out there that's better for this.
To provide a really good answer one would need to have a complete list of your data sources and destination(s) as well as any special tasks which you might need to complete along with any requirements for running the conversion (is it a one-time deal or do you need to be able to schedule it?)
Not sure about tools, but your going to have to deal with:
synchronizing generated keys
synchronizing/normalizing data formats (e.g. different date formats)
synchronizing record structures.
orphan records
If the data is running/being updated while you're developing this process or moving data you're also going to need to capture the updates. When I've had to do this sort of thing in the past the best, not so great answer I had was to develop a set of scripts that ran in multiple iterations, so that I could develop and test the process iteratively before I moved any of the data. I found it helpful to have a script (I used a schema and an ant script, but it could be anything) that could clean/rebuild the destination database. It's also likely that you'll need to have some way of recording dirty/mismatched data.
In similar situations I personally have found Emacs and Python mighty useful but, I guess, any text editor with good searching capabilities and a language with powerful string manipulation features should do the job. I first convert the data into flat text files and then
Eyeball either the whole data set or a representative true random sample of the data.
Based on that make conjectures about different columns ("doesn't allow nulls", "contains only values 'Y' and 'N'", "'start date' always precede 'end date'", etc.).
Write scripts to check the conjectures.
Obviously this kind method tends to focus on one table at a time and therefore only complements the checks made after uploading the data into a relational database.
One trick that comes in useful for me with this, is to find a way for each type of data source to output a single column plus unique identifier at a time in tab delimited form say, so that you can clean it up using text tools (sed, awk, orTextMate's grep search), and then re-import it / update the (copy of!) original source.
It then becomes much quicker to clean up multiple sources, as you can re-use tools across them (e.g. capitalising last names - McKay, O'Leary o'Neil, Da Silva, Von Braun, etc., fixing date formats, trimming whitespace) and to some extent automate the process (depending on the source).
I'm working on a a tool to generate TSV files for import into a database using bcp.exe and I'd like to validate my output. I can do this by comparing the file I generate to the files produced by exporting using bcp from an existing database. My problem is that the ordering can sometimes be different between files. I'd like a tool that will tell me just if there are lines that have no exact match in a pair of files, irregardless of the order of the lines.
'Irregardless' of whether 'irregardless' is a word...
The reliable way to do that comparison is to sort the two files into the same order, and then do a file comparison. Since you mention 'bcp.exe', that sounds more like Windows and probably MS SQL Server than Unix and Sybase.
I'd probably use Cygwin and either diff or comm to compare (and sort to order) the files, or any equivalent Unix workalike toolset (MKS, ...). Other people might recommend other tools. It depends, in part, on how many differences you think you're likely to find normally, and how you will handle them after you find them. Is a GUI output necessary? Also, you face a problem tracking the differences back to specific line numbers in the unsorted data files.