I was bored and looking at old code that runs like molasses on a cold day. I found that a group of tables in our accounting system - each with 500,000 records of ~20 datapoints - that use a single column of concatenated, fixed-width values instead of separate columns. (Fixing the tables isn't an option.) An old .net ETL project is grabbing all records, doing a bunch of substrings on each record to set an object's corresponding attributes, then sending the object to merge with production data via a stored proc.
The way it is working is fine. It works. And, to be perfectly honest, I doubt I'll be given the go-ahead to fix it even if I come up with a better solution, but I was curious to see if anyone knew of a better way of doing this, because it's not entirely unlikely that I'll face a situation like this in the future.
I was thinking that if there was a way to use the TextFieldParser to parse a static string instead of a file/stream that might be a valid idea. Or, instead, I could write the entire table to a text file and then use the TextFieldParser to send data to the SProc. http://www.dotnetperls.com/textfieldparser does show that TextFieldParser is quite a bit faster than split, which I would assume is tantamount to the string manipulation our project is currently doing with substring. So there may be something to that idea.
Or perhaps the whole, old project should be dumped for a shiny new SSIS project. Would it also have to write the records to a flat file before importing into SQL? Or can it import directly from the table?
Thank you in advance!
I only found answers about how to import csv files into the database, for example as blob or as 1:1 representation of the table you are importing it into.
What I need is a little different: My team and I are tracking everything we do in a database. A lot of these tasks produce logfiles, benchmark results, etc., which are stored in CSV format. The number of columns are far from consistent and also the data could be completely different from file to file, e.g. it could be a log from fraps with frametimes in it or a log of CPU temparatures over an amount of time, or even something completely different.
Long story short, I came up with an idea, but - being far from a sql pro - I am not sure if it makes sense or if there is a more elegant solution.
Does this make sense to you:
We also need to deal with a lot of data that is produced, so please give me also your opinion if that is feasible with like 200 files per day which can easyly have a couple of thousands rows.
The purpose of all this will be, that we can generate reports form the stored data and perform analysis of the data. E.g. view it on a webpage in a graph or do calculations with it.
I'm limited to MS-SQL in this case, because that's what the current (quite complex) database is and I'm just adding a new schema with that functionality to it.
Currently we just archive the files on a raid and store a link to it in the database. So everyone who wants to do magic with the data needs to download every file he needs and then use R or Excel to create a visualization of the data.
Have you considered a column of XML data type for the file data as an alternative of ColumnId -> Data structure? SQL server provides is a special dedicated XML index (over the entire XML structure) so your data can be fully indexed no matter what CSV columns you have. You will have much less records in the database to handle (as an entire CSV file will be a single XML field value). There are good XML query options to search by values & attributes of the XML type.
For that you will need to translate CSV to XML, but you will have to parse it either way ...
Not that your plan won't work, I am just giving an idea :)
=========================================================
Update with some online info:
An article from Simple talk: The XML Methods in SQL Server
Microsoft documentation for nodes() with various use case samples: nodes() Method (xml Data Type)
Microsoft document for value() with various use case samples: value() Method (xml Data Type)
I'm brand new to using SQL and I'm not quite sure where to begin. I have many .CSV files in a folder and I wanted to go about making a database in order to provide a way to search through the information stored in each .csv file. All of the files are identical in their parameters, meaning all of the files are set up the same way in terms of columns and rows. Due to the way everything is set up and how files are laid out identically, this should be easy to code in SQL. Currently I am using SQLite in order to store and organize all of the data AFTER the first 17 rows of information Deleting the first 17 rows is also acceptable. I am familiar with Java and C++, but I'm not quite sure I understand how to skip or delete the first 17 rows in SQL, I also do not know how to code for this in SQLite. I would think that this is a simple thing, but I can't find anything on how to achieve this, skipping or deleting the first 17 rows of each .csv file. How would I go about telling SQLite to delete the first 17 rows?
I have more than 100 CSV/text files (vary in size between 1MB to 1GB). I Just need to create a excel sheet for each csv file, presenting:
name of columns
types of column i.e. numeric or string
number of records in each column
min & max values & length of each column
so the output on a sheet would be something like this (I can not paste table image here as I am new on this site, so please consider below dummy table as excel sheet):
A B C D E F G
1 Column_name Type #records min_value max_value min_length max_length
2 Name string 123456 Alis Zomby 4 30
3 Age numeric 123456 10 80 2 2
Is is possible to create any vba code for this? I am at very beginner stage so if any expert can help me out on code side, would be really helpful.
thanks!!!
You could try writing complex VBA file- and string-handling code for this; my advice is: don't.
A better approach is to ask: "What other tools can read a csv file?"
This is tabulated data, and the files are quite large. Larger, really, than you should be reading using a spreadsheet: it's database work, and your best toolkit will be SQL queries with MIN() MAX() and COUNT() functions to aggregate the data.
Microsoft Access has a good set of 'external data' tools that will read fixed-width files, and if you use 'linked data' rather than 'import table' you'll be able to read the files using SQL queries without importing all those gigabytes into an Access .mdb or .accdb file.
Outside MS-Access, you're looking at intermediate-to-advanced VBA using the ADODB database objects (Microsoft Active-X Data Objects) and a schema.ini file.
Your link for text file schema.ini files is here:
http://msdn.microsoft.com/en-us/library/ms709353%28v=vs.85%29.aspx
...And you'll then be left with the work of creating an ADODB database 'connection' object that sees text files in a folder as 'tables', and writing code to scan the file names and build the SQL queries. All fairly straightforward for an experienced developer who's used the ADO text file driver.
I can't offer anything more concrete than these general hints - and nothing like a code sample - because this is quite a complex task, and it's not really an Excel-VBA task; it's a programming task best undertaken with database tools, except for the very last step of displaying your results in a spreadsheet.
This is not a task I'd give a beginner as a teaching exercise, it demands so many unfamiliar concepts and techniques that they'd get nowhere until it was broken down into a structured series of separate tutorials.
I have a table called Animals. I pull data from this table to populate another system.
I get Excel data with lists of animals that need to go in the Animals table.
The Excel data will also have other identifiers, like Breed, Color, Age, Favorite Toy, Veterinarian, etc.
These identifiers will change with each new excel file. Some may repeat, others are brand new.
Because the fields change, and I never know what new fields will come with each new excel file, my Animals table only has Animal Id and Animal Name.
I've created a Values table to hold all the other identifier fields. That table is structured like this:
AnimalId
Value
FieldId
DataFileId
And then I have a Fields table that holds the key to each FieldId in the Values table.
I do this because the alternative is to keep a big table with fields that may not even be used each time I need to add data. A big table with a lot of null columns.
I'm not sure my way is a good way either. It can seem overly complex.
But, assuming it is a good way, what is the best way to get this excel data into my Values table? The list of animals is easy to add to my Animals table. But for each identifier (Breed, Color, etc.) I have to copy or import the values and then update the table to assign a matching FieldId (or create a new FieldId in the Fields table if it doesn't exist yet).
It's a huge pain to load new data if there are a lot of identifiers. I'm really struggling and could use a better system.
Any advice, help, or just pointing me in a better direction would be really appreciated.
Thanks.
Depending on your client (eg, I use SequelPro on a Mac), you might be able to import CSVs. This is generally pretty shaky, but you can also export your Excel document as a CSV... how convenient.
However, this doesn't really help with your database structure. Granted, using foreign keys is a good idea, but importing that data unobtrusively (and easily) is something that will need to likely be done a row at a time.
However, you could try modifying something like this to suit your needs, by first exporting your Excel document as a CSV, removing the header row (the first one), and then using regular expressions on it to change it into a big chunk of SQL. For example:
Your CSV:
myval1.1,myval1.2,myval1.3,myval1.4
myval2.1,myval2.2,myval2.3,myval2.4
...
At which point, you could do something like:
myCsvText.replace(/^(.+),(.+),(.+)$/mg, 'INSERT INTO table_name(col1, col2, col3) VALUES($1, $2, $3)')
where you know the number of columns, their names, and how their values are organized (via the regular expression & replacement).
Might be a good place to start.
Your table looks OK. Since you have a variable number of fields, it seems logical to expand vertically. Although you might want to make it easier on yourself by changing DataFileID and FieldID into FieldName and DataFileName, unless you will use them in a lot of other tables too.
Getting data from Excel into SQL Server is unfortunately not so easy as you would expect from two Microsoft products interacting with eachother. There are several routes that I know of that you can take:
Work with CSV files instead of Excel files. Excel can edit CSV files just as easily as Excel files, but CSV is an infinitely more reliable datasource when it comes to importing. You don't get problems with different file formats for different Excel versions, Excel having to be installed on the computer that will run the script or quirks with automatic datatype recognition. A CSV can be read with the BCP commandline tool, the BULK INSERT command or with SSIS. Then use stored procedures to convert the data from a horizontal bulk of columns into a pure vertical format.
Use SSIS to read the data directly from the Excel file(s). It is possible to make a package that loops over several Excel files. A drawback is that the column format and the sheet name of the Excel file has to be known beforehand, and so a different template (with a separate loop) has to be made each time a new Excel format arrives. There exist third-party SSIS components that claim to be more flexible, but I haven't tested them out yet.
Write a Visual C# program or PowerShell script that grabs the Excel file, extracts the data and outputs into your SQL table. Visual C# is a pretty easy language with powerful interfaces into Office and SQL Server. I don't know how big the learning curve is to get started, but once you do, it will be a pretty easy program to write. I have also heard good things about Powershell.
Create an Excel Macro that uses VB code to open other Excel files, loop through their data and write the results either in a predefined sheet or as CSV to disk. Once everything is in a standard format it will be easy to import the data using one of the above methods.
Since I have had headaches with 1) and 2) before, I would advise on either 3) or 4). Because of my greater experience with VBA than Visual C# or Powershell, I´d go for 4) if I was in a hurry. But I think 3) is the better investment for the long term.
(You could also go adventurous and use another script language, such as Python as I once did because Python is cool, unfortunately Python offers pretty slow and limited interfaces to SQL server and Excel)
Good luck!