Read Excel Files from External Tables - sql

I am tasked to create a template that will be Filled up by Business Users with Employee Information, then our program will load this into the Database using External Tables.
However, our Business Users constantly change the template by adding, removing or reordering fields.
I am convinced to use XLSX instead of CSV so that I can lock the Column Headers so they cannot remove, add and reorder the columns.
However, When i query the External Table, it shows Non-ASCII Characters when reading XLSX because its in Binary.
How can i do either of the following?
Effectively Read Excel Files from External Tables
Lock the Headers of CSV Files?

What you have here is a political problem, but you are looking for a technical fix. Not a good fit.
The problem comes in two halves:
Somebody decided it was a good idea to collect user input in a spreadsheet, which it is generally not.
Users are fiddling with the input format, which they should not.
Fixes are:
Strictly enforce the data structure. Reject any CSV which doesn't natch and make the users edit them. They will quickly tire of tweaking the spreadsheets when they realise they're just creating more work for themselves. But they will also get resentful, so consider ...
Building a data input screen. It's pretty simple to knock up a spreadsheet-like grid UI. You don't need anything complicated in Java: Oracle's Apex is intended for exactly this sort of thing. Find out more.
However, if you are stuck with Excel as a UI I suggest you have a look at Anton Scheffer's excellent PLSQL as_read_xlsx package on the Amis site. Check it out. You'll probably need to replace your external table with a view over a table (perhaps pipelined) function.

Related

Looking to build a lineup builder for the site draftkings using excel

Why excel? Well excel is what is used to import the player salaries.
Now I need the spreadsheet to do the following.
Create teams within salary cap.
Include/exclude specific player function
Build multiple lineups from a selected list of players within the cap
Can I do all of this with excel? or do I need to know excel vba as well?
Also which parts of excel or if necessary excel vba must I need to know to code such a thing? Also if someone could give me a short summary of the steps needed to hypothetically make such a thing it would be great. Thanks.
I'm posting this as a reply because it is too long for a comment window.
Just because Excel has a grid doesn't mean that it is fit for data storage and data handling, on the contrary.
What you typically want to do is create a transparent structure that guarantees the integrity of your data and that allows dynamic portability when needed one day.
Excel is meant to be a spreadsheet, people forget this all the time or they just avoid the topic: although Excel has a grid, doesn't mean that it is a good fit for reliable data storage. It is not even the least complex way of storing data, depending on the amount of VBA that you need to manage all these data and the gates that you unnecessarily open towards potential bugs.
This is why an RDBMS is what will fit your needs, in this case Access would be a good option as it preserves your data integrity if you get the table structure right and it executes a lot of tasks for you that you should otherwise need to program yourself to protect the integrity of your data.
Although you can perform SQL on spreadsheets too, note that Excel does NOT cover related tables (what you typically seem to need for building your teams and salary limits), so what many Excel programmers will typically do is to create their own code to make this cross-table data storage thing work.
Don't do this if the alternative is available and much more reliable and future-proof.
At first sight, it seems that you won't even need any VBA; I'm not sure of all the things you want to do, but my first impression is that you can manage everything with SQL syntax and stored queries in MS Access. You can import Excel sheets into Access if you get their format right so that should not be a problem.
Once your data is stored there, rest assured that you have made your life a lot easier.

Transferring data from a database of unknown structure to a databases of known structure

I have a few challenges I need help on. I need to pull data in to my SQL database from arbitrary sources.
The details are: I know the exact structure of my database and the structure will not change. When I do take in new data, it will occur only one time, at the time I set up an instance of my database. I will make many instances of my database and each time it will have to pull data from a different source, and those sources will be structured in different ways.
The data will most likely contain thousands of rows of records. The data source will most likely be held in Excel, Access, more rare Word and even rarer, it'll be held in a SQL database.
I can assume that most of the core data will be the same, just put in different locations. They will follow a general grouping despite how there held.
Essentially, I'm transferring data from legacy systems to a SQL system and this must be done for many groups and they need their own private instance of the database.
Any thoughts on how I would do this? How hard would it be to write a program that would do most of this for me?
This is definitely a real-world question. Is it possible to write a program that will do most of this? Not most of this, I think, but perhaps some of it.
For each table in your target system, create a view that displays the source data you expect to be able to insert. Choose column names that make it easy to tell what has to be done; most likely you'll choose column names that match the target columns in your INSERT statement. Save your INSERT statements as stored procedures.
Now, when you are given a new source of data in a new format, you will still have to recreate your views, but once the views are displaying the right data under your chosen column names, you can run your stored procedures without change.
I have a similar type of project where data is being retrieved from Access, .ini file, file modification dates, and MySql. I scrape this data every morning and basically append to a set SqlServer schema.
I created a DataTable and as I iterate a set of directories, insert the data into each new row. Once I have the DataTable complete, I perform a bulkcopy to append to the database.
I hope that helps you out a bit. I know my project doesn't cover all the aspects of your question; but also don't have a DBA to provide views, stored procedures, etc. Nor do I have the additional time to devote to such things. Not the most favorable of conditions, but that's the way it is.
HTH...
The best way of solving this problem is with and ETL (Extract-Transform-Load) solution. A good choice is SSIS which is through Microsoft's BI suite.
This is the building blocks for consciousness or the base......
1 A data base that organizes thousands of files similar to dna,
2 user interface
3 parts are hidden, preventing a system breach/crash

Getting a significant amount of data into a SQL Server (Express) database at time of deployment

For most database-backed projects I've worked on, there is a need to get "startup" or test data into the database before deploying the project. Examples of startup data: a table that lists all the countries in the world or a table that lists a bunch of colors that will be used to populate a color palette.
I've been using a system where I store all my startup data in an Excel spreadsheet (with one table per worksheet), then I have a utility script in SQL that (1) creates the database, (2) creates the schemas, (3) creates the tables (including primary and foreign keys), (4) connects to the spreadsheet as a linked server, and (5) inserts all the data into the tables.
I mostly like this system. I find it very easy to lay out columns in Excel, verify foreign key relationships using simple lookup functions, perform concatenation operations, copy in data from web tables or other spreadsheets, etc. One major disadvantage of this system is the need to sync up the columns in my worksheets any time I change a table definition.
I've been going through some tutorials to learn new .NET technologies or design patterns, and I've noticed that these typically involve using Visual Studio to create the database and add tables (rather than scripts), and the data is typically entered using the built-in designer. This has me wondering if maybe the way I'm doing it is not the most efficient or maintainable.
Questions
In general, do you find it preferable to build your whole database via scripts or a GUI designer, such as SSMSE or Visual Studio?
What method do you recommend for populating your database with startup or test data and why?
Clarification
Judging by the answers so far, I think I should clarify something. Assume that I have a significant amount of data (hundreds or thousands of rows) that needs to find its way into the database. This data could be sourced from various places, such as text files, spreadsheets, web tables, etc. I've received several suggestions to script this process using INSERT statements, but is this really viable when you're talking about a lot of data?
Which leads me to...
New questions
How would you write a SQL script to take the country data on this page and insert it into the database?
With Excel, I could just copy/paste the table into a worksheet and run my utility script, and I'd basically be done.
What if you later realized you needed a new column, CapitalCity?
With Excel, I could take that information from this page, paste it into Excel, and with a quick text-to-column manipulation, I'd have the data in the format I need.
I honestly didn't write this question to defend Excel as the best way or even a good way to get data into a database, but the answers so far don't seem to be addressing my main concern--how to get all this data into your database. Writing a script with hundreds of INSERT statements by hand would be extremely time consuming and error prone. Somehow, this script needs to be machine generated, but how?
I think your current process is fine for seeding the database with initial data. It's simple, easy to maintain, and works for you. If you've got a good database design with adequate constraints then it doesn't really matter how you seed the initial data. You could use an intermediate tool to generate scripts but why bother?
SSIS has a steep learning curve, doesn't work well with source control (impossible to tell what changed between versions), and is very finicky about type conversions from Excel. There's also an issue with how many rows it reads ahead to determine the data type -- you're in deep trouble if your first x rows contain numbers stored as text.
1) I prefer to use scripts for several reasons.
• Scripts are easy to modify, and plus when I get ready to deploy my application to a production environment, I already have the scripts written so I'm all set.
• If I need to deploy my database to a different platform (like Oracle or MySQL) then it's easy to make minor modifications to the scripts to work on the target database.
• With scripts, I'm not dependent on a tool like Visual Studio to build and maintain the database.
2) I like good old fashioned insert statements using a script. Again, at deployment time scripts are your best friend. At our shop, when we deploy our applications we have to have scripts ready for the DBA's to run, as that's what they expect.
I just find that scripts are simple, easy to maintain, and the "least common denominator" when it comes to creating a database and loading up data to it. By least common denominator, I mean that the majority of people (i.e. DBA's, other people in your shop that might not have visual studio) will be able to use them without any trouble.
The other thing that's important with scripts is that it forces you to learn SQL and more specfically DDL (data definition language). While the hand-holding GUI tools are nice, there's no substitute for taking the time to learn SQL and DDL inside out. I've found that those skills are invaluable to have in almost any shop.
Frankly, I find the concept of using Excel here a bit scary. It obviously works, but it's creating a dependency on an ad-hoc data source that won't be resolved until much later. Last thing you want is to be in a mad rush to deploy a database and find out that the Excel file is mangled, or worse, missing entirely. I suppose the severity of this would vary from company to company as a function of risk tolerance, but I would be actively seeking to remove Excel from the equation, or at least remove it as a permanent fixture.
I always use scripts to create databases, because scripts are portable and repeatable - you can use (almost) the same script to create a development database, a QA database, a UAT database, and a production database. For this reason it's equally important to use scripts to modify existing databases.
I also always use a script to create bootstrap data (AKA startup data), and there's a very important reason for this: there's usually more scripting to be done afterward. Or at least there should be. Bootstrap data is almost invariably read-only, and as such, you should be placing it on a read-only filegroup to improve performance and prevent accidental changes. So you'll generally need to script the data first, then make the filegroup read-only.
On a more philosophical level, though, if this startup data is required for the database to work properly - and most of the time, it is - then you really ought to consider it part of the data definition itself, the metadata. For that reason, I don't think it's appropriate to have the data defined anywhere but in the same script or set of scripts that you use to create the database itself.
Test data is a little different, but in my experience you're usually trying to auto-generate that data in some fashion, which makes it even more important to use a script. You don't want to have to manually maintain an ad-hoc database of millions of rows for testing purposes.
If your problem is that the test or startup data comes from an external source - a web page, a CSV file, etc. - then I would handle this with an actual "configuration database." This way you don't have to validate references with VLOOKUPS as in Excel, you can actually enforce them.
Use SQL Server Integration Services (formerly DTS) to pull your external data from CSV, Excel, or wherever, into your configuration database - if you need to periodically refresh the data, you can save the SSIS package so it ends up being just a couple of clicks.
If you need to use Excel as an intermediary, i.e. to format or restructure some data from a web page, that's fine, but the important thing IMO is to get it out of Excel as soon as possible, and SSIS with a config database is an excellent repeatable method of doing that.
When you are ready to migrate the data from your configuration database into your application database, you can use SQL Server Management Studio to generate a script for the data (in case you don't already know - when you right click on the database, go to Tasks, Generate Scripts, and turn on "Script Data" in the Script Options). If you're really hardcore, you can actually script the scripting process, but I find that this usually takes less than a minute anyway.
It may sound like a lot of overhead, but in practice the effort is minimal. You set up your configuration database once, create an SSIS package once, and refresh the config data maybe once every few months or maybe never (this is the part you're already doing, and this part will become less work). Once that "setup" is out of the way, it's really just a few minutes to generate the script, which you can then use on all copies of the main database.
Since I use an object-relational mapper (Hibernate, there is also a .NET version), I prefer to generate such data in my programming language. The ORM then takes care of writing things into the database. I don't have to worry about changing column names in the data because I need to fix the mapping anyway. If refactoring is involved, it usually takes care of the startup/test data also.
Excel is an unnecessary component of this process.
Script the current version the database components that you want to reuse, and add the script to your source control system. When you need to make changes in the future, either modify the entities in the database and regenerate the script, or modify the script and regenerate the database.
Avoid mixing Visual Studio's db designer and Excel as they only add complexity. Scripts and SQL Management Studio are your friends.

cleaning datasources [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm project managing a development that's pulling data from all kinds of data sources (SQL MySQL, Filemaker, excel) before installing into a new database structure with a record base through 10 years. Obviously I need to clean all this before exporting, and am wondering if there are any apps that can simplify this process for me, or any guides that I can follow.
Any help would be great
I do this all the time and like Tom do it in SQl Server using DTS or SSIS depending on the version of the final database.
Some things I strongly recommend:
Archive all files received before you process them especially if you are getting this data from outside sources, you may have to research old imports and go back to the raw data. After the archive is successful, copy the file to the processing location.
For large files especially, it is helpful to get some sort of flag file that is only copied after the other file is completed or even better whcich contains the number of records in the file. This can help prevent problems from corrupted or incomplete files.
Keep a log of number of records and start failing your jobs if the file size or number of records is suspect. Put in a method to process anyway if you find the change is correct. Sometimes they really did mean to cut the file in half but most of the time they didn't.
If possible get column headers in the file. You would be amazed at how often data sources change the columns, column names or order of the columns without advance warning and break imports. It is easier to check this before processing data if you have column headers.
Never import directly to a production table. Always better to use a staging table where you can check and clean data before putting it into prod.
Log each step of your process, so you can easily find what caused a failure.
If you are cleaning lots of files consider creating functions to do specific types of cleaning (phone number formatting for instance) then you can use the same function in multiple imports.
Excel files are evil. Look for places where leading zeros have been stripped in the import process.
I write my processes so I can run them as a test with a rollback at the end. Much better to do this than realize your dev data is so hopelessly messed up that you can't even do a valid test to be sure everything can be moved to prod.
Never do a new import on prod without doing it on dev first. Eyeball the records directly when you are starting a new import (not all of them if it is a large file of course, but a good sampling). If you think you should get 20 columns and it imports the first time as 21 columns, look at the records in that last column, many times that means the tab delimited file had a tab somewhere in the data and the column data is off for that record.
Don't assume the data is correct, check it first. I've had first names in the last name column, phones in the zip code column etc.
Check for invalid characters, string data where there should just be numbers etc.
Any time it is possible, get the identifier from the people providing the data. Put this in a table that links to your identifier. This will save you from much duplication of records becuase the last name changed or the address changed.
There's lots more but this should get you started on thinking about building processes to protect your company's data by not importing bad stuff.
I work mostly with Microsoft SQL Server, so that's where my expertise is, but SSIS can connect to a pretty big variety of data sources and is very good for ETL work. You can use it even if none of your data sources are actually MS SQL Server. That said, if you're not using MS SQL Server there is probably something out there that's better for this.
To provide a really good answer one would need to have a complete list of your data sources and destination(s) as well as any special tasks which you might need to complete along with any requirements for running the conversion (is it a one-time deal or do you need to be able to schedule it?)
Not sure about tools, but your going to have to deal with:
synchronizing generated keys
synchronizing/normalizing data formats (e.g. different date formats)
synchronizing record structures.
orphan records
If the data is running/being updated while you're developing this process or moving data you're also going to need to capture the updates. When I've had to do this sort of thing in the past the best, not so great answer I had was to develop a set of scripts that ran in multiple iterations, so that I could develop and test the process iteratively before I moved any of the data. I found it helpful to have a script (I used a schema and an ant script, but it could be anything) that could clean/rebuild the destination database. It's also likely that you'll need to have some way of recording dirty/mismatched data.
In similar situations I personally have found Emacs and Python mighty useful but, I guess, any text editor with good searching capabilities and a language with powerful string manipulation features should do the job. I first convert the data into flat text files and then
Eyeball either the whole data set or a representative true random sample of the data.
Based on that make conjectures about different columns ("doesn't allow nulls", "contains only values 'Y' and 'N'", "'start date' always precede 'end date'", etc.).
Write scripts to check the conjectures.
Obviously this kind method tends to focus on one table at a time and therefore only complements the checks made after uploading the data into a relational database.
One trick that comes in useful for me with this, is to find a way for each type of data source to output a single column plus unique identifier at a time in tab delimited form say, so that you can clean it up using text tools (sed, awk, orTextMate's grep search), and then re-import it / update the (copy of!) original source.
It then becomes much quicker to clean up multiple sources, as you can re-use tools across them (e.g. capitalising last names - McKay, O'Leary o'Neil, Da Silva, Von Braun, etc., fixing date formats, trimming whitespace) and to some extent automate the process (depending on the source).

Creating a database in Microsoft Access that is searchable only by certain fields

How would you create a database in Microsoft Access that is searchable only by certain fields and controlled by only a few (necessary) text boxes and check boxes on a form so it is easy to use - no difficult queries?
Example:
You have several text boxes and several corresponding check boxes on a form, and when the check box next to the text box is checked, the text box is enabled and you can then search by what is entered into said text box
(Actually I already know this, just playing stackoverflow jeopardy, where I ask a question I know the answer just to increase the world's coding knowledge! answer coming in about 5 mins)
My own solution is to add a "filter" control in the header part of the form for each of the columns I want to be able to filter on (usually all ...). Each time such a "filter" control is updated, a procedure will run to update the active filter of the form, using the "BuildCriteria" function available in Access VBA.
Thus, When I type "*cable*" in the "filter" at the top of the Purchase Order Description column, the "WHERE PODescription IS LIKE "*cable*" is automatically added to the MyForm.filter property ....
Some would object that filtering record source made of multiple underlying tables can become very tricky. That's right. So the best solution is according to me to always (I mean it!) use a flat table or a view ("SELECT" query in Access) as a record source for a form. This will make your life a lot easier!
Once you're convinced of this, you can even think of a small module that will automate the addition of "filter" controls and related procedures to your forms. You'll be on the right way for a real user-friendly client interface.
This is actually a pretty large topic, and fraught with all kinds of potential problems. Most intermediate to advanced books on Access will have some kind of section discussing "Query by Form," where you have an unbound form that allows the user to choose certain criteria, and that when executed, writes on-the-fly SQL to return the matching data.
In anything but a flat, single-table data structure, this is not a trivial task because the FROM clause of the SQL is dependent on the tables queried in the WHERE clause.
A few examples of some QBF forms from apps I've created for clients:
Querying 4 underlying tables
Querying a flat single table
Querying 3 underlying tables
Querying 6 underlying tables
Querying 2 underlying tables
The first one is driven by a class module that has properties that reflect the criteria selected in this form, and that has methods that write the FROM and WHERE clauses. This makes it extremely easy to add other fields (as long as those fields don't come from tables other than the ones already included).
The most complex part of the process is writing the FROM clause, as you have to have appropriate join types and include only the tables that are either in the SELECT clause or the WHERE clause. If you include anything else, you'll slow down your query a lot (especially if you have any outer joins).
But this is a big subject, and there is no magic bullet solution -- instead, something like this has to be created for each particular application. It's also important that you test it thoroughly with users, since what is completely clear and understandable to you, the developer, is often pretty darned mystifying to end users.
But that's a principle that doesn't just apply to QBF!
At start-up, you need to show a form and disable other menus etc. That way your user only ever sees your limited functionality and cannot directly open the tables etc.
This book excerpt, Real World Microsoft Access Database Protection and Security, should be enlightening.
For a question that vague, all that I can answer is open MS Access, and click the mouse a few times.
On second thought:
Use the "WhereCondition" argument of the "OpenForm" method
If the functionality is very limited and/or specialised then a SQL database is probably going to be overkill anyhow e.g. cache all combinations of the data locally, in memory even, and show one according to the checkboxes on the form. Previously you could have revoked permissions from the table and granted them only on VIEWs/PROCs that queried the data in the prescribed way, however security has been removed from MS Access 2007 so you can you now really stop users bypassing your simple app using, say, Excel and querying the data any way they like ...but then isn't that the point of an enterprise database? ;-)