We are presently developing an application, let's call it APP1, which uses a SQL Database which have about 800 stored procedures, 600 tables, etc. APP1 was originally created in order to replace another application, APP0, from which we do not have source code but only SQL tables, Stored Procedures, views, etc. Previous programers of APP1 used some DB objects from this same database and added some other objects specific to APP1 because it becomes bigger than APP0. And we do not need APP0 anymore as APP1 does all what we want, and more.
So, now, we are thinking about a way to find out which objects are used by APP1 in order to remove objects which are ONLY used by APP0.
What is the best approach to discover all objects used by APP1 without having to open every single class and form?
Once we will have a complete list of these objects, it will be easy to use a program we bought which detects all dependencies for all SQL Objects specified directly from SQL and remove objects which do not return from any dependencies. Any ideas of how I could get this list without having to go through all our program that have many, many, many classes and forms?
Thanks,
Note : I know, in a perfect world, all calls to PSs and tables should be in a DAL but in the case of the application we're presently working on ... this is not our case! Yippy! (sarcastic yippy) ;)
Note 2 : This application is not using any ORM. So all queries are directly using SqlCommand. So any call to any DB objects are in string format.
You mentioned you have all the Tables, Sprocs & etc from APP0. Presumably there is a BAK of them or you can grab the original SQL objects by installing APP0 on a fresh PC.
Then use SQL Compare from RedGate to compare the Database that APP1 uses to the original APP0 Database, then you can see which objects you've added and can strip out all the redundant APP0 db objects.
You could run a trace on the database whilst the application is in use. This is likely to create a rather large amount of data, but from that you can reduce it to the procedures and or SQL statements executed by your application.
Can you guarantee that you will, or can, use all the functionality? You might want to also run something like NCover to check how much of the application code you've exercised whilst using it.
I don't have an easy answer, but here's how I'd attack it. I admit up front this would take a fair amount of time, so I'll readily yield to someone who has a better answer.
It's a two-step problem.
Step 1: Find all the dependencies within SQL. That is, find all the tables that are used to make views, and find all the tables and views that are used in stored procedures and functions. MS SQL server has a function to do this for you. With other DBs you could write some queries against information_schema (or whatever their proprietary equivalent is).
Step 2: Get a list of all the SQL statements and stored procedures executed from within your code. This should be relatively straightforward if you do not build SQL statements on the fly. Just search for all your SQLCommand objects and find what you set the query to. You could write a little program to scan your source and dump this out.
Then parse through this dump and make a list of referenced sprocs, tables, and views. Sort alphabetically and eliminate duplicates. Then add any tables or views referenced from sprocs and any tables referenced from views. Sort and eliminate duplicates again. Then you have your list.
If you do geneate SQL on the fly, I think the complexity level multiplies greatly. Then you have to work your way through code that generates SQL and pick out the table names. If there are places where table names are passed from function to function, this could get very hard. (I could imagine real nightmare scenarios, like you ask the user to type in a table name or you build a table name from pieces. Like, "dim tablename = if(dept="B17","accounting", "manufacturing") & "_" & year".)
Related
I am upgrading a webapp that will be using two different database types. The existing database is a MySQL database, and is tightly integrated with the current systems, and a MongoDB database for the extended functionality. The new functionality will also be relying pretty heavily on the MySQL database for environmental variables such as information on the current user, content, etc.
Although I know I can just assemble the queries independently, it got me thinking of a way that might make the construction of queries much simpler (only for easier legibility while building, once it's finished, converting back to hard coded queries) that would entail an encapsulation object that would contain:
what data is being selected (including functionally derived data)
source (including joined data, I know that join's are not a good idea for non-relational db's, but it would be nice to have the facility just in case, which can be re-written into two queries later for performance times)
where and having conditions (stored as their own object types so they can be processed later, potentially including other select queries that can be interpreted by whatever db is using it)
orders
groupings
limits
This data can then be passed to an interface adapter that can build and execute the query, returning it in an array, or object or whatever is desired.
Although this sounds good, I have no idea if any code like this exists. If so, can anybody point it out to me, if not, are there any resources on similar projects undertaken that might allow me to continue the work and build a basic version?
I know this is a complicated library, but I have been working on this update for the last few days, and constantly switching back and forth has been getting me muddled at times and allowing for mistakes to occur
I would study things like the SQL grammar: http://www.h2database.com/html/grammar.html
Gives you an idea of how queries should be constructed.
You can study existing libraries around LINQ (C#): https://code.google.com/p/linqbridge/
Maybe even check out this link about FQL (Facebook's query language): https://code.google.com/p/mockfacebook/issues/list?q=label:fql
Like you already know, this is a hard problem. It will be a big challenge to make it run efficiently. Maybe consider moving all data from MySQL and Mongo to a third data store that has a copy of all the data and then running queries against that? Replicating all writes to something like Redis or Elastic Search and then write your queries there?
Either way, good luck!
I have a medium-sized app written in Ruby, which makes pretty heavy use of a RDBMS. As our code grows, I found the ugly SQL statements are spreading to all modules and methods in my app and embedded in many application logic. I am not sure if this is bad, however, my gut tells me this is quite ugly...
So generally in any languages, how do you manage your SQL statements? Or do you think it is harmful for maintainibility to let many SQL statements embedded in the application logic? Why or why not?
Thanks.
SQL is a language for accessing databases. Often, it gets confused as being the API into the data store for a larger application. In fact, you should design a real API between the data store and the app.
The means several things.
For accessing data stored in tables, you want to go through views in the database, rather than directly access the tables.
For data modification steps, you want to wrap insert/update/delete in stored procedures. This has secondary benefits, where you can handle constraints and triggers in the stored procedure and better log what is happening.
For security, you want to include database security as part of your security architecture. Giving all users full access may not be the best approach.
Unfortunately, it is easy to write a simple app that uses a database directly, whether in java or ruby or VBA or whatever. This grows into a bigger app, and then the maintenance problems arise.
I would suggest an incremental approach to fixing this. Go through the code and create views where you have nasty select statements. You'll probably find you need many fewer views than selects (the views can be re-used -- a good thing).
Find places where code is being modified, and change these to stored procedures. I always return status from the stored procedure for error checking and put log information into a table called someting like splog or _spcalls.
If you want to limit permissions for different users of your app, then you might be interested in this.
Leaving the raw SQL statements in the code is a problem. Just wait until you want to rename a column and you have to find all the places where this breaks the code.
Yes, this is not optimal - maintenance becomes a nightmare; it's hard to forecast and determine which code must change when underlying DB changes occur. This is why it is good practice to create a data access layer (DAL) to encapsulate CRUD operations from the application logic. There is often an business logic layer (BLL) between the application logic and DAL to enforce business rules/logic.
Google "data access layer" "business logic layer" and even "n-tier architecture" to learn more.
If you are concerned about the SQL statements littered around your application logic, maybe consider implementing them as Stored Procedures?
That way you will only be including the procedure name and any parameters that need to be passed to it in your code.
It has other benefits too, a common one being easier to re-use in multiple files.
There is much debate about speed and security of Stored Procedure and you will never get a definitive answer about that so I won't even open that can of worms.
Here is how you do this with Java: Create a class that encapsulates all access to the database. Add a method to the class for each query you need to run.
The answer for ruby will be similar to this.
It depends on the architecture of your application but a simple solution is to keep each sql in a file, qry.sql. For each Ruby module (or whatever is used in Ruby to aggregate related code) you can keep a folder SQL with these files. So, the collection of SQL folder/files form the data access layer of your application. The Ruby code provides the business layer. If your data model changes (field names, etc), you can do greps to identify the sql files that need changes. Anyway, definitely separate SQL from your logic code.
I have a few challenges I need help on. I need to pull data in to my SQL database from arbitrary sources.
The details are: I know the exact structure of my database and the structure will not change. When I do take in new data, it will occur only one time, at the time I set up an instance of my database. I will make many instances of my database and each time it will have to pull data from a different source, and those sources will be structured in different ways.
The data will most likely contain thousands of rows of records. The data source will most likely be held in Excel, Access, more rare Word and even rarer, it'll be held in a SQL database.
I can assume that most of the core data will be the same, just put in different locations. They will follow a general grouping despite how there held.
Essentially, I'm transferring data from legacy systems to a SQL system and this must be done for many groups and they need their own private instance of the database.
Any thoughts on how I would do this? How hard would it be to write a program that would do most of this for me?
This is definitely a real-world question. Is it possible to write a program that will do most of this? Not most of this, I think, but perhaps some of it.
For each table in your target system, create a view that displays the source data you expect to be able to insert. Choose column names that make it easy to tell what has to be done; most likely you'll choose column names that match the target columns in your INSERT statement. Save your INSERT statements as stored procedures.
Now, when you are given a new source of data in a new format, you will still have to recreate your views, but once the views are displaying the right data under your chosen column names, you can run your stored procedures without change.
I have a similar type of project where data is being retrieved from Access, .ini file, file modification dates, and MySql. I scrape this data every morning and basically append to a set SqlServer schema.
I created a DataTable and as I iterate a set of directories, insert the data into each new row. Once I have the DataTable complete, I perform a bulkcopy to append to the database.
I hope that helps you out a bit. I know my project doesn't cover all the aspects of your question; but also don't have a DBA to provide views, stored procedures, etc. Nor do I have the additional time to devote to such things. Not the most favorable of conditions, but that's the way it is.
HTH...
The best way of solving this problem is with and ETL (Extract-Transform-Load) solution. A good choice is SSIS which is through Microsoft's BI suite.
This is the building blocks for consciousness or the base......
1 A data base that organizes thousands of files similar to dna,
2 user interface
3 parts are hidden, preventing a system breach/crash
I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....
I am currently creating a master ddl for our database. Historically we have used backup/restore to version our database, and not maintained any ddl scripts. The schema is quite large.
My current thinking:
Break script into parts (possibly in separate scripts):
table creation
add indexes
add triggers
add constraints
Each script would get called by the master script.
I might need a script to drop constraints temporarily for testing
There may be orphaned tables in the schema, I plan to identify suspect tables.
Any other advice?
Edit: Also if anyone knows good tools to automate part of the process, we're using MS SQL 2000 (old, I know).
I think the basic idea is good.
The nice thing about building all the tables first and then building all the constraints, is that the tables can be created in any order. When I've done this I had one file per table, which I put in a directory called "Tables" and then a script which executed all the files in that directory. Likewise I had a folder for constraint scripts (which did foreign key and indexes too), which were executed when after the tables were built.
I would separate the build of the triggers and stored procedures, and run these last. The point about these is they can be run and re-run on the database without affecting the data. This means you can treat them just like ordinary code. You should include "if exists...drop" statements at the beginning of each trigger and procedure script, to make them re-runnable.
So the order would be
table creation
add indexes
add constraints
Then
add triggers
add stored procedures
On my current project we are using MSBuild to run the scripts. There are some extension targets that you can get for it which allow you to call sql scripts. In the past I have used perl which was fine too (and batch files...which I would not recommend - the're too limited).
#Adam
Or how about just by domain -- a useful grouping of related tables in the same file, but separate from the rest?
Only problem is if some domains (in this somewhat legacy system) are tightly coupled. Plus you have to maintain the dependencies between your different sub-scripts.
If you are looking for an automation tool, I have often worked with EMS SQLManager, which allows you to generate automatically a ddl script from a database.
Data inserts in reference tables might be mandatory before putting your database on line. This can even be considered as part of the ddl script. EMS can also generate scripts for data inserts from existing databases.
Need for indexes might not be properly estimated at the ddl stage. You will just need to declare them for primary/foreign keys. Other indexes should be created later, once views and queries have been defined
What you have there seems to be pretty good. My company has on occasion, for large enough databases, broken it down even further, perhaps to the individual object level. In this way each table/index/... has its own file. Can be useful, can be overkill. Really depends on how you are using it.
#Justin
By domain is mostly always sufficient. I agree that there are some complexities to deal with when doing it this way, but that should be easy enough to handle.
I think this method provides a little more seperation (which in a large database you will come to appreciate) while still making itself pretty manageable. We also write Perl scripts that do a lot of the processing of these DDL files, so that might be an option of a good way to handle that.
there is a neat tools that will iterate through the entire sql server and extract all the table, view, stored proceedures and UDF defintions to the local file system as SQL scripts (Text Files). I have used this with 2005 and 2008, not sure how it wil work with 2000 though. Check out http://www.antipodeansoftware.com/Home/Products
Invest the time to write a generic "drop all constraints" script, so you don't have to maintain it.
A cursor over the following statements does the trick.
Select * From Information_Schema.Table_Constraints
Select * From Information_Schema.Referential_Constraints
I previously organised my DDL code organised by one file per entity and made a tool that combined this into a single DDL script.
My former employer used a scheme where all table DDL was in one file (stored in oracle syntax), indicies in another, constraints in a third and static data in a fourth. A change script was kept in paralell with this (again in Oracle). The conversion to SQL was manual. It was a mess. I actually wrote a handy tool that will convert Oracle DDL to SQL Server (it worked 99.9% of the time).
I have recently switched to using Visual Studio Team System for Database professionals. So far it works fine, but there are some glitches if you use CLR functions within the database.