Purging SQL Tables from large DB?

Purging SQL Tables from large DB? - sql

The site I am working on as a student will be redesigned and released in the near future and I have been assigned the task of manually searching through every table in the DB the site uses to find tables we can consider for deletion. I'm doing the search through every HTML files source code in dreamweaver but I was hoping there is an automated way to check my work. Does anyone have any suggestions as to how this is done in the business world?

If you search through the code, you may find SQL that is never used, because the users never choose those options in the application.
Instead, I would suggest that you turn on auditing on the database and log what SQL is actually used. For example in Oracle you would do it like this. Other major database servers have similar capabilities.
From the log data you can identify not only what tables are being used, but their frequency of use. If there are any tables in the schema that do not show up during a week of auditing, or show up only rarely, then you could investigate this in the code using text search tools.
Once you have candidate tables to remove from the database, and approval from your manager, then don't just drop the tables, create them again as an empty table, or put one dummy record in the table with mostly null values (or zero or blank) in the fields, except for name and descriptive fields where you can put something like "DELETED" "Report error DELE to support center", etc. That way, the application won't fail with a hard error, and you have a chance at finding out what users are doing when they end up with these unused tables.

Reverse engineer the DB (Visio, Toad, etc...), document the structure and ask designers of the new site what they need -- then refactor.

I would start by combing through the HTML source for keywords:
SELECT
INSERT
UPDATE
DELETE
...using grep/etc. None of these are HTML entities, and you can't reliably use table names because you could be dealing with views (assuming any exist in the system). Then you have to pour over the statements themselves to determine what is being used.
If [hopefully] functions and/or stored procedures were used in the system, most DBs have a reference feature to check for dependencies.
This would be a good time to create a Design Document on a screen by screen basis, listing the attributes on screen & where the value(s) come from in the database at the table.column level.
Compile your list of tables used, and compare to what's actually in the database.

If the table names are specified in the HTML source (and if that's the only place they are ever specified!), you can do a Search in Files for the name of each table in the DB. If there are a lot of tables, consider using a tool like grep and creating a script that runs grep against the source code base (HTML files plus any others that can reference the table by name) for each table name.
Having said that, I would still follow Damir's advice and take a list of deletion candidates to the data designers for validation.

I'm guessing you don't have any tests in place around the data access or the UI, so there's no way to verify what is and isn't used. Provided that the data access is consistent, scripting will be your best bet. Have it search out the tables/views/stored procedures that are being called and dump those to a file to analyze further. That will at least give you a list of everything that is actually called from some place. As for if those pages are actually used anywhere, that's another story.
Once you have the list of the database elements that are being called, compare that with a list of the user-defined elements that are in the database. That will give you the ones that could potentially be deleted.
All that being said, if the site is being redesigned then a fresh database schema may actually be a better approach. It's usually less intensive to start fresh and import the old data than it is to find dead tables and fields.

Related

I'm a new CDS/Dataverse user and am wondering why there are so many columns in new tables?

I'm new to CDS/Dataverse, coming from the SQL Server world. I created a new Dataverse table and there are over a dozen columns in my "new" table (e.g. "status", "version number"). Apparently these are added automatically. Why is this?
Also, there doesn't seem to be a way to view a grid of data (like I can with SQL Server) for quick review/modification of the data. Is there a way to view data visually like this?
Any tips for a new user, coming from SQL Server, would be appreciated. Thanks.
Edit: clarified the main question with examples (column names). (thanks David)

I am also new to CDS/Dataverse, so the following is a limited understanding from what I have explored so far.
The idea behind Dataverse is that it gives you a pre-built schema that follows best-practice for you build off of, so that you spend less time worrying about building a comprehensive data schema, creating tables, and how to relate them all together, and more time building applications in Power Apps.
For example, amongst the several dozen tables it generates from the get-go is Account and Contact. The former is for organisational entities and the latter is for single-person entities. You can go straight into adding your user records in one of these tables and take advantage of bits of Power Apps functionality already hooked up to these tables. You do not have to spend time thinking up column names, creating the table, making sure it hooks up to all the other Dataverse tables, testing whether the Power Apps functionality works with it correctly etc.
It is much the same story with the automatically generated columns for new tables: they are all there to maintain a best-practice schema and functionality for Power Apps. For example, the extra columns give you good auditing with the data you add, including when a row was created, modified, who created the row etc. The important thing is to start from what you want to build, and not get too caught up in the extra tables/columns. After a bit of research, you'll probably find you can utilise some more tables/columns in your design.
Viewing and adding data is very tedious -- it seems to take 5 clicks and several seconds to load the bit of data you want, which is eons in comparison to doing it in SQL Server. I believe it is how it is due to Microsoft's attempt to make it "user friendly".
Anyhow, the standard way to view data, starting from the main Power Apps view is:
From the right-hand side pane, click Data
Click Tables
From the list of tables, click your table
Along the top row, click Data
There is an alternative method that allows you to view the Dataverse tables in SSMS – see link below:
https://www.strategy365.co.uk/using-sql-to-query-the-common-data-service/
To import data in bulk:
Click on Data from the top drop-down menu > Get data.
Importing data from Excel is free. To import from other sources, including SQL Server, I believe is a paid service (although I think you may be able to do this on the free Community Plan).

Database optimisation

I'm starting a web application that will be used by a lot of companies (over 20K), and most importantly a lot of information will be recorded daily. I would like your advice on the following idea: create a database for each company to do sql queries like this:
select * from enterprisedb1.tablename;
select * from enterprisedb2.tablename2 where enterprisedb2.tablename2.col='foo'
Pleace i need your advice, i don't find anything on google

If you are selling this to multiple clients then it might come down to separation of their data.
On the one hand everything for the app is in the one database for each client, and provided you get the connection string right you probably don't need to ever specify the company name again for the rest of the app. No more "where customer=123" on every single query.
Also means a client could be deleted, backed up, moved, audited, whatever in a completely independent manner.
And also means there is no risk of a developer or a query accidentally doing cross-client things. So you can even open up to generic query access that still cant accidentally cross a client-to-client border. And security set-up will be simpler.
But if you have a million clients you do end up with a lot of databases. How well this works will depend on all sorts of things, including your database of choice.
You also end up having multiple copies of reference data unless you create an additional database "common" or something like that.
Its going to be very much a "depends" answer, but that's a few things to consider.

I suggest to use common tables for each company. It will better to manage and easy to understand.
Create one table for company data and use Integer reference of that key in another mete data tables. For better performance, Index and Query must be well formed.

How do I reconstruct a database schema from hundreds of separate scripts?

I requested that a client send me a copy of their current MS SQL database. Instead of being given a database backup, or small set of scripts I could use to recreate the database, I was provided with hundreds upon hundreds of individual SQL scripts, and no instructions on the order in which they'd need to be run.
The scripts cannot simply be executed in one batch operation, as there are foreign key dependencies between tables. It appears as though they've limited these scripts to creating a single table or stored procedure per script.
Normally, I'd simply ask a client to provide the information in a more usable format, but they're not known for getting back to us in a timely manner, and our project timeline is already in jeopardy due to delays on their end.
Are there any tools I can use to recreate the database from this enormous set of scripts?

This may sound a bit arcane, but you can do the following, iteratively:
Put all the scripts into a list of "scripts to be run"
Run all the scripts in the "to be run" scripts
Remove the successful runs
Repeate 2-3 until no scripts are left
The scripts with no dependencies will finish in the first round. The ones that depend on them in the next round, and then so on and so on.
I would suggest that you operate all this from a metascript, that uses a database table to store the names of the available scripts.
Good luck.

If you set your folder of scripts as a data source in Red Gate SQL Compare, and specify a blank database as the target, it should allow you to compare and deploy to the target database. This is because the tool is able to read all SQL creation scripts recursively from the folder you specify. This is available as a fully functional 14-day trial, so you can easily test it in your scenario.
http://www.red-gate.com/products/sql-development/sql-compare/

The quickest (and by far the dirtiest) way of (maybe) doing this is to concatenate all of the scripts together, ensuring that you have a GO statement in between each one. Make sure there are no DROP statements in your scripts, or this technique won't work.
Run the concatenated script repeatedly for... I don't know, 10 or so iterations. Chances are you will have their database recreated properly in your test system.
If you're feeling more rigorous, go with Gordon's suggestion. I'm not really aware of a tool which will be able to reconstruct the dependencies, but you may want to take a look at Red-Gate's SQL Compare, which you can get a demo of for free, and can do some pretty magical things.

You can remove all the foreign keys constraints. Then, organize the scripts so that it first creates all the tables, then add back all the foreign keys. Finally create indexes.

Building on Gordon.
Split them up into one table each.
Count the number of FK and sort starting with least first.
Then remove the scripts that run as Gordon suggests.
Another potential problem is that is creates the table and fails on the FK and leaves the table.
You come back later to create the take and the table is already there so it fails.
If you parse them out with Table FKs
Start with Tables with no FK in a list
Then loop thru Tables with FK
Only add to the List if all the FK are already in the List.
If you know .NET then a class with string property table, sting property script, and a property List String of FK property names.
They should parse out pretty clean regex.

Best practices for writing SQL scripts for deployment

I was wondering what are the best practices in order to write SQL scripts to set up databases for production and/or development, for instance:
Should I include the CREATE DATABASE statement?
Should I create users for the database in the same script?
Is correct to disable FK check before executing the body of the script?
May I include the hole script in a transaction?
Is better to generate 1 script per database than one script for all of them?
Thanks!

The problem with your question is is hard to answer as it depends on the way the scripts are used in what you are trying to achieve. you also don't say which DB server you are using as there are tools provided which can make some tasks easier.
Taking your points in order, here are some suggestions, which will probably be very different to everyone elses :)
Should I include the CREATE DATABASE
statement?
What alternative are you thinking of using? If your question is should you put the CREATE DATABASE statement in the same script as the table creation it depends. When developing DB I use a separate create DB script as I have a script to drop all objects and so I don't need to create the database again.
Should I create users for the database in the same script?
I wouldn't, simply because the users may well change but your schema has not. Might as well manage those changes in a smaller script.
Is correct to disable FK check before executing the body of the script?
If you are importing the data in an attempt to recover the database then you may well have to if you are using auto increment IDs and want to keep the same values. Also you may end up importing the tables "out of order" an not want checks performed.
May I include the whole script in a transaction?
Yes, you can, but again it depends on the type of script you are running. If you are importing data after rebuilding a db then the whole import should work or fail. However, your transaction file is going to be huge during the import.
Is better to generate 1 script per database than one script for all of them?
Again, for maintenance purposes it's probably better to keep them separate.

This probably depends what kind of database and how it is used and deployed. I am developing a n-tier standard application that is deployed at many different customer sites.
I do not add a CREATE DATABASE statement in the script. Creating the the database is a part of the installation script which allows the user to choose server, database name and collation
I have no knowledge about the users at my customers sites so I don't add create users statements also the only user that needs access to the database is the user executing the middle tire application.
I do not disable FK checks. I need them to protect the consistency of the database, even if it is I who wrote the body scripts. I use FK to capture my errors.
I do not include the entire script in one transaction. I require from the users to take a backup of the db before they run any db upgrade scripts. For creating of a new database there is nothing to protect so running in a transaction is unnecessary. For upgrades there are sometimes extensive changes to the db. A couple of years ago we switched from varchar to nvarchar in about 250 tables. Not something you would like to do in one transaction.
I would recommend you to generate one script per database and version control the scripts separately.

Direct answers, please ask if you need to expand on any point
* Should I include the CREATE DATABASE statement?
Normally I would include it since you are creating and owning the database.
* Should I create users for the database in the same script?
This is also a good idea, especially if your application uses specific users.
* Is correct to disable FK check before executing the body of the script?
If the script includes data population, then it helps to disable it so that the order is not too important, otherwise you can get into complex scripts to insert (without fk link), create fk record, update fk column.
* May I include the hole script in a transaction?
This is normally not a good idea. Especially if data population is included as the transaction can become quite unwieldy large. Since you are creating the database, just drop it and start again if something goes awry.
* Is better to generate 1 script per database than one script for all of them?
One per database is my recommendation so that they are isolated and easier to troubleshoot if the need arises.

For development purposes it's a good idea to create one script per database object (one script for each table, stored procedure, etc). If you check them into your source control system that way then developers can check out individual objects and you can easily keep track of versions and know what changed and when.
When you deploy you may want to combine the changes for each release into one single script. Tools like Red Gate SQL compare or Visual Studio Team System will help you do that.

Should I include the CREATE DATABASE statement?
Should I create users for the database in the same script?
That depends on your DBMS and your customer.
In an Oracle environment you will probably never be allowed to do such a thing (mainly because in the Oracle world a "database" is something completely different than e.g. in the PostgreSQL or MySQL world).
Sometimes the customer will have a DBA that won't let you create databases (or schemas or users - depending on the DBMS in use). So you will need to supply that information to the DBA in order for him/her to prepare the environment for your script.
May I include the hole script in a transaction?
That totally depends on the DBMS that you are using.
Some DBMS don't support transactional DDL and will implicitely commit any transaction when you execute a DDL statement, so you need to consider the order of your installation script.
For populating the tables with data I would definitely try to do that in a single transaction, but again this depends on your DBMS.
Some DBMS are faster if you commit only once or very seldomly (Oracle and PostgreSQL fall into this category) but will slow down if you commit more often.
Other DBMS handle smaller but more transactions better and will slow down if the transactions get too big (SQL Server and MySQL tend to fall into that direction)

The best practices will differ considerably on whether it is the first time set-up or a new version being pushed. For the first time set-up yes you need create database and create table scripts. For a new version, you need to script only the changes from the previous version, so no create database and no create table unless it is a new table. Now you need alter table statements becasue you don't want to lose the existing data. I do usually write stored procs, functions and views with a drop and create statment as dropping those pbjects doesn't generally affect the underlying data.
I find it best to create all database changes with scripts that are stored in source control under the version. So if a client is new, you run the create version 1.0 scripts, then apply all the other versions in order. If a client is just upgrading from version 1.2 to version 1.3, then you run just the scripts in version 1.3 source control repository. This would also include scripts to populate or add records to lookup tables.
For transactions you may want to break them up into several chunks not to leave a prod database locked in one transaction.
We also write reversal scripts to return to the old version if need be. This makes life easier if you have a part of a change that causes unanticipated problems on prod (usually performance issues).

Ideas for Combining Thousand Databases into One Database

We have a SQL server that has a database for each client, and we have hundreds of clients. So imagine the following: database001, database002, database003, ..., database999. We want to combine all of these databases into one database.
Our thoughts are to add a siteId column, 001, 002, 003, ..., 999.
We are exploring options to make this transition as smoothly as possible. And we would LOVE to hear any ideas you have. It's proving to be a VERY challenging problem.
I've heard of a technique that would create a view that would match and then filter.
Any ideas guys?

Create a client database id for each of the client databases. You will use this id to keep the data logically separated. This is the "site id" concept, but you can use a derived key (identity field) instead of manually creating these numbers. Create a table that has database name and id, with any other metadata you need.
The next step would be to create an SSIS package that gets the ID for the database in question and adds it to the tables that have to have their data separated out logically. You then can run that same package over each database with the lookup for ID for the database in question.
After you have a unique id for the data that is unique, and have imported the data, you will have to alter your apps to fit the new schema (actually before, or you are pretty much screwed).
If you want to do this in steps, you can create views or functions in the different "databases" so the old client can still hit the client's data, even though it has been moved. This step may not be necessary if you deploy with some downtime.
The method I propose is fairly flexible and can be applied to one client at a time, depending on your client application deployment methodology.

Why do you want to do that?
You can read about Multi-Tenant Data Architecture and also listen to SO #19 (around 40-50 min) about this design.

The "site-id" solution is what's done.
Another possibility that may not work out as well (but is still appealing) is multiple schemas within a single database. You can pull common tables into a "common" schema, and leave the customer-specific stuff in customer-specific schema. In some database products, however, the each schema is -- effectively -- a separate database. In other products (Oracle, DB2, for example) you can easily write queries that work in multiple schemas.
Also note that -- as an optimization -- you may not need to add siteId column to EVERY table.
Sometimes you have a "contains" relationship. It's a master-detail FK, often defined with a cascade delete so that detail cannot exist without the parent. In this case, the children don't need siteId because they don't have an independent existence.

Your first step will be to determine if these databases even have the same structure. Even if you think they do, you need to compare them to make sure they do. Chances are there will be some that are customized or missed an upgrade cycle or two.
Now depending on the number of clients and the number of records per client, your tables may get huge. Are you sure this will not create a performance problem? At any rate you may need to take a fresh look at indexing. You may need a much more powerful set of servers and may also need to partion by client anyway for performance.
Next, yes each table will need a site id of some sort. Further, depending on your design, you may have primary keys that are now no longer unique. You may need to redefine all primary keys to include the siteid. Always index this field when you add it.
Now all your queries, stored procs, views, udfs will need to be rewritten to ensure that the siteid is part of them. PAy particular attention to any dynamic SQL. Otherwise you could be showing client A's information to client B. Clients don't tend to like that. We brought a client from a separate database into the main application one time (when they decided they didn't still want to pay for a separate server). The developer missed just one place where client_id had to be added. Unfortunately, that sent emails to every client concerning this client's proprietary information and to make matters worse, it was a nightly process that ran in the middle of the night, so it wasn't known about until the next day. (the developer was very lucky not to get fired.) The point is be very very careful when you do this and test, test, test, and test some more. Make sure to test all automated behind the scenes stuff as well as the UI stuff.

what I was explaining in Florence towards the end of last year is if you had to keep the database names and the logical layer of the database the same for the application. In that case you'd do the following:
Collapse all the data into consolidated tables into one master, consolidated database (hereafter referred to as the consolidated DB).
Those tables would have to have an identifier like SiteID.
Create the new databases with the existing names.
Create views with the old table names which use row-level security to query the tables in the consolidated DB, but using the SiteID to filter.
Set up the databases for cross-database ownership chaining so that the service accounts can't "accidentally" query the base tables in the consolidated DB. Access must happen through the views or through stored procedures and other constructs that will enforce row-level security. Now, if it's the same service account for all sites, you can avoid the cross DB ownership chaining and assign the rights on the objects in the consolidated DB.
Rewrite the stored procedures to either handle the change (since they are now referring to views and they don't know to hit the base tables and include SiteID) or use InsteadOf Triggers on the views to intercept update requests and put the appropriate site specific information into the base tables.

If the data is large you could look at using a partioned view. This would simplify your access code as all you'd have to maintain is the view; however, if the data is not large, just add a column to identify the customer.

Depending on what the data is and your security requirements the threat of cross contamination may be a show stopper.
Assuming you have considered this and deem it "safe enough". You may need/want to create VIEWS or impose some other access control to prevent customers from seeing each-other's data.
IIRC a product called "Trusted Oracle" had the ability to partition data based on such a key (about the time Oracle 7 or 8 was out). The idea was that any given query would automagically have "and sourceKey = #userSecurityKey" (or some such) appended. The feature may have been rolled into later versions of the popular commercial product.

To expand on Gregory's answer, you can also make a parent ssis that calls the package doing the actual moving within a foreach loop container.
The parent package queries a config table and puts this in an object variable. The foreach loop then uses this recordset to pass variables to the package, such as your database name and any other details the package might need.
You table could list all of your client databases and have a flag to mark when you are ready to move them. This way you are not sitting around running the ssis package on 32,767 databases. I'm hooked on the foreach loop in ssis.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas