Finding unused columns

Finding unused columns - sql

I'm working with a legacy database which due to poor management and design has had a wildgrowth of columns which never have been or are no longer beeing used.
Is it possible to some how query for column usage? As in how often a column is beeing selected (either specifically or with *, or joined on)?
Seems to me like this is something we should be able to somehow retrieve but i have been unable to find anything like this.
Greetings,
F.B. ten Kate

Unfortunately, this analysis on the DB side isn't really going to be a full answer. I've seen a LOT of instances where application code only needed 3 columns of a 10+ column table, but selected them all anyway.
Your column would still show up on a usage report in any sort of trace or profiling you did, but it still may not ACTUALLY be in use.
You might have to either a) analyze the entire collection of apps that use this website or b) start drafting the a return-on-investment style doc on whether it's worth rebuilding.

This article will give you a good idea of how to search all fixed code (prodedures, views, functions and triggers) for the columns that are used. The code in the article searches for a specific table/column combination. You could easily adapt it to run for all columns. For anything dynamically executed, you'd probably have to set up a profiler trace.

Even if you could determine whether a column had been used in the past X period of time, would that be good enough? There may be some obscure program out there that populates a column once a week, a month, a year; or once every time they click the mystery button that no one ever clicks, or to log the report that only Fred in accounting ever runs (he quit two years ago), or that gets logged to if that one rare bug happens (during daylight savings time, perhaps?)
My point is, the only way you can truly be certain that a column is absolutely not used by anything is to review everything -- every call, every line of code, every ad hoc Excel data dump, every possible contingency -- everything that references the database . As this may be all but unachievable, try to get a formally defined group of programs and procedures that must be supported, bend over backwards to make sure they are supported, and be prepared to fix things when some overlooked or forgotten piece of functionality turns up.

Related

Search entire database in Toad for searchterm

I need a way to search my entire Oracle database for a column that contains the value 'Beef'. What I need is the column name and table name so I can complete my query. Beef is an animal feed type and it is a known value in my database. I just don't know where.....
Essentially, we have a very old very clunky application that I am using SQL data sets generated from Toad freeware to get around. The application shows us laboratory testing information for our companies. The catch is you can only look at one company's lab report at a time, and as a I said, it takes FOREVER. We have over 700 companies we regulate so this is not an option (oh and you can't copy any of the fields).
I have already generated a query that gets me 99% of the information I need until I realized I was missing one column value that for some unearthly reason isn't included with the other attributes of the lab samples. We have around 100 or so tables and many of them aren't even in use. It's a poorly organized database and I've tried manually going through it and simply cannot find the stupid column and I have no idea what it could be named (naming conventions here seem not to apply).
A monkey wrench is: although I've done a decent amount of SQL coding for my job I'm not in IT. My job hooked me up with a read-only access of our database so I could run reports for them etc, but I'm not in the IT section so I don't get write privileges. So a lot of solutions I see that use DDL aren't available to me.
I guess it might be relevant, it looks like we're running oraClient10g.
I've tried this code that I got from here: community.Oracle
but as you can see I don't get any results.
I also tried the one suggested here at stackOverflow, but got a litany of errors so I abandoned that pretty quick. (I figured it's because of my read-only privileges and or my version of Oracle).
Any help would be greatly appreciated.

Relational DB's View of View of View AntiPattern?

I have inherited a database that's causing me issues.
I'm in the need of describing something horrible to stakeholders. So far using the names of anti patterns and sending them away pointing them to a google search on this has been the most efficent to buy me some time.
Trouble is, I have not come across this before. Here's what's happening.
I have a simple single table, with a couple of columns. One of these columns contains values like:
660x90_SomeCity_SomeCountryISO_ImageName_SomeRubbish
or
SomeIataAirportCode_SomeCountry_660x90_SomeRubbish_ImageName
Now the database contains an (admittedly so far and on current data) faultless logic to extract and lookup things so that the output has additional columns such as:
AdSize
Country
City
The trouble is that this is achieved through gradual conversions implemented in a labyrinth of 50 (not joking) different views. I've now got to formalize the logic to something like
View One: Extract the first column and work out the length of it.
View Two: Now split of the 2nd column using the length.
View Three: If after replacing the x in the first column the value is numeric, store the value in "AdSize", and place the second value in the "CityCandidateOne" column.
To me this is a horrible antipattern and should all be done either in custom functions, or preferably during the ETL process, in one place so the logic can be captured.
However I'm not being given the time and wonder if this is a known anti pattern. Usually I can then use the credibility of a Google search to buy a little time to really sort this out.

I'd start with this answer which covers the violation of First Normal Form.
I also found this free ebook that might be of value.
I understand that what you are facing is something on a grander scale that just putting a couple of values in a field with a comma or other token to separate them, but I don't know of any antipattern that covers such a baroque mess.
Finally, here you can find more about "replacing SQL logic with Views" as an antipattern (just look for "Views as SQL Building Blocks Anti-Pattern" in the article) but take in account that in this case the problem seem to be about inefficient access to the data.
Last minute edit: maybe this is just a special case of the general Golden Hammer antipattern? (see also: http://en.wikipedia.org/wiki/Golden_hammer)

Why not simply rewrite the SQL how you would rather do it, then print out the execution plans of both, and show the performance and timing of both. That should be enough to show them that it needs to change (and if there is no major performance difference, then your only other argument can be one of maintainability and that's something you're going to have to argue by showing them what it takes to make changes).

Is it good practice to count on the file system as a database?

I'm working on an ASP.net web application that uses SQL as a database back-end. One issue that I have is that it sometimes takes a while to get my DBA to create or modify tables in the database which under no circumstance am I allowed to modify on my own.
Here is something that I do is when I expect users to upload files with their data.
Suppose the user uploads a new record for a table called Student_Records. The user uploads a record with fname Bob and lname Smith. The record is assigned primary key 123 The user also uploads two files: attendance_record.pdf and homework_record.pdf. Let's suppose that I have a network share: \\foo\bar where the files are saved.
One way of handling this situtation would be to have a table Student_Records_Files that associates the key 123 with Bob Smith. However, since I have trouble getting tables created, I've gone and done something different: When I save the files on the server, I call them 123_attendance_record.pdf and 123_homework_record.pdf. That way, I can easily identify what table record each file is associated with without having to create a new SQL table. I am, in essence, using the file system itself as a join table (Obviously, the file system is a type of database).
In my code for retrieving the files, I scan the directory \\foo\bar and look for files that begin with each primary key number from Student_Records.
It seems to work very well, but is it good practice?

There is nothing wrong with using the file system to store files. It's what it is used for.
There are a few things to keep in mind though.
I would consider a better method of storing the files - perhaps a directory for each user, rather than simply appending the user id to the filename.
Ensure that the file store is resilient and backed up with the same regularity as your database. If your database is configured to give you a backup every 10 minutes, but your file store only does a backup every day (or worse week) then you might be in for a world of pain.
Also consider what would happen if the user uploads two documents that are the same name.

First of all, I think it's a bad practice, in general, to design your architecture based on how responsive your DBA is. Any given compromise based on this approach may or may not be a big deal, but over time it will result in a poorly designed system.
Second, making the file name this critical seems dangerous to me; there's no protection against a person or application modifying the filename without realizing its importance.
Third, one of the advantages of having a table to maintain the join between the person and the file is that you can add additional data, such as: when was the file uploaded, what is the MIME type, has the file been read by anyone through the system, is this file a newer version of a previous file, etc. etc. Metadata can be very powerful, and the filesystem offers only limited ways to store it.

There are really two questions here. One is, given that for administrative reasons you cannot get changes made to the database schema, is it acceptable to devise some workaround. To that I'd have to say yes. What else can you do? In theory, if it takes two weeks to get the DBA to make a schema change for you, then this two weeks should be added to any deadline that you are given. In practice, this almost never happens. I've often worked places where some paperwork or whatever required two weeks before I could even begin work, and then I'd be given two weeks and one day to do the project. Sometimes you just have to put it together with rubber bands and bandaids.
Two is, is it a good idea to build a naming convention into file names and use this to identify files and their relationship to other data. I've done this at times and it's generally worked for me, though I have a perhaps irrational emotional feeling that it's not a good idea.
On the plus side, (a) By building information into a file name, you make it easy for both the computer and a human being to identify file associations. (Human readable as long as the naming convention is straightforward enough, anyway.) (b) By eliminating the separate storage of a link, you eliminate the possibility of a bad link. A file with the appropriate name may not exist, of course, but a database record with appropriate keys may not exist, or the file reference in such a record may be null or invalid. So it seems to solve one problem there without creating any new problems.
Potential minuses are: (a) You may have characters in the key that are not legal in file names. You may be able to just strip such characters out, or this may cause duplicates. The only safe thing to do is to escape them in some way, which is a pain. (b) You may exceed the legal length of a file name. Not as much of an issue as it was in the bad old 8.3 days. (c) You can't share files. If a database record points to a file, then two db records could point to the same file. If you must make two copies of a file, not only does this waste disk space, but it also means that if the file is updated, you must be sure to update all copies. If in your application it would make no sense to share files, than this isn't an issue.
You have to manage the files in some way, but you had to do that anyway.
I really can't think of any over-riding minuses. As I say, I've done this on occassion and didn't run into any particular problems. I'm interested in seeing others' responses.

I think it is not good practice because you are making your working application very dependent on specific implementation details and it would make it pretty hard to work with in the future to maintain, or if other people later needed access to your code/api.
Now weather you should do this or not is a whole different question. If you are really taking that much of a performance hit and it is significantly easier to work with how you have it, then I would say go ahead and break the rules. Ideally its good to follow best practice methods, but sometimes you have to bend the rules a little to make things work.

First, why is this a table change as opposed to a data change? Once you have the tables set up you should only need to update rows in that table every time that a user adds new files. If you have to put up with this one-time, two-week delay then bite the bullet and just get it done right.
Second, instead of trying to work around the problem why don't you try to fix the problem? Why is the process of implementing table changes so slow? Are you at least able to work on a development database (in which you have control to test and try out these changes)? Even if it's your own laptop you can at least continue on with development. Work with your manager, the DBA, and whoever else you need to, in order to improve the process. Would it help to speed things up if your scripts went through a formal testing process before you handed them off to the DBA so that he doesn't need to test the scripts, etc. himself?
Third, if this is a production database then you should probably be building in this two-week delay into your development cycle. You know that it takes two weeks for the DBA to review and implement changes in production, so make sure that if you have a deadline for releasing functionality that you have enough lead time for it.
Building this kind of "data" into a filename has inherent problems as others have pointed out. You have no relational integrity guarantees and the "data" can be changed without knowledge of the rest of the application/database.

It's best to keep everything in the database.
Network file I/O is spotty at best. In addition, its slower than the DB I/O.
If the DBA is difficult in getting small changes into the database, you
may be dealing with:
A political control issue. Maybe he just knows DB stuff and is threatened
when he perceives others moving in on his turf. Whatever his reasons, you need
to GET WORK DONE. Period. Document all the extra time / communication / work
you need to do for each small change and take that up with the management.
If the first level of management is unwilling to see things your way,
(it does not matter what their reasons are), escalate the issue
to the next level of management. In the past, I've gotten results this way.
It was more of a political territory problem than a technical problem.
The DBA eventually gave up and gave me full access to the TEST system BUT
he also stipulated that I would need to learn his testing process,
naming convention, his DB standards and practices, his way of testing, etc.
I was game.
I would also need to fix any database problems arising from changes I introduced.
This was fair and I got to wear the DBA hat in addition to the developer hat.
I got the freedom I needed and he got one less thing to worry about.
A process issue. Maybe the DBA needs to put every small DB change you submit
through a gauntlet of testing and performance analysis. Maybe he has a highly
normalized DB schema and because he has the big picture, he needs to normalize or
denormalize your requested DB changes to fit into the existing schema.
Ask to work with him. Ask him for a full DB design diagram.
Get a good sense of his DB design philosophy. Implement your DB changes with
his DB design philosophy in mind. Show that you understand that he's trying
to keep the DB in good order (understand normalization, relational constraints,
check constraints) Give him less to worry about. He needs to trust that you
will not muck up his database.
Accumulate all the small changes into a lengthy script and submit them to the DBA.
This way, you won't have to wait for each small change to go through all of his
process / testing. In addition, you're giving him a bigger picture view of your
development planning (that is in step with his DB design philosophy) instead of
just the play by play.

How do you think while formulating Sql Queries. Is it an experience or a concept?

I have been working on sql server and front end coding and have usually faced problem formulating queries.
I do understand most of the concepts of sql that are needed in formulating queries but whenever some new functionality comes into the picture that can be dont using sql query, i do usually fails resolving them.
I am very comfortable with select queries using joins and all such things but when it comes to DML operation i usually fails
For every query that i never done before I usually finds uncomfortable with that while creating them. Whenever I goes for an interview I usually faces this problem.
Is it their some concept behind approaching on formulating sql queries.
Eg.
I need to create an sql query such that
A table contain single column having duplicate record. I need to remove duplicate records.
I know i can find the solution to this query very easily on Googling, but I want to know how everyone comes to the desired result.
Is it something like Practice Makes Man Perfect i.e. once you did it, next time you will be able to formulate or their is some logic or concept behind.
I could have get my answer of solving above problem simply by posting it on stackoverflow and i would have been with an answer within 5 to 10 minutes but I want to know the reason. How do you work on any new kind of query. Is it a major contribution of experience or some an implementation of concepts.
Whenever I learns some new thing in coding section I tries to utilize it wherever I can use it. But here scenario seems to be changed because might be i am lagging in some concepts.
EDIT
How could I test my knowledge and
concepts in Sql and related sql
queries ?

Typically, the first time you need to open a child proof bottle of pills, you have a hard time, but after that you are prepared for what it might/will entail.
So it is with programming (me thinks).
You find problems, research best practices, and beat your head against a couple of rocks, but in the process you will come to have a handy set of tools.
Also, reading what others tried/did, is a good way to avoid major obsticles.
All in all, with a lot of practice/coding, you will see patterns quicker, and learn to notice where to make use of what tool.

I have a somewhat methodical method of constructing queries in general, and it is something I use elsewhere with any problem solving I need to do.
The first step is ALWAYS listing out any bits of information I have in a request. Information is essentially anything that tells me something about something.
A table contain single column having
duplicate record. I need to remove
duplicate
I have a table (I'll call it table1)
I have a
column on table table1 (I'll call it col1)
I have
duplicates in col1 on table table1
I need to remove
duplicates.
The next step of my query construction is identifying the action I'll take from the information I have.
I'll look for certain keywords (e.g. remove, create, edit, show, etc...) along with the standard insert, update, delete to determine the action.
In the example this would be DELETE because of remove.
The next step is isolation.
Asnwer the question "the action determined above should only be valid for ______..?" This part is almost always the most difficult part of constructing any query because it's usually abstract.
In the above example you're listing "duplicate records" as a piece of information, but that's really an abstract concept of something (anything where a specific value is not unique in usage).
Isolation is also where I test my action using a SELECT statement.
Every new query I run gets thrown through a select first!
The next step is execution, or essentially the "how do I get this done" part of a request.
A lot of times you'll figure the how out during the isolation step, but in some instances (yours included) how you isolate something, and how you fix it is not the same thing.
Showing duplicated values is different than removing a specific duplicate.
The last step is implementation. This is just where I take everything and make the query...
Summing it all up... for me to construct a query I'll pick out all information that I have in the request. Using the information I'll figure out what I need to do (the action), and what I need to do it on (isolation). Once I know what I need to do with what I figure out the execution.
Every single time I'm starting a new "query" I'll run it through these general steps to get an idea for what I'm going to do at an abstract level.
For specific implementations of an actual request you'll have to have some knowledge (or access to google) to go further than this.
Kris

I think in the same way I cook dinner. I have some ingredients (tables, columns etc.), some cooking methods (SELECT, UPDATE, INSERT, GROUP BY etc.) then I put them together in the way I know how.
Sometimes I will do something weird and find it tastes horrible, or that it is amazing.
Occasionally I will pick up new recipes from the internet or friends, then use parts of these in my own.
I also save my recipes in handy repositories, broken down into reusable chunks.

On the "Delete a duplicate" example, I'd come to the result by googling it. This scenario is so rare if the DB is designed properly that I wouldn't bother keeping this information in my head. Why bother, when there is a good resource is available for me to look it up when I need it?
For other queries, it really is practice makes perfect.
Over time, you get to remember frequently used patterns just because they ARE frequently used. Rare cases should be kept in a reference material. I've simply got too much other stuff to remember.

Find a good documentation to your software. I am using Mysql a lot and Mysql has excellent documentation site with decent search function so you get many answers just by reading docs. If you do NOT get your answer at least you are learning something.
Than I set up an example database (or use the one I am working on) and gradually build my SQL. I tend to separate the problem into small pieces and solve it step by step - this is very successful if you are building queries including many JOINS - it is best to start with some particular case and "polute" your SQL with many conditions like WHEN id = "123" which you are taking out as you are working towards your solution.
The best and fastest way to learn good SQL is to work with someone else, preferably someone who knows more than you, but it is not necessarry condition. It can be replaced by studying mature code written by others.

Your example is a test of how well you understand the DISTINCT keyword and the GROUP BY clause, which are SQL's ways of dealing with duplicate data.

Examples and experience. You look at other peoples examples and you create your own code and once it groks, you don't need to think about it again.

I would have a look at the Mere Mortals book - I think it's the one by Hernandez. I remember that when I first started seriously with SQL Server 6.5, moving from manual ISAM databases and Access database systems using VB4, that it was difficult to understand the syntax, the joins and the declarative style. And the SQL queries, while powerful, were very intimidating to understand - because typically, I was looking at generated code in Microsoft Access.
However, once I had developed a relatively systematic approach to building queries in a consistent and straightforward fashion, my skills and confidence quickly moved forward.

From seeing your responses you have two options.
Have a copy of the specification for whatever your working on (SQL spec and the documentation for the SQL implementation (SQLite, SQL Server etc..)
Use Google, SO, Books, etc.. as a resource to find answers.
You can't formulate an answer to a problem without doing one of the above. The first option is to become well versed into the capabilities of whatever you are working on.
The second option allows you to find answers that you may not even fully know how to ask. You example is fairly simplistic, so if you read the spec/implementation documentaion you would know the answer right away. But there are times, where even if you read the spec/documentation you don't know the answer. You only know that it IS possible, just not how to do it.
Remember that as far as jobs and supervisors go, being able to resolve a problem is important, but the faster you can do it the better which can often be done with option 2.

What are your best practices for ensuring the correctness of the reports from SQL?

Part of my work involves creating reports and data from SQL Server to be used as information for decision. The majority of the data is aggregated, like inventory, sales and costs totals from departments, and other dimensions.
When I am creating the reports, and more specifically, I am developing the SELECTs to extract the aggregated data from the OLTP database, I worry about mistaking a JOIN or a GROUP BY, for example, returning incorrect results.
I try to use some "best practices" to prevent me for "generating" wrong numbers:
When creating an aggregated data set, always explode this data set without the aggregation and look for any obvious error.
Export the exploded data set to Excel and compare the SUM(), AVG(), etc, from SQL Server and Excel.
Involve the people who would use the information and ask for some validation (ask people to help to identify mistakes on the numbers).
Never deploy those things in the afternoon - when possible, try to take a look at the T-SQL on the next morning with a refreshed mind. I had many bugs corrected using this simple procedure.
Even with those procedures, I always worry about the numbers.
What are your best practices for ensuring the correctness of the reports?

have you considered filling your tables with test data that produces known results and compare your query results with your expected results.

Signed, in writing
I've found that one of the best practices is that both the reader/client and the developers are on the same (documented) page. That way, when mysterious numbers appear (and they do), I can point to the specification in writing and say, "This is why you see this number. Would you like it to be different?".
Test, test, test
For seriously complicated reports, we went through test data up and down with the client, until all the numbers were correct, and client was were satisfied.
Edge Cases
We discovered a seriously complicated case in our reporting system that turned everything upside down (on our end). What if the user generates a report (say Year-End 2009) , enters data for the new year, and then comes back to generate the same report? The data has changed but that report should not. Thinking and working these cases out can save a lot of heartache.

Write some automated tests.
We have quite a lot of reporting services reports - we test them using Selenium. We use a test data page to squirt some known data into an empty database, then run the report and assert that the numbers are as expected.
The builds run every time we check in, so we know we haven't done anything too stupid

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas