How do I tell if data formatting inconsistencies are themselves consistent?

How do I tell if data formatting inconsistencies are themselves consistent? - sql

A client has sent us a couple of tables that we need to be able to cross reference against each other. Unfortunately, the columns that we need to use for cross-referencing have inconsistently formatted data.
However, it looks like they are inconsistent in a consistent way. I.e., in one column, there is a "name", and in the other column there is a name preceded by a 4 digit id code and a space, so "1234 name".
If it's true that the inconsistencies are consistent, then we can use the data as is just by calling the MySQL substring function. But I'm not convinced. How do I know for certain whether the inconsistencies are universal? What if there are other inconsistencies that I'm not seeing?
What I need to know is, do all the unique values in columnA = all the unique values in substring(columnB, 6).
I'm not great at MySQL and have tried a few queries, but they've either returned all results (not distinct ones) or they've gotten "interrupted" on the server since there's a lot of data and they take forever to run. Help?

You can do this with a not exists clause:
select t1.*
from t1
where not exists (select 1
from t2
where t2.name = substring(t1.columnB, 6)
);
This will identify all t1.columnB that do not have a matching name in t2 by the rule you have given.

This is a horrible problem to solve - especially if you're not familiar with SQL.
In principle, I always treat this kind of data as "untrusted" - whatever rules you think apply usually turn out to be false over time.
My strategy is to use the "dirty" data to populate similar "clean" tables by running SQL queries, rather than try to work with the "dirty" data directly.
So, you might create two tables with the schema you think works best< and then populate that table by inserting substring(columnB, 6) into that table. By adding a where clause (e.g. isnumeric(substring(t1.columnB, 6))), you can validate your assumptions.
Once you have "clean" tables, you can perform the join easily.

Related

Check against a list of values in the database

Let's say TableA has a field called, FieldA. FieldA consists of a list of values with comma being the delimiter (i.e. 1,2,3,4). #MyVar# contains the value (a single value, NOT a list) I want to search for. Other than doing the followings,
SELECT *
FROM TableA
WHERE FieldA LIKE '%,#MyVar#,%'
OR FieldA LIKE '#MyVar#,%'
OR FieldA LIKE '%,#MyVar#'
Is there a SQL command/reserved word to perform the equivalent of the SQL statement above?

You have a badly designed, denormalized database. If at all possible, fix the design.
If that's not possible then consider the nature of the MyVar values. If they cannot possibly nest inside each other (for instance, they're all guaranteed to be the same number of characters) you can get away with a single like. If not, your solution is about as good as you'll get.

As per Pounding A Nail: Old Shoe or Glass Bottle?:
This is quite possibly the worst way
you could store this data in a
database. No, seriously. They had
better ways of doing this over thirty
years ago. You have created a horrible
problem that is only starting to rear
its ugly head. It will only continue
to get horribly worse, costing your
company more and more money to
maintain.
The answer is to make a new table, with a one-to-many relationship between the main ID and the different values. Then it's a much simpler query:
SELECT MainID
FROM NewTable
WHERE FieldA = #Value
This is a much easier solution to code against, not to mention understand. And it will help significantly when you have to join this table to others.
EDIT: From your responses, I understand that doing this would be a serious amount of work. However, you're already running into the issues that are arising from the current setup. Doing this will take time up front, but it will make future development significantly faster in the long run.

There isn't any need to include the several ORs or commas. This query should do what you need.
SELECT *
FROM TableA
WHERE FieldA LIKE '%#MyVar#%'
If the variable appears anywhere in the field, the row will be returned. You don't have to try to compare it to each of your "list" items in each field.
You're kind of abusing database design by including a list of values in a single field. There isn't any built-in SQL function that deals with this kind of data, that I know of.

Put your delimiters on the left side of the like comparison too and then make sure you include your delimiters on the right side. Like this:
SELECT *
FROM TableA
WHERE ',' + FieldA + ',' LIKE '%,#MyVar#,%'

This actually should do it in MySQL... correct or am I missing something?
where find_in_set('1', field_with_a_list_of_numbers)

You can use IN to check an list:
SELECT * FROM TableA WHERE FieldA IN '%,#MyVar#,%'

No, in a word. In a normalized database you would usually split these list items into a separate table, but there are times where is this is either not possible (user has no ability to alter the database structure) or undesired (budget limitations).
you can either
do something like durilai suggests
(not performant for large amounts of
data, but probably the easiest to
implement)
use a trigger to split the entries in
your on arrival (i.e. on INSERT) into a subtable (SubTableA) and
then do
:-
select * from TableA a where exists (select 1 from SubTableA s on s.a_id = a.id
and list_element like '%#MyVar#%')

Why are positional queries bad?

I'm reading CJ Date's SQL and Relational Theory: How to Write Accurate SQL Code, and he makes the case that positional queries are bad — for example, this INSERT:
INSERT INTO t VALUES (1, 2, 3)
Instead, you should use attribute-based queries like this:
INSERT INTO t (one, two, three) VALUES (1, 2, 3)
Now, I understand that the first query is out of line with the relational model since tuples (rows) are unordered sets of attributes (columns). I'm having trouble understanding where the harm is in the first query. Can someone explain this to me?

The first query breaks pretty much any time the table schema changes. The second query accomodates any schema change that leaves its columns intact and doesn't add defaultless columns.
People who do SELECT * queries and then rely on positional notation for extracting the values they're concerned about are software maintenance supervillains for the same reason.

While the order of columns is defined in the schema, it should generally not be regarded as important because it's not conceptually important.
Also, it means that anyone reading the first version has to consult the schema to find out what the values are meant to mean. Admittedly this is just like using positional arguments in most programming languages, but somehow SQL feels slightly different in this respect - I'd certainly understand the second version much more easily (assuming the column names are sensible).

I don't really care about theoretical concepts in this regard (as in practice, a table does have a defined column order). The primary reason I would prefer the second one to the first is an added layer of abstraction. You can modify columns in a table without screwing up your queries.

You should try to make your SQL queries depend on the exact layout of the table as little as possible.
The first query relies on the table only having three fields, and in that exact order. Any change at all to the table will break the query.
The second query only relies on there being those three felds in the table, and the order of the fields is irrelevant. You can change the order of fields in the table without breaking the query, and you can even add fields as long as they allow null values or has a default value.
Although you don't rearrange the table layout very often, adding more fields to a table is quite common.
Also, the second query is more readable. You can tell from the query itself what the values put in the record means.

Something that hasn't been mentioned yet is that you will often be having a surrogate key as your PK, with auto_increment (or something similar) to assign a value. With the first one, you'd have to specify something there — but what value can you specify if it isn't to be used? NULL might be an option, but that doesn't really fit in considering the PK would be set to NOT NULL.
But apart from that, the whole "locked to a specific schema" is a much more important reason, IMO.

SQL gives you syntax for specifying the name of the column for both INSERT and SELECT statements. You should use this because:
Your queries are stable to changes in the column ordering, so that maintenance takes less work.
The column ordering maps better to how people think, so it's more readable. It's more clear to think of a column as the "Name" column rather than the 2nd column.

I prefer to use the UPDATE-like syntax:
INSERT t SET one = 1 , two = 2 , three = 3
Which is far easier to read and maintain than both the examples.

Long term, if you add one more column to your table, your INSERT will not work unless you explicitly specify list of columns. If someone changes the order of columns, your INSERT may silently succeed inserting values into wrong columns.

I'm going to add one more thing, the second query is less prone to error orginally even before tables are changed. Why do I say that? Becasue with the seocnd form you can (and should when you write the query) visually check to see if the columns in the insert table and the data in the values clause or select clause are in fact in the right order to begin with. Otherwise you may end up putting the Social Security Number in the Honoraria field by accident and paying speakers their SSN instead of the amount they should make for a speech (example not chosen at random, except we did catch it before it actually happened thanks to that visual check!).

'SELECT *' from inner joined tables

How do you select all fields of two joined tables, without having conflicts with the common field?
Suppose I have two tables, Products and Services. I would like to make a query like this:
SELECT Products.*, Services.*
FROM Products
INNER JOIN Services ON Products.IdService = Services.IdService
The problem with this query is that IdService will appear twice and lead to a bunch of problems.
The alternative I found so far is to discriminate every field from Products except the IdService one. But this way I'll have to update the query every time I add a new field to Products.
Is there a better way to do this?

What are the most common SQL anti-patterns?
You've hit anti-pattern #1.
The better way is to provide a fieldlist. One way to get a quick field list is to
sp_help tablename
And if you want to create a view from this query - using select * gets you in more trouble. SQL Server captures the column list at the time the view is created. If you edit the underlying tables and don't recreate the view - you're signing up for trouble (I had a production fire of this nature - view was against tables in a different database though).

You should NEVER have SELECT * in production code (well, almost never, but the times where it is justified can be easily counted).

As far as I am aware you'll have to avoid SELECT * but this't really a problem.
SELECT * is usually regarded as a problem waiting to happen for the reason you quote as an advantage! Usually extra results columns appearing for queries when the database has been modified will cause problems.

Does your dialect of SQL support COMPOSE? COMPOSE gets rid of the extra copy of the column that's used on an equijoin, like the one in your example.

As others have said the Select * is bad news especially if other fields are added to the tables in which you are querying. You should select out the exact fields you want from the tables and can use an alias for fields with the same names or just use table.columnName.

Do not use *. Use somthing like this:
SELECT P.field1 AS 'Field from P'
, P.field2
, S.field1 AS 'Field from S'
, S.field4
FROM Products P
INNER JOIN
Services S
ON P.IdService = S.IdService

That would be correct, list the fields you want (in SQL Server you can drag them over from the object browser, so you don't have to type them all). Incidentally, if there are fields your specific query doe not need, do not list them. This creates extra work for the server and uses up extra network resources and can be one of the causes of poor performance when it is done thoughout your system and such wasteful queries are run thousands of times a day.
As to it being a maintenance problem, you only need to add the fields if the part of the application that uses your query would be affected by them. If you don't know what affect the new field would have or where you need to add it, you shouldn't be adding the field. Also adding new fileds unexopectedly through the use of select * can cause maintenance problems as well. Creating performance problems to avoid doing maintenance (maintenance you may never even need to do as column changes should be rare (if they aren't you need to look at your design)) is pretty short-sighted.

The best way is to specify the exact fields that you want from the query. You shouldn't use * anyway.
It is convenient to use * to get all fields, but it doesn't produce robust code. Any change in the table will change the result that is returned from the query, and that is not always desirable.
You should return only the data that you really want from the query, specified in the exact order you want it. That way the result looks exactly the same even if you add fields to the table or change the order of the fields in the table.
It's a litte more work to specify the exact output, but in the long run it usually pays off. When you make a change, only what you actually change is affected, you don't get cascading effects that breaks code that you didn't even know was affected.

SQL query - Select * from view or Select col1, col2, ... colN from view [duplicate]

This question already has answers here:
What is the reason not to use select *?
(20 answers)
Closed 3 years ago.
We are using SQL Server 2005, but this question can be for any RDBMS.
Which of the following is more efficient, when selecting all columns from a view?
Select * from view
or
Select col1, col2, ..., colN from view

NEVER, EVER USE "SELECT *"!!!!
This is the cardinal rule of query design!
There are multiple reasons for this. One of which is, that if your table only has three fields on it and you use all three fields in the code that calls the query, there's a great possibility that you will be adding more fields to that table as the application grows, and if your select * query was only meant to return those 3 fields for the calling code, then you're pulling much more data from the database than you need.
Another reason is performance. In query design, don't think about reusability as much as this mantra:
TAKE ALL YOU CAN EAT, BUT EAT ALL YOU TAKE.

It is best practice to select each column by name. In the future your DB schema might change to add columns that you would then not need for a particular query. I would recommend selecting each column by name.

Just to clarify a point that several people have already made, the reason Select * is inefficient is because there has to be an initial call to the DB to find out exactly what fields are available, and then a second call where the query is made using explicit columns.
Feel free to use Select * when you are debugging, running casual queries or are in the early stages of developing a query, but as soon as you know your required columns, state them explicitly.

Select * is a poor programming practice. It is as likely to cause things to break as it is to save things from breaking. If you are only querying one table or view, then the efficiency gain may not be there (although it is possible if you are not intending to actually use every field). If you have an inner join, then you have at least two fields returning the same data (the join fields) and thus you are wasting network resources to send redundant data back to the application. You won't notice this at first, but as the result sets get larger and larger, you will soon have a network pipeline that is full and doesn't need to be. I can think of no instance where select * gains you anything. If a new column is added and you don't need to go to the code to do something with it, then the column shouldn't be returned by your query by definition. If someone drops and recreates the table with the columns in a different order, then all your queries will have information displaying wrong or will be giving bad results, such as putting the price into the part number field in a new record.
Plus it is quick to drag the column names over from the object browser, so that is just pure laziness not efficiency in coding.

It depends. Inheritance of views can be a handy thing and easy to maintain (SQL Anywhere):
create view v_fruit as select F.id, S.strain from F key join S;
create view v_apples as select v_fruit.*, C.colour from v_fruit key join C;

If you're really selecting all columns, it shouldn't make any noticeable difference whether you ask for * or if you are explicit. The SQL server will parse the request the same way in pretty much the same amount of time.

Always do select col1, col2 etc from view. There's no efficieny difference between the two methods that I know of, but using "select *" can be dangerous. If you modify your view definition adding new columns, you can break a program using "select *", whereas selecting a predefined set of columns (even all of them, named), will still work.

I guess it all depends on what the query optimizer does.
If I want to get every record in the row, I will generally use the "SELECT *..." option, since I then don't have to worry should I change the underlying table structure. As well, for someone maintaining the code, seeing "SELECT *" tells them that this query is intended to return every column, whereas listing the columns individually does not convey the same intention.

For performance - look at the query plan (should be no difference).
For maintainability. - always supply a fieldlist (that goes for INSERT INTO too).

select
column1
,column2
,column3
.
.
.
from Your-View
this one is more optimizer than Using the
select *
from Your View

Why is using '*' to build a view bad?

Why is using '*' to build a view bad ?
Suppose that you have a complex join and all fields may be used somewhere.
Then you just have to chose fields needed.
SELECT field1, field2 FROM aview WHERE ...
The view "aview" could be SELECT table1.*, table2.* ... FROM table1 INNER JOIN table2 ...
We have a problem if 2 fields have the same name in table1 and table2.
Is this only the reason why using '*' in a view is bad?
With '*', you may use the view in a different context because the information is there.
What am I missing ?
Regards

I don't think there's much in software that is "just bad", but there's plenty of stuff that is misused in bad ways :-)
The example you give is a reason why * might not give you what you expect, and I think there are others. For example, if the underlying tables change, maybe columns are added or removed, a view that uses * will continue to be valid, but might break any applications that use it. If your view had named the columns explicitly then there was more chance that someone would spot the problem when making the schema change.
On the other hand, you might actually want your view to blithely
accept all changes to the underlying tables, in which case a * would
be just what you want.
Update: I don't know if the OP had a specific database vendor in mind, but it is now clear that my last remark does not hold true for all types. I am indebted to user12861 and Jonny Leeds for pointing this out, and sorry it's taken over 6 years for me to edit my answer.

Although many of the comments here are very good and reference one common problem of using wildcards in queries, such as causing errors or different results if the underlying tables change, another issue that hasn't been covered is optimization. A query that pulls every column of a table tends to not be quite as efficient as a query that pulls only those columns you actually need. Granted, there are those times when you need every column and it's a major PIA having to reference them all, especially in a large table, but if you only need a subset, why bog down your query with more columns than you need.

Another reason why "*" is risky, not only in views but in queries, is that columns can change name or change position in the underlying tables. Using a wildcard means that your view accommodates such changes easily without needing to be changed. But if your application references columns by position in the result set, or if you use a dynamic language that returns result sets keyed by column name, you could experience problems that are hard to debug.
I avoid using the wildcard at all times. That way if a column changes name, I get an error in the view or query immediately, and I know where to fix it. If a column changes position in the underlying table, specifying the order of the columns in the view or query compensates for this.

These other answers all have good points, but on SQL server at least they also have some wrong points. Try this:
create table temp (i int, j int)
go
create view vtemp as select * from temp
go
insert temp select 1, 1
go
alter table temp add k int
go
insert temp select 1, 1, 1
go
select * from vtemp
SQL Server doesn't learn about the "new" column when it is added. Depending on what you want this could be a good thing or a bad thing, but either way it's probably not good to depend on it. So avoiding it just seems like a good idea.
To me this weird behavior is the most compelling reason to avoid select * in views.
The comments have taught me that MySQL has similar behavior and Oracle does not (it will learn about changes to the table). This inconsistency to me is all the more reason not to use select * in views.

Using '*' for anything production is bad. It's great for one-off queries, but in production code you should always be as explicit as possible.
For views in particular, if the underlying tables have columns added or removed, the view will either be wrong or broken until it is recompiled.

Using SELECT * within the view does not incur much of a performance overhead if columns aren't used outside the view - the optimizer will optimize them out; SELECT * FROM TheView can perhaps waste bandwidth, just like any time you pull more columns across a network connection.
In fact, I have found that views which link almost all the columns from a number of huge tables in my datawarehouse have not introduced any performance issues at all, even through relatively few of those columns are requested from outside the view. The optimizer handles that well and is able to push the external filter criteria down into the view very well.
However, for all the reasons given above, I very rarely use SELECT *.
I have some business processes where a number of CTEs are built on top of each other, effectively building derived columns from derived columns from derived columns (which will hopefully one day being refactored as the business rationalizes and simplifies these calculations), and in that case, I need all the columns to drop through each time, and I use SELECT * - but SELECT * is not used at the base layer, only in between the first CTE and the last.

The situation on SQL Server is actually even worse than the answer by #user12861 implies: if you use SELECT * against multiple tables, adding columns to a table referenced early in the query will actually cause your view to return the values of the new columns under the guise of the old columns. See the example below:
-- create two tables
CREATE TABLE temp1 (ColumnA INT, ColumnB DATE, ColumnC DECIMAL(2,1))
CREATE TABLE temp2 (ColumnX INT, ColumnY DATE, ColumnZ DECIMAL(2,1))
GO
-- populate with dummy data
INSERT INTO temp1 (ColumnA, ColumnB, ColumnC) VALUES (1, '1/1/1900', 0.5)
INSERT INTO temp2 (ColumnX, ColumnY, ColumnZ) VALUES (1, '1/1/1900', 0.5)
GO
-- create a view with a pair of SELECT * statements
CREATE VIEW vwtemp AS
SELECT *
FROM temp1 INNER JOIN temp2 ON 1=1
GO
-- SELECT showing the columns properly assigned
SELECT * FROM vwTemp
GO
-- add a few columns to the first table referenced in the SELECT
ALTER TABLE temp1 ADD ColumnD varchar(1)
ALTER TABLE temp1 ADD ColumnE varchar(1)
ALTER TABLE temp1 ADD ColumnF varchar(1)
GO
-- populate those columns with dummy data
UPDATE temp1 SET ColumnD = 'D', ColumnE = 'E', ColumnF = 'F'
GO
-- notice that the original columns have the wrong data in them now, causing any datatype-specific queries (e.g., arithmetic, dateadd, etc.) to fail
SELECT *
FROM vwtemp
GO
-- clean up
DROP VIEW vwTemp
DROP TABLE temp2
DROP TABLE temp1

It's because you don't always need every variable, and also to make sure that you are thinking about what you specifically need.
There's no point getting all the hashed passwords out of the database when building a list of users on your site for instance, so a select * would be unproductive.

Once upon a time, I created a view against a table in another database (on the same server) with
Select * From dbname..tablename
Then one day, a column was added to the targetted table. The view started returning totally incorrect results until it was redeployed.
Totally incorrect : no rows.
This was on Sql Server 2000.
I speculate that this is because of syscolumns values that the view had captured, even though I used *.

A SQL query is basically a functional unit designed by a programmer for use in some context. For long-term stability and supportability (possibly by someone other than you) everything in a functional unit should be there for a purpose, and it should be reasonably evident (or documented) why it's there - especially every element of data.
If I were to come along two years from now with the need or desire to alter your query, I would expect to grok it pretty thoroughly before I would be confident that I could mess with it. Which means I would need to understand why all the columns are called out. (This is even more obviously true if you are trying to reuse the query in more than one context. Which is problematic in general, for similar reasons.) If I were to see columns in the output that I couldn't relate to some purpose, I'd be pretty sure that I didn't understand what it did, and why, and what the consequences would be of changing it.

It's generally a bad idea to use *. Some code certification engines mark this as a warning and advise you to explicitly refer only the necessary columns. The use of * can lead to performance louses as you might only need some columns and not all. But, on the other hand, there are some cases where the use of * is ideal. Imagine that, no matter what, using the example you provided, for this view (aview) you would always need all the columns in these tables. In the future, when a column is added, you wouldn't need to alter the view. This can be good or bad depending the case you are dealing with.

I think it depends on the language you are using. I prefer to use select * when the language or DB driver returns a dict(Python, Perl, etc.) or associative array(PHP) of the results. It makes your code alot easier to understand if you are referring to the columns by name instead of as an index in an array.

No one else seems to have mentioned it, but within SQL Server you can also set up your view with the schemabinding attribute.
This prevents modifications to any of the base tables (including dropping them) that would affect the view definition.
This may be useful to you for some situations. I realise that I haven't exactly answered your question, but thought I would highlight it nonetheless.

And if you have joins using select * automatically means you are returning more data than you need as the data in the join fields is repeated. This is wasteful of database and network resources.
If you are naive enough to use views that call other views, using select * can make them even worse performers (This is technique that is bad for performance on its own, calling mulitple columns you don't need makes it much worse).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How do I tell if data formatting inconsistencies are themselves consistent? - sql

You can do this with a not exists clause: select t1.* from t1 where not exists (select 1 from t2 where t2.name = substring(t1.columnB, 6) ); This will identify all t1.columnB that do not have a matching name in t2 by the rule you have given.

Related

Check against a list of values in the database

Why are positional queries bad?

'SELECT *' from inner joined tables

SQL query - Select * from view or Select col1, col2, ... colN from view [duplicate]

Why is using '*' to build a view bad?

Categories

Resources