SQL (MS) - Custom compare for Null=Value on many columns - sql

I have a table with 50 columns of identifying information that is inconsistently filled out, even for the same individual. Sadly, individuals do not have a unique identifier in this system.
For example, some times we may capture a person's middle name, preferred name, and sometimes it is null - for the SAME individual.
Simplest solution I could think of would be a custom compare function that takes (NULL,VALUE) and returns true, but I'm not sure how to implement this, or if it's even wise.
Ideally I would like to link up records with a lag over partition, but there is frustratingly little information on how partition works other than it takes a 'value expression'. I have tested that it can accept multiple comma separated columns, but the occurrence of null values causes us to miss matches.

Related

"Cannot construct data type datetime" when filtering data, but all values filtered DO have valid dates

I am convinced that this question is NOT a duplicate of:
Cannot construct data type datetime, some of the arguments have values which are not valid
In that case the values passed in are explicitly not valid. Whereas in this case the values that the function could be expected to be called upon are all valid.
I know what the actual problem is, and it's not something that would help most people that find the other question. But it IS something that would be good to be findable on SO.
Please read the answer, and understand why it's different from the linked question before voting to close as dupe of that question.
I've run some SQL that's errored with the error message: Cannot construct data type datetime, some of the arguments have values which are not valid.
My SQL uses DATETIMEFROMPARTS, but it's fine evaluating that function in the select - it's only a problem when I filter on the selected value.
It's also demonstrating weird, can't-possibly-be-happening behaviour w.r.t. other changes to the query.
My query looks roughly like this:
WITH FilteredDataWithDate (
SELECT *, DATETIMEFROMPARTS(...some integer columns representing date data...) AS Date
FROM Table
WHERE <unrelated pre-condition filter>
)
SELECT * FROM FilteredDataWithDate
WHERE Date > '2020-01-01'
If I run that query, then it errors with the invalid data error.
But if I omit the final Date > filter, then it happily renders every result record, so clearly none of the values it's filtering on are invalid.
I've also manually examined the contents of Table WHERE <unrelated pre-condition filter> and verified that everything is a valid date.
It also has a wild collection of other behaviours:
If I replace all of ...some integer columns representing date data... with hard-coded numbers then it's fine.
If I replace some parts of that data with hardcoded values, that fixes it, but others don't. I don't find any particular patterns in what does or doesn't help.
If I remove most of the * columns from the Table select. Then it starts to be fine again.
Specifically, it appears to break any time I include an nvarchar(max) column in the CTE.
If I add an additional filter to the CTE that limits the results to Id values in the following ranges, then the results are:
130,000 and 140,000. Error.
130,000 and 135,000. Fine.
135,000 and 140,000. Fine.!!!!
Filtering by the Date column breaks everything ... but ORDER BY Date is fine. (and confirms that all dates lie within perfectly sensible bounds.)
Adding TOP 1000000 makes it work ... even though there are only about 1000 rows.
... WTAF?!
This took me a while to decode, but it turns out that the SS compiler doesn't necessarily restrict its execution of the function just to rows that are, or could be, relevant to the result set.
Depending on the execution plan it arrives at, the function could get called on any record in Table, even one that doesn't satisfy WHERE <unrelated pre-condition filter>.
This was found by another user, for another function, over here.
So the fact that it could return all the results without the filter wasn't actually proving that every input into the function was valid. And indeed there were some records in the table that weren't in the result set, but still had invalid data.
That actually means that even if you were to add an explicit WHERE filter to exclude rows containing invalid date-component data ... that isn't actually guaranteed to fix it, because the function may still get called against the 'excluded' rows.
Each of the random other things I did will have been influencing the query plan in one way or another that happened to fix/break things.
The solution is, naturally, to fix the underlying table data.

SQL query equality comparison with special character that is equal to everything

I am writing a python script that gets info from a database through SQL queries. Let's say we have an SQL array with information about some people. I have one query that can retrieve this information about a specific person whose name I pass as an argument to the query.
(" SELECT telephone FROM People_info WHERE name=%s " % (name))
Is it possible to pass as an argument a special character or something like that will return me the telephone for all the names? Meaning something that when I compare with every name the result will be equal? I want to use only one query for all the cases (either if I want the info about one person or all of them)
You can edit the SQL code in
SELECT telephone FROM People_info WHERE name=nvl(%s, name)
and pass null if you want to get all the records
Notice that this will never get the records where name is null, but I suppose this is not a problem.
You can use LIKE and the wild card % which matches no, one or any number of any characters.
SELECT telephone
FROM people_info
WHERE name LIKE '%';
However, it won't show records where name IS NULL.
Maybe the optimizer is smart enough to see, that this actually equivalent to a WHERE name IS NOT NULL and uses an index, if there is one. But maybe it don't see it, then this may come as higher price than necessary. So I'd rather change the WHERE clause (or completely omit it, if I wanted all records) in the application to what I actually want, then use such tricks.

Access SQL nested query as a dictionary for final query result in very long run. Ways to optimize?

Want to warn you that it relates to MS Access SQL, so full outer join and some other nice stuff doesn't work here. The code below (by my idea) creates dictionary of tuples from an initial query, then refines it into the next dictionary by selecting those LKAK that have multiple values of ChainCl(It could have only 2 possible values, but for each LKAK must be one value of ChainCl, so I do query that finds mistakes for me). My goal is, relying on that final dictionary-list of LKAK, to get records with those LKAK from the MS_FLT query (all of them are there, dictionary made of its values) and RC_FLT query (values from dictionary are present there partially, but still are). Creation of final dictionary works just fine and instant. However, when I come to withdrawing the records, query run for about 20 min through 130000 records. What could be done to optimize the speed in connection between dictionary and source query. Note that adding "distinct" after the first "select" doesn't change run speed significantly. My problem is that I need to get joined results (using union) from three same-way formatted tables-queries(say, RC,RW,RE), and if one takes 20 min, then three is an hour. This thing is quite useful for me and possibly others as a workaround for cases when you need to display extra data that doesn't take part in criteria involving totals needed "group by"(like count). Creation of additional query won't work for me either as I need to collect them within the single complex query(don't like the idea, but that's the requirement). So, any suggestions on optimization?
{select distinct
RC_FLT.LKAK as DataKeyAccount,
MS_FLT.LKAK as MasterKeyAccount,
RC_FLT.LSTCK as DataSubTradeChannel,
MS_FLT.LSTCK as MasterSubTradeChannel,
RC_FLT.UFPCh as DataUFPChannel,
MS_FLT.UFPCh as MasterUFPChannel,
RC_FLT.ChainCl as DataChainClass,
MS_FLT.ChainCl as MasterChainClass,
RC_FLT.Mkt as MarketUnit
from
RC_FLT,
MS_FLT
where
RC_FLT.LSTCK=MS_FLT.LSTCK
and RC_FLT.LKAK=MS_FLT.LKAK
and RC_FLT.ChainCl=MS_FLT.ChainCl
and MS_FLT.LKAK in(
SELECT LKAK
FROM (
select distinct LKAK, ChainCl
from MS_FLT)
group by LKAK
having count(LKAK)>1);}
PS In both initial queries, fields used are there, but there are other fields as well.

SQL Server Text Searching

I have a business requirement where we need to do somce crazy name matching against records stored in the database and I was wondering if there is any easy way to do it using SQL Server.
Name Stored in the DB : Austin K
Name to be Matched from UI : Austin Kierland
That's just a sample. In reality, there could be whole lot of different permutations and combinations.
If it's other way round, I could've used wild character but in this case, the name in the database is smaller than the search criteria.
Any suggestions?
Realistically - no. Databases were meant for comparing absolute values, not for messy comparisons. The way they store their data internally just isn't fit for really messy matching. Actually even a superpowerful dedicated search engine like Google, that has a LOT of messy matching features, wouldn't be able to pull off your example without prior knowledge.
I don't know how the requirement is precisely worded, but I'd either shoot the feature request with "technically impossible", or implement a rule set for which messy matches are tried - for your example, you could easily 'hard code' that multiple searches are executed when capitalized words are entered, shortening them so a single letter. No idea if that's a solution to your problem though.
You can do a normal search using the LIKE operator which determines whether a specific character string matches a specified pattern. The problem you will run into is the probability of the returning of multiple records or incorrect people. I've had similar requirement myself for a business app and the best solution to the issue is to require other qualifying values rather then just name. If you do a partial name search without other qualifying data you are certainly going to come across the false positive matches and/or multiple records. In my case I built a web service that checks eligibility allowing text search for first & last name but also added date of birth, primary person SSN, and gender which ensured the matching person was in deed the person intended to search for. If my situation was like yours in which name was the only search criteria my recommendation to the business would be we cannot perform the search until qualifying data is entered into the database otherwise there is no accurate way to query the results they are looking for.

Best SQL query for list of records containing certain characters?

I'm working with a relatively large SQL Server 2000 DB at the moment. It's 80 GB in size, and have millions and millions of records.
I currently need to return a list of names that contains at least one of a series of illegal characters. By illegal characters is just meant an arbitrary list of characters that is defined by the customer. In the below example I use question mark, semi-colon, period and comma as the illegal character list.
I was initially thinking to do a CLR function that worked with regular expressions, but as it's SQL server 2000, I guess that's out of the question.
At the moment I've done like this:
select x from users
where
columnToBeSearched like '%?%' OR
columnToBeSearched like '%;%' OR
columnToBeSearched like '%.%' OR
columnToBeSearched like '%,%' OR
otherColumnToBeSearched like '%?%' OR
otherColumnToBeSearched like '%;%' OR
otherColumnToBeSearched like '%.%' OR
otherColumnToBeSearched like '%,%'
Now, I'm not a SQL expert by any means, but I get the feeling that the above query will be very inefficient. Doing 8 multiple wildcard searches in a table with millions of records, seems like it could slow the system down rather seriously. While it seems to work fine on test servers, I am getting the "this has to be completely wrong" vibe.
As I need to execute this script on a live production server eventually, I hope to achieve good performance, so as not to clog the system. The script might need to be expanded later on to include more illegal characters, but this is very unlikely.
To sum up: My aim is to get a list of records where either of two columns contain a customer-defined "illegal character". The database is live and massive, so I want a somewhat efficient approach, as I believe the above queries will be very slow.
Can anyone tell me the best way for achieving my result? Thanks!
/Morten
It doesn't get used much, but the LIKE statement accepts patterns in a similar (but much simplified) way to Regex. This link is the msdn page for it.
In your case you could simplify to (untested):
select x from users
where
columnToBeSearched like '%[?;.,]%' OR
otherColumnToBeSearched like '%[?;.,]%'
Also note that you can create the LIKE pattern as a variable, allowing for the customer defined part of your requirements.
One other major optimization: If you've got an updated date (or timestamp) on the user row (for any audit history type of thing), then you can always just query rows updated since the last time you checked.
If this is a query that will be run repeatedly, you are probably better off creating an index for it. The syntax escapes me at the moment, but you could probably create a computed column (edit: probably a PERSISTED computed column) which is 1 if columnToBeSearched or otherColumnToBeSearched contain illegal characters, and 0 otherwise. Create an index on that column and simply select all rows where the column is 1. This assumes that the set of illegal characters is fixed for that database installation (I assume that that's what you mean by "specified by the customer"). If, on the other hand, each query might specify a different set of illegal characters, this won't work.
By the way, if you don't mind the risk of reading uncommitted rows, you can run the query in a transaction with the the isolation level READ UNCOMMITTED, so that you won't block other transactions.
You can try to partition your data horizontally and "split" your query in a number of smaller queries. For instance you can do
SELECT x FROM users
WHERE users.ID BETWEEN 1 AND 5000
AND -- your filters on columnToBeSearched
putting your results back together in one list may be a little inconvenient, but if it's a report you're only extracting once (or once in a while) it may be feasible.
I'm assuming ID is the primary key of users or a column that has a index defined, which means SQL should be able to create an efficient execution plan, where it evaluates users.ID BETWEEN 1 AND 5000 (fast) before trying to check the filters (which may be slow).
Look up PATINDEX it allows you to put in an array of characters PATINDEX('[._]',ColumnName) returns a 0 or a value of the first occurance of an illegal character found in a certain value. Hope this helps.