I have a MS SQL 2005/2008 database and trying to compare two tables of data using substrings with % wildcard to try and find data within one character of a column in other table.
Example is:
UPDATE table1
SET table1.Marker = 1
FROM table1
INNER JOIN table2
ON table1.ForeignKey = table2.ID
AND tabl1.CharacterColumn LIKE SUBSTRING(table2.CharacterColumn , 1, 5) + '%' + SUBSTRING(table2.CharacterColumn , 7, 8)
UPDATE table1
SET table1.Marker = 1
FROM table1
INNER JOIN table2
ON table1.ForeignKey = table2.ID
AND tabl1.CharacterColumn LIKE SUBSTRING(table2.CharacterColumn , 1, 6) + '%' + SUBSTRING(table2.CharacterColumn , 8, 8)
At present it takes a while to run this routine as the column can contain up to 10 characters and the dataset is on a table1 of 300 million rows (however a dataset of maybe 300k) and table2 of 2 million rows (a dataset of 100k).
My question is is the JOIN statement the best way to do one character out searching on a column?
i can't give exact examples as the data is protected, however this should help:
Table2 -
ID | FK | Name
1 | 100 | Phillips
2 | 100 | Bloggs
3 | 100 | Jones
Table1 -
ID | Table2FK | Name
1 | 100 | Philpips
2 | 100 | Bloggs
3 | 100 | Jones
As you see table2 record 1 is within one character of table1 record 1 and I want to identify that. Also the one character out can be at any point in the string
When you wrap column in SQL function, SQL Server is no longer able to use indexes. If you have large tables like you have described SQL Server will need to do many CPU intensive operations like Index Scan. Your have 2 alternatives
Create indexed view with a columns of that sub-string. It will take longer to build it first time but after that you will be able to easily join.
Second alternative is to modify your tables to break apart character column into two separate column and that create index on those two columns.
String operations are very costly and it is best to be able to break strings apart instead into separate columns instead of doing it real time.
Indexed Views Documentation http://technet.microsoft.com/en-us/library/dd171921(v=sql.100).aspx
Related
I need to join 2 tables of which I need to substring a column in Table 1. What is unknown is the length of the substring in order to join to Table 2. The first few numbers are the joining key with differing lengths. Table 2 does state the length and will be the indicator on which entries need to be substringed with a specific length. The second table has fixed length of 9 so will also need to be substringed (which will be easy to do). Table 1 is my problem. The length column in Table 2 tells you how much of the ShortRef to use as well as how much to substring RefNr in Table1 which then becomes the join. However, I am not sure how to do this in SSMS or whether it is possible.
Since table 2 informs how much to substring, I currently don't see a solution and I don't know if like will work or how to do this using like.
Example:
TABLE 1
|RefNr |
|----------------|
|1234567890101234|
|9876543210090876|
|1234569000100223|
TABLE 2
|ShortRef | Length | Name |
|---------|--------|------|
|123456789|8 |Alice |
|123456909|8 |Cindy |
|987654999|6 |Ben |
RESULTS SHOULD BE:
|RefNr | Substr Table1&2 based on Length in Table2 | Name |
|----------------|-------------------------------------------|------|
|1234567890101234| 12345678 |Alice |
|9876543210090876| 987654 |Ben |
|1234569000100223| 12345690 |Cindy |
EXAMPLE OF TABLES
I don't know if you're wanting this exactly, but i took correct output for the example table with this.
SELECT RefNr
,tb2."Substring Table1 based on length in Table2"
,tb2.Name
FROM Table1
INNER JOIN
(SELECT SUBSTRING(ShortRef, 0, Length+1) as "Substring Table1 based on length in Table2",
Name,
Length
FROM Table2) as tb2
ON SUBSTRING(RefNr, 0, tb2.Length + 1) = tb2."Substring Table1 based on length in Table2"
It looks like you want to join the tables together with string operations:
select t1.*, left(t2.refnr, t2.length), t2.name
from table1 t1 join
table2 t2
on left(t2.refnr, t2.length) = left(t1.refnr, t2.length);
I have been given a task to develop a script/ function/ query to aggregate groups of rows in a table and then search for specific keywords in it. The column to be aggregated is a varchar2 column with size 3200 and some of the aggregated rows have lengths way beyond 5000.
(I understand that the size of varchar2 is 4000)
When I try to aggregate the data into a single column, it gives a "result of string concatenation is too long" error (ORA-01489)
I have tried inbuilt aggregators like LISTAGG, XMLAGG, and also some custom functions but I have been asked to prefer a SQL query over a function or procedure.
Once I can get the data to be aggregated, I have to then search through the rows for matching keywords.
(can't just search the rows without aggregating as some of the words are split across the rows, eg row1 ends with "KEYW" and row2 starts with "ORD" if I need to look for "KEYWORD" in the table
my table kind of looks like this (can't post the real table data, sorry),
id_1 | id_2 | name | row_num | description
1 5 A 0 this has so
1 5 A 1 me keyword
1 5 B 0 this is
1 3 E 0 new some
2 12 A 0 diff str
here the unique rows are identified using the first 3 columns and the 4th column lists the order in which these "description" strings need to be concatenated.
I would like to get the output as:
id_1 | id_2 | name | description (concated)
1 5 A this is **some** keyword
1 3 E new **some**
when looking for the keyword "some"
Please help as I am fairly new to DBs and any help will be highly appreciated.
Thanks & Regards
Kunal
From within the results of a query I need to take the first 3 digits (got this part - substr) from a column in one table and pull the value from a second table where the 3 digit code is in a different column of the same row. While I'm not looking for assistance in this part I will be grouping by the code column for reference
Example
Table 1 Table 2
3_Digit Code_Column Description
123456 123 Blue
103456 103 Green
so if my substr query return 123 5 times I'm looking for
Blue 3
select t2.Code_Column, count(*)
from table1 t1
join table2 t2
on substr(t1.digit, 0, 3) = t2.Code_Column
group by t2.Code_Column
SQL Fiddle sample
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I inherited a table that has a column containing hand-entered award numbers. It has been used for many years by many people. The award numbers in general look like this:
R01AR012345-01
R01AR012345-02
R01AR012345-03
Award numbers get assigned each year. Because so many different people have had their hands in this in the past, there isn't a lot of consistency in how these are entered. For instance, an award sequence may appear like this:
R01AR012345-01
1 RO1AR012345-02
12345-03
12345-05A1
1234506
The rule I've been given to find is to return any record in which 5 consecutive integers from that column match with another record.
I know how to match a given string, but am at a loss when the 5 consecutive integers are unknown.
Here's a sample table to make what I'm looking for more clear:
+----------------------+
| table: AWARD |
+-----+----------------+
| ID | AWARD_NO |
+-----+----------------+
| 12 | R01AR015123-01 |
+-----+----------------+
| 13 | R01AR015124-01 |
+-----+----------------+
| 14 | 15123-02A1 |
+-----+----------------+
| 15 | 1 Ro1XY1512303 |
+-----+----------------+
| 16 | R01XX099232-01 |
+-----+----------------+
In the above table, the following IDs would be returned: 12,13,14,15
The five consecutive integers that match are:
12,13: 01512
12,14: 15123
12,15: 15123
In our specific case, ID 13 is a false positive... but we're willing to deal with those on a case-by-case basis.
Here's the desired return set for the above table:
+-----+-----+----------------+----------------+
| ID1 | ID2 | AWARD_NO_1 | AWARD_NO_2 |
+-----+-----+----------------+----------------+
| 12 | 13 | R01AR015123-01 | R01AR015124-01 |
+-----+-----+----------------+----------------+
| 12 | 14 | R01AR015123-01 | 15123-02A1 |
+-----+-----+----------------+----------------+
| 12 | 15 | R01AR015123-01 | 1 Ro1XY1512303 |
+-----+-----+----------------+----------------+
Now... I'm OK with false positives (like 12 matching 13) and duplicates (because if 12 matches 14, then 14 also matches 12). We're looking through something like 18,000 rows. Optimization isn't really necessary in this situation, because it's only needed to be run one time.
This should handle removing duplicates and most false-positives:
DECLARE #SPONSOR TABLE (ID INT NOT NULL PRIMARY KEY, AWARD_NO VARCHAR(50))
INSERT INTO #SPONSOR VALUES (12, 'R01AR015123-01')
INSERT INTO #SPONSOR VALUES (13, 'R01AR015124-01')
INSERT INTO #SPONSOR VALUES (14, '15123-02A1')
INSERT INTO #SPONSOR VALUES (15, '1 Ro1XY1512303')
INSERT INTO #SPONSOR VALUES (16, 'R01XX099232-01')
;WITH nums AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS [Num]
FROM sys.objects
),
cte AS
(
SELECT sp.ID,
sp.AWARD_NO,
SUBSTRING(sp.AWARD_NO, nums.Num, 5) AS [TestCode],
SUBSTRING(sp.AWARD_NO, nums.Num + 5, 1) AS [FalsePositiveTest]
FROM #SPONSOR sp
CROSS JOIN nums
WHERE nums.Num < LEN(sp.AWARD_NO)
AND SUBSTRING(sp.AWARD_NO, nums.Num, 5) LIKE '%[1-9][0-9][0-9][0-9][0-9]%'
-- AND SUBSTRING(sp.AWARD_NO, nums.Num, 5) LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
)
SELECT sp1.ID AS [ID1],
sp2.ID AS [ID2],
sp1.AWARD_NO AS [AWARD_NO1],
sp2.AWARD_NO AS [AWARD_NO2],
sp1.TestCode
FROM cte sp1
CROSS JOIN #SPONSOR sp2
WHERE sp2.AWARD_NO LIKE '%' + sp1.TestCode + '%'
AND sp1.ID < sp2.ID
--AND 1 = CASE
-- WHEN (
-- sp1.FalsePositiveTest LIKE '[0-9]'
-- AND sp2.AWARD_NO NOT LIKE
-- '%' + sp1.TestCode + sp1.FalsePositiveTest + '%'
-- ) THEN 0
-- ELSE 1
-- END
Output:
ID1 ID2 AWARD_NO1 AWARD_NO2 TestCode
12 14 R01AR015123-01 15123-02A1 15123
12 15 R01AR015123-01 1 Ro1XY1512303 15123
14 15 15123-02A1 1 Ro1XY1512303 15123
If IDs 14 and 15 should not match, we might be able to correct for that as well.
EDIT:
Based on the comment from #Serpiton I commented out the creation and usage of the [FalsePositiveTest] field since changing the initial character range in the LIKE clause on the SUBSTRING to be [1-9] accomplished the same goal and slightly more efficiently. However, this change assumes that no valid Award # will start with a 0 and I am not sure that this is a valid assumption. Hence, I left the original code in place but just commented out.
You want to use the LIKE command in your where clause and use a pattern to look for the 5 numbers. See this post here:
There are probably better ways of representing this but the below example looks for 5 digits from 0-9 next to each other in the data anywhere in your column value. This could perform quite slowly however...
Select *
from blah
Where column LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
Create a sql server function to extract the 5 numbers and then use the function in your query.
Perhaps something like:
select GetAwardNumber(AwardNumberField) as AwardNumber
from Awards
group by GetAwardNumber(AwardNumberField)
I will not post the code, but an idea on how to do it.
First of all you need to make a table valued function that will return all number sequences from a string bigger then 5 characters. (there are examples on SO)
So for each entry your function will return a list of numbers.
After that the query will simplify like:
;with res as (
select
id, -- hopefully there is an id on your table
pattern -- pattern is from the list of patterns the udtf returns
from myTable
cross apply udtf_custom(myString) -- myString is the string you need to split
)
select
pattern
from res
group by pattern
having count(distinct id)>1
I have to note that this is for example purposes, there should be some coding and testing involved, but this should be the story with it.
Good luck, hope it helps.
Here's what I ended up with:
SELECT a1.ID as AWARD_ID_1, a2.ID as AWARD_ID_2, a1.AWARD_NO as Sponsor_Award_1, a2.AWARD_NO as Sponsor_Award_2
FROM AWARD a1
LEFT OUTER JOIN AWARD a2
ON SUBSTRING(a1.AWARD_NO,PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',a1.AWARD_NO + '1'),5) = SUBSTRING(a2.AWARD_NO,PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',a2.AWARD_NO + '1'),5)
WHERE
a1.AWARD_NO <> '' AND a2.AWARD_NO <> ''
AND a1.ID <> a2.ID
AND a1.AWARD_NO LIKE '%[0-9][0-9][0-9][0-9][0-9]%' AND a2.AWARD_NO LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
There's a possibility that the first substring of five characters might not match (when they should generate a match), but it's close enough for us. :-)
I have one table that stores a range of integers in a field, sort of like a print range, (e.g. "1-2,4-7,9-11"). This field could also contain a single number.
My goal is to join this table to a second one that has discrete values instead of ranges.
So if table one contains
1-2,5
9-15
7
And table two contains
1
2
3
4
5
6
7
8
9
10
The result of the join would be
1-2,5 1
1-2,5 2
1-2,5 5
7 7
9-15 9
9-15 10
Working in SQL Server 2008 R2.
Use a string split function of your choice to split on comma. Figure out the min/max values and join using between.
SQL Fiddle
MS SQL Server 2012 Schema Setup:
create table T1(Col1 varchar(10))
create table T2(Col2 int)
insert into T1 values
('1-2,5'),
('9-15'),
('7')
insert into T2 values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
Query 1:
select T1.Col1,
T2.Col2
from T2
inner join (
select T1.Col1,
cast(left(S.Item, charindex('-', S.Item+'-')-1) as int) MinValue,
cast(stuff(S.Item, 1, charindex('-', S.Item), '') as int) MaxValue
from T1
cross apply dbo.Split(T1.Col1, ',') as S
) as T1
on T2.Col2 between T1.MinValue and T1.MaxValue
Results:
| COL1 | COL2 |
----------------
| 1-2,5 | 1 |
| 1-2,5 | 2 |
| 1-2,5 | 5 |
| 9-15 | 9 |
| 9-15 | 10 |
| 7 | 7 |
Like everybody has said, this is a pain to do natively in SQL Server. If you must then I think this is the proper approach.
First determine your rules for parsing the string, then break down the process into well-defined and understood problems.
Based on your example, I think this is the process:
Separate comma separated values in the string into rows
If the data does not contain a dash, then it's finished (it's a standalone value)
If it does contain a dash, parse the left and right sides of the dash
Given the left and right sides (the range) determine all the values between them into rows
I would create a temp table to populate the parsing results into which needs two columns:
SourceRowID INT, ContainedValue INT
and another to use for intermediate processing:
SourceRowID INT, ContainedValues VARCHAR
Parse your comma-separated values into their own rows using a CTE like this Step 1 is now a well-defined and understood problem to solve:
Turning a Comma Separated string into individual rows
So your result from the source
'1-2,5'
will be:
'1-2'
'5'
From there, SELECT from that processing table where the field does not contain a dash. Step 2 is now a well-defined and understood problem to solve These are standalone numbers and can go straight into the results temp table. The results table should also get the ID reference to the original row.
Next would be to parse the values to the left and right of the dash using CHARINDEX to locate it, then the appropriate LEFT and RIGHT functions as needed. This will give you the starting and ending value.
Here is a relevant question for accomplishing this step 3 is now a well-defined and understood problem to solve:
T-SQL substring - separating first and last name
Now you have separated the starting and ending values. Use another function which can explode this range. Step 4 is now a well-defined and understood problem to solve:
SQL: create sequential list of numbers from various starting points
SELECT all N between #min and #max
What is the best way to create and populate a numbers table?
and, also, insert it into the temp table.
Now what you should have is a temp table with every value in the exploded range.
Simply JOIN that to the other table on the values now, then to your source table on the ID reference and you're there.
My suggestion is to add one more field and many more records to your ranges table. Specifically, the primary key would be the integer and the other field would be the range. Records would look like this:
number range
1 1-2,5
2 1-2,5
3 na
4 na
5 1-2,5
etc
Having said that, this is still rather limiting because a number can only have one range. If you want to be thorough, set up a many to many relationship between numbers and ranges.
As far as I can tell you best option is something like below:
Create a table value function that accepts your ranges an converts them to a collection of ints. So 1-3,5 would return:
1
2
3
5
Then use these results to join to other tables. I don't have an exact function to do this at hand, but this one seems like an excellent start.