remove duplicates from almost similar records in SQL

remove duplicates from almost similar records in SQL - sql

I have a table with three columns: Name, Address, City. This table is around a million records long. The name and address fields can probably have duplicates.
An example of duplicate names are:
XYZ foundation Coorporation
XYZ foundation Corp
XYZ foundation Co-orporation
Or another example
XYZ Center
XYZ Ctr
An example of duplication in addresses would be
60909 East 34TH STREET BAY #1
60909 East 34TH ST. BAY #1
60909 East 34TH ST. BAY 1
As you can see, the name and address fields are duplicates, but only to the human eye, because we understand abbreviations and short forms. How do I build this into a select statement in SQL Server? If not SQL Server, is there another way to scan and remove such duplicates?

The approach that I used is better suited for surnames, but I used it for company names as well. Most likely it will not work well for addresses.
Stage 1
Add a column to the table that stores a "normalized" company name. In my case I've written a function that populates the column via a trigger. The function has a set of rules, like this:
adds one space in the front and one in the back
replaces single char symbols ~`!##$%^&*()=_+[]{}|;':",.<>? with space (all except / -)
replaces multi-char tokens with space: T/A C/- P/L
replaces single char symbols -/ with space
replaces multi-char tokens with space: PTY PTE INC INCORPORATED LTD LIMITED CO COMPANY MR DR THE AND 'TRADING AS' 'TRADE AS' 'OPERATING AS'
replaces CORPORATION with CORP
trim all leading and trailing spaces
replace multiple consecutive spaces with single space
Note: when dealing with multi-char tokens surround them with spaces
I looked through my data and made these rules up. Adjust them for you case.
Stage 2
I used the so-called Jaro-Winkler metric to calculate the distance between two normalized company names. I implemented the function that calculates this metric in CLR.
In my case my goal was to check for duplicates as a new entry is added to the system. The user enters the company name, program normalizes it and calculates the Jaro-Winkler distance between the given name and all existing names. The closer the distance to 1, the closer the match. The user saw existing records ordered by the relevance and could decide whether the company name that he just entered already exists in the database, or he still wanted to create a new one.
There exist other metrics that try to perform fuzzy search, like Levenshtein distance. Most likely, you'll have to use different metrics for names and addresses, because the types of mistakes are significantly different for them.
SQL Server has built-in functions to do fuzzy search, but I didn't use them and I'm not sure if they are available in standard editions or only enterprise, e.g. CONTAINSTABLE
Returns a table of zero, one, or more rows for those columns
containing precise or fuzzy (less precise) matches to single words and
phrases, the proximity of words within a certain distance of one
another, or weighted matches.
Note
When I was looking into this topic I came to the conclusion that all these metrics (Jaro-Winkler, Levenstein, etc.) look for simple mistypes, like a missed/extra letter or two letters swapped. In my and your cases this approach as-is would perform poorly, because you effectively have a dictionary of contractions first and then on top of that there can be simple mistypes. That's why I ended up doing it in two stages - normalization and then applying the fuzzy search metric.
To make a list of rules that I mentioned above I made a dictionary of all words that appear in my data. Essentially, take each Name and split it into multiple rows by space. Then group by found tokens and count how many times they appear. Manually look through the list of tokens. This list should not be too long when you remove rare tokens from it. Hopefully common words and contractions would be easy to spot. I would imagine that the word Corporation and "Corp" would appear many times, as opposed to the actual company name XYZ. Those odd mistypes like "Coorporation" should be picked up by the fuzzy metric later.
In a similar way make a separate dictionary for Addresses, where you would see that Street and St. appear many times. For addresses you can "cheat" and get a list of common words from the index of some city map (street/st, road/rd, highway/hwy, grove/gv, etc.)
This is my implementation of the Jaro-Winkler metric:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
/*
The Winkler modification will not be applied unless the percent match
was at or above the WeightThreshold percent without the modification.
Winkler's paper used a default value of 0.7
*/
private static readonly double m_dWeightThreshold = 0.7;
/*
Size of the prefix to be concidered by the Winkler modification.
Winkler's paper used a default value of 4
*/
private static readonly int m_iNumChars = 4;
[Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
public static SqlDouble StringSimilarityJaroWinkler(SqlString string1, SqlString string2)
{
if (string1.IsNull || string2.IsNull)
{
return 0.0;
}
return GetStringSimilarityJaroWinkler(string1.Value, string2.Value);
}
private static double GetStringSimilarityJaroWinkler(string string1, string string2)
{
int iLen1 = string1.Length;
int iLen2 = string2.Length;
if (iLen1 == 0)
{
return iLen2 == 0 ? 1.0 : 0.0;
}
int iSearchRange = Math.Max(0, Math.Max(iLen1, iLen2) / 2 - 1);
bool[] Matched1 = new bool[iLen1];
for (int i = 0; i < Matched1.Length; ++i)
{
Matched1[i] = false;
}
bool[] Matched2 = new bool[iLen2];
for (int i = 0; i < Matched2.Length; ++i)
{
Matched2[i] = false;
}
int iNumCommon = 0;
for (int i = 0; i < iLen1; ++i)
{
int iStart = Math.Max(0, i - iSearchRange);
int iEnd = Math.Min(i + iSearchRange + 1, iLen2);
for (int j = iStart; j < iEnd; ++j)
{
if (Matched2[j]) continue;
if (string1[i] != string2[j]) continue;
Matched1[i] = true;
Matched2[j] = true;
++iNumCommon;
break;
}
}
if (iNumCommon == 0) return 0.0;
int iNumHalfTransposed = 0;
int k = 0;
for (int i = 0; i < iLen1; ++i)
{
if (!Matched1[i]) continue;
while (!Matched2[k])
{
++k;
}
if (string1[i] != string2[k])
{
++iNumHalfTransposed;
}
++k;
// even though length of Matched1 and Matched2 can be different,
// number of elements with true flag is the same in both arrays
// so, k will never go outside the array boundary
}
int iNumTransposed = iNumHalfTransposed / 2;
double dWeight =
(
(double)iNumCommon / (double)iLen1 +
(double)iNumCommon / (double)iLen2 +
(double)(iNumCommon - iNumTransposed) / (double)iNumCommon
) / 3.0;
if (dWeight > m_dWeightThreshold)
{
int iComparisonLength = Math.Min(m_iNumChars, Math.Min(iLen1, iLen2));
int iCommonChars = 0;
while (iCommonChars < iComparisonLength && string1[iCommonChars] == string2[iCommonChars])
{
++iCommonChars;
}
dWeight = dWeight + 0.1 * iCommonChars * (1.0 - dWeight);
}
return dWeight;
}
};

You could look for a more customized solution, together with the DIFFERENCE function for instance. (see: DIFFERENCE function, SQL Server)
Is it possible for Name and City to be logically similar yet different, as well?
Since there's a lot of room for variations here and only you have access to the real data, only you can check what works and what kind of exceptions you basically have there.
But hopefully this will get you started.
-- Creating the test set
DECLARE #TESTTABLE TABLE (Name VARCHAR(256), City VARCHAR(256), Address VARCHAR(256))
INSERT INTO #TESTTABLE VALUES ('Billy bob' ,'New York' ,'Baker street 125')
INSERT INTO #TESTTABLE VALUES ('Billy bob' ,'New York' ,'Baker street 120')
INSERT INTO #TESTTABLE VALUES ('Billy bob' ,'New York' ,'Baker st 125')
INSERT INTO #TESTTABLE VALUES ('Billy bob' ,'New York' ,'Mallroad 1')
INSERT INTO #TESTTABLE VALUES ('James Dean' ,'Washington DC' ,'Primadonna road 15 c 100')
INSERT INTO #TESTTABLE VALUES ('James Dean' ,'Washington DC' ,'Primadonna r 15')
INSERT INTO #TESTTABLE VALUES ('Got Nuttin' ,'Philly' ,'Mystreet 1500') -- Doesn't show, since no real duplicates
And then, after the test data, the actual query.
-- The query
;WITH CTE AS
(SELECT DISTINCT SRC.RN, T1.*, DIFFERENCE(T1.Address, T2.Address) DIFF_FACTOR
FROM #TESTTABLE T1
JOIN #TESTTABLE T2 ON T1.Name = T1.Name AND T2.City = T1.City AND T1.Address <> T2.Address
JOIN (SELECT DENSE_RANK() OVER (ORDER BY Name, City) RN, Name, City FROM #TESTTABLE T3 GROUP BY Name, City HAVING COUNT(*) > 1) SRC
ON SRC.City = T1.City AND SRC.Name = T1.Name)
SELECT DISTINCT RN, Name, City, COUNT(DISTINCT C.Address) Address_CT
, STUFF((SELECT ','+B.Address
FROM CTE B
WHERE B.RN = C.RN AND B.DIFF_FACTOR = C.DIFF_FACTOR
ORDER BY B.Address ASC
FOR XML PATH('')),1,1,'') AllAdresses
, DIFF_FACTOR
FROM CTE C
WHERE DIFF_FACTOR > 1 -- Comment this row to see that 'Mallroad 1' was considered to be too different from the rest, so this filter prevents us from considering that in the result set
GROUP BY RN, Name, City, DIFF_FACTOR
ORDER BY RN ASC, DIFF_FACTOR DESC
That is probably not the most effective - or accurrate - way to go about doing this, but it's a good place to start and to show what can be done. If there's a chance for Name and City to also be different but duplicates to human eyes, you could modify the query to match any two identical column values, comparing the third. But it gets really difficult to automate comparisons in cases where you have one identifying column, and both of the others can be different from one another to varying degrees.
I suspect you need to make several queries to first sort out the biggest mess, and eventually find the last most evasive "duplicates" by hand, a few at a time.

Related

Longest Common SubString, BigQuery, SQL

Given I have two a Table with Two string columns:
A
B
John likes to go jumpping
Max likes swimming but he also likes to go jummping
John is cool
max is smart
John
max
In Big-query SQL How can I find the longest common substring? such that I get
A
B
C
John likes to go jumping
Max likes swimming but he also likes to go jumping
likes to go jumping
John is cool
max is smart
is
John
max
null

Try below very much SQL'ish approach
select A, B,
(
select string_agg(word, ' ' order by a_pos) phrase
from unnest(split(A, ' ')) word with offset a_pos
join unnest(split(B, ' ')) word with offset b_pos
using(word)
group by b_pos - a_pos
order by length(phrase) desc
limit 1
) as C
from `project.dataset.table`
when applied to sample data in your question - output is
Obviously your example is very simple, so in real use case you might need to adjust above to reflect reality
Also, note: there are many other options/approaches for your problem that SO has already multiple answers for, including mine - for text similarity mostly based on using JS UDF and levenshtein distance or similar algorithms

This probably is not a problem for your SQL to solve (it is though very simple to solve via any scripting language). However, BigQuery does support JS based UDFs, which usually come in handy to solve such problems.
Here is an option (which at its core is not SQL) that you can take in BigQuery:
CREATE TEMP FUNCTION lcsub(a string, b string)
RETURNS STRING
LANGUAGE js AS """
a = a.split(' ');
b = b.split(' ');
let la = a.length;
let lb = b.length;
let output = [];
for (var i=0; i<la; i++){
for (var j=0; j<lb; j++){
if (a[i] == b[j]){
let u = [b[j]]
let aidx = i;
for (var k = j+1; k<lb; k++){
u.push(b[k]);
if (u.join(' ') == a.slice(i, aidx +1+1).join(' ')){
if (u.length >= output.length){
output = u;
}
}
else {
u.pop();
if (u.length >= output.length){
output = u;
}
break;
}
aidx += 1;
if (aidx > la -1){
break
}
}
}
}
}
return output.join(' ')
""";
select A, B, lcsub(A, B) as C from dataset.table

Add array of other records from the same table to each record

My project is a Latin language learning app. My DB has all the words I'm teaching, in the table 'words'. It has the lemma (the main form of the word), along with the definition and other information the user needs to learn.
I show one word at a time for them to guess/remember what it means. The correct word is shown along with some wrong words, like:
What does Romanus mean? Greek - /Roman/ - Phoenician - barbarian
What does domus mean? /house/ - horse - wall - senator
The wrong options are randomly drawn from the same table, and must be from the same part of speech (adjective, noun...) as the correct word; but I am only interested in their lemma. My return value looks like this (some properties omitted):
[
{ lemma: 'Romanus', definition: 'Roman', options: ['Greek', 'Phoenician', 'barbarian'] },
{ lemma: 'domus', definition: 'house', options: ['horse', 'wall', 'senator'] }
]
What I am looking for is a more efficient way of doing it than my current approach, which runs a new query for each word:
// All the necessary requires are here
class Word extends Model {
static async fetch() {
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: ['lemma', 'definition'], // also a few other columns I need
});
const wordsWithOptions = await Promise.all(words.map(this.addOptions.bind(this)));
return wordsWithOptions;
}
static async addOptions(word) {
const options = await this.findAll({
order: [Sequelize.literal('RANDOM()')],
limit: 3,
attributes: ['lemma'],
where: {
partOfSpeech: word.dataValues.partOfSpeech,
lemma: { [Op.not]: word.dataValues.lemma },
},
});
return { ...word.dataValues, options: options.map((row) => row.dataValues.lemma) };
}
}
So, is there a way I can do this with raw SQL? How about Sequelize? One thing that still helps me is to give a name to what I'm trying to do, so that I can Google it.
EDIT: I have tried the following and at least got somewhere:
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: {
include: [[sequelize.literal(`(
SELECT lemma FROM words AS options
WHERE "partOfSpeech" = "options"."partOfSpeech"
ORDER BY RANDOM() LIMIT 1
)`), 'options']],
},
});
Now, there are two problems with this. First, I only get one option, when I need three; but if the query has LIMIT 3, I get: SequelizeDatabaseError: more than one row returned by a subquery used as an expression.
The second error is that while the code above does return something, it always gives the same word as an option! I thought to remedy that with WHERE "partOfSpeech" = "options"."partOfSpeech", but then I get SequelizeDatabaseError: invalid reference to FROM-clause entry for table "words".
So, how do I tell PostgreSQL "for each row in the result, add a column with an array of three lemmas, WHERE existingRow.partOfSpeech = wordToGoInTheArray.partOfSpeech?"

Revised
Well that seems like a different question and perhaps should be posted that way, but...
The main technique remains the same. JOIN instead of sub-select. The difference being generating the list of lemmas for then piping then into the initial query. In a single this can get nasty.
As single statement (actually this turned out not to be too bad):
select w.lemma, w.defination, string_to_array(string_agg(o.defination,','), ',') as options
from words w
join lateral
(select defination
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma in( select lemma
from words
order by random()
limit 4 --<<< replace with parameter
)
group by w.lemma, w.defination;
The other approach build a small SQL function to randomly select a specified number of lemmas. This selection is the piped into the (renamed) function previous fiddle.
create or replace
function exam_lemma_definition_options(lemma_array_in text[])
returns table (lemma text
,definition text
,option text[]
)
language sql strict
as $$
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(lemma_array_in)
group by w.lemma, w.definition;
$$;
create or replace
function exam_lemmas(num_of_lemmas integer)
returns text[]
language sql
strict
as $$
select string_to_array(string_agg(lemma,','),',')
from (select lemma
from words
order by random()
limit num_of_lemmas
) ll
$$;
Using this approach your calling code reduces to a needs a single SQL statement:
select *
from exam_lemma_definition_options(exam_lemmas(4))
order by lemma;
This permits you to specify the numbers of lemmas to select (in this case 4) limited only by the number of rows in Words table. See revised fiddle.
Original
Instead of using a sub-select to get the option words just JOIN.
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(array['Romanus', 'domus'])
group by w.lemma, w.definition;
See fiddle. Obviously this will not necessary produce the same options as your questions provides due to random() selection. But it will get matching parts of speech. I will leave translation to your source language to you; or you can use the function option and reduce your SQL to a simple "select *".

Berkeley DB equivalent of SELECT COUNT() All, SELECT COUNT() WHERE LIKE "%...%"

I'm looking for Berkeley DB equivalent of
SELECT COUNT All, SELECT COUNT WHERE LIKE "%...%"
I have got 100 records with keys: 1, 2, 3, ... 100.
I have got the following code:
//Key = 1
i=1;
strcpy_s(buf, to_string(i).size()+1, to_string(i).c_str());
key.data = buf;
key.size = to_string(i).size()+1;
key.flags = 0;
data.data = rbuf;
data.size = sizeof(rbuf)+1;
data.flags = 0;
//Cursor
if ((ret = dbp->cursor(dbp, NULL, &dbcp, 0)) != 0) {
dbp->err(dbp, ret, "DB->cursor");
goto err1;
}
//Get
dbcp->get(dbcp, &key, &data_read, DB_SET_RANGE);
db_recno_t cnt;
dbcp->count(dbcp, &cnt, 0);
cout <<"count: "<<cnt<<endl;
Count cnt is always 1 but I expect it calculates all the partial key matches for Key=1: 1, 10, 11, 21, ... 91.
What is wrong in my code/understanding of DB_SET_RANGE ?
Is it possible to get SELECT COUNT WHERE LIKE "%...%" in BDB ?
Also is it possible to get SELECT COUNT All records from the file ?
Thanks

You're expecting Berkeley DB to be way more high-level than it actually is. It doesn't contain anything like what you're asking for. If you want the equivalent of WHERE field LIKE '%1%' you have to make a cursor, read through all the values in the DB, and do the string comparison yourself to pick out the ones that match. That's what an SQL engine actually does to implement your query, and if you're using libdb instead of an SQL engine, it's up to you. If you want it done faster, you can use a secondary index (much like you can create additional indexes for a table in SQL), but you have to provide some code that links the secondary index to the main DB.
DB_SET_RANGE is useful to optimize a very specific case: you're looking for items whose key starts with a specific substring. You can DB_SET_RANGE to find the first matching key, then DB_NEXT your way through the matches, and stop when you get a key that doesn't match. This works only on DB_BTREE databases because it depends on the keys being returned in lexical order.
The count method tells you how many exact duplicate keys there are for the item at the current cursor position.

You can use method DB->stat().
For example, number of unique keys in the BT_TREE.
bool row_amount(DB *db, size_t &amount) {
amount = 0;
if (db==NULL) return false;
DB_BTREE_STAT *sp;
int ret = db->stat(db, NULL, &sp, 0);
if(ret!=0) return false;
amount = (size_t)sp->bt_nkeys;
return true;
}

How to make this SQL query more efficient and how to do more

I have a sqlite3 database in which I have corrupt data. I qualify "corrupt" with the following characteristics:
Data in name, telephone, latitude, longitude columns is corrupt if: The value is NULL or "" or length < 2
Data in address column is corrupt if The value is NULL or "" or number of words < 2 and length of word is <2
To test this I wrote the following script in Ruby:
require 'sqlite3'
db = SQLite3::Database.new('development.sqlite3')
db.results_as_hash = true;
#Checks for empty strings in name, address, telephone, latitude, longitude
#Also checks length of strings is valid
rows = db.execute(" SELECT * FROM listings WHERE LENGTH('telephone') < 2 OR LENGTH('fax') < 2 OR LENGTH('address') < 2 OR LENGTH('city') < 2 OR LENGTH('province') < 2 OR LENGTH('postal_code') < 2 OR LENGTH('latitude') < 2 OR LENGTH('longitude') < 2
OR name = '' OR address = '' OR telephone = '' OR latitude = '' OR longitude = '' ")
rows.each do |row|
=begin
db.execute("INSERT INTO missing (id, name, telephone, fax, suite, address, city, province, postal_code, latitude, longitude, url) VALUES (?,?,?,?,?,?,?,?,?,?,?,?)", row['id'], row['name'], row['telephone'], row['fax'], row['suite'], row['address'], row['city'], row['province'],
row['postal_code'], row['latitude'], row['longitude'], row['url'] )
=end
id_num = row['id']
puts "Id = #{id_num}"
corrupt_name = row['name']
puts "name = #{corrupt_name}"
corrupt_address = row['address']
puts "address = #{corrupt_address}"
corrupt_tel = row['telephone']
puts "tel = #{corrupt_tel}"
corrupt_lat = row['latitude']
puts "lat = #{corrupt_lat}"
corrupt_long = row['longitude']
puts "lat = #{corrupt_long}"
puts '===end===='
end
#After inserting the records into the new table delete them from the old table
=begin
db.execute(" DELETE * FROM listings WHERE LENGTH('telephone') < 2 OR LENGTH('fax') < 2 OR LENGTH('address') < 2 OR
LENGTH('city') < 2 OR LENGTH('province') < 2 OR LENGTH('postal_code') < 2 OR LENGTH('latitude') < 2 OR LENGTH('longitude') < 2
OR name = '' OR address = '' OR telephone = '' OR latitude = '' OR longitude = '' ")
=end
This works but Im new to Ruby and DB programming. So I would welcome any suggestions to make this query better.
The ultimate goal I have is to run a script on my database which tests the validity of data in it and if there are some data that are not valid they are copied to a different table and deleted from the 1st table.
Also, I would like to add to this query a test to check for duplicate entries.
I qualify an entry as duplicate if more than 1 rows share the same name and the same address and the same telephone and the same latitude and the same longitude
I came up with this query but Im not sure if its the most optimal:
SELECT *
FROM listings L1, listings L2
WHERE L1.name = L2.name
AND L1.telephone = L2.telephone
AND L1.address = L2.address
AND L1.latitude = L2.latitude
AND L1.longitude = L2.longitude
Any suggestions, links, help would be greatly appreciated

Your first query doesn't have any significant performance problem. It will run with a seq scan evaluating your "is corrupt" predicate. The check for == '' is redundant with length(foo) < 2 as length('') is < 2. You have a bug where you quoted the field names in your length() calls, so you'll be evaluating the length of the literal field name instead of the value of the field. You have also failed to test for NULL which is a value distinct from ''. You can use the coalesce function to convert NULL to '' and capture NULLS with the length check. You also don't seem to have addressed the special word based rule for address. This later is trouble unless you extend sqlite with a regexp function. I suggest approximating it with LIKE or GLOB.
Try this alternative:
SELECT * FROM listings
WHERE LENGTH(coalesce(telephone,'')) < 2
OR LENGTH(coalesce(fax,'')) < 2
OR LENGTH(coalesce(city,'')) < 2
OR LENGTH(coalesce(province,'')) < 2
OR LENGTH(coalesce(postal_code,'')) < 2
OR LENGTH(coalesce(latitude,'')) < 2
OR LENGTH(coalesce(longitude,'')) < 2
OR LENGTH(coalesce(name,'')) < 2
OR LENGTH(coalesce(address,'')) < 5
OR trim(address) not like '%__ __%'
You find duplicates query doesn't work, since there's always at least one record to match when self joining on equality. You need to exclude the record under test on one side of the join. Typically this can be done by excluding on primary key. You haven't mentioned if the table has a primary key, but IIRC sqllite can give you a proxy for one with ROWID. Something like this:
SELECT L1.*
FROM listings L1
where exists (
select null
from listings L2
where L1.ROWID <> L2.ROWID
AND L1.name = L2.name
AND L1.telephone = L2.telephone
AND L1.address = L2.address
AND L1.latitude = L2.latitude
AND L1.longitude = L2.longitude
)
BTW, while you stressed efficiency in your question, it's important to make your code correct before you worry about efficiency.

I think you're doing overprocessing. As the length of the string '' is 0 then it matches the condicion length('') < 2. So, you don't need to check if a field is equal to '' as it has already been filtered by the conditions on the length function.
However, I don't see how you're checking for null values. I'd replace all the aField = '' with aField is null.

Creating a new table from grouped substring of existing table

I am having some trouble creating some SQL (for SQL server 2008).
I have a table of tasks that are priority ordered, comma delimited tasks:
Id = 1, LongTaskName = "a,b,c"
Id = 2, LongTaskName = "a,c"
Id = 3, LongTaskName = "b,c"
Id = 4, LongTaskName = "a"
etc...
I am trying to build a new table that groups them by the first task, along with the id:
GroupName: "a", TaskId: 1
GroupName: "a", TaskId: 2
GroupName: "a", TaskId: 4
GroupName: "b", TaskId: 3
Here is the naive, slow, linq code:
foreach(var t in Tasks)
{
var gt = new GroupedTasks();
gt.TaskId = t.Id;
var firstWord = t.LongTaskName.Split(',');
if(firstWord.Count() > 0)
{
gt.GroupName = firstWord.First();
}
else
{
gt.GroupName = t.LongTaskName;
}
GroupedTasks.InsertOnSubmit(gt);
}
I wrote a sql function to do the string split:
create function fn_Split(
#String nvarchar (4000),
#Delimiter nvarchar (10)
)
returns nvarchar(4000)
begin
declare #FirstComma int
set #FirstComma = charindex(#Delimiter,#String)
if(#FirstComma = 0)
return #String
return substring(#String, 0, #FirstComma)
end
go
However, I am getting stuck on the real sql to do the work.
I can get the group by alone:
SELECT dbo.fn_Split(LongTaskName, ',')
FROM [dbo].[Tasks]
GROUP BY dbo.fn_Split(LongTaskName, ',')
And I know I need to head down something like this:
DECLARE #RowSet TABLE (GroupName nvarchar(1024), Id nvarchar(5))
insert into #RowSet
select ???
FROM [dbo].Tasks as T
INNER JOIN
(
SELECT dbo.fn_Split(LongTaskName, ',')
FROM [dbo].[Tasks]
GROUP BY dbo.fn_Split(LongTaskName, ',')
) G
ON T.??? = G.???
ORDER BY ???
INSERT INTO dbo.GroupedTasks(GroupName, Id)
select * from #RowSet
But I am not quite groking how to reference the grouped relationships and am confused about having to call split multiple times.
Any thoughts?

If you only care about the first item in the list, there's no need really for a function. I would recommend this way. You also don't need the #RowSet table variable for any temporary holding.
INSERT dbo.GroupedTasks(GroupName, Id)
SELECT
LEFT(LongTaskName, COALESCE(NULLIF(CHARINDEX(',', LongTaskName)-1, -1), 1024)),
Id
FROM dbo.Tasks;
It is even easier if the tasks are 1-character long, you can use LEFT(LongTaskName, 1) instead of the ugly SUBSTRING/CHARINDEX mess. But I'm guessing your task names are not one character long (if this is the case, you should include some data that varies a bit so that others don't make assumptions about length).
Now, keep in mind that you'll have to do something like this to keep dbo.GroupedTasks up to date every time a dbo.Tasks row is inserted, updated or deleted. How are you going to keep these two tables in sync?
More to the point, you should consider storing the top priority task separately in the first place, either by using a computed column or separating it out before the insert. Munging data together is something that you do with hash tables and arrays in application code, but it rarely has any positive attributes inside a database. You almost always spend more time and effort extracting the data apart than you ever saved by keeping it together in the first place. This will negate the need for a second table at all.

Select Id, Split( ',', LongTaskName ) as GroupName into TasksWithGroupInfo
Does this answer your question?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

remove duplicates from almost similar records in SQL - sql

Related

Longest Common SubString, BigQuery, SQL

Add array of other records from the same table to each record

Berkeley DB equivalent of SELECT COUNT() All, SELECT COUNT() WHERE LIKE "%...%"

How to make this SQL query more efficient and how to do more

Creating a new table from grouped substring of existing table

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

remove duplicates from almost similar records in SQL - sql

Related

Longest Common SubString, BigQuery, SQL

Add array of other records from the same table to each record

Berkeley DB equivalent of SELECT COUNT(*) All, SELECT COUNT(*) WHERE LIKE "%...%"

How to make this SQL query more efficient and how to do more

Creating a new table from grouped substring of existing table

Categories

Resources

Berkeley DB equivalent of SELECT COUNT() All, SELECT COUNT() WHERE LIKE "%...%"