Delete duplicate rows with soundex? - sql

I have two tables, one has foreign keys to the other. I want to delete duplicates from Table 1 at the same time updating the keys on Table 2. I.e count the duplicates on Table 1 keep 1 key from the duplicates and query the rest of the duplicate records on Table 2 replacing them with the key I'm keeping from Table 1. Soundex would be the best option because not all the names are spelled right in Table 1. I have the basic algorithm but not sure how to do it. Help?
So far this is what I have:
declare #Duplicate int
declare #OriginalKey int
create table #tempTable1
(
CourseID int, <--- The Key I want to keep or delete
SchoolID int,
CourseName nvarchar(100),
Category nvarchar(100),
IsReqThisYear bit,
yearrequired int
);
create table #tempTable2
(
CertID int,
UserID int,
CourseID int, <---- Must stay updated with Table 1
SchoolID int,
StartDateOfCourse datetime,
EndDateOfCourse datetime,
Type nvarchar(100),
HrsOfClass float,
Category nvarchar(100),
Cost money,
PassFail varchar(20),
Comments nvarchar(1024),
ExpiryDate datetime,
Instructor nvarchar(200),
Level nchar(10)
)
--Deletes records from Table 1 not used in Table 2--
delete from Table1
where CourseID not in (select CourseID from Table2 where CourseID is not null)
insert into #tempTable1(CourseID, SchoolID, CourseName, Category, IsReqThisYear, yearrequired)
select CourseID, SchoolID, CourseName, Category, IsReqThisYear, yearrequired from Table1
insert into #tempTable2(CertID, UserID, CourseID, SchoolID, StartDateOfCourse, EndDateOfCourse, Type, HrsOfClass,Category, Cost, PassFail, Comments, ExpiryDate, Instructor, Level)
select CertID, UserID, CourseID, SchoolID, StartDateOfCourse, EndDateOfCourse, Type, HrsOfClass,Category, Cost, PassFail, Comments, ExpiryDate, Instructor, Level from Table2
select cour.CourseName, Count(cour.CourseName) cnt from Table1 as cour
join #tempTable1 as temp on cour.CourseID = temp.CourseID
where SOUNDEX(temp.CourseName) = SOUNDEX(cour.CourseName) <---
The last part does not exactly work, gives me an error
Error: Column 'Table1.CourseName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
UPDATE: Some of the names in CourseName have numbers in them too. Like some are in romans and numeral format. Need to find those too but Soundex ignores numbers.

Related

How to sum/group/sort columns in a view

I have two tables, Author_Dim and Author_Fact as below:
CREATE TABLE Author_Dim
(
TitleAuthor_ID INT IDENTITY(1,1),
Title_ID CHAR(20),
Title VARCHAR(80),
Type_Title CHAR(12),
Author_ID CHAR(20),
Last_Name VARCHAR(40),
First_Name VARCHAR(20),
Contract_Author BIT,
Author_Order INT,
PRIMARY KEY (TitleAuthor_ID),
);
GO
CREATE TABLE Author_Fact
(
Fact_ID INT IDENTITY(1,1),
TitleAuthor_ID INT,
Author_ID CHAR (20),
Price DEC(10,2),
YTD_Sales INT,
Advance DEC(10,2),
Royalty INT,
Royalty_Perc INT,
Total_Sales DEC(10,2),
Total_Advance DEC(10,2),
Total_Royalty DEC(10,2)
PRIMARY KEY(Fact_ID),
FOREIGN KEY (TitleAuthor_ID) REFERENCES Author_Dim(TitleAuthor_ID),
);
go
I wish to create a view that gives the total royalties paid per author and then sorts it with the highest paid author shown first, i.e. it sums the Total_Royalty column, groups it by the Author_ID and then sorts the Total_Royalty in descending order.
I have the below but I'm not sure how to add the sum/group/sort functions to the view:
Create view [Total_Royalty_View] As (
Select Author_Dim.Author_ID, Author_Dim.Last_Name, Author_Dim.First_Name, Author_Fact.Total_Royalty
From Author_Dim
Join Author_Fact
On Author_Fact.TitleAuthor_ID = Author_Dim.TitleAuthor_ID
);
Go
In SQL it is all tables. You select from tables and the result is again a table (consisting of columns and rows). You can store a select statement for re-use and this is called a view. You could just as well write an ad-hoc view (a subquery in a from clause). Their results are again tables.
And tables are considered unordered sets of data.
So, you cannot write a view that produces an ordered set of rows.
Here is the view (unordered):
create view total_royalty_view as
select
a.author_id,
a.last_name,
a.first_name,
coalesce(r.sum_total_royalty, 0) as total_royalty
from author_dim a
left join
(
select titleauthor_id, sum(total_royalty) as sum_total_royalty
from author_fact
group by titleauthor_id
) r on r.titleauthor_id = a.titleauthor_id;
And here is how to select from it:
select *
from total_royalty_view
order by total_royalty desc;

Select from a Temp table is giving slow performance

I need some help with the below query where the last step of
Select * from #PersonDetail order by....
is taking so long to execute - why?
There are millions of records being inserted in this temp table #PersonDetail and insert process takes a few seconds, but the last Select from this same temp table is taking so long.
I created a unique clustered index on the columns used for order by and tried many other options but it doesn't make any difference in the performance.
It is a big stored procedure with many temp table but it is this last select step which is impacting the performance. Here is an example of the last step of the query:
DROP TABLE IF EXISTS #PersonDetail
CREATE TABLE #PersonDetail
(
PersonId INT NOT NULL,
Name NVARCHAR(50) NULL,
Number INT NOT NULL,
Tag NVARCHAR(50) NULL,
UserId INT NOT NULL,
NumberEncrypted VARCHAR(100),
Type NVARCHAR(255),
Status NVARCHAR(50),
CreatedDate DATETIMEOFFSET(7),
AddressDetailId NVARCHAR(50),
Category NVARCHAR(50),
PrimaryId INT,
DailyAmount MONEY,
UNIQUE (PersonId UserId),
UNIQUE CLUSTERED(CreatedDate, UserId)
)
INSERT INTO #PersonDetail (PersonId, Name, Number, Tag, UserId, NumberEncrypted,
Type, Status, CreatedDate, AddressDetailId, Category, PrimaryId, Amount)
SELECT
PersonId, Name, Number, Tag, UserId, NumberEncrypted,
Type, Status, CreatedDate, AddressDetailId, Category, PrimaryId, DailyAmount
FROM
#User u
JOIN
dbo.DailyAmount da (NOLOCK) ON da.UserId = u.UserId
SELECT *
FROM #PersonDetail pd
ORDER BY CreatedDate, UserId
You must specify what database are you using.
In general, you must do theese things:
create some indexes on the join columns (DailyAmount.userId, User.userID); how to create the index must change;
create an index on the order by columns, (CreatedDate+UserID); this must change, in postgresql for example an index with the 2 column is better than 2 indexes;
If your data are not changing frequently, you could try materialized view and create the indexes on the materialized view.

How can insert value by selecting from another

I have three tables. table1 for insert data and table2, table3 for selecting data.
table1:
int id,
varchar name,
varchar gender,
varchar category,
table2:
int id,
varchar gender,
table3:
int id,
varchar category,
According to these tables, i'm using stored procedure as below:
Create procedure ABC
(
#id varchar(10),
#name varchar(11),
#gender varchar(10),
#category varchar(10)
)
As
begin
Insert into table1(id, name, gender, category) select #id, #name, g.id, c.id from gender g, category c where g.gender=#gender and c.category=#category
end
Now, using this if gender or category is empty then data is not inserting. Please tell me where i'm wrong and tell best query to insert data using select query. Please help me to solve the problem. Thanks
You are using and in where clause which means only if gender = #gender then it is selecting any record.
Create procedure ABC
(
#id int, ----- change this to int since id is int type in your table
#name varchar(11),
#gender varchar(10)='',
#category varchar(10)=''
)
As
begin
Insert into table1(id, name, gender, category)
select #id, #name, g.id, c.id from gender g, category c
where (g.gender=#gender or isnull(#gender,'')='') and (c.category=#category or isnull(#category,'')='')
end
Although you may use this approach. But since you want to insert id of gender and category column in your table, then on passing null or blank value what is your expected id to be put in your record.
It is better if you first check gender and category value in your respected table and if record not found then insert it in your table and then save its corresponding id in your final table.

group by clause

i have a table called table1 and it has following columns.
suppose there are records like localamount is 20,000, 30000,50000, 100000 then as per my condition i have to delete records from this table according to the group by set of site id, till id, transid,shift id where localamount exceeds 10,000... the rest of the records can be available?
my aim is to delete rows from this table where local amount is exceeds 10,0000 according to site id, till id, transid,shift id
SiteId varchar(10),
TillId tinyint,
ShiftId int,
TransId int,
TranDate datetime,
SettlementType varchar(5),
CreditCardNumber varchar(25),
ProductTypeCode varchar(10),
NewProductTypeCode varchar(10),
TransactionType int,
ForeignAmount money,
LocalAmount money,
ProductCode varchar(10)
Im not sure I understand what you are saying, but couldn't you do this without a group by?
delete from table1 where LocalAmount > 10,0000[sic] and SiteId = whatever and TillId = whatever...
obviously take the [sic] out...
Assuming you want to delete the whole group where the sum is > 10000
;with cte as
(
select sum(localamount) over
(partition by siteid, tillid, transid,shiftid) as l,
* from table1
)
delete from cte where l>10000

versioning of a table

anybody has seen any examples of a table with multiple versions for each record
something like if you would had the table
Person(Id, FirstName, LastName)
and you change a record's LastName than you would have both versions of LastName (first one, and the one after the change)
I've seen this done two ways. The first is in the table itself by adding an EffectiveDate and CancelDate (or somesuch). To get the current for a given record, you'd do something like: SELECT Id, FirstName, LastName FROM Table WHERE CancelDate IS NULL
The other is to have a global history table (which holds all of your historical data). The structure for such a table normally looks something like
Id bigint not null,
TableName nvarchar(50),
ColumnName nvarchar(50),
PKColumnName nvarchar(50),
PKValue bigint, //or whatever datatype
OriginalValue nvarchar(max),
NewValue nvarchar(max),
ChangeDate datetime
Then you set a trigger on your tables (or, alternatively, add a policy that all of your Updates/Inserts will also insert into your HX table) so that the correct data is logged.
The way we're doing it (might not be the best way) is to have an active bit field, and a foreign key back to the parent record. So for general queries you would filter on active employees, but you can get the history of a single employee with their Employee ID.
declare #employees
(
PK_emID int identity(1,1),
EmployeeID int,
FirstName varchar(50),
LastName varchar(50),
Active bit,
FK_EmployeeID int
primary key(PK_emID)
)
insert into #employees
(
EmployeeID,
FirstName,
LastName,
Active,
FK_EployeeID
)
select 1, 'David', 'Engle', 1,null
union all
select 2, 'Amy', 'Edge', 0,null
union all
select 2, 'Amy','Engle',1,2