Access 2016 Using Top in a Union Query - sql

I am trying to create a query that will output the top 6 results based on a union of two other queries. The two queries are as follows:
SELECT GameData.GameID,
GameData.DivisionID,
GameData.SeasonID,
GameData.HomeTeamID AS TeamID,
GameData.GameDate,
IIf([FullTimeResult] = 'H', 1, 0) AS W,
IIf([FullTimeResult] = 'D', 1, 0) AS D,
IIf([FullTimeResult] = 'A', 1, 0) AS L,
IIf([HalfTimeResult] = 'H', 1, 0) AS WHT,
IIf([HalfTimeResult] = 'D', 1, 0) AS DHT,
IIf([HalfTimeResult] = 'A', 1, 0) AS LHT,
GameData.FullTimeHomeGoals AS GS,
GameData.FullTimeAwayGoals AS GC,
IIf([FullTimeResult] = 'H', 3, IIf([FullTimeResult] = 'D', 1, 0)) AS P,
GameData.HalfTimeHomeGoals AS GSHT,
GameData.HalfTimeAwayGoals AS GCHT,
GameData.HomeShots AS Sh,
GameData.AwayShots AS ShA,
GameData.HomeShotsOnTarget AS ShOnT,
GameData.AwayShotsOnTarget AS ShAOnT,
GameData.HomeFouls AS FM,
GameData.AwayFouls AS FA,
GameData.HomeCorners AS C,
GameData.AwayCorners AS CA,
GameData.HomeYellowCards AS YC,
GameData.AwayYellowCards AS YCA,
GameData.HomeRedCards AS RC,
GameData.AwayRedCards AS RCA
FROM TeamsDivSea
INNER JOIN GameData ON TeamsDivSea.TeamID = GameData.HomeTeamID
WHERE (
(
(GameData.GameID) IN (
SELECT TOP 3 GameID
FROM GameData AS Dupe
WHERE Dupe.HomeTeamID = GameData.HomeTeamID
ORDER BY Dupe.GameDate DESC,
Dupe.GameID DESC
)
)
);
SELECT GameData.GameID,
GameData.DivisionID,
GameData.SeasonID,
GameData.AwayTeamID AS TeamID,
GameData.GameDate,
IIf([FullTimeResult] = 'A', 1, 0) AS W,
IIf([FullTimeResult] = 'D', 1, 0) AS D,
IIf([FullTimeResult] = 'H', 1, 0) AS L,
IIf([HalfTimeResult] = 'A', 1, 0) AS WHT,
IIf([HalfTimeResult] = 'D', 1, 0) AS DHT,
IIf([HalfTimeResult] = 'H', 1, 0) AS LHT,
GameData.FullTimeAwayGoals AS GS,
GameData.FullTimeHomeGoals AS GC,
IIf([FullTimeResult] = 'A', 3, IIf([FullTimeResult] = 'D', 1, 0)) AS P,
GameData.HalfTimeAwayGoals AS GSHT,
GameData.HalfTimeHomeGoals AS GCHT,
GameData.AwayShots AS Sh,
GameData.HomeShots AS ShA,
GameData.AwayShotsOnTarget AS ShOnT,
GameData.HomeShotsOnTarget AS ShAOnT,
GameData.AwayFouls AS FM,
GameData.HomeFouls AS FA,
GameData.AwayCorners AS C,
GameData.HomeCorners AS CA,
GameData.AwayYellowCards AS YC,
GameData.HomeYellowCards AS YCA,
GameData.AwayRedCards AS RC,
GameData.HomeRedCards AS RCA
FROM TeamsDivSea
INNER JOIN GameData ON TeamsDivSea.TeamID = GameData.AwayTeamID
WHERE (
(
(GameData.GameID) IN (
SELECT TOP 3 GameID
FROM GameData AS Dupe
WHERE Dupe.AwayTeamID = GameData.AwayTeamID
ORDER BY Dupe.GameDate DESC,
Dupe.GameID DESC
)
)
);
Is it possible to create a union query only using one SELECT TOP statement so only the top 6 are output from the joined results?
Many thanks

This should work - basically UNION your two queries and select the top 6.
SELECT TOP 6 GameID, DivisionID, SeasonID, TeamID, GameDate, W,D,L,WHT,DHT,LHT,GS,GC,P,GSHT,GCHT,Sh,shA, shOnT, ShAOnt, FM,FA,C,CA,YC,YCA,RC,RCA
FROM (
SELECT GameData.GameID,
GameData.DivisionID,
GameData.SeasonID,
GameData.HomeTeamID AS TeamID,
GameData.GameDate,
IIf([FullTimeResult] = 'H', 1, 0) AS W,
IIf([FullTimeResult] = 'D', 1, 0) AS D,
IIf([FullTimeResult] = 'A', 1, 0) AS L,
IIf([HalfTimeResult] = 'H', 1, 0) AS WHT,
IIf([HalfTimeResult] = 'D', 1, 0) AS DHT,
IIf([HalfTimeResult] = 'A', 1, 0) AS LHT,
GameData.FullTimeHomeGoals AS GS,
GameData.FullTimeAwayGoals AS GC,
IIf([FullTimeResult] = 'H', 3, IIf([FullTimeResult] = 'D', 1, 0)) AS P,
GameData.HalfTimeHomeGoals AS GSHT,
GameData.HalfTimeAwayGoals AS GCHT,
GameData.HomeShots AS Sh,
GameData.AwayShots AS ShA,
GameData.HomeShotsOnTarget AS ShOnT,
GameData.AwayShotsOnTarget AS ShAOnT,
GameData.HomeFouls AS FM,
GameData.AwayFouls AS FA,
GameData.HomeCorners AS C,
GameData.AwayCorners AS CA,
GameData.HomeYellowCards AS YC,
GameData.AwayYellowCards AS YCA,
GameData.HomeRedCards AS RC,
GameData.AwayRedCards AS RCA
FROM TeamsDivSea
INNER JOIN GameData ON TeamsDivSea.TeamID = GameData.HomeTeamID
WHERE (
(
(GameData.GameID) IN (
SELECT TOP 3 GameID
FROM GameData AS Dupe
WHERE Dupe.HomeTeamID = GameData.HomeTeamID
ORDER BY Dupe.GameDate DESC,
Dupe.GameID DESC
)
)
)
UNION ALL SELECT GameData.GameID,
GameData.DivisionID,
GameData.SeasonID,
GameData.AwayTeamID AS TeamID,
GameData.GameDate,
IIf([FullTimeResult] = 'A', 1, 0),
IIf([FullTimeResult] = 'D', 1, 0),
IIf([FullTimeResult] = 'H', 1, 0),
IIf([HalfTimeResult] = 'A', 1, 0),
IIf([HalfTimeResult] = 'D', 1, 0),
IIf([HalfTimeResult] = 'H', 1, 0),
GameData.FullTimeAwayGoals,
GameData.FullTimeHomeGoals,
IIf([FullTimeResult] = 'A', 3, IIf([FullTimeResult] = 'D', 1, 0)),
GameData.HalfTimeAwayGoals,
GameData.HalfTimeHomeGoals,
GameData.AwayShots,
GameData.HomeShots,
GameData.AwayShotsOnTarget,
GameData.HomeShotsOnTarget,
GameData.AwayFouls,
GameData.HomeFouls,
GameData.AwayCorners,
GameData.HomeCorners,
GameData.AwayYellowCards,
GameData.HomeYellowCards,
GameData.AwayRedCards,
GameData.HomeRedCards
FROM TeamsDivSea
INNER JOIN GameData ON TeamsDivSea.TeamID = GameData.AwayTeamID
WHERE (
(
(GameData.GameID) IN (
SELECT TOP 3 GameID
FROM GameData AS Dupe
WHERE Dupe.AwayTeamID = GameData.AwayTeamID
ORDER BY Dupe.GameDate DESC,
Dupe.GameID DESC
)
)
)
)

Related

How much unique data is there, put it all in a table

I would like to query in SQL how many unique values ​​there are and how many rows are there. In Python, I could do it like this. But how do I do this in SQL so that I get the result like at the bottom?
In Python I could do the following
d = {'sellerid': [1, 1, 1, 2, 2, 3, 3, 3], 'modelnumber': [85, 45, 85, 12 ,85, 74, 85, 12]
, 'modelgroup': [2, 3, 2, 1, 2, 3, 2, 1 ]}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
unique_sellerid = df['sellerid'].nunique()
print("unique_sellerid", unique_sellerid)
unique_modelnumber = df['modelnumber'].nunique()
print("unique_modelnumber", unique_modelnumber)
unique_modelgroup = df['modelgroup'].nunique()
print("unique_modelgroup", unique_modelgroup)
total_rows = df.shape[0]
print("total_rows", total_rows)
[OUT]
unique_sellerid 3
unique_modelnumber 4
unique_modelgroup 3
total_rows 8
I want a query like
Here is the dummy table
CREATE TABLE cars (
sellerid INT NOT NULL,
modelnumber INT NOT NULL,
modelgroup INT,
);
INSERT INTO cars
(sellerid , modelnumber, modelgroup )
VALUES
(1, 85, 2),
(1, 45, 3),
(1, 85, 2),
(2, 12, 1),
(2, 85, 2),
(3, 74, 3),
(3, 85, 2),
(3, 12, 1);
You could use the count(distinct column) aggregation function like :
select
count(distinct col1) as nunique_col1,
count(distinct col2) as nunique_col2,
count(1) as nb_rows
from database
Also in pandas, you can also apply the nunique() function on the dataset, rather than doing it on each column: df.nunique()

Invalid identifier error when using RIGHT JOIN inside FROM clause

I want to use within a FROM a subset of 2 tables using RIGHT JOIN (I want from that subset all the rows of ITV2_VEHICULOS whose ID is not in ITV2_HIST_VEHICULOS) so that the SELECT "takes" the data from there and with the WHERE it can filter
My query:
SELECT
*
FROM
ITV2_INSPECCIONES I,
ITV2_HORAS_INSPECCION HI_FIN,
ITV2_INSPECCIONES I_SIG,
ITV2_HORAS_INSPECCION HI_SIG_INI,
ITV2_HIST_VEHICULOS VH,
ITV2_CATEGORIAS_VEHICULO CAT,
ITV2_CLASIF_VEH_CONS CVC,
ITV2_CLASIF_VEH_USO CVU,
(
SELECT
*
FROM
ITV2_HIST_VEHICULOS VH
RIGHT JOIN ITV2_VEHICULOS V ON
VH.C_VEHICULO_ID = V.C_VEHICULO_ID
) VI
WHERE
I.C_TIPO_INSPECCION = 1
AND I.F_DESFAVORABLE IS NOT NULL
AND I.C_RESULTADO IN(
3,
4
)
AND I.C_VEHICULO_ID = VI.C_VEHICULO_ID
AND VI.C_CATEGORIA_ID = CAT.C_CATEGORIA_ID
AND VI.C_CLASIF_VEH_CONS_ID = CVC.C_CLASIF_VEH_CONS_ID
AND VI.C_CLASIF_VEH_USO_ID = CVU.C_CLASIF_VEH_USO_ID -- HORAS
AND I.C_ESTACION_ID = HI_FIN.C_ESTACION_ID
AND I.C_INSPECCION_ID = HI_FIN.C_INSPECCION_ID
AND I.N_ANNO = HI_FIN.N_ANNO
AND HI_FIN.C_TIPO_HORA_ID = 6 -- INSPECCION SIGUIENTE
AND I.C_ESTACION_ID = I_SIG.C_ESTACION_ID_FASE_ANT
AND I.C_INSPECCION_ID = I_SIG.C_INSPECCION_ID_FASE_ANT
AND I.N_ANNO = I_SIG.N_ANNO_FASE_ANT --
AND I_SIG.N_ANNO IN(
2013,
2014,
2015,
2016,
2017,
2018
)
AND I_SIG.C_ESTACION_ID IN(
3,
21,
22,
26,
28,
32,
34,
37,
41,
47,
53,
59,
60
)
AND I_SIG.F_INSPECCION >= '01/09/2015'
AND I_SIG.F_INSPECCION <= '30/09/2018' --
AND I_SIG.F_DESFAVORABLE IS NULL
AND I_SIG.C_RESULTADO IN(
1,
2
) -- Y HORAS
AND I_SIG.C_ESTACION_ID = HI_SIG_INI.C_ESTACION_ID
AND I_SIG.C_INSPECCION_ID = HI_SIG_INI.C_INSPECCION_ID
AND I_SIG.N_ANNO = HI_SIG_INI.N_ANNO
AND HI_SIG_INI.C_TIPO_HORA_ID = 1
--GROUP BY...
I expect in the output:
C_ESTACION_ID(FROM I) |C_VEHICULO_ID(FROM(I) |C_TIPO_HORA_ID(FROM HI_FIN)|F_HORA (FROM I_FIN) |A_MATRICULA FROM (V) | F_CAMBIO FROM (VH -> IF subdata of V EXISTS)
---------------------|----------------------|---------------------------|--------------------|---------------------|---------------------------------------
This is what your query would look like if you use "explicit join syntax" instead of just some commas between table names:
SELECT *
FROM ITV2_INSPECCIONES I
INNER JOIN ITV2_HORAS_INSPECCION HI_FIN ON I.C_ESTACION_ID = HI_FIN.C_ESTACION_ID
AND I.C_INSPECCION_ID = HI_FIN.C_INSPECCION_ID
AND I.N_ANNO = HI_FIN.N_ANNO
INNER JOIN ITV2_INSPECCIONES I_SIG ON I.C_ESTACION_ID = I_SIG.C_ESTACION_ID_FASE_ANT
AND I.C_INSPECCION_ID = I_SIG.C_INSPECCION_ID_FASE_ANT
AND I.N_ANNO = I_SIG.N_ANNO_FASE_ANT
INNER JOIN ITV2_HORAS_INSPECCION HI_SIG_INI ON I_SIG.C_ESTACION_ID = HI_SIG_INI.C_ESTACION_ID
AND I_SIG.C_INSPECCION_ID = HI_SIG_INI.C_INSPECCION_ID
AND I_SIG.N_ANNO = HI_SIG_INI.N_ANNO
WHERE I.C_TIPO_INSPECCION = 1
AND I.F_DESFAVORABLE IS NOT NULL
AND I.C_RESULTADO IN (3, 4)
AND HI_FIN.C_TIPO_HORA_ID = 6 -- INSPECCION SIGUIENTE
AND HI_SIG_INI.C_TIPO_HORA_ID = 1
AND I_SIG.F_INSPECCION >= '01/09/2015'
AND I_SIG.F_INSPECCION <= '30/09/2018'
AND I_SIG.F_DESFAVORABLE IS NULL
AND I_SIG.N_ANNO IN (2013, 2014, 2015, 2016, 2017, 2018)
AND I_SIG.C_ESTACION_ID IN (3, 21, 22, 26, 28, 32, 34, 37, 41, 47, 53, 59, 60)
AND I_SIG.C_RESULTADO IN (1, 2) -- Y HORAS
Now I had to pull out several tables and the subquery from that because, frankly, they don't make much sense to me:
ITV2_HIST_VEHICULOS VH, << no join conditions to preceding tables
ITV2_CATEGORIAS_VEHICULO CAT, << no join conditions to preceding tables
ITV2_CLASIF_VEH_CONS CVC, << no join conditions to preceding tables
ITV2_CLASIF_VEH_USO CVU, << no join conditions to preceding tables
(
SELECT
*
FROM ITV2_VEHICULOS V
LEFT JOIN ITV2_HIST_VEHICULOS VH ON
VH.C_VEHICULO_ID = V.C_VEHICULO_ID
) VI
AND I.C_VEHICULO_ID = VI.C_VEHICULO_ID
AND VI.C_CATEGORIA_ID = CAT.C_CATEGORIA_ID
AND VI.C_CLASIF_VEH_CONS_ID = CVC.C_CLASIF_VEH_CONS_ID
AND VI.C_CLASIF_VEH_USO_ID = CVU.C_CLASIF_VEH_USO_ID

Pandas apply on index of dataframe with a mult-iindex

I have a dataframe with an index like
MultiIndex(levels=[['A', 'B', 'C', 'D', 'E', 'F', 'G'], [0, 1]],
labels=[[0, 0, 1, 1, 2, 3, 3, 4, 4, 5, 5, 6, 6], [0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['I1', 'I2'])
Now I would like to apply a function to index I1.
If it were a simple column I would do something like
df['I1'] = df['I1'].apply(lamdba x :...)
How can I apply this to an Index in a df with a multi-index?
I believe need rename:
df = df.rename(lambda x: 'a' + x, level=0)
Or Index.get_level_values for select level of MultiIndex, map and then create MultiIndex.from_arrays:
idx = df.index.get_level_values('I1').map(lambda x: 'a' + x)
df.index = pd.MultiIndex.from_arrays([idx, df.index.get_level_values('I2')])
because I get :
df.index = df.index.set_levels(idx, level='I1')
ValueError: Level values must be unique: ['aA', 'aA', 'aB', 'aB', 'aC', 'aD', 'aD', 'aE', 'aE', 'aF', 'aF', 'aG', 'aG'] on level 0
Sample:
mux = pd.MultiIndex(levels=[['A', 'B', 'C', 'D', 'E', 'F', 'G'], [0, 1]],
labels=[[0, 0, 1, 1, 2, 3, 3, 4, 4, 5, 5, 6, 6], [0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['I1', 'I2'])
df = pd.DataFrame([0] * 13, index=mux, columns=['a'])
df = df.rename(lambda x: 'a' + x, level=0)
print(df)
a
I1 I2
aA 0 0
1 0
aB 0 0
1 0
aC 1 0
aD 0 0
1 0
aE 0 0
1 0
aF 0 0
1 0
aG 0 0
1 0

Converting formula from Crystal Reports to SSRS

I'll try and keep this as short as possible but I'm trying to convert a formula cell from crystal report to SSRS.
Here is the query:
SELECT
(SELECT START_DATE
FROM APPS.GL_PERIODS
WHERE PERIOD_TYPE = 'Month'
AND TRUNC(SYSDATE-:Days) BETWEEN START_DATE AND END_DATE) STR_DATE,
(SELECT END_DATE
FROM APPS.GL_PERIODS
WHERE PERIOD_TYPE = 'Month'
AND TRUNC(SYSDATE-:Days) BETWEEN START_DATE AND END_DATE) END_DATE,
DECODE(RT.ORGANIZATION_ID, 104, 'LPD',RT.ORGANIZATION_ID) ORG,
SUBSTR(POV.VENDOR_NAME, 1, 24) VENDOR_NAME,
DECODE(SUBSTR(PHA.SEGMENT1, 2,1), 'E', 'EXPENSE', 'e', 'EXPENSE', 'P', 'PRODUCT', 'p', 'PRODUCT', ' OTHER') PO_TYPE,
DECODE(SIGN(TRUNC(RT.TRANSACTION_DATE) - TRUNC(NVL(PLL.PROMISED_DATE - 3, PLL.NEED_BY_DATE - 3))), -1, 'LATE', 'ON TIME') PERFORMANCE,
COUNT(*) LINE_COUNT
FROM
APPS.RCV_TRANSACTIONS RT,
APPS.PO_HEADERS_ALL PHA,
APPS.PO_LINES_ALL PLA,
APPS.PO_LINE_LOCATIONS_ALL PLL,
APPS.PO_VENDORS POV
WHERE
RT.ORGANIZATION_ID = 104
AND RT.TRANSACTION_DATE >= (SELECT START_DATE
FROM APPS.GL_PERIODS
WHERE PERIOD_TYPE = 'Month'
AND TRUNC(SYSDATE-:Days) BETWEEN START_DATE AND END_DATE)
AND RT.TRANSACTION_DATE < (SELECT END_DATE + 1
FROM APPS.GL_PERIODS
WHERE PERIOD_TYPE = 'Month'
AND TRUNC(SYSDATE-:Days) BETWEEN START_DATE AND END_DATE)
AND RT.TRANSACTION_TYPE = 'RECEIVE'
AND RT.PO_HEADER_ID = PLL.PO_HEADER_ID
AND RT.PO_LINE_LOCATION_ID = PLL.LINE_LOCATION_ID
AND RT.PO_LINE_ID = PLL.PO_LINE_ID
AND RT.ORGANIZATION_ID = PLL.SHIP_TO_ORGANIZATION_ID
AND PLA.PO_LINE_ID = PLL.PO_LINE_ID
AND PLA.PO_HEADER_ID = PLL.PO_HEADER_ID
AND PHA.PO_HEADER_ID = PLA.PO_HEADER_ID
AND PHA.VENDOR_ID = POV.VENDOR_ID
GROUP BY
DECODE(RT.ORGANIZATION_ID, 104, 'LPD', RT.ORGANIZATION_ID),
SUBSTR(POV.VENDOR_NAME, 1, 24),
DECODE(SUBSTR(PHA.SEGMENT1, 2, 1), 'E', 'EXPENSE', 'e', 'EXPENSE', 'P', 'PRODUCT', 'p', 'PRODUCT', ' OTHER'),
DECODE(SIGN(TRUNC(RT.TRANSACTION_DATE) - TRUNC(NVL(PLL.PROMISED_DATE - 3, PLL.NEED_BY_DATE - 3))), -1, 'LATE', 'ON TIME')
ORDER BY
ORG, VENDOR_NAME, PO_TYPE, PERFORMANCE
In crystal the formula is
SUM({query.LINE_COUNT},{query.PERFORMANCE}) % SUM({query.LINE_COUNT}, {query.PO_TYPE})
This cell basically is just calculating the percentage of on time deliveries and late ones.

SQL Server Fuzzy Search with Percentage of match

I am using SQL Server 2008 R2 SP1.
I have a table with about 36034 records of customers.
I am trying to implement Fuzy search on Customer Name field.
Here is Function for Fuzzy Search
ALTER FUNCTION [Party].[FuzySearch]
(
#Reference VARCHAR(200) ,
#Target VARCHAR(200)
)
RETURNS DECIMAL(5, 2)
WITH SCHEMABINDING
AS
BEGIN
DECLARE #score DECIMAL(5, 2)
SELECT #score = CASE WHEN #Reference = #Target
THEN CAST(100 AS NUMERIC(5, 2))
WHEN #Reference IS NULL
OR #Target IS NULL
THEN CAST(0 AS NUMERIC(5, 2))
ELSE ( SELECT [Score %] = CAST(SUM(LetterScore)
* 100.0 / MAX(WordLength
* WordLength) AS NUMERIC(5,
2))
FROM ( -- do
SELECT seq = t1.n ,
ref.Letter ,
v.WordLength ,
LetterScore = v.WordLength
- ISNULL(MIN(tgt.n),
v.WordLength)
FROM ( -- v
SELECT
Reference = LEFT(#Reference
+ REPLICATE('_',
WordLength),
WordLength) ,
Target = LEFT(#Target
+ REPLICATE('_',
WordLength),
WordLength) ,
WordLength = WordLength
FROM
( -- di
SELECT
WordLength = MAX(WordLength)
FROM
( VALUES
( DATALENGTH(#Reference)),
( DATALENGTH(#Target)) ) d ( WordLength )
) di
) v
CROSS APPLY ( -- t1
SELECT TOP ( WordLength )
n
FROM
( VALUES ( 1),
( 2), ( 3), ( 4),
( 5), ( 6), ( 7),
( 8), ( 9),
( 10), ( 11),
( 12), ( 13),
( 14), ( 15),
( 16), ( 17),
( 18), ( 19),
( 20), ( 21),
( 22), ( 23),
( 24), ( 25),
( 26), ( 27),
( 28), ( 29),
( 30), ( 31),
( 32), ( 33),
( 34), ( 35),
( 36), ( 37),
( 38), ( 39),
( 40), ( 41),
( 42), ( 43),
( 44), ( 45),
( 46), ( 47),
( 48), ( 49),
( 50), ( 51),
( 52), ( 53),
( 54), ( 55),
( 56), ( 57),
( 58), ( 59),
( 60), ( 61),
( 62), ( 63),
( 64), ( 65),
( 66), ( 67),
( 68), ( 69),
( 70), ( 71),
( 72), ( 73),
( 74), ( 75),
( 76), ( 77),
( 78), ( 79),
( 80), ( 81),
( 82), ( 83),
( 84), ( 85),
( 86), ( 87),
( 88), ( 89),
( 90), ( 91),
( 92), ( 93),
( 94), ( 95),
( 96), ( 97),
( 98), ( 99),
( 100), ( 101),
( 102), ( 103),
( 104), ( 105),
( 106), ( 107),
( 108), ( 109),
( 110), ( 111),
( 112), ( 113),
( 114), ( 115),
( 116), ( 117),
( 118), ( 119),
( 120), ( 121),
( 122), ( 123),
( 124), ( 125),
( 126), ( 127),
( 128), ( 129),
( 130), ( 131),
( 132), ( 133),
( 134), ( 135),
( 136), ( 137),
( 138), ( 139),
( 140), ( 141),
( 142), ( 143),
( 144), ( 145),
( 146), ( 147),
( 148), ( 149),
( 150), ( 151),
( 152), ( 153),
( 154), ( 155),
( 156), ( 157),
( 158), ( 159),
( 160), ( 161),
( 162), ( 163),
( 164), ( 165),
( 166), ( 167),
( 168), ( 169),
( 170), ( 171),
( 172), ( 173),
( 174), ( 175),
( 176), ( 177),
( 178), ( 179),
( 180), ( 181),
( 182), ( 183),
( 184), ( 185),
( 186), ( 187),
( 188), ( 189),
( 190), ( 191),
( 192), ( 193),
( 194), ( 195),
( 196), ( 197),
( 198), ( 199),
( 200)
) t2 ( n )
) t1
CROSS APPLY ( SELECT
Letter = SUBSTRING(Reference,
t1.n, 1)
) ref
OUTER APPLY ( -- tgt
SELECT TOP ( WordLength )
n = ABS(t1.n
- t2.n)
FROM
( VALUES ( 1),
( 2), ( 3), ( 4),
( 5), ( 6), ( 7),
( 8), ( 9),
( 10), ( 11),
( 12), ( 13),
( 14), ( 15),
( 16), ( 17),
( 18), ( 19),
( 20), ( 21),
( 22), ( 23),
( 24), ( 25),
( 26), ( 27),
( 28), ( 29),
( 30), ( 31),
( 32), ( 33),
( 34), ( 35),
( 36), ( 37),
( 38), ( 39),
( 40), ( 41),
( 42), ( 43),
( 44), ( 45),
( 46), ( 47),
( 48), ( 49),
( 50), ( 51),
( 52), ( 53),
( 54), ( 55),
( 56), ( 57),
( 58), ( 59),
( 60), ( 61),
( 62), ( 63),
( 64), ( 65),
( 66), ( 67),
( 68), ( 69),
( 70), ( 71),
( 72), ( 73),
( 74), ( 75),
( 76), ( 77),
( 78), ( 79),
( 80), ( 81),
( 82), ( 83),
( 84), ( 85),
( 86), ( 87),
( 88), ( 89),
( 90), ( 91),
( 92), ( 93),
( 94), ( 95),
( 96), ( 97),
( 98), ( 99),
( 100), ( 101),
( 102), ( 103),
( 104), ( 105),
( 106), ( 107),
( 108), ( 109),
( 110), ( 111),
( 112), ( 113),
( 114), ( 115),
( 116), ( 117),
( 118), ( 119),
( 120), ( 121),
( 122), ( 123),
( 124), ( 125),
( 126), ( 127),
( 128), ( 129),
( 130), ( 131),
( 132), ( 133),
( 134), ( 135),
( 136), ( 137),
( 138), ( 139),
( 140), ( 141),
( 142), ( 143),
( 144), ( 145),
( 146), ( 147),
( 148), ( 149),
( 150), ( 151),
( 152), ( 153),
( 154), ( 155),
( 156), ( 157),
( 158), ( 159),
( 160), ( 161),
( 162), ( 163),
( 164), ( 165),
( 166), ( 167),
( 168), ( 169),
( 170), ( 171),
( 172), ( 173),
( 174), ( 175),
( 176), ( 177),
( 178), ( 179),
( 180), ( 181),
( 182), ( 183),
( 184), ( 185),
( 186), ( 187),
( 188), ( 189),
( 190), ( 191),
( 192), ( 193),
( 194), ( 195),
( 196), ( 197),
( 198), ( 199),
( 200) ) t2 ( n )
WHERE
SUBSTRING(#Target,
t2.n, 1) = ref.Letter
) tgt
GROUP BY t1.n ,
ref.Letter ,
v.WordLength
) do
)
END
RETURN #score
END
Here is the query to call the function
select [Party].[FuzySearch]('First Name Middle Name Last Name', C.FirstName) from dbo.Customer C
This is taking about 2 minutes 22 seconds to give me the percentage of fuzzy match for all
How can I fix this to run in lessthan a second. Any suggestions on my function to make it more robust.
Expected ouput is 45.34, 40.00, 100.00, 23.00, 81.23.....
The best I have been able to do is simplify some of the query, and change it to a table valued function. Scalar functions are notoriously poor performers, and the benefit of an inline TVF is that the query definition is expanded out into the main query, much like a view.
This reduces the execution time significantly on the tests I have done.
ALTER FUNCTION dbo.FuzySearchTVF (#Reference VARCHAR(200), #Target VARCHAR(200))
RETURNS TABLE
AS
RETURN
( WITH N (n) AS
( SELECT TOP (ISNULL(CASE WHEN DATALENGTH(#Reference) > DATALENGTH(#Target)
THEN DATALENGTH(#Reference)
ELSE DATALENGTH(#Target)
END, 0))
ROW_NUMBER() OVER(ORDER BY n1.n)
FROM (VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) AS N1 (n)
CROSS JOIN (VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) AS N2 (n)
CROSS JOIN (VALUES (1), (1)) AS N3 (n)
WHERE #Reference IS NOT NULL AND #Target IS NOT NULL
), Src AS
( SELECT Reference = CASE WHEN DATALENGTH(#Reference) > DATALENGTH(#Target) THEN #Reference
ELSE #Reference + REPLICATE('_', DATALENGTH(#Target) - DATALENGTH(#Reference))
END,
Target = CASE WHEN DATALENGTH(#Target) > DATALENGTH(#Reference) THEN #Target
ELSE #Target + REPLICATE('_', DATALENGTH(#Target) - DATALENGTH(#Reference))
END,
WordLength = CASE WHEN DATALENGTH(#Reference) > DATALENGTH(#Target) THEN DATALENGTH(#Reference) ELSE DATALENGTH(#Target) END
WHERE #Reference IS NOT NULL
AND #Target IS NOT NULL
AND #Reference != #Target
), Scores AS
( SELECT seq = t1.n ,
Letter = SUBSTRING(s.Reference, t1.n, 1),
s.WordLength ,
LetterScore = s.WordLength - ISNULL(MIN(ABS(t1.n - t2.n)), s.WordLength)
FROM Src AS s
CROSS JOIN N AS t1
INNER JOIN N AS t2
ON SUBSTRING(#Target, t2.n, 1) = SUBSTRING(s.Reference, t1.n, 1)
WHERE #Reference IS NOT NULL
AND #Target IS NOT NULL
AND #Reference != #Target
GROUP BY t1.n, SUBSTRING(s.Reference, t1.n, 1), s.WordLength
)
SELECT [Score] = 100
WHERE #Reference = #Target
UNION ALL
SELECT 0
WHERE #Reference IS NULL OR #Target IS NULL
UNION ALL
SELECT CAST(SUM(LetterScore) * 100.0 / MAX(WordLength * WordLength) AS NUMERIC(5, 2))
FROM Scores
WHERE #Reference IS NOT NULL
AND #Target IS NOT NULL
AND #Reference != #Target
GROUP BY WordLength
);
And this would be called as:
SELECT f.Score
FROM dbo.Customer AS c
CROSS APPLY [dbo].[FuzySearch]('First Name Middle Name Last Name', c.FirstName) AS f
It is still a fairly complex function though, and, depending on the number of records in your customer table, I think getting it down to 1 second is going to be a bit of a challenge.
This is how I could accomplish this:
Explained further # SQL Server Fuzzy Search - Levenshtein Algorithm
Create below file using any editor of your choice:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class StoredFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)]
public static SqlDouble Levenshtein(SqlString stringOne, SqlString stringTwo)
{
#region Handle for Null value
if (stringOne.IsNull)
stringOne = new SqlString("");
if (stringTwo.IsNull)
stringTwo = new SqlString("");
#endregion
#region Convert to Uppercase
string strOneUppercase = stringOne.Value.ToUpper();
string strTwoUppercase = stringTwo.Value.ToUpper();
#endregion
#region Quick Check and quick match score
int strOneLength = strOneUppercase.Length;
int strTwoLength = strTwoUppercase.Length;
int[,] dimention = new int[strOneLength + 1, strTwoLength + 1];
int matchCost = 0;
if (strOneLength + strTwoLength == 0)
{
return 100;
}
else if (strOneLength == 0)
{
return 0;
}
else if (strTwoLength == 0)
{
return 0;
}
#endregion
#region Levenshtein Formula
for (int i = 0; i <= strOneLength; i++)
dimention[i, 0] = i;
for (int j = 0; j <= strTwoLength; j++)
dimention[0, j] = j;
for (int i = 1; i <= strOneLength; i++)
{
for (int j = 1; j <= strTwoLength; j++)
{
if (strOneUppercase[i - 1] == strTwoUppercase[j - 1])
matchCost = 0;
else
matchCost = 1;
dimention[i, j] = System.Math.Min(System.Math.Min(dimention[i - 1, j] + 1, dimention[i, j - 1] + 1), dimention[i - 1, j - 1] + matchCost);
}
}
#endregion
// Calculate Percentage of match
double percentage = System.Math.Round((1.0 - ((double)dimention[strOneLength, strTwoLength] / (double)System.Math.Max(strOneLength, strTwoLength))) * 100.0, 2);
return percentage;
}
};
Name it levenshtein.cs
Go to Command Prompt. Go to the file directory of levenshtein.cs then call csc.exe /t: library /out: UserFunctions.dll levenshtein.cs you may have to give the full path of csc.exe from NETFrameWork 2.0.
Once your DLL is ready. Add it to the assemblies Database>>Programmability>>Assemblies>> New Assembly.
Create function in your database:
CREATE FUNCTION dbo.LevenshteinSVF
(
#S1 NVARCHAR(200) ,
#S2 NVARCHAR(200)
)
RETURNS FLOAT
AS EXTERNAL NAME
UserFunctions.StoredFunctions.Levenshtein
GO
In my case I had to enable clr:
sp_configure 'clr enabled', 1
GO
reconfigure
GO
Test the function:
SELECT dbo.LevenshteinSVF('James','James Bond')
Result: 50 % match