We try to use Full text search on Azure database and got performance problems on using CONTAINS search.
Our data has star schema, Fact table has clustered column store index enabled and around 40 million rows. Below is how we use CONTAINS on dimension and do aggregation on Fact table on different queries:
Query 1 using EXISTS:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
WHERE EXISTS (
SELECT * FROM [SPENDBY].[DimCompanyCode] d
WHERE f.[FK_DimCompanyCodeId] = d.Id
AND CONTAINS(d.*, 'Comcast'))
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
This query seems run forever and never return the result.
There is non-clustered index on the foreign key FK_DimCompanyCodeId] and there is only one row returned when searching Comcast:
SELECT id FROM [SPENDBY].[DimCompanyCode] d
WHERE CONTAINS(d.*, 'Comcast');
-- will return id = 5
And there are around 27 million rows of Fact table which has FK_DimCompanyCodeId = 5.
Query 2 using INNER JOIN:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
INNER JOIN [SPENDBY].[DimCompanyCode] d ON (f.[FK_DimCompanyCodeId] = d.Id)
WHERE CONTAINS(d.*, 'Comcast')
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
This query seems run forever and never return the result as well.
Query 3 using #temp table:
SELECT id INTO #temp FROM [SPENDBY].[DimCompanyCode] d
WHERE CONTAINS(d.*, 'Comcast');
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
WHERE EXISTS (
SELECT * FROM #temp
WHERE f.[FK_DimCompanyCodeId] = #temp.Id)
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
Very fast, returns the result after 5 seconds.
Why full text search is so slow for in case 1 and case 2.
The problem is competing indexes -- one for the JOIN and one for the filter. Perhaps a subquery would convince SQL Server to use the text index first:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f JOIN
(SELECT id
FROM [SPENDBY].[DimCompanyCode] cc
WHERE CONTAINS(cc.*, 'Comcast')
) cc
ON cc.id = f.FK_DimCompanyCodeId
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
It would probably also help if you have an index on FactInvoiceDetail(FK_DimCompanyCodeId).
Eventually, I figured out CONTAINS works well on specific column (Description for example):
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
WHERE f.[FK_DimCompanyCodeId] IN (
SELECT d.Id FROM [SPENDBY].[DimCompanyCode] d
WHERE CONTAINS(d.[Description], 'Comcast')
)
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
In order to search for the whole table, CONTAINSTABLE will have the best performance and avoid using #temp table:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
LEFT OUTER JOIN CONTAINSTABLE([SPENDBY].[DimCompanyCode], *, '"Comcast"') ct
ON f.[FK_DimCompanyCodeId] = ct.[Key]
WHERE ct.[Key] IS NOT NULL
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
Related
i have a problem with this query. it go in loop, I mean query after 15 minutes not finish
But if remove one of the left join works
where wrong I?
Select distinct a.sito,
Count(distinct a.id_us) as us,
Count (distinct b.id_invmat) as materiali,
Count (distinct c.id_struttura) as Struttura,
Count(distinct d.id_tafonomia) as tafonomia
From us_table as a
Left join invetario_materiali as b on a.sito=b.sito
Left join struttura_table as c on a.sito=c.sito
Left join tafonomia_table as d on a.sito=d.sito
Group by a.sito
Order by us
thanks
E
This is a case where correlated subqueries might be the simplest approach:
select s.sito,
(select count(*) from invetario_materiali m where s.sito = m.sito) as materiali,
(select count(*) from struttura_tablest where s.sito = st.sito) as Struttura,
(select count(*) from tafonomia_table t where s.sito = t.sito) as tafonomia
from (select sito, count(*) as us
from us_table
group by sito
) s
order by us;
This should be much, much faster than your version for two reasons. First, it avoids the outer aggregation. Second, it avoids the Cartesian products among the tables.
You can make this even faster by creating indexes on each of the secondary tables on sito.
Assuming that id_us, id_invmat, id_struttura and id_tafonomia are all PRIMARY KEY CLUSTERED
You should add indexes on join columns:
CREATE INDEX IX_SITO ON us_table ( sito ASC) ;
CREATE INDEX IX_SITO ON invetario_materiali ( sito ASC) ;
CREATE INDEX IX_SITO ON struttura_table ( sito ASC) ;
CREATE INDEX IX_SITO ON tafonomia_table ( sito ASC) ;
Than you can reduce complexity in this way:
with
_us_table as (
select sito, count(distinct a.id_us) us
from us_table a
group by sito
),
_invetario_materiali as (
select sito, count(distinct b.id_invmat) materiali
from invetario_materiali b
group by sito
),
_struttura_table as (
select sito, count(distinct c.id_struttura) Struttura
from struttura_table c
group by sito
),
_tafonomia_table as (
select sito, count(distinct d.id_tafonomia) tafonomia
from tafonomia_table d
group by sito
)
Select a.sito, a.us, b.materiali, c.Struttura, d.tafonomia
From _us_table as a
Left join _invetario_materiali as b on a.sito=b.sito
Left join _struttura_table as c on a.sito=c.sito
Left join _tafonomia_table as d on a.sito=d.sito
Order by a.us;
should be much faster
Unfortunately COUNT(DISTINCT ...) is difficult to improve upon using an index. However, we can at least try adding indices which cover all the joins in your query:
CREATE INDEX inv_mat_idx ON invetario_materiali (sito, id_invmat);
CREATE INDEX strut_tbl_idx ON struttura_table (sito, id_struttura);
CREATE INDEX taf_tbl_idx ON tafonomia_table (sito, id_tafonomia);
Note that the above indices would only help the joins, and would not affect the aggregation step by sito and the distinct counts per group. As #jarlh has noted in the comments, SELECT DISTINCT is superfluous, since you are using GROUP BY, so just do a plain SELECT.
The data of the table exceeds 7 billion.
I want to display the max of entryDate affiliation for each participant and i want to include the null values so i used left join but the query takes long minute. Anyway it gives me the expected results.
Could anyone has a better idea or another better solution to fix the performance?
Select ParticipantID,MaxDate
From dbo.Participant Par
LEFT JOIN dbo.Affiliation Aff
ON AFF.ParticipantID=Par.ParticipantID
LEFT JOIN (
SELECT AFF.AffiliationID,
MAX(EntryDate) as MaxDate
FROM dbo.Affiliation
GROUP BY AFF.AffiliationID
)AS AFF1
ON AFF1.AffiliationID = AFF.AffiliationID
AND AFF1.MaxDate = AFF.EntryDate
I think the first join is unnecessary
SELECT ParticipantID, MaxDate
FROM dbo.Participant Par
OUTER APPLY (
SELECT MAX(EntryDate) as MaxDate
FROM dbo.Affiliation Aff
WHERE Aff.ParticipantID = ParParticipantID
) A
Also you need index on Affilation:
CREATE INDEX IX_Affiliation_ParticipantID_EntryDate ON dbo.Affiliation(ParticipantID, EntryDate)
#llyas There could be lot more considerations if you have xml show plan on or include actual execution plan and check for subtree cost
Anyways, you can use this query by using row_num function
WITH par
AS (
SELECT ParticipantID
,EntryDate AS MaxDate
,ROW_NUMBER() OVER (
PARTITION BY AffiliationID ORDER BY ENTRYDATE DESC
) rn
FROM dbo.Participant Par
LEFT JOIN dbo.Affiliation Aff
ON AFF.ParticipantID = Par.ParticipantID
)
SELECT Participantid
,Maxdate
WHERE rn = 1
Hi i have two table A and B.A has 6 rows and b has 7 rows.Both tables have common value in name column.All the 6 rows of a table is present in b table on name column.
When i write query select * from a,b where a.name = b.name i get 14 rows returned i was expecting an inner join of with 6 rows in result.
Please explain me how query works when we have two tables in form clause.
Table A
Table B
query is
select * from a,b where a.tt = b.tt and a.nename=b.nename;
reuslt is
You've got duplicates in both tables (except for {2, 2017-03-04 03:00:00} which has three copies) which is why you get 14 = (2 * 4) + (2 * 3).
It's very hard to make sense of duplicate data. It's even harder to do when it duplicated on both sides of a join.
You could do something like
With fixedA (SELECT
*,
row_number() over (partition by nename, tt order by nename) rn
FROM
A),
fixedb (SELECT
*,
row_number() over (partition by nename, tt order by nename) rn
FROM
B)
SELECT *
FROM fixedA a full outer join fixedb b
on a.neName = b.neName
and a.tt = b.tt
and a.rn = b.rn
This will however leave one B record with a Null A record
The row_number also seems to do what cellID does so you could just do
SELECT *
FROM a full outer join b
on a.neName = b.neName
and a.tt = b.tt
and a.cellID = b.cellID
you should be doing something like full outer join on that table that you need result set from I would suggest something like this
select * from a full outer join b on a.tt = b.tt and a.nename=b.nename;
if your dealing with a bigger data set join on data type like varchar might take a lot of time to load the result set due to comparison. So, it would be better to use foreign key or primary key joins
https://www.w3schools.com/sql/sql_join_full.asp
table a.
Table b . I have two tables. Table A has over 8000+ records and continues to grow with time.
Table B has only 5 or so records and grows rarely but does grow sometimes.
I want to query Table A's last records where the Id for Table A matches for Table B. The problem is; I am getting all the rows from Table A. I just need the ones where Table A and B match once. These are unique Id's when a new row is inserted into table B and never get repeated.
Any help is most appreciated.
SELECT a.nshift,
a.loeeworkcellid,
b.loeeconfigworkcellid,
b.loeescheduleid,
b.sdescription,
b.sshortname
FROM oeeworkcell a
INNER JOIN dbo.oeeconfigworkcell b
ON a.loeeconfigworkcellid = b.loeeconfigworkcellid
ORDER BY a.loeeworkcellid DESC
I am assuming you want to get the only the lastest (as you said) row from the TableA but JOIN giving you all the rows.You can use the Row_Number() to get the rownumber and then apply the join and filter it with the Where clause to select only the first row from the JOIN. So what you can try as below,
;WITH CTE
AS
(
SELECT * , ROW_NUMBER() OVER(PARTITION BY loeeconfigworkcellid ORDER BY loeeworkcellid desc) AS Rn
FROM oeeworkcell
)
SELECT a.nshift,
a.loeeworkcellid,
b.loeecoonfigworkcellid,
b.loeescheduleid,
b.sdescription,
b.sshortname
FROM CTE a
INNER JOIN dbo.oeeconfigworkcell b
ON a.loeeconfigworkcellid = b.loeeconfigworkcellid
WHERE
a.Rn = 1
You need to group by your data and select only the data having the condition with min id.
SELECT a.nshift,
a.loeeworkcellid,
b.loeecoonfigworkcellid,
b.loeescheduleid,
b.sdescription,
b.sshortname
FROM oeeworkcell a
INNER JOIN dbo.oeeconfigworkcell b
ON a.loeeconfigworkcellid = b.loeeconfigworkcellid
group by
a.nshift,
a.loeeworkcellid,
b.loeecoonfigworkcellid,
b.loeescheduleid,
b.sdescription,
b.sshortname
having a.loeeworkcellid = min(a.loeeworkcellid)
Like, there is top keyword in sql server 2005, how to select top 1 row in mysql if i have join on multiple table & want to retrieve extreme of each ID/column. Limit restricts the no. of row returns so it can't solve my problem.
SELECT v.*
FROM document d
OUTER APPLY
(
SELECT TOP 1 *
FROM version v
WHERE v.document = d.id
ORDER BY
v.revision DESC
) v
or
SELECT v.*
FROM document d
LEFT JOIN
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY v.id ORDER BY revision DESC)
FROM version
) v
ON v.document = d.id
AND v.rn = 1
The latter is more efficient if your documents usually have few revisions and you need to select all or almost all documents; the former is more efficient if the documents have many revisions or you need to select just a small subset of documents.
Update:
Sorry, didn't notice the question is about MySQL.
In MySQL, you do it this way:
SELECT *
FROM document d
LEFT JOIN
version v
ON v.id =
(
SELECT id
FROM version vi
WHERE vi.document = d.document
ORDER BY
vi.document DESC, vi.revision DESC, vi.id DESC
LIMIT 1
)
Create a composite index on version (document, revision, id) for this to work fast.
If I understand you correctly, top doesn't solve your problem either. top is exactly equivalent to limit. What you are looking for is aggregate functions, like max() or min() if you want the extremes. for example:
select link_id, max(column_a), min(column_b) from table_a a, table_b b
where a.link_id = b.link_id group by link_id