SQL Does it make sense to re-order database instead of using ORDER BY to increase performance? - sql

I have a database with around 120.000 Entries and I need to do substring comparisons (where ... like 'test%') for an autocomplete function. The database won't change.
I have a column called "relevance" and for my searches I want them to be ordered by relevance DESC. I noticed, that as soon as I add the "ORDER BY relevance DESC" to my queries, the execution time increases by about 100% - since my queries already take around 100ms on average, this causes significant lag.
Does it make sense to re-order the whole database by relevance once so I can remove the ORDER BY? Can I be certain, that when searching through the table with SQL it will always go through the database in the order that I added the rows?
This is how my query looks like right now:
select *
from hao2_dict
where definitions like 'ba%'
or searchable_pinyin like 'ba%'
ORDER BY relevance DESC
LIMIT 100
UPDATE: For context, here is my DB structure:
And some time measurements:
Using an Index (relevance DESC) for the search term 'b%' gives me 50ms, which is faster than not using an Index. But the search term 'banana%' takes over 1700ms which is way slower than not using an Index. These are the results from 'explain':
b%:
0 Init 0 27 0 0
1 Noop 1 11 0 0
2 Integer 100 1 0 0
3 OpenRead 0 5 0 9 0
4 OpenRead 2 4223 0 k(2,-,) 0
5 Rewind 2 26 2 0 0
6 DeferredSeek 2 0 0 0
7 Column 0 6 4 0
8 Function 1 3 2 like(2) 0
9 If 2 13 0 0
10 Column 0 4 6 0
11 Function 1 5 2 like(2) 0
12 IfNot 2 25 1 0
13 IdxRowid 2 7 0 0
14 Column 0 1 8 0
15 Column 0 2 9 0
16 Column 0 3 10 0
17 Column 0 4 11 0
18 Column 0 5 12 0
19 Column 0 6 13 0
20 Column 0 7 14 0
21 Column 2 0 15 0
22 RealAffinity 15 0 0 0
23 ResultRow 7 9 0 0
24 DecrJumpZero 1 26 0 0
25 Next 2 6 0 1
26 Halt 0 0 0 0
27 Transaction 0 0 10 0 1
28 String8 0 3 0 b% 0
29 String8 0 5 0 b% 0
30 Goto 0 1 0 0
banana%:
0 Init 0 27 0 0
1 Noop 1 11 0 0
2 Integer 100 1 0 0
3 OpenRead 0 5 0 9 0
4 OpenRead 2 4223 0 k(2,-,) 0
5 Rewind 2 26 2 0 0
6 DeferredSeek 2 0 0 0
7 Column 0 6 4 0
8 Function 1 3 2 like(2) 0
9 If 2 13 0 0
10 Column 0 4 6 0
11 Function 1 5 2 like(2) 0
12 IfNot 2 25 1 0
13 IdxRowid 2 7 0 0
14 Column 0 1 8 0
15 Column 0 2 9 0
16 Column 0 3 10 0
17 Column 0 4 11 0
18 Column 0 5 12 0
19 Column 0 6 13 0
20 Column 0 7 14 0
21 Column 2 0 15 0
22 RealAffinity 15 0 0 0
23 ResultRow 7 9 0 0
24 DecrJumpZero 1 26 0 0
25 Next 2 6 0 1
26 Halt 0 0 0 0
27 Transaction 0 0 10 0 1
28 String8 0 3 0 banana% 0
29 String8 0 5 0 banana% 0
30 Goto 0 1 0 0

Can I be certain, that when searching through the table with SQL it will always go through the database in the order that I added the rows?
No. SQL results have no inherent order. They might come out in the order you inserted them, but there is no guarantee.
Instead, put an index on the column. Indexes keep their values in order.
However, this will only deal with the sorting. In the query above it still has to search the whole table for rows with matching definitions and searchable_pinyins. In general, SQL will only use one index per table at a time; usually trying to use two is inefficient. So you need one multi-column index to make this query not have to search the whole table and get the results in sorted order. Make sure relevance is first, you need to have the index columns in the same order as your order by.
(relevance, definitions, searchable_pinyins) will make that query use only the index for searching and sorting. Adding (relevance, searchable_pinyins) as well will handle searching by definitions, searchable_pinyins, or both.

Related

Select only data which columns does not have specific corresponding values respectively

image
Select only data which columns does not have specific corresponding values.
Table Values:
1 D675F009-6908-47A4-816A-AD25A68D8514 0
2 7C96A948-B889-4630-BF67-2187ECFA37DC 1
3 FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E 1
4 178B055F-45FF-4951-A9E2-3470B1DE25E9 1
5 FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E 0
6 D675F009-6908-47A4-816A-AD25A68D8514 0
7 59737584-F44F-4B42-AF9C-1550DFEC1EA5 1
8 FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E 1
9 D675F009-6908-47A4-816A-AD25A68D8514 1
10 7C96A948-B889-4630-BF67-2187ECFA37DC 0
11 178B055F-45FF-4951-A9E2-3470B1DE25E9 1
12 016FAF52-8FBF-4C9C-802D-CA9E13071719 0
Don't select values which have:
(D675F009-6908-47A4-816A-AD25A68D8514) have 1 respectively and
(FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E) have 1 respectively
Allow select values:
(D675F009-6908-47A4-816A-AD25A68D8514) have 0
respectively and (FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E) have 0
respectively
Expected Result::
1 D675F009-6908-47A4-816A-AD25A68D8514 0
2 7C96A948-B889-4630-BF67-2187ECFA37DC 1
4 178B055F-45FF-4951-A9E2-3470B1DE25E9 1
5 FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E 0
6 D675F009-6908-47A4-816A-AD25A68D8514 0
7 59737584-F44F-4B42-AF9C-1550DFEC1EA5 1
10 7C96A948-B889-4630-BF67-2187ECFA37DC 0
11 178B055F-45FF-4951-A9E2-3470B1DE25E9 1
12 016FAF52-8FBF-4C9C-802D-CA9E13071719 0
Is this what you want?
Select * from table where
(is_active=1 and
Participant_id NOT IN
('D675F009-6908-47A4-816A-AD25A68D8514', 'FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E' )
) or
is_active=0;

pandas aggregate based on continuous same rows

Suppose I have this data frame and I want to aggregate and sum values on column 'a' based on the labels that have the same amount.
a label
0 1 0
1 3 0
2 5 0
3 2 1
4 2 1
5 2 1
6 3 0
7 3 0
8 4 1
The desired result will be:
a label
0 9 0
1 6 1
2 6 0
3 4 1
and not this:
a label
0 15 0
1 10 1
IIUC
s=df.groupby(df.label.diff().ne(0).cumsum()).agg({'a':'sum','label':'first'})
s
Out[280]:
a label
label
1 9 0
2 6 1
3 6 0
4 4 1

how populate columns dependng found value?

I have a pandas DataFrame with customers ID and columns related to months (1,2,3....)
I have a column with the number of months since last purchase
I am using the following to populate the relevant months columns
dt.loc[dt.month == 1, '1'] = 1
dt.loc[dt.month == 2, '2'] = 1
dt.loc[dt.month == 3, '3'] = 1
etc,
How can I populate the columns in a better way to avoid creating 12 statements?
pd.get_dummies
pd.get_dummies(dt.month)
Consider the dataframe dt
dt = pd.DataFrame(dict(
month=np.random.randint(1, 13, (10)),
a=range(10)
))
a month
0 0 8
1 1 3
2 2 8
3 3 11
4 4 3
5 5 4
6 6 1
7 7 5
8 8 3
9 9 11
Add columns like this
dt.join(pd.get_dummies(dt.month))
a month 1 3 4 5 8 11
0 0 8 0 0 0 0 1 0
1 1 3 0 1 0 0 0 0
2 2 8 0 0 0 0 1 0
3 3 11 0 0 0 0 0 1
4 4 3 0 1 0 0 0 0
5 5 4 0 0 1 0 0 0
6 6 1 1 0 0 0 0 0
7 7 5 0 0 0 1 0 0
8 8 3 0 1 0 0 0 0
9 9 11 0 0 0 0 0 1
If you wanted the column names to be strings
dt.join(pd.get_dummies(dt.month).rename(columns='month {}'.format))
a month month 1 month 3 month 4 month 5 month 8 month 11
0 0 8 0 0 0 0 1 0
1 1 3 0 1 0 0 0 0
2 2 8 0 0 0 0 1 0
3 3 11 0 0 0 0 0 1
4 4 3 0 1 0 0 0 0
5 5 4 0 0 1 0 0 0
6 6 1 1 0 0 0 0 0
7 7 5 0 0 0 1 0 0
8 8 3 0 1 0 0 0 0
9 9 11 0 0 0 0 0 1

SQL Insert distinct records

I'm trying to batch some insertion scripts (avoiding duplicates) and I've come across some tables that have no primary key (I know...I didn't create them and I cannot modify them). Basically what I've done is grabbed the rows I need, put them into a temporary table ([TempTable]), and updated some values in them.
Now I need to re-insert DISTINCT TOP values from [TempTable] into [OriginalTable] in batches. To do this, I imagine I would need a column in the temp table (which I've created...let's call it [ValuesInserted]), that specifies which columns were just inserted.
I would do an INSERT statement to put DISTINCT values into the original table, using TOP to batch it.
INSERT INTO [OriginalTable]
SELECT DISTINCT TOP (1000) *
FROM [TempTable]
Then I would UPDATE the temp table to have ValuesInserted set to 1 for the records that were just inserted. This is where I'm stuck:
UPDATE /*TOP (1000) - Doesn't work*/ [TempTable]
SET [ValuesInserted] = 1
???
Then I would DELETE those records from the temp table so that my next INSERT statement (using TOP) will not capture the previous set of records.
DELETE
FROM [TempTable]
WHERE [ValuesInserted] = 1
The main problem I'm having is that just running an UPDATE on just the TOP (1000) rows, doesn't capture all of the records that may have duplicates in [TempTable]. I also cannot perform an INNER JOIN on all columns on two copies of [TempTable] because this is being run on many different tables using dynamic SQL. Basically, the script needs to be generic (not specific to any table), but it should be assumed that there is no primary key.
The following generic sample captures the idea:
Val1 Val2 ValuesInserted
1 1 0
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
1 7 0
1 8 0
1 9 0
1 1 0 <--Duplicate
2 1 0
2 2 0
2 3 0
2 4 0
2 5 0
2 6 0
2 7 0
2 8 0
2 9 0
2 1 0 <--Duplicate
3 1 0
3 2 0
3 3 0
3 4 0
3 5 0
3 6 0
3 7 0
3 8 0
3 9 0
3 1 0 <--Duplicate
1 2 0 <--Duplicate
1 3 0 <--Duplicate
Doing an UPDATE TOP (5) on this above data set will only update the first 5 records:
Val1 Val2 ValuesInserted
1 1 1 <--Updated
1 2 1 <--Updated
1 3 1 <--Updated
1 4 1 <--Updated
1 5 1 <--Updated
1 6 0
1 7 0
1 8 0
1 9 0
1 1 0 <--Duplicate
2 1 0
2 2 0
2 3 0
2 4 0
2 5 0
2 6 0
2 7 0
2 8 0
2 9 0
2 1 0 <--Duplicate
3 1 0
3 2 0
3 3 0
3 4 0
3 5 0
3 6 0
3 7 0
3 8 0
3 9 0
3 1 0 <--Duplicate
1 2 0 <--Duplicate
1 3 0 <--Duplicate
I need to update any records that match the top 5 records like so:
Val1 Val2 ValuesInserted
1 1 1 <--Updated
1 2 1 <--Updated
1 3 1 <--Updated
1 4 1 <--Updated
1 5 1 <--Updated
1 6 0
1 7 0
1 8 0
1 9 0
1 1 1 <--Updated
2 1 0
2 2 0
2 3 0
2 4 0
2 5 0
2 6 0
2 7 0
2 8 0
2 9 0
2 1 0 <--Duplicate
3 1 0
3 2 0
3 3 0
3 4 0
3 5 0
3 6 0
3 7 0
3 8 0
3 9 0
3 1 0 <--Duplicate
1 2 1 <--Updated
1 3 1 <--Updated
If you can make your idea work on this sample, I can apply it to my specific case.
Am I approaching this completely wrong, or am I missing something? I'm looking for a solution that doesn't hog resources because the script is batched and is running on very large databases on high-impact servers.
The closest topic I could find on this was:
Using Distinct in SQL Update.
However, the answers given would not work when using TOP.
EDIT: This apparently wasn't clear at the beginning. The first thing I'm doing is grabbing rows from [OriginalTable] and putting them into [TempTable]. These rows are initially unique. However, I perform an update that modifies some of the values, yielding data like the sample above. From there, I need to grab DISTINCT rows and re-insert them into [OriginalTable].
It looks like you're really going out of your way to make this as complicated as possible. I would just remove the duplicates from the temporary table in the first place. Or never INSERT them there, which is even better. Or build an actual ETL solution, perhaps with SSIS.
Those things said, what you're looking for is the OUTPUT clause, which can be added to any INSERT, UPDATE, or DELETE statement:
DECLARE #inserted_ids TABLE (val1, val2)
INSERT INTO dbo.OriginalTable (val1, val2)
OUTPUT INSERTED.val1, INSERTED.val2 INTO #inserted_ids
SELECT DISTINCT TOP 1000 val1, val2
FROM dbo.TempTable
DELETE TT
FROM #inserte_ids II
INNER JOIN dbo.TempTable TT ON
TT.val1 = II.val1 AND
TT.val2 = II.val2

Result from CTE query not sorted by Level, why?

Update:
I asked the question because the result does not look like how it is executed... which is explained here:
http://msdn.microsoft.com/en-us/library/ms186243(v=sql.105).aspx
Well I have another question...
If I do not need the same parents and the Level, is it possible to return one parent only once using the CTE (not the select after it) or maybe some other sql?
===========================
Desired result is like:
......
54 4
**** the above is anchor member, the numbers are correct
4 1
1 0
2 1
36 35
35 8
8 1
54 12
12 1
11 1
3 1
===========================
I am using a recursive query to find out all the parents from a Hierarchy table, for items from Items table; I thought the result should be by Level, but it is not... I know I can sort it using order by, I just think the output itself should be ordered by Level because the Recursive Memeber is run by Level, right?
WITH Result(ItemID, ParentID, Level)
AS
(
--get the anchor member from tbItems
SELECT itemID, itemParentID, 0 AS Level
FROM tbItems WHERE
approved = 0
UNION ALL
--recursive member from tbHierarchy
SELECT h.hierarchyItemID, h.parentItemID, Level + 1
FROM tbHierarchy AS h
INNER JOIN
Result AS r
ON
h.hierarchyItemID = r.ParentID
)
SELECT *
FROM Result
the Result is:
ItemID ParentID Level
----------- ----------- -----------
7 3 0
11 2 0
18 11 0
19 11 0
21 54 0
31 2 0
33 36 0
34 36 0
35 36 0
36 36 0
38 2 0
39 2 0
40 2 0
54 4 0
**** the above is anchor member, the numbers are correct
4 1 1
1 0 2
2 1 1
1 0 2
2 1 1
1 0 2
2 1 1
1 0 2
36 35 1
35 8 2
8 1 3
1 0 4
36 35 1
35 8 2
8 1 3
1 0 4
36 35 1
35 8 2
8 1 3
1 0 4
36 35 1
35 8 2
8 1 3
1 0 4
2 1 1
1 0 2
54 12 1
12 1 2
1 0 3
11 1 1
1 0 2
11 1 1
1 0 2
2 1 1
1 0 2
3 1 1
1 0 2
The data in a database are not ordered. If you do not put an order by, you cannot be sure of the order of the output.
Never think the database will do things the way you think, 90% of the time, it's wrong. The objective of database is to understand your query and find the fastest way to find the solution, and it's often not the way you think.
Depending of your editor, but sometimes you can ask it to explain the plan used to find the solution, it might give you some explanation of how the result is obtained.