Result from CTE query not sorted by Level, why? - sql

Update:
I asked the question because the result does not look like how it is executed... which is explained here:
http://msdn.microsoft.com/en-us/library/ms186243(v=sql.105).aspx
Well I have another question...
If I do not need the same parents and the Level, is it possible to return one parent only once using the CTE (not the select after it) or maybe some other sql?
===========================
Desired result is like:
......
54 4
**** the above is anchor member, the numbers are correct
4 1
1 0
2 1
36 35
35 8
8 1
54 12
12 1
11 1
3 1
===========================
I am using a recursive query to find out all the parents from a Hierarchy table, for items from Items table; I thought the result should be by Level, but it is not... I know I can sort it using order by, I just think the output itself should be ordered by Level because the Recursive Memeber is run by Level, right?
WITH Result(ItemID, ParentID, Level)
AS
(
--get the anchor member from tbItems
SELECT itemID, itemParentID, 0 AS Level
FROM tbItems WHERE
approved = 0
UNION ALL
--recursive member from tbHierarchy
SELECT h.hierarchyItemID, h.parentItemID, Level + 1
FROM tbHierarchy AS h
INNER JOIN
Result AS r
ON
h.hierarchyItemID = r.ParentID
)
SELECT *
FROM Result
the Result is:
ItemID ParentID Level
----------- ----------- -----------
7 3 0
11 2 0
18 11 0
19 11 0
21 54 0
31 2 0
33 36 0
34 36 0
35 36 0
36 36 0
38 2 0
39 2 0
40 2 0
54 4 0
**** the above is anchor member, the numbers are correct
4 1 1
1 0 2
2 1 1
1 0 2
2 1 1
1 0 2
2 1 1
1 0 2
36 35 1
35 8 2
8 1 3
1 0 4
36 35 1
35 8 2
8 1 3
1 0 4
36 35 1
35 8 2
8 1 3
1 0 4
36 35 1
35 8 2
8 1 3
1 0 4
2 1 1
1 0 2
54 12 1
12 1 2
1 0 3
11 1 1
1 0 2
11 1 1
1 0 2
2 1 1
1 0 2
3 1 1
1 0 2

The data in a database are not ordered. If you do not put an order by, you cannot be sure of the order of the output.
Never think the database will do things the way you think, 90% of the time, it's wrong. The objective of database is to understand your query and find the fastest way to find the solution, and it's often not the way you think.
Depending of your editor, but sometimes you can ask it to explain the plan used to find the solution, it might give you some explanation of how the result is obtained.

Related

SQL Does it make sense to re-order database instead of using ORDER BY to increase performance?

I have a database with around 120.000 Entries and I need to do substring comparisons (where ... like 'test%') for an autocomplete function. The database won't change.
I have a column called "relevance" and for my searches I want them to be ordered by relevance DESC. I noticed, that as soon as I add the "ORDER BY relevance DESC" to my queries, the execution time increases by about 100% - since my queries already take around 100ms on average, this causes significant lag.
Does it make sense to re-order the whole database by relevance once so I can remove the ORDER BY? Can I be certain, that when searching through the table with SQL it will always go through the database in the order that I added the rows?
This is how my query looks like right now:
select *
from hao2_dict
where definitions like 'ba%'
or searchable_pinyin like 'ba%'
ORDER BY relevance DESC
LIMIT 100
UPDATE: For context, here is my DB structure:
And some time measurements:
Using an Index (relevance DESC) for the search term 'b%' gives me 50ms, which is faster than not using an Index. But the search term 'banana%' takes over 1700ms which is way slower than not using an Index. These are the results from 'explain':
b%:
0 Init 0 27 0 0
1 Noop 1 11 0 0
2 Integer 100 1 0 0
3 OpenRead 0 5 0 9 0
4 OpenRead 2 4223 0 k(2,-,) 0
5 Rewind 2 26 2 0 0
6 DeferredSeek 2 0 0 0
7 Column 0 6 4 0
8 Function 1 3 2 like(2) 0
9 If 2 13 0 0
10 Column 0 4 6 0
11 Function 1 5 2 like(2) 0
12 IfNot 2 25 1 0
13 IdxRowid 2 7 0 0
14 Column 0 1 8 0
15 Column 0 2 9 0
16 Column 0 3 10 0
17 Column 0 4 11 0
18 Column 0 5 12 0
19 Column 0 6 13 0
20 Column 0 7 14 0
21 Column 2 0 15 0
22 RealAffinity 15 0 0 0
23 ResultRow 7 9 0 0
24 DecrJumpZero 1 26 0 0
25 Next 2 6 0 1
26 Halt 0 0 0 0
27 Transaction 0 0 10 0 1
28 String8 0 3 0 b% 0
29 String8 0 5 0 b% 0
30 Goto 0 1 0 0
banana%:
0 Init 0 27 0 0
1 Noop 1 11 0 0
2 Integer 100 1 0 0
3 OpenRead 0 5 0 9 0
4 OpenRead 2 4223 0 k(2,-,) 0
5 Rewind 2 26 2 0 0
6 DeferredSeek 2 0 0 0
7 Column 0 6 4 0
8 Function 1 3 2 like(2) 0
9 If 2 13 0 0
10 Column 0 4 6 0
11 Function 1 5 2 like(2) 0
12 IfNot 2 25 1 0
13 IdxRowid 2 7 0 0
14 Column 0 1 8 0
15 Column 0 2 9 0
16 Column 0 3 10 0
17 Column 0 4 11 0
18 Column 0 5 12 0
19 Column 0 6 13 0
20 Column 0 7 14 0
21 Column 2 0 15 0
22 RealAffinity 15 0 0 0
23 ResultRow 7 9 0 0
24 DecrJumpZero 1 26 0 0
25 Next 2 6 0 1
26 Halt 0 0 0 0
27 Transaction 0 0 10 0 1
28 String8 0 3 0 banana% 0
29 String8 0 5 0 banana% 0
30 Goto 0 1 0 0
Can I be certain, that when searching through the table with SQL it will always go through the database in the order that I added the rows?
No. SQL results have no inherent order. They might come out in the order you inserted them, but there is no guarantee.
Instead, put an index on the column. Indexes keep their values in order.
However, this will only deal with the sorting. In the query above it still has to search the whole table for rows with matching definitions and searchable_pinyins. In general, SQL will only use one index per table at a time; usually trying to use two is inefficient. So you need one multi-column index to make this query not have to search the whole table and get the results in sorted order. Make sure relevance is first, you need to have the index columns in the same order as your order by.
(relevance, definitions, searchable_pinyins) will make that query use only the index for searching and sorting. Adding (relevance, searchable_pinyins) as well will handle searching by definitions, searchable_pinyins, or both.

Remove Elements from Dataframe Based on Group Appearance Rate

I have a simple dataframe that is basically a list of objects with their own list of items (see below). What is the cleanest method of filtering out all rows in the overall dataframe based on their rate of occurrence within each group? For example, I want to remove all rows that appear in groups at least 75% of the time. In this example table, I would expect all rows with '30' in column 2 to be deleted, because it appears in 3 out of the 4 groups. Is this a use case for a lambda filter? If so, what would the filter be?
Col1
Col2
0
3
0
7
0
15
0
30
1
5
1
6
1
11
1
30
2
1
2
9
2
17
2
29
3
2
3
14
3
18
3
30
Try:
condition = df.drop_duplicates().groupby(['Col2'])['Col1'].count() / len(df['Col1'].drop_duplicates())<0.75
condition = condition[condition].index
print(df[df['Col2'].isin(condition)])
Output:
Col1 Col2
0 0 3
1 0 7
2 0 15
4 1 5
5 1 6
6 1 11
8 2 1
9 2 9
10 2 17
11 2 29
12 3 2
13 3 14
14 3 18

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

In SQL, how to select minimum value of a column and group by other columns?

I have a lookup table below:
id ref order
1 6 0
2 6 0
3 7 0
5 34 0
6 33 0
6 255 1
9 12 0
9 80 1
12 7 0
12 76 1
13 10 0
15 12 0
16 6 0
16 7 1
17 6 1
17 63 0
18 7 0
19 7 1
19 75 0
20 6 0
20 63 1
So in the lookup table (tab_lkp), it has column [id] (the IDs of entities), [ref] (the reference id that points to other entities in another table) and [order] (tells the order of reference, smaller order means higher priority).
My expectation is that, for each of the IDs, only one ref with the smallest order is selected. My code is (by following Phil's answer):
select id
, ref
, min_order = min(order)
from [dbo].[tab_lkp]
group by id, ref
order by id, ref
But the code doesn't work for me, the results still contains multiple records for each of the IDs:
id ref order
1 6 0
2 6 0
3 7 0
5 34 0
6 33 0
6 255 1
9 12 0
9 80 1
12 7 0
12 76 1
13 10 0
15 12 0
16 6 0
16 7 1
17 6 1
17 63 0
18 7 0
19 7 1
19 75 0
20 6 0
20 63 1
Could you please let me know what is wrong with my code? And how should I achieve my goal?
From an ANSI sql approach:
select x2.id, x2.ref, x2.order
from MyTable x2
inner join
(
select id, min(order) as min_order
from MyTable
group by id
) x1
on x1.id = x2.id
and x1.min_order = x2.order
You would normally do this using row_number():
select t.*
from (select t.*, row_number() over (partition by id order by ref) as seqnum
from [dbo].[tab_lkp] t
) t
where seqnum = 1;
or by using a subquery that does exactly what you state that you want,
"for each of the IDs, only one ref with the smallest order is selected"
Select * from tab_lkp t
Where order =
(Select Min(order) from tab_lkp
where Id = t.Id)

Display Rows only if group of rows' sum is greater then 0

I have a table like the one below. I would like to get this data to SSRS (Grouped by LineID and Product and Column as Hour) to show only those rows where HourCount > 0 for every LineID and Product.
LineID Product Hour HourCount
3 A 0 0
3 A 1 0
3 A 2 0
3 A 3 0
3 A 4 0
3 A 5 0
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
4 B 0 0
4 B 1 0
4 B 2 0
4 B 3 0
4 B 4 0
4 B 5 0
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
Basically I would like this table to look like this before it's in SSRS:
LineID Product Hour HourCount
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
So display Product for the line only if any of the Hourd have HourCount higher then 0.
Is there any query that could give me these results or I should play with display settings in SSRS?
Something like this should work:
with NonZero as
(
select *
, GroupZeroCount = sum(HourCount) over (partition by LineID, Product)
from HourTable
)
select LineID
, Product
, [Hour]
, HourCount
from NonZero
where GroupZeroCount > 0
SQL Fiddle with demo.
You could certainly so something similar in SSRS, but it's certainly much easier and intuitive to apply at the T-SQL level.
I think you are looking for
SELECT LineID,Product,Hour,Count(Hour) AS HourCount
FROM abc
GROUP BY LineID,Productm,Hour HAVING Count(Hour) > 0