Mariadb Building the best INDEX for a given SELECT - GROUP BY - indexing

I do not have much knowledge in the database.
For study, I am reading MariaDB's index documents.
But there are parts that I do not understand.
Document
Algorithm, step 2b (GROUP BY)¶
WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc) (='s first, in any order; then the GROUP BY)
aaa or bbb knows that ordering of the indexes is important, regardless of the order of the where clauses. Therefore, the indexes of aaa and bbb in the where clause are used, and sort ccc based on the matched aaa and bbb.
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)
(no WHERE) means don't use WHERE clause?
What if I use it like this?
WHERE x > 1 GROUP BY x, y
my think:
(1) from table
(2) where x > 1 -> using index
(3) group by x, y -> using index..? because (2) already sorted..? or sort again?
(4) having -> if i did not enter this keyword, is it not used?
(5) select -> print data(?)
(6) order by -> group by already order by(?)

Algorithm, step 2b (GROUP BY)¶
WHERE aaa = 123 AND bbb = 1 GROUP BY ccc ⇒ INDEX(bbb, aaa, ccc) or INDEX(aaa, bbb, ccc) (='s first, in any order; then the GROUP BY)
there is table like below:
aaa | bbb | ccc
------------------
123 | 1 | 30
------------------
123 | 1 | 48
------------------
123 | 2 | 27
------------------
125 | 1 | 11
------------------
125 | 3 | 29
------------------
125 | 3 | 40
------------------
WHERE aaa = 123 AND bbb = 1 clause result is this:
aaa | bbb | ccc
------------------
123 | 1 | 30
------------------
123 | 1 | 48
check ccc column.
ccc column is sorted by bbb column.
so GROUP BY clause can be grouped quickly because the ccc columns are sorted.
**CAUTION**
think about WHERE aaa >= 123 AND bbb = 1 GROUP BY ccc clause.
aaa | bbb | ccc
------------------
123 | 1 | 30
------------------
123 | 1 | 48
------------------
125 | 1 | 11
------------------
ccc column doesn't be sorted by bbb column.
The ccc column is meaningful only if the aaa and bbb columns have the same value.
GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)
this is same thing.

GROUP BY x,y ⇒ INDEX(x,y) (no WHERE)
should probably say "(if there is no WHERE)". If there is a WHERE, then that index may or may not be useful. You should (usually) build the INDEX based on the WHERE, an only if you get past it, consider the GROUP BY.
WHERE x > 1 GROUP BY x, y
OK, that can use INDEX(x,y), in that order. First, it will filter, and that leaves the rest of the index still in a good order for the grouping. Similarly:
WHERE x > 1 ORDER BY x, y
WHERE x > 1 GROUP BY x, y ORDER BY x, y
No sorting should be necessary.
So, here are the steps I might take:
1. WHERE x > 1 ... --> INDEX(x) (or any index _starting_ with `x`)
2. ... GROUP BY x, y --> INDEX(x,y)
3. recheck that I did not mess up the WHERE.
This has no really good index:
WHERE x > 1 AND y = 4 GROUP BY x,y
1. WHERE x > 1 AND y = 4 ... --> INDEX(y,x) in this order!
2. ... GROUP BY x,y --> can use that index
However, flipping to GROUP BY y,x has the same effect (ignoring the order of display).
(4) having -> if i did not enter this keyword, is it not used?
HAVING, if present, is applied after things for which INDEXes are useful. Having no HAVING does mean there is no HAVING.
(6) order by -> group by already order by(?)
That has become a tricky question. Until very recently (MySQL 8.0; don't know when or if MariaDB changed), GROUP BY implied the equivalent ORDER BY. That was non-standard and potentially interfered with optimization. With 8.0, GROUP BY does not imply any order; you must explicitly request the order (if you care).
(I updated the source document in response to this discussion.)

Related

Remove group rows

How do I remove group rows based on other columns for a particular ID such that:
ID Att Comp Att. Inc. Att
aaa 2 0 2
aaa 2 0 2
bbb 3 1 2
bbb 3 1 2
bbb 3 0 2
becomes:
ID Att Comp Att. Inc. Att
aaa 2 0 2
bbb 3 1 2
I need to discard cases which are not just duplicate, but also infer the same data based on the columns.
Use drop_duplicates -- check out the documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
I can't tell for sure from your description what you want to pay attention for for duplicates, but you can tell drop_duplicates which column(s) to look at.

Aggregate with groupby but with distinct condition on the aggregated column [duplicate]

This question already has answers here:
one to one distinct restriction on selection
(2 answers)
Closed 8 years ago.
I have encountered a problem like that. There is a Table A, I want to aggregate it using a `group by x order by diff(which is abs(x-y)) incrementally
x and y goes always incrementally. And x with smaller value will have the priority when two different x can paired with same y
x y diff
1 2 1
1 4 3
1 6 5
3 2 1
3 4 1
3 6 3
4.5 2 3.5
4.5 4 0.5
4.5 6 1.5
The aggregate function I want is:
take the y in each group which has the smallest difference with x(smallest diff value).
BUT that y which is taken can not be reused.(for example y=2 will be taken in (x=1) group so that can not be reused in (x=3) group)
Expected result:
x y diff
1 2 1
3 4 1
4.5 4 0.5
seems to be very tricky in plain SQL. I am using PostgreSQL. The real data will be much
complicated and longer than this idea-shooting example
If properly understood your question
test=# select * from A;
x | y | diff
---+---+------
1 | 2 | 1
1 | 4 | 3
1 | 6 | 5
3 | 2 | 1
3 | 4 | 1
3 | 6 | 3
5 | 2 | 3
5 | 4 | 1
5 | 6 | 1
(9 rows)
test=# SELECT MIN(x) AS x, y FROM A WHERE diff = 1 GROUP BY y ORDER BY x;
x | y
---+---
1 | 2
3 | 4
5 | 6
(3 rows)
SELECT MIN(x) AS x, y, MIN(diff) FROM A WHERE diff = 1 GROUP BY y ORDER BY x;
x | y | min
---+---+-----
1 | 2 | 1
3 | 4 | 1
5 | 6 | 1
(3 rows)
added MIN(diff) if not needed can be removed.
Try like this
t1 as table name
d as diff
with cte as (
select x, y,d from t1 where d=(select min(d) from t1) order by x )
select t1.x, min(t1.y), min(t1.d) from t1 inner join cte on
t1.x=cte.x and not t1.y in (select y from cte where cte.x<t1.x)
group by t1.x
This is more of a comment.
This problem essentially a graph problem, of finding the shortest set pairs between two discrete sets (x and y in this case). Technically, this is a maximum matching of a weighted bipartite graph (see [here][1]). I don't think this problem is NP-complete. But that still can make it hard to solve particularly in SQL.
Regardless of whether or not it is hard in the theoretical sense (NP-complete is considered "hard theoretically"), it is hard to do in SQL. One issue is that greedy algorithms don't work. The same "y" value might be closest to all the X values. Which one to choose? Well, the algorithm has to look further.
The only way that I can think to do this accurate in SQL is an exhaustive approach. That is, generate all possible combinations and then check for the one that meets your conditions. Finding all possible combinations requires generating N-factorial combinations of the X's (or Y's). That, in turn, requires a lot of computation. My first thought would be to use recursive CTEs for this. However, that would only work on small problems.

Convert multi-dimensional array to records

Given: {{1,"a"},{2,"b"},{3,"c"}}
Desired:
foo | bar
-----+------
1 | a
2 | b
3 | c
You can get the intended result with the following query; however, it'd be better to have something that scales with the size of the array.
SELECT arr[subscript][1] as foo, arr[subscript][2] as bar
FROM ( select generate_subscripts(arr,1) as subscript, arr
from (select '{{1,"a"},{2,"b"},{3,"c"}}'::text[][] as arr) input
) sub;
This works:
select key as foo, value as bar
from json_each_text(
json_object('{{1,"a"},{2,"b"},{3,"c"}}')
);
Result:
foo | bar
-----+------
1 | a
2 | b
3 | c
Docs
Not sure what exactly you mean saying "it'd be better to have something that scales with the size of the array". Of course you can not have extra columns added to resultset as the inner array size grows, because postgresql must know exact colunms of a query before its execution (so before it begins to read the string).
But I would like to propose converting the string into normal relational representation of matrix:
select i, j, arr[i][j] a_i_j from (
select i, generate_subscripts(arr,2) as j, arr from (
select generate_subscripts(arr,1) as i, arr
from (select ('{{1,"a",11},{2,"b",22},{3,"c",33},{4,"d",44}}'::text[][]) arr) input
) sub_i
) sub_j
Which gives:
i | j | a_i_j
--+---+------
1 | 1 | 1
1 | 2 | a
1 | 3 | 11
2 | 1 | 2
2 | 2 | b
2 | 3 | 22
3 | 1 | 3
3 | 2 | c
3 | 3 | 33
4 | 1 | 4
4 | 2 | d
4 | 3 | 44
Such a result may be rather usable in further data processing, I think.
Of course, such a query can handle only array with predefined number of dimensions, but all array sizes for all of its dimensions can be changed without rewriting the query, so this is a bit more flexible approach.
ADDITION: Yes, using with recursive one can build resembling query, capable of handling array with arbitrary dimensions. None the less, there is no way to overcome the limitation coming from relational data model - exact set of columns must be defined at query parse time, and no way to delay this until execution time. So, we are forced to store all indices in one column, using another array.
Here is the query that extracts all elements from arbitrary multi-dimensional array along with their zero-based indices (stored in another one-dimensional array):
with recursive extract_index(k,idx,elem,arr,n) as (
select (row_number() over())-1 k, idx, elem, arr, n from (
select array[]::bigint[] idx, unnest(arr) elem, arr, array_ndims(arr) n
from ( select '{{{1,"a"},{11,111}},{{2,"b"},{22,222}},{{3,"c"},{33,333}},{{4,"d"},{44,444}}}'::text[] arr ) input
) plain_indexed
union all
select k/array_length(arr,n)::bigint k, array_prepend(k%array_length(arr,2),idx) idx, elem, arr, n-1 n
from extract_index
where n!=1
)
select array_prepend(k,idx) idx, elem from extract_index where n=1
Which gives:
idx | elem
--------+-----
{0,0,0} | 1
{0,0,1} | a
{0,1,0} | 11
{0,1,1} | 111
{1,0,0} | 2
{1,0,1} | b
{1,1,0} | 22
{1,1,1} | 222
{2,0,0} | 3
{2,0,1} | c
{2,1,0} | 33
{2,1,1} | 333
{3,0,0} | 4
{3,0,1} | d
{3,1,0} | 44
{3,1,1} | 444
Formally, this seems to prove the concept, but I wonder what a real practical use one could make out of it :)

View to replace values with max value corresponding to a match

I am sure my question is very simple for some, but I cannot figure it out and it is one of those things difficult to search an answer for. I hope you can help.
In a table in SQL I have the following (simplified data):
UserID UserIDX Number Date
aaa bbb 1 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 3 15.02.2020
xxx bbb 99 15.02.2020
And I need a view which will give me the same amount of records, but for every combination of UserID and UserIDX, there should be only 1 value under the Number field, i.e. the highest value found in the combination data set. The Date field needs to remain unchanged. So the above would be transformed to:
UserID UserIDX Number Date
aaa bbb 5 21.01.2000
aaa bbb 5 21.01.2010
ppp ggg 9 21.01.2009
ppp ggg 9 15.02.2020
xxx bbb 99 15.02.2020
So, for all instances of aaa+bbb combination the unique value in Number should be 5 and for ppp+ggg the unique number is 9.
Thank you very much.
Leo
select userid,useridx,maxnum,date
from table a
inner join (
select userid,useridx,max(number) maxnum
from table
group by userid,useridx) b
on a.userid = b.userid and a.useridx = b.useridx

SQL - querying without duplicate base on another column, /improving condition.?

I have written a query which involves joins and finally returns the below result,
Name ID
AAA 1
BBB 1
BBB 6
CCC 1
CCC 6
DDD 6
EEE 1
But I want my result to be still filtered in such a way that, the duplicate values in the first column should be ignored which has lesser value. ie, CCC and BBB which are duplicates with value 1 should be removed. The result should be
AAA 1
BBB 6
CCC 6
DDD 6
EEE 1
Note: I have a condition called Where (ID = '6' or ID = '1'), is there any way to improve this condition saying Where ID = 6 or ID = 1 (if no 6 is available in that table)"
You will likely want to add:
GROUP BY name
to the bottom of your query and change ID to MAX(ID) in your SELECT statement
It is hard to give a more specific answer without seeing the query you've already written.