why bloom filters do not work, tell me please - indexing

I have two tables: original and new with bloom filters. Bloom filter created for int column ( CLUSTERED BY and 'orc.bloom.filter.columns'). in hdfs in the partition, I see the number of files = the number of unique values ​​in the column. but when I query ( select min(...) from table where id = ...) these tables, requests finishes execution in the same time. and in job's logs and in 'explain analyze' I do not see the use of a bloom filter and the request reads the entire partition. what else needs to be configured in order for the bloom filter to work, requests are executed faster, and not all files in the partition are read, but only one file with the desired id?

Bloom filters can help not in all cases.
ORC contains indexes on file level, stripe level and row level (for 10000 rows, configurable). If PPD configured, indexes (min, max values) can be used to skip reading files (footer part will be read anyway), stripes also can be skipped. These indexes are useful for filtering sortable sequential values and range queries. like integer number for example. For indexes to be efficient you should sort data by index keys when inserting. Unsorted index is not efficient because all stripes can contain all keys.
Sorting during insert can be expensive.
Having only Indexes is enough in most cases.
Bloom filters are structures which can help to check if key is not present in the dataset with 100 percent probability.
Bloom filters efficient for equality queries, especially for not sequential unsorted values like GUIDs. MIN/MAX indexes do not work efficiently for such values. Filter by specific GUID should be very efficient with Bloom filter.
For sortable sequential values like integer id, min/max values stored in ORC indexes (sorted) are better.

Related

What is the difference between partitioning and indexing in DB ? (Performance-Wise)

I am new to SQL and have been trying to optimize the query performances of my microservices to my DB (Oracle SQL).
Based on my research, I checked that you can do indexing and partitioning to improve query performance, I seem to have known each of the concept and how to do it, but I'm not sure about the difference between both?
For example, suppose I have table Orders, with 100 million entries with
Columns:
OrderId (PK)
CustomerId (6 digit unique number)
Order (what the order is. Ex: "Burger")
CreatedTime (Timestamp)
In essence, both methods "subdivides" the orders table so that a DB query wont need to scan through all 100 million entries in DB, right?
Lets say I want to find orders on "2020-01-30", I can create an index on createdTime to improve the performance.
But I can also create a partition based on createdTime to improve the performance. (the partition is per day)
Are there any difference to both methods in this case? Is one better than the other ?
There are several ways to partition - by range, by list, by hash, and by reference (although that tends to be uncommon).
If you were to partition by a date column, it would usually be using a range, so one month/week/day uses one partition, another uses another etc. If you want to filter rows where this date is equal to a value then you can do a partition full table scan to read all of the partition that houses this data with a full scan. This can end up being quite efficient if most of the data in the partition would match your filter - apply the same thinking about whether a full table scan in general is a good idea but when the data in the table is already filtered down. If you wanted to look for an hour long date range and you’re partitioning by range with monthly intervals then you’re going to be reading about 730 times more data than necessary. Local indexes are useful in that scenario.
Smaller partitions also help this out, but you can end up with a case where you have thousands of partitions. If you have selective queries that don’t know which partition needs to be read - you could want global indexes. These add a lot of effort into all partition maintenance operations.
If you index the date column instead then you can quickly establish the location of the rows in your table that meet your filter. This is easy in an index because it’s just a sorted list - you find the first key in the index that matches the filter and read until it no longer matches. You then have to lookup these rows using single block reads. Usual efficiency rules of an index apply - the less data you need to read with your filters the more useful the index will be.
Usually, queries include more than just a date filter. These additional filters might be more appropriate for your partitioning scheme. You could also just include the columns in your index (remembering the Golden Rule of Indexing would tell you if you’re using a column with equality filters it should go before columns that you use range filters on in an index).
You can generally get all the performance you need with just indexing. Partitioning really comes into play when you have important queries that need to read huge chunks of data (generally reporting queries) or when you need to do things like purge data older than X months.

Issue with the big tables ( no primary key available)

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.
A primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity. But you can create your own unique indexes on a table. So, strictly speaking, a primary index does not affect performance, but the index used by the primary key does.
WHEN SHOULD PRIMARY KEY BE USED?
Primary key is needed for referring to a specific record.
To make your SELECTs run fast you should consider adding an index on an appropriate columns you're using in your WHERE.
E.g. to speed-up SELECT * FROM "Customers" WHERE "State" = 'CA' one should create an index on State column.
Primarykey will not help if you don't have Primarykey in where cause.
If you would like to make you quesry faster, you can create non-cluster index on columns in where cause. You may want include columns on top of your index(it depend on your select cause)
The SQL optimizer will seek on your indexs that will make your query faster.
(but you should think about when data adding in your table. Insert operation might takes time if you create index on many columns.)
It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.
If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.
Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:
Relational Database Index Design and the Optimizers
One way you can do it is to create indexes on your table. It's always better to create a primary key, which creates a unique index that by default will reduce the retrieval time .........
The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan. Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.
Read more: Advantages of using indexes in database?

SQL Indexing: None, Single Column, and Multiple Columns

How does indexing work in SQL and what benefits does it provide? What reason would there be for not indexing? And what is the difference between indexing a single column vs. indexing multiple columns?
How does indexing work in SQL and what benefits does it provide?
When you index columns you express your intent to query the indexed columns in conditional expressions, such as equality or range queries. With this information the storage engine can build a structure that makes such queries faster, often arranging them in tree structures. B-trees are the most common ones, but a lot of different structures exists, such as hash indices, R-tree indices for spatial data etc. Each structure is specialized in a certain type of look ups. For instance, hash indices are very fast for equality conditions, such as:
SELECT * FROM example_table WHERE type = "example";
SELECT * FROM example_table WHERE id = X;
B-trees are also fairly quick for equality look ups, but their main strength is that they support range queries:
SELECT * FROM example_table WHERE id > 5 AND id < 10
SELECT * FROM example_table WHERE type = "example" and value > 25
It is VERY important, however, when you build B-tree indices to understand that the tree is ordered in a "left-to-right" manner. I.e, if you build a B-tree index (lets call it A) on {type, value}, then you NEED to have a condition on the type-column in order for the query to be able to utilize the index. The example index can NOT be used in a query where the condition solely depends on value.
Furthermore, if you mix equality and a range condition, make sure that the equality columns are listed first in the index, otherwise the index can only be partially used.
What reason would there be for not indexing?
If the selectivity of the index is low, then you might not gain much over a table scan. say for instance that you have an index on a field called gender. Then the selectivity of that index will be low, since a lookup on that index will return half the rows of the original table. You can read a pretty simple explanation on selectivity here, and the reasoning behind it: http://mattfleming.com/node/192
Also, maintaining an index has a cost. For each data manipulation the index might need restructuring. So keeping the amount of indices to the minimum required to perform well on the queries against that table might be desirable.
What is the difference between indexing a single column vs. indexing multiple columns?
Once again, it depends on the type of queries you issue. Indexing a single column gender might not be a good idea, since the selectivity is low. When the selectivity is high then such an index makes much more sense. For instance, indices on the primary key is a very good index, since the selectivity is high (actually, it is as high as it gets. Each key in the index corresponds to exactly on record), and indices on columns with unique or highly different values (such as slugs, password hashes and what not) are also good single column indices.
There is also the concept of covering indices. Basically, each leaf in an index contains a pointer into the table where the row is stored (unless the index is a clustered index. In this case the leaf is the record). So for each index hit, the query engine has to fetch the corresponding table row, increasing the number of I/O-operations. Since I/O is extremely slow, you want to keep this to a minimum. Now, lets say that you often need to query for something, and also fetch some additional columns, then you can create a covering index, trading storage space for query performance. Example: Let's find the name and email of all users who has joined in the last 6 months (assuming MySQL):
With index on {joined_at}:
SELECT first_name, last_name, email
FROM users
WHERE joined_at > NOW() - INTERVAL 6 MONTH;
Query explanation:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users ALL test NULL NULL NULL 873 Using where
As you can see in the type-column, the query engine resorted to a full table scan, since the index selectivity was too low to be worthwhile using in this query (too many results would be returned, and thus followed into the table, costing too much in I/O)
With index on {joined_at, first_name, last_name, email}:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users range test,test2 test2 8 NULL 514 Using where;
Using index
Now, since all the information that is necessary to complete the query is available in the index, the query engine evaluates that it is much better to use the index (with 514 rows) instead of doing a full table scan.
So as you can see, by using covering indices we can speed up queries for partial selects of a table, even if the selectivity of the index is quite small.
How does indexing work in SQL
That's a pretty open question but basically databases store a structure that enables faster look up of information. That structure is dependent on the implementation but its typically a type of tree.
what benefits does it provide?
Queries that are SARGable can be significantly faster.*
What reason would there be for not indexing?
Some data modification queries can take longer and there is storage cost to indexes but generally speaking, both these considerations are negligible.
And what is the difference between indexing a single column vs. indexing multiple columns?
There isn't much difference but sometimes people create covering indexes** that index mutliple columns to increase the performance of a specific query.
*SARGable is from Search ARGument ABLE. Basically if you do WHERE FOO > 5 it can be faster if FOO is indexed. On the other hand WHERE h(FOO) > 5 probably won't benefit from an index.
** If all the fields used in the SELECT JOIN and WHERE of a statement are also in an index a database can retrieve all the information it needs without going back to the base table. This is called a covering index. If all the fields were in separate indexes it would only use the ones for the joins and where and then go back to the base table for the columns in the select.

How do i optimize this query?

I have a very specific query. I tried lots of ways but i couldn't reach the performance i want.
SELECT *
FROM
items
WHERE
user_id=1
AND
(item_start < 20000 AND item_end > 30000)
i created and index on user_id, item_start, item_end
this didn't work and i dropped all indexes and create new indexes
user_id, (item_start, item_end)
also this didn't work.
(user_id, item_start and item_end are int)
edit: database is MySQL 5.1.44, engine is InnoDB
UPDATE: per your comment below, you need all the columns in the query (hence your SELECT *). If that's the case, you have a few options to maximize query performance:
create (or change) your clustered index to be on item_user_id, item_start, item_end. This will ensure that as few rows as possible are examined for each query. Per my original answer below, this approach may speed up this particular query but may slow down others, so you'll need to be careful.
if it's not practical to change your clustered index, you can create a non-clustered index on item_user_id, item_start, item_end and any other columns your query needs. This will slow down inserts somewhat, and will double the storage required for your table, but will speed up this particular query.
There are always other ways to increase performance (e.g. by reducing the size of each row) but the primary way is to decrease the number of rows which must be accessed and to increase the % of rows which are accessed sequentially rather than randomly. The indexing suggestions above do both.
ORIGINAL ANSWER BELOW:
Without knowing the exact schema or query plan, the main performance problem with this query is that SELECT * forces a lookup back to your clustered index for every row. If there are large numbers of matching rows for a particular user ID and if your clustered index's first column is not item_user_id, then this will likley be a very inefficient operation because your disk will be trying to fetch lots of randomly distributed rows from teh clustered inedx.
In other words, even thouggh filtering the rows you want is fast (because of your index), actually fetching the data is slower. .
If, however, your clustered index is ordered by item_user_id, item_start, item_end then that should speed things up. Note that this is not a panacea, since if you have other queries which depend on different ordering, or if you're inserting rows in a differnet order, you could end up slowing down other queries.
A less impactful solution would be to create a covering index which contains only the columns you want (also ordered by item_user_id, item_start, item_end, and then add the other cols you need). THen change your query to only pull back the cols you need, instead of using SELECT *.
If you could post more info about the DBMS brand and version, and the schema of your table, and we can help with more details.
Do you need to SELECT *?
If not, you can create a index on user_id, item_start, item_end with the fields you need in the SELECT-part as included columns. This all assuming you're using Microsoft SQL Server 2005+

Same query uses different indexes?

Can a select query use different indexes if a change the value of a where condition?
The two following queries use different indexes and the only difference is the value of the
condition and typeenvoi='EXPORT' or and typeenvoi='MAIL'
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='EXPORT'
and nbessais<1
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='MAIL'
and nbessais<1
Can anyone give me an explanation?
Details on indexes are stored as statistics in a histogram-type dataset in SQL Server.
Each index is chunked into ranges, and each range contains a summary of the key values within that range, things like:
range High value
number of values in the range
number of distinct values in the range (cardinality)
number of values equal to the High value
...and so on.
You can view the statistics on a given index with:
DBCC SHOW_STATISTICS(<tablename>, <indexname>)
Each index has a couple of characteristics like density, and ultimately selectivity, that tell the query optimiser how unique each value in an index is likely to be, and how efficient this index is at quickly locating records.
As your query has three columns in the where clause, it's likely that any of these columns might have an index that could be useful to the optimiser. It's also likely that the primary key index will be considered, in the event of the selectivity of other indexes not being high enough.
Ultimately, it boils down to the optimiser making a quick judgement call on how many page reads will be necessary to read each your non-clustered indexes + bookmark lookups, with comparisons with the other values, vs. doing a table scan.
The statistics that these judgements are based on can vary wildly too; SQL Server, by default, only samples a small percentage of any significant table's rows, so the selectivity of that index might not be representative of the whole. This is particularly problematic where you have highly non-unique keys in the index.
In this specific case, I'm guessing your typeenvoi index is highly non-unique. This being so, the statistics gathered probably indicate to the optimiser that one of the values is rarer than the other, and the likelihood of that index being chosen is increased.
The query optimiser in SQL Server (as in most modern DBMS platforms) uses a methodology known as 'cost based optimisation.' In order to do this it uses statistics about the tables in the database to estimate the amount of I/O needed. The optimiser will consider a number of semantically equivalent query plans that it generates by transforming a basic query plan generated by parsing the statement.
Each plan is evaluated for cost by a heuristic based on the statistics maintained about the tables. The statistics come in various flavours:
Table and index row counts
Distributions histograms of the values in individual columns.
If the ocurrence of 'MAIL' vs. 'EXPORT' in the distribution histograms is significantly different the query optimiser can come up with different optimal plans. This is probably what happened.
Probably has to do with the "cardinality", I believe the word is, of the values in the table. If there are a lot more rows that match that clause, SQL Server may decide that one query will be more efficient using an index for a different column. This is an extreme case, but if there was one row that matched 'MAIL', it would likely use that index. If every other row in the table was 'EXPORT', but only half of those 'EXPORT' rows had an etat of 0, then it would probably use the index on that column.