Has anyone used Snowflake search optimization and gained benefits over cluster keys?

Has anyone used Snowflake search optimization and gained benefits over cluster keys? - optimization

Reference:
https://docs.snowflake.com/en/user-guide/search-optimization-service.html#what-access-control-privileges-are-needed-for-the-search-optimization-service
Has anyone used Snowflake search optimization and gained benefits over cluster keys?
Please share any use cases, cost vs performance as well.
Appreciate the insights

In general, Search Optimisation Service (SOS) would be more beneficial over Clustering for point lookup queries, the type of queries that retrieves 1 or a few rows from a very large table using equality or IN filter condition.
Since you can only have one cluster key in a table, SOS can also help optimise searches from non-cluster-key columns in a clustered table.
However unlike Clustering, SOS adds storage cost which holds search access path data for each table with SOS enabled

Related

Index vs Extended Statistics

In Oracle specifically, and possibly other platforms, what is the difference between Indexes and Extended Statistics? They seem to be constructed is similar fashion and perform the same function. There must be some core differences - can anyone provide details?

Hmmm . . . They seem quite different to me.
An index is a copy of data in one or more columns in a table (perhaps with expressions) structured to speed access or to enforce a unique constraint. The index can be directly used to return values from these columns (or expressions).
Part of the process of creating and maintaining an index provides statistics about the underlying distributions of values. The optimizer can take advantage both of the data in the index and the information about distributions. However, the main purpose of indexes is either to provide an alternative, faster access path or to enforce uniqueness constraints.
Statistics (and extended statistics) describe properties of one or more columns. These properties are used by the optimizer to choose the best algorithm for running the query. The most important property is cardinality -- the number of different rows -- although skewness can also be important.
Statistics are not used to directly return values in result sets. They only affect the optimizer. Indexes can be used to return values; information gathered in the creation of indexes can also be used by the optimizer to define the best execution plan.

Indexing and cluster in Greenplum database

I am new to Greenplum database. I have a question.
Is Cluster on table mandatory after creating an index on a column in Greenplum in case of row-based distribution?

The "massively parallel" (MPP) nature of Greenplum's software-level architecture, when coupled with the throughput capabilities of modern servers makes indexes unnecessary in most cases.
To say it differently, the speed of table scans in Greenplum is a feature, rather than a bottleneck. Please refer to this great writeup on how MPP works under the hood : https://dwarehouse.wordpress.com/2012/12/28/introduction-to-massively-parallel-processing-mpp-database/

If your data is not updated frequently and you need quickly return the result, you can use clustered index table. it will cost much time. you can build index for the column-oriented table.

Is having many partition key in azure table storagea good design for read queries?

I know that having many partition keys reduce the batch processing (EGT) in the Azure Table Storage. However I wonder to know whether there is any performance issue in terms of reading as well or not? For example, if I designed my Azure Table such that every new entity has a new partition key and I end up having 1M or more partition keys. IS there any performance disadvantege for read queries?

If the most often operation done by you is Point Query (PartitionKey and RowKey specified), the unique-partition-key design is quite good. However if your querying operation is usually Table Scan (No Partition Key specified), the design will be awful.
You can refer to chapter "Design for querying" in Azure Table Design Guide for the details.

Point query is the most efficient query to retrieve a single entity by specifying a single PartitionKey and RowKey using equality predicates. If your PartitionKey is unique, you may consider using a constant string as RowKey to enable you to leverage point query. The choice of design also depends on how you plan to read/retrieve your data. If you always plan to use point query to retrieve the data, this design makes sense.
Please see “New PartitionKey Value for Every Entity” section in the following article http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx. In short, it will scale very well since our storage system has an option to load balance several partitions. However, if your application requires you to retrieve data without specifying a PartionKey, it will be inefficient because it will result a table scan.
Please me an email # ascl#microsoft.com, if you want to discuss further on your table design.

give the db2 a hint which index to use?

moin-moin,
I have a join over some tables, and I want to give the db2-database a hint, which index i want her to use. I know, this may result in a slow query, but I have a production and a test database, and I want the same behaviour in both databases (even if in one db the amount of data is significantly different or what state the (index-)cache has.
Is this possible (and how)? I did not find anything in the online manual, which could mean, I had the wron searching criteria.
Thanks a million.

This is not something that is commonly done with DB2. However, you can use selectivity. It should still be around in present versions. Adding selectivity clauses to queries will affect the decisions made by the query optimizer.
Also what Gilbert Le Blanc noted above will work. You can UPDATE the syscat.tables colums and fool the DB2 to optimize the queries for non-existent amounts of data in the rows. Also the rest of your DB / DBM CFG should match (ie. the calculated disk and cpu speeds, memory usage related settings etc) because in some situations they might also matter to some degree.

You can influence the optimizer via a Profile:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1202storedprocedure/index.html
It was recently asked here: Is it possible to replace NL join with HS join in sql
However, I haven't heard about the selectivity clause, and I think you should try first this option, before create a profile. But you should do this just after having tried other options. Follow the steps as indicated in the DeveloperWorks tutorial before influence the optimizer:
Experiment with different SQL optimization classes. The default optimization class is controlled by the DFT_QUERYOPT parameter in the database configuration file.
Attempt to resolve any performance problems by ensuring that proper database statistics have been collected. The more detailed the statistics, the better the optimizer can perform. (See RUNSTATS in the DB2 Command Reference).
If the poor access plan is the result of rapidly changing characteristics of the table (i.e. grows very quickly such that statistics get out of date quickly), try marking the table as VOLATILE using the ALTER TABLE command.
Try explaining the query using literal values instead of parameter markers in your predicates. If you are getting different access plans when using parameter markers, it will help you understand the nature of the performance problem better. You may find that using literals in your application will yield a better plan (and therefore better performance) at the cost of SQL compilation overhead.
Try using DB2’s index advisor (db2advis) to see if there are any useful indexes which you may have overlooked.

Sql Query Optimization

I am not so proficient in TSql as of now (writing since last 4/5 months) but I have written many queries. Although I have given the outputs, sometimes I feel that the queries are not so optimized. I searched in google and found lot of stuffs about query optimization, and they ask to look into the query plan(actual & estimated) for the performance improvisation.
As I already said that I am very new to writing queries so it is becoming difficult for me to grasp those solutions. But I need to learn query optimization.
Can any body help me out initially how and where should I start from?
Searching in internet reveals that, SEEK is better than SCAN(May it be index or Table). How can I achieve a seek over a scan?
Then they says that ORDER BY clause i.e. sorting is more costly. Then what is the work around? How can I write effective query?
Can anybody explain me, with some examples, which kind of query is better over what and in what situation?
Edited
Dear All,
You all have answered and that will help me a lot. But what I intend to say is that, you all have practised a lot for becoming an expert. Once upon a time, I guess you all were like what I am now.So my humble request is how you all started for writing optimised query.I know that patience is needed and I will devote that.
I apologise for any wrong statement of mine.
Thanks in advance

Articles discussing Query Optimization issues are often very factual and useful, but as you found out they can be hard to follow. It is a bit like when someone is trying to learn the basics rules of baseball, and all the sports commentary he/she finds on the subject is rife with acronyms and strategic details about the benefits of sacrificing someone at bat, and other "inside baseball" trivia...
So you need to learn the basics first:
the structure(s) of the database storage
indexes' structure, the clustered and non clustered kind, the multi column indexes
the concept of covering a query
the selectivity of a particular column
the disadvantage of indexes when it comes to CRUD operations
the basic subtasks/strategies of a query: table or index scan, index seek, sorting, inner-outer merge etc.
the log file, the data recovery model.
The following links apply to MS SQL Server. If that is not the DBMS you are using you can try and find similar material for the system of your choice. In fact, so long as you realize that the implementation may vary, it may be useful to peruse the MS documention.
MS SQL storage structures
MS SQL pages and extents
Then as you started doing, learn the way to read query plans (even if not in fully understand at first), and all this should bring you to a level where you start to make sense of the more advanced books or articles on the topic. I do not know of tutorials for Query Plans on the Internet (though I'm quite sure they exist...), but the following methodology may be of use: Start with simple queries, review the query plan (if possible in a graphic fashion), start recognizing the most common elements: Table Scan, Index Seek, Sort, nested loops... Read the detailed properties of these instances: estimated nb of rows, cost percentage etc. When you find a new element that you do not know/understand, use this keyword to find details on the internet. Also: experiment a lot.
Finally you should remember that while the way the query is written and the set of indexes etc. provided cover a great part of optimization needs, there are other sources of optmization, for example the way hardware is put to use (a basic example is how by having the data file and the log file on separate physical disks, we can greatly improve CRUD performance).

Searching in internet reveals that,
SEEK is better than SCAN(May it be
index or Table). How can I achieve a
seek over a scan?
Add the necessary index -- if the incremental costs on INSERT and UPDATE (and extra storage) are an overall win to speed up the seeking in your queries.
Then they says that ORDER BY clause
i.e. sorting is more costly. Then what
is the work around? How can I write
effective query?
Add the necessary index -- if the incremental costs on INSERT and UPDATE (and extra storage) are an overall win to speed up the ordering in your queries.
Can anybody explain me, with some
examples, which kind of query is
better over what and in what
situation?
You already pointed out a couple of specific questions -- and the answers were nearly identical. What good would it do to add another six?
Run benchmark queries over representative artificial data sets (must resemble what you're planning to have in production -- if you have small toy-sized tables the query plans will not be representative nor meaningful), try with and without the index that appear to be suggested by the different query plans, measure performance; rinse, repeat.
It takes 10,000 hours of practice to be good at anything. Optimizing DB schemas, indices, queries, etc, is no exception;-).

ORDER BY is a necessary evil - there's no way around it.
Refer to this question for solving index seek, scan and bookmark/key lookups. And this site is very good for optimization techniques...

Always ensure that you have indexes on your tables. Not too many and not too few.
Using sql server 2005, apply included columns in these indexes, they help for lookups.
Order by is costly, if not required, why sort a data table if it is not required.
Always filter as early as possible, if you reduce the number of joins, function calls etc, as early as possible, you reduce time taken over all
avoid cursors if you can
use temp tables/ table vars for
filtering where possible
remote queries will cost you
queries with sub
selects in the where clause can be
hurtfull
table functions can be costly if not
filtered
as always, there is no hard rule, and things should be taken on a per query basis.
Always create the query as understandle/readable as possible, and optimize when needed.
EDIT to comment question:
Temp tables can be used when you require to add indexes on the temp table (you cannot add indexes on var tables, except the pk). I mostly use var tables when i can, and only have the required fields in them as such
DECLARE #Table TABLE(
FundID PRIMARY KEY
)
i would use this to fill my fund group ids instead of having a join to tables that are less optimized.
I read a couple of articles the other day and to my surprise found that var tables are actually created in the tempdb
link text
Also, i have heard, and found that table UDFs can seems like a "black box" to the query planner. Once again, we tend to move the selects from the table functions into table vars, and then join on these var tables. But as mentioned earlier, write the code first, then optimize when you find bottle necks.
I have found that CTEs can be usefull, but also, that when the level of recursion grows, that it can be very slow...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas