row_number() over partition by resource consumption - sql

If the table contains lots of different partition by id (10K+) and the size of the table is growing (with millions of records), will it run into out of memory issue?
When this query runs, does system require to store partition window (which would be 10K then) in to memory as well as row_number for each partition?

It is safe to use it. How it will perform - it depends. If there is a "POC index" (Partition, Order by, Cover), i.e. an index with your partition by column as a first key column, then your order by column(s) as key column(s), and the selected columns as included columns, it will be the best for this particular query. But you should consider the pros and cons of this index. If there is no such index (or similar), the query will be heavier - think for table scan, spills to tempdb, etc. What will be the impact on your server - you should test to see.

Related

Oracle partition performance

I have a large oracle table with more than 600 million records and we just repartitioned them so that we can purge some of the old data with out increasing the logs size.
Problem is there there are some queries that does full index scan and that are run very often like 300 times per sec. Before partition query used to take about .15 sec but after partition its taking .50 sec to 1.25 sec. Does anyone know partitioning oracle table degrades the performance of the query? If yes, could you give the reason? There seems to be some articles but not clear enough for me to understand.
If the index is local and the query is not based on the partitioning key (meaning: partition pruning is not possible) but highly selective the effort will increase in proportion to the number of partitions you create. If you have 30 partitions then 30 indexes have to be searched for your values. The fact that each index is smaller is not offset by the larger number of indexes. (You might want to look at how btree indexes work to understand why this is not the case).
To cut a long story short: If you use a global index you should be able to avoid this problem.
When having partition table and if you are having lot of select query on this table, always include in WHERE clause Paritioncoloumn =value.
example if the partition is based on column of date type (PERSISTED_DATE)
Query
SELECT * FROM TABLE_NAME WHERE COLOUMN1='VALUE' and trunc(PERSISTED_DATE)=trunc(sysdate);
Important points to note.
Avoid using global index if it is high transnational table else you have to build the global index after deletion of partition.
For better performance keep the partition count less, you can automate creating a new partition and deletion of older partition on daily basis.

Dynamo DB single partition, Global Secondary indexes

Current Scenario
Datastore used: Dynamo Db.
DB size: 15-20 MB
Problem: for storing data I am thinking to use a common hash as the partition key (and timestamp as sort key), so that the complete table is saved in a single partition only. This would give me undivided throughput for the table.
But I also intend to create GSIs for querying, so I was wondering whether it would be wrong to use GSIs for single partition. I can use Local SIs also.
Is this the wrong approach?
Under the hood, GSI is basically just another DynamoDB table. It follows the same partitioning rules as the main table. Partitions in you primary table are not correlated to the partitions of your GSIs. So it doesn't matter if your table has a single partition or not.
Using single partition in DynamoDB is a bad architectural choice overall, but I would argue that for 20 Mb database that doesn't matter too much.
DynamoDB manages table partitioning for you automatically, adding new
partitions if necessary and distributing provisioned throughput
capacity evenly across them.
Deciding which partition the item should go can't be controlled if the partition key values are different.
I guess what you are going to do is having same partition key value for all the items with different sort key value (timestamp). In this case, I believe the data will be stored in single partition though I didn't understand your point regarding undivided throughput.
If you wanted to keep all the items of the index in single partition, I think LSI (Local Secondary Index) would be best suited here. LSI is basically having an alternate sort key for the partition key.
A local secondary index maintains an alternate sort key for a given
partition key value.
Your single partition rule is not applicable for index and you wanted different partition key, then you need GSI.

Cassandra SELECT on 2ndary index always sorted on partition key?

Say I have the following table and secondary indices defined:
CREATE TABLE ItemUpdates (
time timestamp,
item_name text,
item_context text,
item_descr text,
tags map<text, int>,
PRIMARY KEY ((time, item_name, item_context))
);
CREATE INDEX ItemUpdateByName
ON ItemUpdates(item_name);
CREATE INDEX ItemUpdateByContext
ON ItemUpdates(item_context);
CREATE INDEX ItemUpdateByTag
ON ItemUpdates(KEYS(tags));
General background information on the data model: an item has a unique name within a context, so (item_name, item_context) is a natural key for items. Tags have some value associated with them.
A natural query in my application is "show me all updates on item X with a certain tag". This translates to:
SELECT * FROM ItemUpdates
WHERE item_name='x'
AND item_context='a'
AND tags CONTAINS KEY 't';
When I try some queries I notice that although the cluster uses the Murmur3Partitioner, the results come ordered by time. This makes sense when you consider that Cassandra stores secondary indices as wide rows, and that colums are ordered by their name.
(1) Does Cassandra always return rows sorted by partition key when selecting on a(n) (set of) indexed column(s)?
The reason I find this interesting is that other natural queries in my application include:
fetch all updates on item X, since date D
fetch the 300 most recent updates on item X
What surprises me is that adding a clause ORDER BY time DESC to my select statement on ItemUpdates results in an error message "ORDER BY with 2ndary indexes is not supported."
(2) (How) can I do a range query on the partition key when I narrow the query by selecting on an indexed column?
The only natural "auto" sorting that you should get on cassandra is for columns in a wide row. partitions when using murmur3 are not "sorted" as that would mess up the random distribution (afaik). Indexes are stored on each node in a "hidden" table as wide rows. When on filter on an index, it's hitting that "partition" "on the node" and the values are the rows in that partition (which correspond to matching rows on that node). Try your query using different data sets and different columns. Maybe the data you have cause the results to be sorted.
(2) As it stands, you can only do range queries on clustering keys, not on the partition key. In general, for efficient querying, you should attempt to hit one (or a few) partitions, and filter on indexes / filter on clustering keys / range query on the clustering key. If you attempt to not hit a partition, it becomes a clusterwide operation, which isn't usually great. If you are looking to do cluster wide analysis (ala map reduce style), take a look at Apache Spark. Spark cassandra integration is quite good and is getting better.

Clustering Factor and Unique Key

Clustering factor - A Awesome Simple Explanation on how it is calculated:
Basically, the CF is calculated by performing a Full Index Scan and
looking at the rowid of each index entry. If the table block being
referenced differs from that of the previous index entry, the CF is
incremented. If the table block being referenced is the same as the
previous index entry, the CF is not incremented. So the CF gives an
indication of how well ordered the data in the table is in relation to
the index entries (which are always sorted and stored in the order of
the index entries). The better (lower) the CF, the more efficient it
would be to use the index as less table blocks would need to be
accessed to retrieve the necessary data via the index.
My Index statistics:
So, here are my indexes(index over just one column) under analysis.
Index starting PK_ is my Primary Key and UI is a Unique key. (Ofcourse both hold unique values)
Query1:
SELECT index_name,
UNIQUENESS,
clustering_factor,
num_rows,
CEIL((clustering_factor/num_rows)*100) AS cluster_pct
FROM all_indexes
WHERE table_name='MYTABLE';
Result:
INDEX_NAME UNIQUENES CLUSTERING_FACTOR NUM_ROWS CLUSTER_PCT
-------------------- --------- ----------------- ---------- -----------
PK_TEST UNIQUE 10009871 10453407 96 --> So High
UITEST01 UNIQUE 853733 10113211 9 --> Very Less
We can see the PK having the highest CF and the other unique index is not.
The only logical explanation that strikes me is, the data beneath is stored actually by order of column over the Unique index.
1) Am I right with this understanding?
2) Is there any way to give the PK , the lowest CF number?
3) Seeing the Query cost using both these index, it is very fast for single selects. But still, the CF number is what baffle us.
The table is relatively huge over 10M records, and also receives real time inserts/updates.
My Database version is Oracle 11gR2, over Exadata X2
You are seeing the evidence of a heap table indexed by an ordered tree structure.
To get extremely low CF numbers you'd need to order the data as per the index. If you want to do this (like SQL Server or Sybase clustered indexes), in Oracle you have a couple of options:
Simply create supplemental indexes with additional columns that can satisfy your common queries. Oracle can return a result set from an index without referring to the base table if all of the required columns are in the index. If possible, consider adding columns to the trailing end of your PK to serve your heaviest query (practical if your query has small number of columns). This is usually advisable over changing all of your tables to IOTs.
Use an IOT (Index Organized Table) - It is a table, stored as an index, so is ordered by the primary key.
Sorted hash cluster - More complicated, but can also yield gains when accessing a list of records for a certain key (like a bunch of text messages for a given phone number)
Reorganize your data and store the records in the table in order of your index. This option is ok if your data isn't changing, and you just want to reorder the heap, though you can't explicitly control the order; all you can do is order the query and let Oracle append it to a new segment.
If most of your access patterns are random (OLTP), single record accesses, then I wouldn't worry about the clustering factor alone. That is just a metric that is neither bad nor good, it just depends on the context, and what you are trying to accomplish.
Always remember, Oracle's issues are not SQL Server's issues, so make sure any design change is justified by performance measurement. Oracle is highly concurrent, and very low on contention. Its multi-version concurrency design is very efficient and differs from other databases. That said, it is still a good tuning practice to order data for sequential access if that is your common use case.
To read some better advice on this subject, read Ask Tom: what are oracle's clustered and nonclustered indexes

What is a good size (# of rows) to partition a table to really benefit?

I.E. if we have got a table with 4 million rows.
Which has got a STATUS field that can assume the following value: TO_WORK, BLOCKED or WORKED_CORRECTLY.
Would you partition on a field which will change just one time (most of times from to_work to worked_correctly)? How many partitions would you create?
The absolute number of rows in a partition is not the most useful metric. What you really want is a column which is stable as the table grows, and which delivers on the potential benefits of partitioning. These are: availability, tablespace management and performance.
For instance, your example column has three values. That means you can have three partitions, which means you can have three tablespaces. So if a tablespace becomes corrupt you lose one third of your data. Has partitioning made your table more available? Not really.
Adding or dropping a partition makes it easier to manage large volumes of data. But are you ever likely to drop all the rows with a status of WORKED_CORRECTLY? Highly unlikely. Has partitioning made your table more manageable? Not really.
The performance benefits of partitioning come from query pruning, where the optimizer can discount chunks of the table immediately. Now each partition has 1.3 million rows. So even if you query on STATUS='WORKED_CORRECTLY' you still have a huge number of records to winnow. And the chances are, any query which doesn't involve STATUS will perform worse than it did against the unpartitioned table. Has partitioning made your table more performant? Probably not.
So far, I have been assuming that your partitions are evenly distributed. But your final question indicates that this is not the case. Most rows - if not all - rows will end up in the WORKED_CORRECTLY. So that partition will become enormous compared to the others, and the chances of benefits from partitioning become even more remote.
Finally, your proposed scheme is not elastic. As the current volume each partition would have 1.3 million rows. When your table grows to forty million rows in total, each partition will hold 13.3 million rows. This is bad.
So, what makes a good candidate for a partition key? One which produces lots of partitions, one where the partitions are roughly equal in size, one where the value of the key is unlikely to change and one where the value has some meaning in the life-cycle of the underlying object, and finally one which is useful in the bulk of queries run against the table.
This is why something like DATE_CREATED is such a popular choice for partitioning of fact tables in data warehouses. It generates a sensible number of partitions across a range of granularities (day, month, or year are the usual choices). We get roughly the same number of records created in a given time span. Data loading and data archiving are usually done on the basis of age (i.e. creation date). BI queries almost invariably include the TIME dimension.
The number of rows in a table isn't generally a great metric to use to determine whether and how to partition the table.
What problem are you trying to solve? Are you trying to improve query performance? Performance of data loads? Performance of purging your data?
Assuming you are trying to improve query performance? Do all your queries have predicates on the STATUS column? Are they doing single row lookups of rows? Or would you want your queries to scan an entire partition?