I had a table which was having a global index(b-tree) on a date column.
We created range partition on it for a better retrieval performance. The partition key we used is the same column which has indexes on it.
Later we moved the global indexes to local by dropping the recreating the local index on same date column.
Now we have local index and partition key on same column.But after this the dataload into this table is taking three times more of the usual time
What could be the reason that if the partition key and local index are on the same column of the table then the dataload takes more time?
When we checked the explain plan , we figured that the queries are not using local indexes. why it does not use the local index in this case?
is there any hidden inbuilt indexes attached with the partion key as well , which oracle uses in place of the local index?
What could be solution to this so the performance of dataload doesnot get impacted?
Any leads on this would be highly appreciated.
Related
I'm currently using MSSQL Server, I've created a table with indexes on 4 columns. I plan on appending 1mm rows every month end. Is it customary to drop the indexes, and recreate them every time you add data to the table?
Don't recreate the index. Instead, you can use update statistics to compute the statistics for the given index or for the whole table:
UPDATE STATISTICS mytable myindex; -- statistics for the table index
UPDATE STATISTICS mytable; -- statistics for the whole table
I don't think it is customary, but it is not uncommon. Presumably the database would not be used for other tasks during the data load, otherwise, well, you'll have other problems.
It could save time and effort if you just disabled the indexes:
ALTER INDEX IX_MyIndex ON dbo.MyTable DISABLE
More info on this non-trivial topic can be found here. Note especially that disabling the clustered index will block all access to the table (i.e. don't do that). If the data being loaded is ordered in [clustered index] order, that can help some.
A last note, do some testing. 1MM rows doesn't seem like that much; the time you save may get used up by recreating the indexes.
I have a quite huge existing partitioned table in bigquery. I want to make the table clustered, at least for the new partition.
From the documentation: https://cloud.google.com/bigquery/docs/creating-clustered-tables, it is said that we are able to Creating a clustered table when you load data and I have tried to load a new partition using clustering fields: job_config.clustering_fields = ["event_type"].
The load finished successfully, however it seems that the new partition is not clustered (I am not really sure how to check whether it is clustered or not, but when I query to that particular partition it would always scan all rows).
Is there a good way to make clustering field for an existing partitioned table?
Any comment, suggestion, or answer is well appreciated.
Thanks a lot,
Yosua
BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table.
You can change the clustering specification in the following ways:
Call the tables.update or tables.patch API method.
Call the bq command-line tool's bq update command with the --clustering_fields flag.
Reference
https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec
This answer is no longer valid / correct
https://cloud.google.com/bigquery/docs/creating-clustered-tables#modifying-cluster-spec
You can only specify clustering columns when a table is created
So, obviously you cannot expect existing non-clustered table and especially just new partitions to become clustered
The "workaround" is to create new table to be properly partitioned / clustered and load data into it from Google Cloud Storage (GCS). You can export data from original table into GCS first for this so whole process will be free of charge
What I missed from the above answers was a real example, so here it goes:
bq update --clustering_fields=tool,qualifier,user_id my_dataset.my_table
Where tool, qualifier and user_id are the three columns I want the table to be clustered by (in that order) and the table is my_dataset.my_table.
Current Scenario
Datastore used: Dynamo Db.
DB size: 15-20 MB
Problem: for storing data I am thinking to use a common hash as the partition key (and timestamp as sort key), so that the complete table is saved in a single partition only. This would give me undivided throughput for the table.
But I also intend to create GSIs for querying, so I was wondering whether it would be wrong to use GSIs for single partition. I can use Local SIs also.
Is this the wrong approach?
Under the hood, GSI is basically just another DynamoDB table. It follows the same partitioning rules as the main table. Partitions in you primary table are not correlated to the partitions of your GSIs. So it doesn't matter if your table has a single partition or not.
Using single partition in DynamoDB is a bad architectural choice overall, but I would argue that for 20 Mb database that doesn't matter too much.
DynamoDB manages table partitioning for you automatically, adding new
partitions if necessary and distributing provisioned throughput
capacity evenly across them.
Deciding which partition the item should go can't be controlled if the partition key values are different.
I guess what you are going to do is having same partition key value for all the items with different sort key value (timestamp). In this case, I believe the data will be stored in single partition though I didn't understand your point regarding undivided throughput.
If you wanted to keep all the items of the index in single partition, I think LSI (Local Secondary Index) would be best suited here. LSI is basically having an alternate sort key for the partition key.
A local secondary index maintains an alternate sort key for a given
partition key value.
Your single partition rule is not applicable for index and you wanted different partition key, then you need GSI.
We have a range partitioned table and about 10 bitmap local indexes for that table. We perform some ddl/dml operations on that table in our daily load, which is truncate a specific partition and load data. when we do this, the local bitmap indexes are not becoming unusable. They are in usable status. However, my question is, even though the indexes are not getting unusable, do we always need to incorporate index rebuilding as part of the best practice for range partitioned tables, or use the index rebuilding only when it is required? because index rebuilding takes time, imagine we have 10 local indexes on that table which has large volume, then it becomes a costly affair for etl.
Please provide me your suggestions or thoughts in this situation?
No a rebuild of local indexes is not required, that is one of the main purpose of an local index.
Local partitioned indexes actually create 'sub index' for each partition, so such 'sub index' can be managed independently from other partitions. And when you truncate partition all its local indexes are truncated either.
Oracle doc:
"You cannot truncate an index partition. However, if local indexes are
defined for the table, the ALTER TABLE ... TRUNCATE PARTITION
statement truncates the matching partition in each local index."
So when you load data to that partition it recreate local indexes. But statistic on that index could be wrong and optimizer can consider don't use index. So consider gathering statistics from such indexes if you don't do it.
I'm working with table partitioning on extremely large fact table in a warehouse. I have executed the script a few different ways. With and without non clustered indexes. With indexes it appears to dramatically expand the log file while without the non clustered indexes it appears to not expand the log file as much but takes more time to run due to the rebuilding of the indexes.
What I am looking for is any links or information as to what is happening behind the scene specifically to the log file when you split a table partition.
I think it isn't to hard to theorize what is going on (to a certain extent). Behind the scenes each partition is given a different HoBT, which in normal language means each partition is in effect sitting on it's own hidden table.
So theorizing the splitting of a partition (assuming data is moving) would involve:
inserting the data into the new table
removing data from the old table
The NC index can be figured out, but depending on whether there is a clustered index or not, the theorizing will alter. It also matters whether the index is partition aligned or not.
Given a bit more information on the table (CL or Heap) we could theorize this further
If the partition function is used by a
partitioned table and SPLIT results in
partitions where both will contain
data, SQL Server will move the data to
the new partition. This data movement
will cause transaction log growth due
to inserts and deletes.
This is from an article by Microsoft on Partitioned Table and Index Strategies
So looks like its doing a delete from old partition and and insert into the new partition. This could explain the growth in t-log.