How to get last update time of each partition in a table in Azure Cosmos DB - azure-storage

I am storing a table in Azure cosmos DB which has a lot of partitions. I want to query content only for those partitions which had any update in last X minutes. Is there a way to efficiently query the list of partition keys which were updated in last X minutes? - Since cross partition queries can turn out to be really expensive.

One way to do this efficiently is to store the last update time in it's own container with a single logical partition and a composite index with desc sort on your lastUpdated property.
{
"id": "xxxx",
"pk": "0000",
"pkValue": "pk value from other container",
"lastUpdated": "2020-06-21T23:14:25.7251173Z"
}
Make sure you use the ISO 8601 UTC standard for datetime strings, see DateTime in Cosmos DB. Also see docs on Composite Indexes
The question to answer though is whether this is more cost efficient than doing cross partition queries. If as you say there is a lot of data across many physical partitions then it likely could be. This you will have to determine though.
Hope this is helpful.

Related

Creative use of date partitions

I have some data that I would like to partition by date, and also partition by an internally-defined client id.
Currently, we store this data uses the table-per-date model. It works well, but querying individual client ids is slow and expensive.
We have considered creating a table per client id, and using date partitioning within those tables. The only issue here is that would force us to incur thousands of load jobs per day, and also have the data partitioned by client id in advance.
Here is a potential solution I came up with:
-Stick with the table-per-date approach (eg log_20170110)
-Create a dummy date column which we use as the partition date, and set that date to -01-01 (eg for client id 1235, set _PARTITIONTIME to 1235-01-01)
This would allow us to load data per-day, as we do now, would give us partitioning by date, and would leverage the date partitioning functionality to partition by client id. Can you see anything wrong with this approach? Will BigQuery allow us to store data for the year 200, or the year 5000?
PS: We could also use a scheme that pushes the dates to post-zero-unixtime, eg add 2000 to the year, or push the last two digits to the month and day, eg 1235 => 2012-03-05.
Will BigQuery allow us to store data for the year 200, or the year 5000?
Yes, any date between 00001-01-01 and 9999-12-31
So formally speaking this is an option (and btw depends on how many clients you plan / already have)
See more about same idea at https://stackoverflow.com/a/41091896/5221944
Meantime, I would expect BigQuery to have soon ability to partition by arbitrary field. Maybe at NEXT 2017 - just guessing :o)
The suggested idea is likely to create some performance issues for queries (as the number of partitions increase). Generally speaking, Date partitioning works well with a few 1000 partitions.
client_ids are generally unrelated with each other and are ideal for hashing. While we work towards supporting richer partitioning flavors, one option is to hash your client_ids into N buckets (~100?), and have N partitioned tables. That way you can query across your N tables for a given date. Using, for example, 100 tables would drop the cost down to 1% of what it would be using 1 table with all the client_ids. It should also scan a small number of partitions, improving performance also accordingly. Unfortunately, this approach doesn't address the concern of putting the client ids in the right table (it has to be managed by you).

Indexes vs Partitions in hive

How indexes in hive are different than partitions? both improves query performance as per my knowledge then in what way they differ?
What are the situations I'll be using indexing or partitioning?
Can i use them together?
Kindly suggest
Partitions allow users to store data files stored in different HDFS directories (based on chosen parameter, date for example, if you want to store your datafiles by date) thus, minimizing the number of files to scan when users run queries.
While indexes help in fetching data faster, indexes require index tables to built where the data to be indexed is stored. This leads to storing the data twice.
partition:
Think about that you have a table keeping transactions created from your applications. this table get bigger day by day,
if you partition this table based on day interval ,database creates like table at each day interval but you see only one table. It makes your dailiy basis query more effective.
Index.
Index is used to access your table records fastly.

Organizing lots of timestamped values in a DB (sql / nosql)

I have a device I'm polling for lots of different fields, every x milliseconds
the device returns a list of ids and values which I need to store with a time stamp in a DB of sorts.
Users of the system need to be able to query this DB for historic logs to create graphs, or query the last timestamp for each value.
A simple approach would be to define a MySQL table with
id,value_id,timestamp,value
and let users select
Select value form t where value_id=x order by timestamp desc limit 1
and just push everything there with index on timestamp and id, But my question is what's the best approach performance / size wise for designing the schema? or using nosql? can anyone comment on possible design trade offs. Will such a design scale with millions of records?
When you say "... or query the last timestamp for each value" is this what you had in mind?
select max(timestamp) from T where value = ?
If you have millions of records, and the above is what you meant (i.e. value is alone in the WHERE clause), then you'd need an index on the value column, otherwise you'd have to do a full table scan. But if queries will ALWAYS have [timestamp] column in the WHERE clause, you do not need an index on [value] column if there's an index on timestamp.
You need an index on the timestamp column if your users will issue queries where the timestamp column appears alone in the WHERE clause:
select * from T where timestamp > x and timestamp < y
You could index all three columns, but you want to make sure the writes do not slow down because of the indexing overhead.
The rule of thumb when you have a very large database is that every query should be able to make use of an index, so you can avoid a full table scan.
EDIT:
Adding some additional remarks after your clarification.
I am wondering how you will know the id? Is [id] perhaps a product code?
A single simple index on id might not scale very well if there are not many different product codes, i.e. if it's a low-cardinality index. The rebalancing of the trees could slow down the batch inserts that are happening every x milliseconds. A composite index on (id,timestamp) would be better than a simple index.
If you rarely need to sort multiple products but are most often selecting based on a single product-code, then a non-traditional DBMS that uses a hashed-key sparse-table rather than a b-tree might be a very viable even a superior alternative for you. In such a database, all of the records for a given key would be found physically on the same set of contiguous "pages"; the hashing algorithm looks at the key and returns the page number where the record will be found. There is no need to rebalance an index as there isn't an index, and so you completely avoid the related scaling worries.
However, while hashed-file databases excel at low-overhead nearly instant retrieval based on a key value, they tend to be poor performers at sorting large groups of records on an attribute, because the data are not stored physically in any meaningful order, and gathering the records can involve much thrashing. In your case, timestamp would be that attribute. If I were in your shoes, I would base my decision on the cardinality of the id: in a dataset of a million records, how many DISTINCT ids would be found?
YET ANOTHER EDIT SINCE THE SITE IS NOT LETTING ME ADD ANOTHER ANSWER:
Simplest way is to have two tables, one with the ongoing history, which is always having new values inserted, and the other, containing only 250 records, one per part, where the latest value overwrites/replaces the previous one.
Update latest
set value = x
where id = ?
You have a choice of
indexes (composite; covering value_id, timestamp and value, or some combination of them): you should test performance with different indexes; composite and non-composite, also be aware that there are quite a few significantly different ways to get 'max per group' (search so, especially mysql version with variables)
triggers - you might use triggers to maintain max row values in another table (best performance of further selects; this is redundant and could be kept in memory)
lazy statistics/triggers, since your database is updated quite often you can save cycles if you update your statistics periodically (if you can allow the stats to be y seconds old and if you poll 1000 / x times a second, then you potentially save y * 100 / x potential updates; and this can be noticeable, especially in terms of scalability)
The above is true if you are looking for last bit of performance, if not keep it simple.

SQL server : XML : Data Base Table Optimization

I have a SQL server DB , which have a Table , which Log every Exceptions with details along with 2 XML's (1 for Request , 1 for Response).
These 2 XML's are Compressed.
Now as the Data volume is high , I need to clean the Table in every 3-4 month.
What are the Optimization technique , I can use to avoid Data Clean up's.
Create Indexes on all columns that require searching.
Run optimize table tablename (or similar according to your RDBMS) through cron job every day.
Probably the best thing you can look into is table partitioning, which will allow you to quickly remove data when it needs age out. Also, make sure that you cluster your index on a monotonically increasing value (either a surrogate identity value or a datetime column, like dateofreciept); this will reduce fragmentation on the clustered index.

SQL 2005 DB Partitioning for SharePoint

Background
I have a massive db for a SharePoint site collection. It is 130GB and growing at 10gb per month. 100GB of the 130GB is in one site collection. 30GB is the version table. There is only one site collection - this is by design.
Question
Am I able to partition a database (SharePoint) using SQL 2005s data partitioning features (creating multiple data files)?
Is it possible to partition a database that is already created?
Has anyone partitioned a SharePoint DB? Will I encounter any issues?
You would have to create a partition set and rebuild the table on that partition set. SQL2005 can only partition on a single column, so you would have to have a column in the DB that
Behaves fairly predictably so you don't get a large skew in the amount of data in each partition
IIRC the column has to be a numeric or datetime value
In practice it's easiest if it's monotonically increasing - you can create a series of partitions (automatically or manually) and the system will fill them up as it gets to the range definitions.
A date (perhaps the date the document was entered) would be ideal. However, you may or may not have a useful column on the large table. M.S. tech support would be the best source of advice for this.
The partitioning should be transparent to the application (again, you need a column with appropriate behaviour to use as a partition key).
Unless you are lucky enough to have a partition key column that is also used as a search predicate in the most common queries you may not get much query performance benefit from the partitioning. An example of a column that works well is a date column on a data warehouse. However, your Sharepoint application may not make extensive use of this sort fo query.
Mauro,
Is there no way you can segment the data on a Sharepoint level?
ie you may have multiple "sites" using a single (SQL) content database.
You could migrate site data to a new content database, which will allow you to reduce the data in that large content site and then shrink the datafiles.
it will also assist you in managing your obvious continued growth.
James.