What's the probability of repeating a rowkey that depends on datetime in Azure

What's the probability of repeating a rowkey that depends on datetime in Azure - azure-storage

Is the there the chance to repeat a RowKey in an Azure table that is constructed like this?
string.Format("{0:d19}", DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks)
This is the same to ask:
Will my webrole create two records before we get to a new value for Ticks?
I want to return the entities in reverse chronological order. This table is not expected to be constantly added new entities, but hopefully will have many insert transactions.
Solution
OK, this is the solution I came across thanks to Mark Seeman:
string.Format("{0:d19}+{1}", DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks, Guid.NewGuid().ToString("N"))
Using a Guid, I know for sure that the rowkey will not be repeated.

The probability of getting more than one identical rowkey is 1 as the number of concurrent writes approaches infinity :)
Seriously, a Tick is 100 nanoseconds. Modern processors run at speeds where you will have hundreds of CPU cycles per Tick, so getting identical Ticks will happen.
In fact, I was just talking to one of my customers yesterday. He had that exact problem and had had to hack around it to ensure uniqueness.
Consider introducing a truly unique component (such as GUID) as part of a (sortable) composite.

Related

ReducyByKeyAndWindowByCount in Spark stateful streaming aggegations

I've to inner join two relational tables extracted from Oracle.
Actually i want to perform 1-to-1 join to get one row per primary key with aggegated in list values from the second table. So before joining 1-to-1 two tables i have to reduce all my rows by key to a 1 with values kept in the list.
Here is the illustration of what i need:
[![tables aggregation][1]][1]
And here i've met a problem which is when to stop aggegation for my key and pass aggegated entity to the next step. Spark offers solutions for that by providing window intervals and watermaking for late data. And so assumption for keeping data consistency is the time it receives the data. It is feasible and applicable for infinite datasets but in my case i exactly know the count of aggegations for each key. For exampe for customer_id 1000 i know exactly that there are only 3 products and after i've aggegated 3 products i know that i can stop aggegation now and go to the next streaming step in my pipeline. How can this solution be implemented using Spark and streaming? I know there is reduceByKeyAndWindow operation but in my case i need something like reduceByKeyAndWindowByCount.
Count will be stored in a static dataset or simply store it in a row as an additional data.

Finally we've decided to switch from streaming to core spark with batch processing because we have finite dataset and that thing works well for our use case. We've came to a conclusion that spark streaming was designed for processing continuous (which was actually obvious only from it's naming) datasets. And thats why we have only window intervals by time and watermarks to correct network or other delays during transportation. We've also found our design with counters ugly, complex and in the other words bad. It is a live example of a bad design and such growing complexity was a marker that we were moving in the wrong direction and were trying to use a tool for a purpose it was not designed for.

Plone - ZODB catalog query sort_on multiple indexes?

I have a ZODB catalog query with a start and end date. I want to sort the result on end_date first and then start_date second.
Sorting on either end_date or start_date works fine.
I tried with a tuple (start_date,end_date), but with no luck.
Is there a way to achieve this or do one have to employ some custom logic afterwards?

The generalized answer ought to be post-hoc-sort of your entire result set of catalog brains, use zope.sequencesort (via PyPI, but already shipped with Plone) or similar.
The more complex answer is a rabbit-hole of optimizations that you should only go down if you know you need to and know what you are doing:
Make sure when you do sort the brains that your user gets a sticky session to the same instance, at least for cache-affinity to get the same catalog indexes and brains (metadata);
You might want to cache across requests (thread-global) a unique session id, and a sequence of catalog RID (integer) values for your entire sorted request, should you expect the user to come back and need in subsequent batches. Of course, RIDs need to be re-constituted into ZCatalog's lazy-sequences of brains, and this requires some know-how (or reading the source).
Finally, for large result (many thousands) sets, I would suggest that it is reasonable to make application-specific compromises that approximate correct by post-hoc sorting of the current batch through to the end of the n-batches after it, where n is inversely proportional to the len(site.portal_catalog.uniqueValuesFor(indexnamehere)). For a large set of results, the correctness of an approximated secondary-sort is high for high-variability, and low for low variability (many items with same secondary value, such that count is much larger than batch size can make this frustrating).
Do not optimize as such unless you are dealing with particularly large result sets.
It should go without saying: if you do optimize, you need to verify that you are actually getting a superior result (profile and benchmark). If you cannot justify investing the time to do this, you cannot justify optimizing.

Why database designers do not make IDENTITY columns start from the min value rather than 1?

As we know, In Sql Server, The IDENTITY (n,m) means that the values will start from n, and the increment value is m, but I noticed that all database designers make Identity columns as IDENTITY(1,1) , without taking advantage of all values of int data type which are from (-2,147,483,648) to (2,147,483,647),
I am planning to make all Identity columns as IDENTITY (-2,147,483,648, 1), (the identity columns are hidden from the application user).
Is that a good idea ?

If you find that 2billion values isn't enough, you're going to find out that 4billion isn't enough either (needing more than twice as many of anything over the lifetime of a project, than it was first designed for, is hardly rare*), so you need to take a different approach entirely (possibly long values, possibly something totally different).
Otherwise you're just being strange and unreadable for no gain.
Also, who doesn't have a database where they know that e.g. item 312 is the one with some nice characteristics for testing particular things? I know I have some arbitrary ids burned in my head. They may call it "so good they named it twice", but I'll always know New York as "city 657, covers most of our test cases". It's only a shorthand, but -2147482991 wouldn't be as handy.
*To add a bit to that. With some things you might say "ah about 100" and find it's actually 110, okay. With others you'll find actually it's actually 100,000 - you were out by orders of magnitude. The higher the number, the more often the mistake is of this sort due to the sort of problems that end up with estimates in the billions being different to those that end up with answers in the dozens. If you estimate 200 is your max in a given case, you should probably leave room for maybe a few hundred more. If you estimate 2billion in a given case, you should probably leave room for a few quadrillion more. That said, the only time I saw someone actually start an id at minus 2billion they ended up having about 3,000 rows.

If you have a class that represent your table in your code (which is very likely to happen), everytime you will create a new object it will be assigned the ID 0 by default. It could lead to mistakes that overwrite data in the database if the ID 0 is already assigned. This also makes it easy to determine if an object is new or if it came from the database by just doing if (myObject.ID != 0)

On the SQL Server side the negative ID-s are ok, handled like positive numbers, so you could do that.
The others are right, you should think about different suggestion, but the major problems are the applications connected to your database.
Let say take MS Enviroment. Here is an example:
.NET DataSet is using negative ID-s on autoincremented id-s to track changes in a code. So may you will have trouble, because:
The negative keys are used for temporary instances for the rows.
Here is the reference : MSDN
So definietly it is not a good idea to design a database like this for MSSQL in an MS Enviroment.

One of the 'nice' side effects of working with integers close to zero is that they are easy on the eye and easy for devs, testers etc to remember, especially with debugging and unit testing in mind.
Also, surrogate keys have a nasty habit of creeping into business terminology, e.g. users may be able to see the PK sitting in the URL querystring at the top of the browser - the more digits in the number, the more likely they are to misquote something in a helpdesk query.
So this is one of the reasons why I'm quite happy to seed my identities at 1, and not at -2147483231, and instead, as will, as #Jon suggests, move up to a BIGINT anytime that I may ever need more than 2 billion rows in my table.

As you transition from the negative numbers to positive ids, you will cross zero. That means (assuming you are actually inserting a couple of billion rows) that you will eventually have an identifier of zero. This is not intrinsically bad, but could present a potential edge-case for ORM tools or simply sloppy applicaiton code that has difficulty differentiating between a zero and a null.

Negative IDs can be useful for testing in a live environment where dummy data needs to be mixed with real data for testing purposes, then disposed of once the testing is complete. This should be done only with very good reason - I've only used the technique once.
Negative IDs can also be useful for Administrator purposes in single-row, read-only tables (i.e., no transactions, no executable SQL run on the table).
Aside from those specific purposes, identity values <= 0 will generate more heat than light.

And just to add, according to MS Docs (https://learn.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/implementing-identity-in-a-memory-optimized-table?view=sql-server-2014), IDENTITY(1, 1) is supported on a memory-optimized table. However, identity columns with definition of IDENTITY(x, y) where x != 1 or y != 1 are NOT supported on memory-optimized tables.
So in my opinion, and the other reasons alluded to by the other users, IDENTITY(1, 1) is more practical and memory efficient for memory-optimized tables.
Start with INT IDENTITY(1, 1), then when you have maxed-out you can follow these steps to 'upgraded' to BIGINT:
Drop current PK constraint:
ALTER TABLE [dbo].[tbName] DROP CONSTRAINT [PK_tbName_Id]
GO
Upgrade PK to BIGINT:
ALTER TABLE [dbo].[tbName] ALTER COLUMN [dbo.Id] BIGINT
Recreate the PK Constraint:
ALTER TABLE [dbo].[tbName] ADD CONSTRAINT [PK_tbName_Id] PRIMARY KEY
CLUSTERED
(
[Id] ASC
)
Hope this helps, and good lucky!

mysql - Creating rows vs. columns performance

I built an analytics engine that pulls 50-100 rows of raw data from my database (lets call it raw_table), runs a bunch statistical measurements on it in PHP and then comes up with exactly 140 datapoints that I then need to store in another table (lets call it results_table). All of these data points are very small ints ("40","2.23","-1024" are good examples of the types of data).
I know the maximum # of columns for mysql is quite high (4000+) but there appears to be a lot of grey area as far as when performance really starts to degrade.
So a few questions here on best performance practices:
1) The 140 datapoints could be, if it is better, broken up into 20 rows of 7 data points all with the same 'experiment_id' if fewer columns is better. HOWEVER I would always need to pull ALL 20 rows (with 7 columns each, plus id, etc) so I wouldn't think this would be better performance than pulling 1 row of 140 columns. So the question: is it better to store 20 rows of 7-9 columns (that would all need to be pulled at once) or 1 row of 140-143 columns?
2) Given my data examples ("40","2.23","-1024" are good examples of what will be stored) I'm thinking smallint for the structure type. Any feedback there, performance-wise or otherwise?
3) Any other feedback on mysql performance issues or tips is welcome.
Thanks in advance for your input.

I think the advantage to storing as more rows (i.e. normalized) depends on design and maintenance considerations in the face of change.
Also, if the 140 columns have the same meaning or if it differs per experiment - properly modeling the data according to normalization rules - i.e. how is data related to a candidate key.
As far as performance, if all the columns are used it makes very little difference. Sometimes a pivot/unpivot operation can be expensive over a large amount of data, but it makes little difference on a single key access pattern. Sometimes a pivot in the database can make your frontend code a lot simpler and backend code more flexible in the face of change.
If you have a lot of NULLs, it might be possible to eliminate rows in a normalized design and this would save space. I don't know if MySQL has support for a sparse table concept, which could come into play there.

You have a 140 data items to return every time, each of type double.
It makes no practical difference whether this is 1x140 or 20x7 or 7x20 or 4x35 etc. It could be infinitesimally quicker for one shape of course but then have you considered the extra complexity in the PHP code to deal with a different shape.
Do you have a verified bottleneck, or is this just random premature optimisation?

You've made no suggestion that you intend to store big data in the database, but for the purposes of this argument, I will assume that you have 1 billion (10^9) data points.
If you store them in 140 columns, you'll have a mere 7 millon rows, however, if you want to retrieve a single data point from lots of experiments, then it will have to fetch a large number of very wide rows.
These very wide rows will take up more space in your innodb_buffer_pool, hence you won't be able to cache so many; this will potentially slow you down when you access them again.
If you store one datapoint per row, in a table with very few columns (experiment_id, datapoint_id, value) then you'll need to pull out the same number of smaller rows.
However, the size of rows makes little difference to the number of IO operations required. If we assume that your 1 billion datapoints doesn't fit in ram (which is NOT a safe assumption nowadays), maybe the resulting performance will be approximately the same.
It is probably better database design to use few columns; but it will use less disc space and perhaps be faster to populate, if you use lots of columns.

Increase increment size to match GUID advantage

I've been thinking of implementing this system, but can't help but feel there's a catch somewhere. One of the points of using GUID over incrementing int is that, in the future, if you were to merge databases together, you wouldn't have any clashes over the primary key/identifier. However, my approach is to set the increment size to X where X is the number of servers I'll most likely have in the future. Then, on each server, have the seed be an increment over the seed number on the previous server. That way, during merging, there would be no clashes with the primary key. Is this a safe, normal method or have I gone mental :)?
Thanks

In multi-master SQL replication, you typically have primary keys defined as:
GUIDs
int's with a increment size > number of installs
int's with a fixed offset
The downside of GUIDs is they can be harder to read and take up slightly more space. However, it allows you to scale to n instances.
Integers are a bit easier to deal with. They also have the advantage of being able to easily tell which server created a record. The downside is you must either predict the maximum number of databases which might be merged, or guess the maximum number of rows a single instance might insert.
An example of a fixed offset is: site A starts at 0, site B starts at 1,000,000 and site C starts at 2,000,000. This scheme works fine until one site inserts one million rows. This scheme might work well for cars at car dealerships, where it's unlikely that any one dealer would ever sell more than 1,000,000 cars, and you might have hundreds of dealerships over the life of the application.

What scares me here is your use of "most likely". You're assuming on the future here, and typically that's not a good thing to do with things like this. Why not use a GUID?
What if you add one extra server over what you thought you'd have? I could see things getting really complicated really quickly.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

What's the probability of repeating a rowkey that depends on datetime in Azure - azure-storage

Related

ReducyByKeyAndWindowByCount in Spark stateful streaming aggegations

Plone - ZODB catalog query sort_on multiple indexes?

Why database designers do not make IDENTITY columns start from the min value rather than 1?

mysql - Creating rows vs. columns performance

Increase increment size to match GUID advantage

Categories

Resources