Should I initialize my AUTO_INCREMENT id column to 2^32+1 instead of 0? - sql

I'm designing a new system to store short text messages [sic].
I'm going to identify each message by a unique identifier in the database, and use an AUTO_INCREMENT column to generate these identifiers.
Conventional wisdom says that it's okay to start with 0 and number my messages from there, but I'm concerned about the longevity of my service. If I make an external API, and make it to 2^31 messages, some people who use the API may have improperly stored my identifier in a signed 32-bit integer. At this point, they would overflow or crash or something horrible would happen. I'd like to avoid this kind of foo-pocalypse if possible.
Should I "UPDATE message SET id=2^32+1;" before I launch my service, forcing everyone to store my identifiers as signed 64-bit numbers from the start?

If you wanted to achieve your goal and avoid the problems that cletus mentioned, the solution is to set your starting value to 2^32+1. There's still plenty of IDs to go and it won't fit in a 32 bit value, signed or otherwise.
Of course, documenting the value's range and providing guidance to your API or data customers is the only right solution. Someone's always going to try and stick a long into a char and wonder why it doesn't work (always)

What if you provided a set of test suites or a test service that used messages in the "high but still valid" range and persuade your service users to use it to validate their code is proper? Starting at an arbitrary value for defensive reasons is a little weird to me; providing sanity tests rubs me right.

Actually 0 can be problematic with many persistence libraries. That's because they use it as some sort of sentinel value (a substitute for NULL). Rightly or wrongly, I would avoid using 0 as a primary key value. Convention is to start at 1 and go up. With negative numbers you're likely just to confuse people for no good reason.

If everyone alive on the planet sent one message per second every second non-stop, your counter wouldn't wrap until the year 2050 using 64 bit integers.
Probably just starting at 1 would be sufficient.
(But if you did start at the lower bound, it would extend into the start of 2092.)

Why use incrementing IDs? These require locking and will kill any plans for distributing your service over multiple machines. I would use UUIDs. API users will likely store these as opaque character strings, which means you can probably change the scheme later if you like.
If you want to ensure that messages have an order, implement the ordering like a linked list:
---
id: 61746144-3A3A-5555-4944-3D5343414C41
msg: "Hello, world"
next: 006F6F66-0000-0000-655F-444E53000000
prev: null
posted_by: jrockway
---
id: 006F6F66-0000-0000-655F-444E5300000
msg: "This is my second message EVER!"
next: 00726162-0000-0000-655F-444E53000000
prev: 61746144-3A3A-5555-4944-3D5343414C41
posted_by: jrockway
---
id: 00726162-0000-0000-655F-444E53000000
msg: "OH HAI"
next: null
prev: 006F6F66-0000-0000-655F-444E5300000
posted_by: jrockway
(As an aside, if you are actually returning the results as YAML, you can use & and * references instead of just using the IDs as data. Then the client will get the linked-list structure "for free".)

One thing I don't understand is why developers don't grasp that they don't need to expose their AUTO_INCREMENT field. For example, richardtallent mentioned using Guids as the primary key. I say do one better. Use a 64bit Int for your table ID/Primary Key, but also use a GUID, or something similar, as your publicly exposed ID.
An example Message table:
Name | Data Type
-------------------------------------
Id | BigInt - Primary Key
Code | Guid
Message | Text
DateCreated | DateTime
Then your data looks like:
Id | Code Message DateCreated
-------------------------------------------------------------------------------
1 | 81e3ab7e-dde8-4c43-b9eb-4915966cf2c4 | ....... | 2008-09-25T19:07:32-07:00
2 | c69a5ca7-f984-43dd-8884-c24c7e01720d | ....... | 2007-07-22T18:00:02-07:00
3 | dc17db92-a62a-4571-b5bf-d1619210245a | ....... | 2001-01-09T06:04:22-08:00
4 | 700910f9-a191-4f63-9e80-bdc691b0c67f | ....... | 2004-08-06T15:44:04-07:00
5 | 3b094cf9-f6ab-458e-965d-8bda6afeb54d | ....... | 2005-07-16T18:10:51-07:00
Where Code is what you would expose to the public whether it be a URL, Service, CSV, Xml, etc.

Don't want to be the next Twitter, eh? lol
If you're worried about scalability, consider using a GUID (uniqueidentifier) instead.
They are only 16 bytes (twice that of a bigint), but they can be assigned independently on multiple database or BL servers without worrying about collisions.
Since they are random, use NEWSEQUENTIALID() (in SQL Server) or a COMB technique (in your business logic or pre-MSSQL 2005 database) to ensure that each GUID is "higher" than the last one (speeds inserts into your table).
If you start with a number that high, some "genius" programmer will either subtract 2^32 to squeeze it in an int, or will just ignore the first digit (which is "always the same" until you pass your first billion or so messages).

Related

REDIS usecase using large keys with small values

I have a use-case for using redis that is a little bit different.
In my MySQL I have an entity, let's call it HumanEntity. this HumanEntity has many to many relations.
HumanEntity.Urls - Many URLs per HumanEntity.
HumanEntity.UserNames - Many UserNames per HumanEntity.
HumanEntity.Phones ...
HumanEntity.Emails ...
in a normal one hour, the application creates hundreds of these many values.
The use-case is that, the application receives an HTTP call (100 per one second) with a HumanEntity value (Url or UserName or Phone or Email).
I need to scan my MySQL (1,000,000 records) and return back the HumanEntity.Id(integer) .
Since its ok to have some latency in the data integrity I thought about REDIS.
Can I store the values as a Redis key and the and the HumanEntity.Id(integer) as the value.
My API needs to return back the HumanEntity.Id(integer).
does it make sense to have such long key and such short value? The URL, for example, maybe 1500 bytes and the value can be 1 byte.
What is the best redis method to implement that?
Thanks
If the values are not unique then you may have some problem. Phones, emails or usernames maybe unique for user but i am not sure about url or any other property stored in your database. You may overwrite the value of an identifier with another user's.
If you don't have any problem like that; you may proceed with string types, Time complexity of GET and SET is O(1) - that's the best you may get.
In some cases such as checking whether the user used any coupon, you may use long(let's say 64 chars) user id as key, and 1 as value and use EXISTS to determine it. So it's valid to use long key and short value.

Create master table for status column

I have a table that represent a request sent through frontend
coupon_fetching_request
---------------------------------------------------------------
request_id | request_time | requested_by | request_status
Above I tried to create a table to address the issue.
Here request_status is an integer. It could have some values as follows.
1 : request successful
2 : request failed due to incorrect input data
3 : request failed in otp verification
4 : request failed due to internal server error
That table is very simple and status is used to let frontend know what happened to sent request. I had discussion with my team and other developers were proposing that we should have a status representation table. At database side we are not gonna need this status. But team was saying that in future we may need to show simple output from database to show what is the status of all request. According to YAGNI principle I don't think it is a good idea.
Currently I have coded to convert returned request_status value to descriptive value at frontend. I tried to convince team that I can creat an enumuration at business layer to represent meaning of the status OR I could add documentation at frontend and in java but failed to convince them.
The table proposed is as follows
coupon_fetching_request_status
---------------------------------------------------
status_id | status_code | status_description
My question is, Is it necessary to create table for such a simple status in similar cases.
I tried to create simple example to address the problem. In real time the table is to represent a Discount Coupon Code Request and status representing if the code is successfully fetched
It really depends on your use case.
To start with: in you main table, you are already storing request_status as an integer, which is a good thing (if you were storing the whole description, like 'request successful', that would not be optimized).
The main question is: will you eventually need to display that data in a human-readable format?
If no, then it is probably useless to create a representation table.
If yes, then having a representation table would be a good thing, instead of adding some code in the presentation layer to do the transcodification; let the data live in the database, and the frontend take care of presentation only.
Since this table can be easily created when needed, a pragmatic approach would be to hold on until you have a real need for the representation table.
You should create the reference table in the database. You currently have business logic on the application side, interpreting data stored in the database. This seems dangerous.
What does "dangerous" mean? It means that ad-hoc queries on the database might need to re-implement the logic. That is prone to error.
It means that if you add a reporting front end, then the reports have to re-implement the logic. That is prone to error and a maintenance nightmare.
It means that if you have another developer come along, or another module implemented, then the logic might need to be re-implemented. Red flag.
The simplest solution is to have a reference table to define the official meanings of the codes. The application should use this table (via join) to return the strings. The application should not be defining the meaning of codes stored in the database. YAGNI doesn't apply, because the application is so in need of this information that it implements the logic itself.

Error itgensql005 when fetching serial number from Exact Online GoodsDeliveryLines for upload to Freshdesk ticket

I want to exchange information between ExactOnline and Freshdesk based on deliveries (Exact Online Accounts -> Freshdesk Contacts, Exact Online deliveries -> Freshdesk tickets).
The serial number of delivered goods is not available in either the ExactOnlineREST..GoodsDeliveryLines table nor in ExactOnlineXML..DeliveryLines.
The following query lists all columns that are also documented on Exact Online REST API GoodsDeliveryLines:
select * from goodsdeliverylines
All other fields of the documentation on REST APIs are included in GoodsDeliveryLines, solely serial numbers and batch numbers not.
I've tried - as on ExactOnlineXML tables where there column only come into existence when actually specified - to use:
select stockserialnumbers from goodsdeliverylines
This raises however an error:
itgensql005: Unknown identifier 'stockserialnumbers'.
How can I retrieve the serial numbers?
StockSerialNumbers is an array, on the Exact Online documentation it says:
Collection of batch numbers
so far every delivery lines, there can be 0, 1 or more serial numbers included.
These serial numbers were not available till some time ago; please make sure you upgrade to at least build 16282 of the Exact Online SQL provider. It should work then using a query on a separate table:
select ssrdivision
, ssritemcode
, ssrserialnumber
from GoodsDeliveryLineSerialNumbers
Output:
ssrdivision | ssritemcode | ssrserialnumber
----------- | ----------- | ---------------
868,035 | OUT30074 | 132
868,035 | OUT30074 | 456
Use of serial numbers may require more modules from the respective supplier Exact Online like "Trade", but when you can see them in the web user interface, then you have them already. If you get an HTTP 401 unauthorized, you don't have the module for serial numbers.
Since stockserialnumbers is actually a list and not a single field, you have to query it using the entity GoodsDeliveryLineSerialNumbers, which you can find in the latest release.
select * from GoodsDeliveryLineSerialNumbers
If you execute the above query, you will get the fields for GoodsDeliveryLine and those of the underlying serial numbers. The latter fields are prefixed with Ssr to disambiguate both entities. This means you don't need an additional join on GoodsDeliveryLine, which may benefit performance.

How predictable is NEWSEQUENTIALID?

According to Microsoft's documentation on NEWSEQUENTIALID, the output of NEWSEQUENTIALID is predictable. But how predictable is predictable? Say I have a GUID that was generated by NEWSEQUENTIALID, how hard would it be to:
Calculate the next value?
Calculate the previous value?
Calculate the first value?
Calculate the first value, even without knowing any GUID's at all?
Calculate the amount of rows? E.g. when using integers, /order?id=842 tells me that there are 842 orders in the application.
Below is some background information about what I am doing and what the various tradeoffs are.
One of the security benefits of using GUID's over integers as primary keys is that GUID's are hard to guess. E.g. say a hacker sees a URL like /user?id=845 he might try to access /user?id=0, since it is probable that the first user in the database is an administrative user. Moreover, a hacker can iterate over /user?id=0..1..2 to quickly gather all users.
Similarly, a privacy downside of integers is that they leak information. /order?id=482 tells me that the web shop has had 482 orders since its implementation.
Unfortunately, using GUID's as primary keys has well-known performance downsides. To this end, SQL Server introduced the NEWSEQUENTIALID function. In this question, I would like to learn how predictable the output of NEWSEQUENTIALID is.
The underlying OS function is UuidCreateSequential. The value is derived from one of your network cards MAC address and a per-os-boot incremental value. See RFC4122. SQL Server does some byte-shuffling to make the result sort properly. So the value is highly predictable, in a sense. Specifically, if you know a value you can immediately predict a range of similar value.
However one cannot predict the equivalent of id=0, nor can it predict that 52DE358F-45F1-E311-93EA-00269E58F20D means the store sold at least 482 items.
The only 'approved' random generation is CRYPT_GEN_RANDOM (which wraps CryptGenRandom) but that is obviously a horrible key candidate.
In most cases, the next newsequentialid can be predicted by taking the current value and adding one to the first hex pair.
In other words:
1E29E599-45F1-E311-80CA-00155D008B1C
is followed by
1F29E599-45F1-E311-80CA-00155D008B1C
is followed by
2029E599-45F1-E311-80CA-00155D008B1C
Occasionally, the sequence will restart from a new value.
So, it's very predictable
NewSequentialID is a wrapper around the windows function UuidCreateSequential
You can try this code:
DECLARE #tbl TABLE (
PK uniqueidentifier DEFAULT NEWSEQUENTIALID(),
Num int
)
INSERT INTO #tbl(Num) values(1),(2),(3),(4),(5)
select * from #tbl
On my machine in this time is result:
PK Num
52DE358F-45F1-E311-93EA-00269E58F20D 1
53DE358F-45F1-E311-93EA-00269E58F20D 2
54DE358F-45F1-E311-93EA-00269E58F20D 3
55DE358F-45F1-E311-93EA-00269E58F20D 4
56DE358F-45F1-E311-93EA-00269E58F20D 5
You should try it several times in different time/date to interpolate the behaviour.
I tried it run several times and the first part is changing everytime (you see in results: 52...,53...,54...,etc...). I waited some time to check it, and after some time the second part is incremented too. I suppose the incementation continues to the all parts. Basically it look like simple +=1 incementation transformed into Guid.
EDIT:
If you want sequential GUID and you want have control over the values, you can use Sequences.
Sample code:
select cast(cast(next value for [dbo].[MySequence] as varbinary(max)) as uniqueidentifier)
• Calculate the next value? Yes
Microsoft says:
If privacy is a concern, do not use this function. It is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID.
SO it's a possibility to get the next value. I don't find information if it is possible to get the prevoius one.
from: http://msdn.microsoft.com/en-us/library/ms189786.aspx
edit: another few words about NEWSEQUENTIALID and security: http://vadivel.blogspot.com/2007/09/newid-vs-newsequentialid.html
Edit:
NewSequentialID contains the server's MAC address (or one of them), therefore knowing a sequential ID gives a potential attacker information that may be useful as part of a security or DoS attack.
from: Are there any downsides to using NewSequentialID?

Storing ip adresses efficiently, for faster lookups and insertions (proxy checking)

I'm writing a small Python 3 program that is supposed to be testing the validity of a large number of proxies and I want to reorganize data so I can quickly lookup IP, test it via curl, write into database whether it works and timestamp.
With about 50 000 rows, 'simple way' takes too long, thus I need some clever way of searching through IPs.
I'm new to SQL, but if I was do it in some programming language, I would make something like this:
| IP_BYTE1 | IP_BYTE2 | IP_BYTE3 | IP_BYTE4 | TIMESTAMP | WORKS |
and then search 'left to right'.
Can anyone help me with creation of such a table and algorithm for fast lookup/insertion?
The simple way is to store them in a table using your favorite data type (varchar or int) and then build an index on them.
If you are looking for different types of IP addresses, then you might want to break them into separate pieces. Are you generally looking at type D addresses? Or do you need to also look at types A, B, and C?