Swedish "personnummer" (personal identity number) in SQL - sql

This is a specific instance of an old problem: How to store "numbers" (e.g. phone numbers, IP addresses, social security numbers) in SQL databases?
Background: In Sweden, Personal Identity Numbers ("personnummer") are extremely common: You use them when communicating with the government, the bank, your employer, etc. People born in Sweden are assigned them when born. My immigrant friends lament the dark couple of weeks before they got a personnummer and could finally get a debit card and start looking for jobs.
My organization needs to store personnummer of our members. We have a SQL database for this. How should I store the data?
From Wikipedia, regarding the format of a personnummer:
The personal identity number consists of 10 digits and a hyphen. The first six correspond to the person's birthday, in YYMMDD form. They are followed by a hyphen. People over the age of 100 replace the hyphen with a plus sign. The seventh through ninth are a serial number. An odd ninth number is assigned to males and an even ninth number is assigned to females. Some county authorities, such as Stockholm, and some banks, have started using 12 digit numbers to allow YYYYMMDD. This format is also used on some Swedish ID-cards[clarification needed] and on the Swedish European Health Insurance Cards but not on state-issued identity documents.
The tenth digit is a checksum which was introduced in 1967 when the system was computerized.
So, a personnummer could be "120101-3842" for a person born this year. This is also commonly formatted as "20120101-3842" because of Y2K and "replacing the hyphen with a plus sign" is not well-known.
In a database column, I imagine I can:
Store it as a VARCHAR, formatted as "120101-3842", "20120101-3842" or "201201013842" (shaving of a byte by getting of the superfluous hyphen in the YYYYMMDD-format).
Store the full YYYYMMDDXXXX as an INTEGER, which is too big for 32 bits but fits without problems in 64 bits.
There won't be any issues with leading zeroes in this case, and using a VARCHAR is almost twice the size. Unlike IP addresses, storing this number as an INTEGER does not make it harder to read for a human (i.e. "127.0.0.1" compared to 2130706433).
I appreciate the "strictness" of an INTEGER column but also feel that this might run into unseen issues.
EDIT: We have a real need to validate this input with the checksum et cetera, which requires doing math on the indivdual digits (multiplying, summing etc). Since digits aren't really ... uh... part of a quantity, but of decimal formatting, it might make sense to consider it a varchar after all.

Use VARCHAR with a fixed length because it is the most simple approach. And I don't think that your organisation will store the number of all 9.5 million inhabitants so that saving space is a real design goal? :)

So, as I understand it, the hyphen / plus signs are only required for the format with 2 digit year.
If I were you, I would on the application side convert to the 4 digit year format (And drop the hyphen). Then store the resulting value as an integer. As you have stated, this will save space, and will allow you to mathematically transform the values (Although I imagine that on personal numbers this may be irrelevant).
I think the key here is that you should choose a single format rather than trying to manage two different formats in the database. This will also help to lead to application consistency. When it comes to external applications that require one or another format, you can place a transform into the transfer code.
On a side note, it should be fairly trivial to create a trigger that would automatically assign the 2 digit year format (As long as you replace the hyphen / plus with a digit) To the 4 year format.

I would store the canonical form 201201013842 as a CHAR (rather than a VARCHAR).
The bottom line is that you do not control the semantics of the number (Swedish authorities do). If at some point they decide to add non numeric characters to the number (as the number already does in the older format), you will be better equipped to deal with the change.

We have the same problem and we currently store it as yyyyMMdd-xxxx, but if i where to redesign this today i would store the yyyyMMdd in a date field as that would handle the validation of the date, then i would store the 4 other values in a nchar(4) and add a constraint to ensure its only numbers.

Related

Minimum precision SQL numbers

What is the best way to check for minimum precision with numbers in SQL with Oracle database?
CreditCardNumber NUMBER(16) NOT NULL
CHECK (CreditCardNumber LIKE '________________')
or
CreditCardNumber NUMBER(16) NOT NULL
CHECK (REGEXP_LIKE(CreditCardNumber, '^\d{16}'))
I understand this is presentation level checking but I heard it's still good practice to avoid illogical data on the database level.
None of your function really work as expected as you will end-up having an implicit number to string conversion. Losing leading zeros. Maybe this is not a problem in that particular case, assuming a credit card number never start with a leading zero (and never will ? — according to ISO/IEC 7812 the leading number could be a 0 in some corner cases).
However, notice you don't have any benefit here in using the NUMBER type, as you will never perform calculation on the credit card "number". So, for that kind of data (credit card "numbers, telephone "numbers", zip codes, ...), I would strongly suggest you to use a character type (VARCHAR2 or CHAR if you prefer) instead, and at the very least check using an appropriate regexp than only digits are part of the string. Would be better to validate the checksum as suggested by #Allan in his answer though.
In addition, even if 16 digits is the most common case, bank card numbers are variable length -- from 12 to 19 digits (according to http://www.watersprings.org/pub/id/draft-eastlake-card-map-08.txt as I don't have access to the ISO official document).
Finally, concerning credit card numbers, you have to remember that depending your local regulation you are not necessarily allowed to store them unencrypted...
NUMBER(16) will only allow integers (i.e. if you try to insert '10.1', it will round to '10').
Keep in mind too that credit card numbers aren't always 16 digits - American Express uses 15.
The only benefit you'll gain from storing credit cards numbers in the NUMBER type is storage space. Since the digits are packed at a ratio of 9 digits to 4 bytes, 8 bytes will store 16 digits. However, every interaction with the data will require a type conversion to or from text, so you'll have to weigh the costs of storage, processing, and ease of coding.
The obvious way to validate that the value is wide enough is to check the value numerically:
CreditCardNumber NUMBER(16) NOT NULL
CHECK (CreditCardNumber >= 1000000000000000)
However, as #BenGrimm points out, this may not be valid for all credit card numbers.
One way to validate the card lenght per providers is to have a lookup table with each provider that you accept and the length of their card numbers. Again, you'd have to use a trigger to check against that, but it would allow you to verify that the length is appropriate is precisely correct.
A better validation might be to implement the Luhn algorithm in a function and use it to validate the column value via a trigger.
Finally, to reiterate what Sylvain Leroux pointed out, this should all be academic. You shouldn't be storing credit card numbers in plain text and may even be legally or contractually prevented from doing so.
You could potentially use ceiling(log10(Number)) = 16. I think for a credit card number there are better ways to check though.

Should I use VARCHAR(20) or VARCHAR(255) to store a name?

At the university I was told to use VARCHAR(20) to store a first name. The VARCHAR type takes space depending of string length, so is it necessary to specify smaller length range? I'm asking because RedBean ORM creates on default a VARCHAR(255) field for strings which length is <= 255 chars.
Is it any difference between using VARCHAR(20) and VARCHAR(255)? Except the maximum string length that can be stored :)
*I know that such questions have already been asked but all I understand from them is that using VARCHAR(255) where it isn't necessary could cause excessive memory consumption in DB applications.
What it is in real life programming? Should I use VARCHAR(255) for all short text inputs or try to limit length whenever it is possible?
Because 255 is now just an arbitrary choice for a VARCHAR length.
Explanation: Prior to MySQL 5.0.3 (give or take a few point releases - I forget) a VARCHAR column could be 255 characters in length maximum, so VARCHAR(255) was often used as a default. Now, however, you can go up to 65,535 characters on VARCHAR, so if you're still using "255" then that seems arbitrary and not well thought out (or your schema is just old).
It is better to define the limit to reduce the default database size and also helpful for validation purpose.
As far as I know with varchar(20) you are telling that the field will contain not more than 20 characters.
First of all determine an ideal range for the specified field depending on what it will hold (name or address etc). Its always efficient to use as small length as required.
The decision to use 20 characters is just as arbitrary as 255 unless you know your data.
You could use varchar(max) but that changes the way your data is stored and could impact performance; but without knowing more about your application and the size/volume of the data it's hard to give advice.
VARchar (255) is generally abad choice unless you need that much space. YOU always wmake all fields the size they need to be and no more. There are several reasons for this. ONe is the row size of the record has limits. Often you can crete a row larger than the limits but the first time you try to enter data that exceeds teh limits, it willfail. THis is a bad thing and shoudl be avopided by using smaller field sizes. Larger field sizes also encourage the entry of bad information. If users know they havea lot of room in a filed, they willenter notes inthat field instead of theh data, I have seen such gems as (and this is genuine example from a past job) "Talk to the fat girl as the blonde is useless." in fields where teh length was too long. You don't want to give room for junk to be put into a field if you can help it. Bad data in means bad data out.
Wider pages can also be slower to access, so it is the database's best interest to limit field size.
Under no circumstances should you use nvarchar(max) or varchar(max) for any string type fields unless you intend for some records to contain more than 4000 characters. These fields cannot be indexed (in SQL server, know your own datbase limits when doing design) and using them indiscriminately is very bad and will cause a slower than slow database.
Names are tricky and can be hard to determine the size so some people go big. But 25-50 is more reasonable than 255. It may vary depending on the kind of names you are storing, for instance if corporations are mixed in with people names, then the field will need to be wider. If you have a lot of foreign names to store, you need to know what is the norm for names in that country as far as length. Some countries have typically longer names than others. ANd remeber as fara as first name are considered, it will make a difference whether the person uses their middle names as well and if there is anywhere to store that. This is especillay true for people who have more than one middle name or who are using their maiden name as a middle name but still go by their other names such as Mary Elizabeth Annette Von Middlesworth Jamison - you can see how hard a name like that is to break up into first, middle and last and the majority of the name might end up in the firstname column.

When to use VARCHAR and DATE/DATETIME

We had this programming discussion on Freenode and this question came up when I was trying to use a VARCHAR(255) to store a Date Variable in this format: D/MM/YYYY. So the question is why is it so bad to use a VARCHAR to store a date. Here are the advantages:
Its faster to code. Previously I used DATE, but date formatting was a real pain.
Its more power hungry to use string than Date? Who cares, we live in the Ghz era.
Its not ethically correct (lolwut?) This is what the other user told me...
So what would you prefer to use to store a date? SQL VARCHAR or SQL DATE?
Why not put screws in with a hammer?
Because it isn't the right tool for the job.
Some of the disadvantages of the VARCHAR version:
You can't easily add / subtract days to the VARCHAR version.
It is harder to extract just month / year.
There is nothing stopping you putting non-date data in the VARCHAR column in the database.
The VARCHAR version is culture specific.
You can't easily sort the dates.
It is difficult to change the format if you want to later.
It is unconventional, which will make it harder for other developers to understand.
In many environments, using VARCHAR will use more storage space. This may not matter for small amounts of data, but in commercial environments with millions of rows of data this might well make a big difference.
Of course, in your hobby projects you can do what you want. In a professional environment I'd insist on using the right tool for the job.
When you'll have database with more than 2-3 million rows you'll know why it's better to use DATETIME than VARCHAR :)
Simple answer is that with databases - processing power isn't a problem anymore. Just the database size is because of HDD's seek time.
Basically with modern harddisks you can read about 100 records / second if they're read in random order (usually the case) so you must do everything you can to minimize DB size, because:
The HDD's heads won't have to "travel" this much
You'll fit more data in RAM
In the end it's always HDD's seek times that will kill you. Eg. some simple GROUP BY query with many rows could take a couple of hours when done on disk compared to couple of seconds when done in RAM => because of seek times.
For VARCHAR's you can't do any searches. If you hate the way how SQL deals with dates so much, just use unix timestamp in 32 bit integer field. You'll have (basically) all advantages of using SQL DATE field, you'll just have to manipulate and format dates using your choosen programming language, not SQL functions.
Two reasons:
Sorting results by the dates
Not sensitive to date formatting changes
So let's take for instance a set of records that looks like this:
5/12/1999 | Frank N Stein
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
If we were to store the data your way, but sorted on the dates in assending order SQL will respond with the resultset that looks like this:
1/22/2005 | Drake U. La
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
Where if we stored the dates as a DATETIME, SQL will respond correctly ordering them like this:
10/4/1962 | Goul Friend
5/12/1999 | Frank N. Stein
1/22/2005 | Drake U. La
Additionally, if somewhere down the road you needed to display dates in a different format, for example like YYYY-MM-DD, then you would need to transform all your data or deal with mixed content. When it's stored as a SQL DATE, you are forced to make the transform in code, and very likely have one spot to change the format to display all dates--for free.
Between DATE/DATETIME and VARCHAR for dates I would go with DATE/DATETIME everytime. But there is a overlooked third option. Storing it as a INTEGER unsigned!
I decided to go with INTEGER unsigned in my last project, and I am really satisfied with making that choice instead of storing it as a DATE/DATETIME. Because I was passing along dates between client and server it made the ideal type for me to use. Instead of having to store it as DATE and having to convert back every time I select, I just select it and use it however I want it. If you want to select the date as a "human-readable" date you can use the FROM_UNIXTIME() function.
Also a integer takes up 4 bytes while DATETIME takes up 8 bytes. Saving 50% storage.
The sorting problem that Berin proposes is also solved using integer as storage for dates.
I'd vote for using the date/datetime types, just for the sake of simplicity/consistency.
If you do store it as a character string, store it in ISO 8601 format:
http://www.iso.org/iso/date_and_time_format
http://xml.coverpages.org/ISO-FDIS-8601.pdf
http://www.cl.cam.ac.uk/~mgk25/iso-time.html
Among other things, ISO 8601 date/time string (A) collate properly, (B) are human readable, (C) are locale-indepedent, and (D) are readily convertable to other formats. To crib from the ISO blurb, ISO 8601 strings offer
representations for the following:
Date
Time of the day
Coordinated universal time (UTC)
Local time with offset to UTC
Date and time
Time intervals
Recurring time intervals
Representations can be in one of two formats: a basic format
that has a minimal number of characters and an extended format
that adds characters to enhance human readability. For example,
the third of January 2003 can be represented as either 20030103
or 2003-01-03.
[and]
offer the following advantages over many of the locally used
representations:
Easily readable and writeable by systems
Easily comparable and sortable
Language independent
Larger units are written in front of smaller units
For most representations the notation is short and of constant length
One last thing: If all you need to do is store a date, then storing it in the ISO 8601 short form YYYYMMDD in a char(8) column takes no more storage than a datetime value (and you don't need to worry about the 3 millisecond gap between the last tick of the one day and the first tick of the next. But that's a matter for another discussion. If you break it up into 3 columns — YYYY char(4), MM char(2), DD char(2) you'll use up the same amount of storage, and get more options for indexing. Even better, store the fields as a short for yyyy (4 bytes), and a tinyint for each of MM and DD — now you're down to 6 bytes for the date. The drawback, of course, to decomposing the date components into their constituent parts is that conversion to proper date/time data types is complicated.

US phone numbers

I'm building an app that uses phone numbers to perform different tasks, and recently I've had quite a few requests to implement it for the US market. Unfortunately as I live in the UK I don't have much knowledge of US phone number formats, and with so many USA users on here I was hoping some of you would be able to help.
I'm looking to obtain a list of sample phone numbers as they appear in your call log on your mobile phones. I'm trying to determine if they come through in the format +1234567, +001234567, 001234567, 01234567, 1234567, 234567 etc, or perhaps the format can vary..
Hopefully you're hesitant about giving out phone numbers on the web, so feel free to change a few digits (I'm mainly interested in the first few digits and the format of the numbers).
The more numbers you can provide the better, thanks!
The following formats are common:
+12312322334
2312322334
(231) 232-2334
2322334
232-2334
The last two forms are unusual, though may be encountered. The area code is implied to be local to the phone.
Note that there are some invalid entries: Numbers never start with a "1" (thats the long distance dial indicator, optional on cell phones), the "555" prefix is reserved (so commonly used in movies).
U.S. phone numbers have three parts: A three-digit area code, a three-digit number, and a four-digit number. Generally, these are written in the format (234)-555-1234. If you are calling from the same area code as the person you are calling, you can omit the area code (the (234) part). For landlines, you often need to input a 1 first if you intend to include the area code, but most cell phones don't require this.
As I say -- interesting q. Have you searched for something like "dirty north american phone format" or "how are north american phone numbers typically formatted"? Struck me as being something that has to be done often.
Google brings up this as an example: Phone number format provider. It has a) some example formats and b) some code that actually deals with dirty or non-standard formats, and reformats them ...
So -- from my comment I guess I'd strip spaces (and hyphens) to start with, but from then on assume that you've got a right-most part of the number, and that any missing left-most parts represent increasingly wider geographic areas.
In reverse -- if the assumption works, you can create your own sample numbers by taking a standard format number and chopping groups from the left hand side -- I think.

List of Best Practice MySQL Data Types

Is there a list of best practice MySQL data types for common applications. For example, the list would contain the best data type and size for id, ip address, email, subject, summary, description content, url, date (timestamp and human readable), geo points, media height, media width, media duration, etc
Thank you!!!
i don't know of any, so let's start one!
numeric ID/auto_increment primary keys: use an unsigned integer. do not use 0 as a value. and keep in mind the maximum value of of the various sizes, i.e. don't use int if you don't need 4 billion values when the 16 million offered by mediumint will suffice.
dates: unless you specifically need dates/times that are outside the supported range of mysql's DATE and TIME types, use them! if you instead use unix timestamps, you have to convert them to use the built-in date and time functions. if your app needs unix timestamps, you can always convert the standard date and time data types on the way out using unix_timestamp().
ip addresses: use inet_aton() and inet_ntoa() since it easily compacts an ip address in to 4 bytes and gives you the ability to do range searches that utilize indexes.
Integer Display Width You likely define your integers something like this "INT(4)" but have been baffled by the fact that (4) has no real effect on the stored numbers. In other words, you can store numbers like 999999 just fine. The reason is that for integers, (4) is the display width, and only has an effect if used with the ZEROFILL modifier. Further, this is for display purposes only, so you could define a column as "INT(4) ZEROFILL" and store 99999. If you stored 999, the mysql REPL (console) would output 0999 when you've selected this column.
In other words, if you don't need the ZEROFILL stuff, you can leave off the display width.
Money: Use the Decimal data type. Based on real-world production scenarios I recommend (19,8).
EDIT: My original recommendation was (19,4); however, I've recently run into a production issue where the client reported that they absolutely needed decimal with a "scale" of "8"; thus "4" wasn't enough and was causing improper tax calculations. I now recommend (19,8) based on a real-world scenario. I would love to hear stories needing a more granular scale.