capturing user IP address information for audit - sql

We have a requirement to log IP address information of all users who use a certain web application based on Java EE 5.
What would be an appropriate sql data type for storing IPv4 or IPv6 addresses in the following supported databases (h2, mysql, oracle)?
There is also a need to filter activity from certain IP addresses. Should I just treat the representation as a string field (say varchar(32) to hold ipv4, ipv6 addresses)?

I'd store the IP addresses in a varchar(15). This is easily readable, and you can filter for specific IP's like where ip = '1.2.3.4'.
If you have to filter on networks, like 1.2.3.4/24, it becomes a different story. In that case your better off storing the IP address as a 4 byte binary.

If you have huge amounts of data and have to search through, for performance it would be better to convert string (dotted) representation of IPs to their proper integer values.

Either of these is valid
4 bytes, perhaps a 5th byte for CIDR
varchar(15) or (18) to store full representation in one go
Saying that, varchar(48) for SQL Server's sys.dm_exec_connections...

Related

Redis | Best data structure to store IPs and networks

My goal is to store in Redis:
Plain IP addresses like 228.228.228.228
IP networks like 228.228.228.0/24
in order to check in request/response cycle whether or not
current IP xxx.yyy.xxx.vvv is inside( contained by):
Plain ips
or
Ip network ( for example 228.228.228.228 inside 228.228.228.0/24)
Overall amount of ips and networks - few 1000 items.
Question is – what is the best way (best structure) to store both plain ips and networks in Redis and make aforementioned check without fetching data from Redis?
Thanks.
P.S. Current IP is already known.
UPDATE
Ok, lets simplify it a bit with example.
I have 2 ips and 2 networks in where I want to check if certain ip is contained.
# 2 plain ip
202.76.250.29
37.252.145.1
# 2 networks
16.223.132.0/24
9.76.202.0/24
There are 2 possible ways where exact ip might be contained:
1)Just in plain ips. For example 202.76.250.29 contained in the structure above and 215.08.11.23 is not contained simply by definition.
2)Ip might be contained inside network. For example 9.76.202.100 contained inside networks 9.76.202.0/24 but not contained inside list of plain ips as there are no any exact ip = 9.76.202.100.
Little bit of of explanation about ip networks. Very simplified.
Ip network represents range of ips. For example ipv4 network "192.4.2.0/24" represents 256 ip addresses:
IPv4Address('192.4.2.1'), IPv4Address('192.4.2.2'),
…
…
…
IPv4Address('192.4.2.253'), IPv4Address('192.4.2.254')
In another words ip network is a range of ip addresses
from '192.4.2.1' up to '192.4.2.254'
In our example 9.76.202.100 contained inside networks 9.76.202.0/24 as one of this addresses inside the range.
My idea is like this:
Any ip address can be represented as integer. One of our ip addresses
202.76.250.29 converted to integer is 3394042397.
As ip network is a range of ips, so that it is possible to convert it in a range of integers by converting first and last ip in range in integers.
For example one of our networks 16.223.132.0/24 represents range between IPv4Address('16.223.132.1') and IPv4Address('16.223.132.254'). Or integers range from 283083777 up to 283083781 with step 1.
Individual ip can be represented as range between it’s integer and it’s integer + 1 (lower bound included, upper bound excluded).
Obviously search in plain ips can be done by putting them to SET and then using SISMEMBER. But what about searching inside networks. Can we do some trick with ranges maybe?
"Best" is subjective(in memory, in speed etc) but you may use two sets/hash to store them. Since they are unique both hashes and sets would be fine. If you prefer you can use a single set/hash to save both ip and network ip addresses but i would prefer separate since they are two different type of data sets(just like database tables).
Then you can use either of those
SISMEMBER with O(1) time complexity
HEXISTS with O(1) time complexity.
It can be handled on application level with multiple commands or lua script(in a single transaction).
Depending on your choice add to your keys with SADD and HSET(the field value would be 1).
--
Edit: (hope i get it right)
For the range of network addresses create sets from the integers surrounding two dots such as 12.345.67.1-12.345.67.254 range will be represented as 12.345.67 and you will add this to the set. When you want to search for 12.345.67.x it will be parsed into 12.345.67 in your application level and you will check with SISMEMBER. Same can be done with hash with HEXISTS.
Since ip addresses contain four different numbers with three dots, you will discard last dot and last number and the rest will be representing(i assume) the network range.
For IPs you can use Set and query by certain IP within O(1) time.
For IP range, I think you can use List with Lua Script for query. List will have O(n) time for searching, but since you only have 1000 items, O(N) and O(1) will not have a huge difference for Redis in memory query.

How long should SQL email fields be? [duplicate]

This question already has answers here:
What is the optimal length for an email address in a database?
(9 answers)
Closed 9 years ago.
I recognize that an email address can basically be indefinitely long so any size I impose on my varchar email address field is going to be arbitrary. However, I was wondering what the "standard" is? How long do you guys make it? (same question for Name field...)
update: Apparently the max length for an email address is 320 (<=64 name part, <= 255 domain). Do you use this?
The theoretical limit is really long but do you really need worry about these long Email addresses? If someone can't login with a 100-char Email, do you really care? We actually prefer they can't.
Some statistical data may shed some light on the issue. We analyzed a database with over 10 million Email addresses. These addresses are not confirmed so there are invalid ones. Here are some interesting facts,
The longest valid one is 89.
There are hundreds longer ones up to the limit of our column (255) but they are apparently fake by visual inspection.
The peak of the length distribution is at 19.
There isn't long tail. Everything falls off sharply after 38.
We cleaned up the DB by throwing away anything longer than 40. The good news is that no one has complained but the bad news is not many records got cleaned out.
I've in the past just done 255 because that's the so-ingrained standard of short but not too short input. That, and I'm a creature of habit.
However, since the max is 319, I'd do nvarchar(320) on the column. Gotta remember the #!
nvarchar won't use the space that you don't need, so if you only have a 20 character email address, it will only take up 20 bytes. This is in contrast to a nchar which will always take up its maximum (it right-pads the value with spaces).
I'd also use nvarchar in lieu of varchar since it's Unicode. Given the volatility of email addresses, this is definitely the way to go.
The following email address is only 94 characters:
i.have.a.really.long.name.like.seetharam.krishnapillai#AReallyLongCompanyNameOfSomeKind.com.au
Would an organisation actually give you an email that long?
If they were stupid enough to, would you actually use an email address like that?
Would anyone? Of course not. Too long to type and too hard to remember.
Even a 92-year-old technophobe would figure out how to sign up for a nice short gmail address, and just use that, rather than type this into your registration page.
Disk space probably isn't an issue, but there are at least two problems with allowing user input fields to be many times longer than they need to be:
Displaying them could mess up your UI (at best they will be cut off, at worst they push your containers and margins around)
Malicious users can do things with them you can't anticipate (like those cases where hackers used a free online API to store a bunch of data)
I like 50 chars:
123456789.123456789.123456789#1234567890123456.com
If one user in a million has to use their other email address to use my app, so be it.
(Statistics show that no-one actually enters more than about 40 chars for email address, see e.g.: ZZ Coder's answer https://stackoverflow.com/a/1297352/87861)
According to this text, based on the proper RFC documents, it's not 320 but 254:
http://www.eph.co.uk/resources/email-address-length-faq/
Edit:
Using WayBack Machine:
https://web.archive.org/web/20120222213813/http://www.eph.co.uk/resources/email-address-length-faq/
What is the maximum length of an email address?
254 characters
There appears to be some confusion over the maximum valid email
address size. Most people believe it to be 320 characters (64
characters for the username + 255 characters for the domain + 1
character for the # symbol). Other sources suggest 129 (64 + 1 + 64)
or 384 (128+1+255, assuming the username doubles in length in the
future).
This confusion means you should heed the 'robustness principle'
("developers should carefully write software that adheres closely to
extant RFCs but accept and parse input from peers that might not be
consistent with those RFCs." - Wikipedia) when writing software that
deals with email addresses. Furthermore, some software may be crippled
by naive assumptions, e.g. thinking that 50 characters is adequate
(examples). Your 200 character email address may be technically valid
but that will not help you if most websites or applications reject it.
The actual maximum email length is currently 254 characters:
"The original version of RFC 3696 did indeed say 320 was the maximum
length, but John Klensin (ICANN) subsequently accepted this was
wrong."
"This arises from the simple arithmetic of maximum length of a domain
(255 characters) + maximum length of a mailbox (64 characters) + the #
symbol = 320 characters. Wrong. This canard is actually documented in
the original version of RFC3696. It was corrected in the errata.
There's actually a restriction from RFC5321 on the path element of an
SMTP transaction of 256 characters. But this includes angled brackets
around the email address, so the maximum length of an email address is
254 characters." - Dominic Sayers
I use varchar(64) i do not think anyone could have longer email
If you're really being pendantic about it, make a username varchar(60), domain varchar(255). Then you can do ridiculous statistics on domain usage that is slightly faster than doing it as a single field. If you're feeling really gun-ho about optimization, that will also make your SMTP server able to send out emails with fewer connections / better batching.
RFC 5321 (the current SMTP spec, obsoletes RFC2821) states:
4.5.3.1.1. Local-part
The maximum total length of a user
name or other local-part is 64
octets.
4.5.3.1.2. Domain
The maximum total length of a
domain name or number is 255 octets.
This pertains to just localpart#domain, for a total of 320 ASCII (7-bit) characters.
If you plan to normalize your data, perhaps by splitting the localpart and domain into separate fields, additional things to keep in mind:
A technique known as VERP may result in full-length localparts for automatically generated mail (may not be relevant to your use case)
domains are case insensitive; recommend lowercasing the domain portion
localparts are case sensitive; user#domain.com and USER#domain.com are technically different addresses per the specs, although the policy at the domain.com may be to treat the two addresses as equivalent. It's best to restrict localpart case folding to domains that are known to do this.
For email, regardless of the spec, I virtually always go with 512 (nvarchar). Names and surnames are similar.
Really, you need to look at how much you care about having a little extra data. For me, mostly, it's not a worry, so I'll err on the conservative side. But if you've decided, through logically and accurate means, that you'll need to conserve space, then do so. But in general, be conservative with field sizes, and life shall be good.
Note that probably not all email clients support the RFC, so regardless of what it says, you may encounter different things in the wild.

Indexed ranged search algorithm for IP Addresses

Given an ACL list with 10 billion IPv4 ranges in CIDR notiation or between two IPs:
x.x.x.x/y
x.x.x.x - y.y.y.y
What is an effecient search/indexing algorithm for testing that a given IP address meets the critera of one or more ACL ranges?
Lets assume most ACL range definitions span a great number of class C blocks.
Indexing points via hash tables is easy but try as I might have not been able to come up with a reasonable method for detecting which points are covered by a large list of "lines".
Had some thoughts like indexing hints at a certain level of detail -- say pre-computing at the class C level each ACL that covered that point but the table would be too large.. Or some sort of KD tree to dynamically set levels of detail.
Also had the thought that maybe there are collision detection algorithms out there that can address this.
Any hints or pointers in the right direction?
The simple Radix Tree which has been used in the longest prefix match Internet route lookups, can be scaled to hold nodes that represent the larger CIDR subnets that overlap other smaller ones. A longest match lookup will traverse these nodes which will also be selected to get the entire set of CIDR subnets that match an IP address.
Now, to hold the IP ranges in the same tree, we can convert each range into a set of CIDR subnets. This can be always done though the set may have lots of subnets (and even some host IPs -- that is, IP/32 kind CIDR addresses).
You have 10 billion rules to match 4 billion possible addresses?
Make a table of 4 billion addresses. For each of the 10 billion rules, 'paint' the addresses it applies to, doing something sensible when two or more rules apply to the same address.
You can look at the Interval tree to find all intervals that overlap with any given interval or point.
For non-overlapping ip-ranges, you can use a b-tree or compact-tries like Judy arrays (64-bits) for indexing and searching (Store the start-ip as key and the end-ip as value).

Max length for client ip address [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Maximum length of the textual representation of an IPv6 address?
What would you recommend as the maximum size for a database column storing client ip addresses? I have it set to 16 right now, but could I get an ip address that is longer than that with IPv6, etc?
There's a caveat with the general 39 character IPv6 structure. For IPv4 mapped IPv6 addresses, the string can be longer (than 39 characters). An example to show this:
IPv6 (39 characters) :
ABCD:ABCD:ABCD:ABCD:ABCD:ABCD:ABCD:ABCD
IPv4-mapped IPv6 (45 characters) :
ABCD:ABCD:ABCD:ABCD:ABCD:ABCD:192.168.158.190
Note: the last 32-bits (that correspond to IPv4 address) can need up to 15 characters (as IPv4 uses 4 groups of 1 byte and is formatted as 4 decimal numbers in the range 0-255 separated by dots (the . character), so the maximum is DDD.DDD.DDD.DDD).
The correct maximum IPv6 string length, therefore, is 45.
This was actually a quiz question in an IPv6 training I attended. (We all answered 39!)
For IPv4, you could get away with storing the 4 raw bytes of the IP address (each of the numbers between the periods in an IP address are 0-255, i.e., one byte). But then you would have to translate going in and out of the DB and that's messy.
IPv6 addresses are 128 bits (as opposed to 32 bits of IPv4 addresses). They are usually written as 8 groups of 4 hex digits separated by colons: 2001:0db8:85a3:0000:0000:8a2e:0370:7334. 39 characters is appropriate to store IPv6 addresses in this format.
Edit: However, there is a caveat, see #Deepak's answer for details about IPv4-mapped IPv6 addresses. (The correct maximum IPv6 string length is 45 characters.)
If you want to handle IPV6 in standard notation there are 8 groups of 4 hex digits:
2001:0dc5:72a3:0000:0000:802e:3370:73E4
32 hex digits + 7 separators = 39 characters.
CAUTION: If you also want to hold IPV4 addresses mapped as IPV6 addresses, use 45 characters as #Deepak suggests.
Take it from someone who has tried it all three ways... just use a varchar(39)
The slightly less efficient storage far outweighs any benefit of having to convert it on insert/update and format it when showing it anywhere.
As described in the IPv6 Wikipedia article,
IPv6 addresses are normally written as
eight groups of four hexadecimal
digits, where each group is separated
by a colon (:)
A typical IPv6 address:
2001:0db8:85a3:0000:0000:8a2e:0370:7334
This is 39 characters long. IPv6 addresses are 128 bits long, so you could conceivably use a binary(16) column, but I think I'd stick with an alphanumeric representation.
If you are just storing it for reference, you can store it as a string, but if you want to do a lookup, for example, to see if the IP address is in some table, you need a "canonical representation." Converting the entire thing to a (large) number is the right thing to do. IPv4 addresses can be stored as a long int (32 bits) but you need a 128 bit number to store an IPv6 address.
For example, all these strings are really the same IP address: 127.0.0.1, 127.000.000.001, ::1, 0:0:0:0:0:0:0:1
IPv4 uses 32 bits, in the form of:
255.255.255.255
I suppose it depends on your datatype, whether you're just storing as a string with a CHAR type or if you're using a numerical type.
IPv6 uses 128 bits. You won't have IPs longer than that unless you're including other information with them.
IPv6 is grouped into sets of 4 hex digits seperated by colons, like (from wikipedia):
2001:0db8:85a3:0000:0000:8a2e:0370:7334
You're safe storing it as a 39-character long string, should you wish to do that. There are other shorthand ways to write addresses as well though. Sets of zeros can be truncated to a single 0, or sets of zeroes can be hidden completely by a double colon.
People are talking about characters when one can compress an IP address into raw data.
So in principle, since we only use IPv4 (32bit) or IPv6 (128bit), that means you need at most 128 bits of space, or 128/8 = 16 bytes!
Which is much less than the suggested 39 bytes (assuming charset is ascii).
That said, you will have to decode and encode the IP address into/from the raw data, which in itself is a trivial thing to do (I've done it before, see PHP's ip2long() for 32-bit IPs).
Edit: inet_pton (and its opposite, inet_ntop()) does what you need, and works with both address types. But beware, on Windows it's available since PHP 5.3.

SQL SHA1 inside WHERE

In my program, we store a user's IP address in a record. When we display a list of records to a user, we don't want to give away the other user's IP, so we SHA1 hash it. Then, when the user clicks on a record, it goes to a URL like this:
http://www.example.com/allrecordsbyipaddress.php?ipaddress=SHA1HASHOFTHEIPADDRESS
Now, I need to list all the records by the IP address specified in the SHA1 hash. I tried this:
SELECT * FROM records
WHERE SHA1(IPADDRESS)="da39a3ee5e6b4b0d3255bfef95601890afd80709"
but this does not work. How would I do this?
Thanks,
Isaac Waller
Don't know if it matters, but your SHA1 hash da39a3ee5e6b4b0d3255bfef95601890afd80709 is a well-known hash of an empty string.
Is it just an example or you forgot to provide an actual IP address to the hash calculation function?
Update:
Does your webpage code generate SHA1 hashes in lowercase?
This check will fail in MySQL:
SELECT SHA1('') = 'DA39A3EE5E6B4B0D3255BFEF95601890AFD80709'
In this case, use this:
SELECT SHA1('') = LOWER('DA39A3EE5E6B4B0D3255BFEF95601890AFD80709')
, which will succeed.
Also, you can precalculate the SHA1 hash when you insert the records into the table:
INSERT
INTO ip_records (ip, ip_sha)
VALUES (#ip, SHA1(CONCAT('my_secret_salt', #ip))
SELECT *
FROM ip_records
WHERE ip_sha = #my_salted_sha1_from_webpage
This will return you the original IP and allow indexing of ip_sha, so that this query will work fast.
I'd store the SHA1 of the IP in the database along with the raw IP, so that the query would become
SELECT * FROM records WHERE ip_sha1 = "..."
Then I'd make sure that the SHA1 calculation happens exactly one place in code, so that there's no opportunity for it be be done slightly differently in multiple places. That also gives you the opportunity to mix a salt into the calculation, so that someone can't simply compute the SHA1 on an IP address they're interested in and pass that in by hand.
Storing the SHA1 hash the database also gives you the opportunity to add a secondary index on ip_sha1 to speed up that SELECT. If you have a very large data set, doing the SHA1 in the WHERE clauses forces the database to do a complete table scan, along with redoing a calculation for every record on every scan.
Every time I've had an unexpected hashing mismatch, it was because I accidentally hashed a string that included some whitespace, such as "\n".
Just a quick thought: that's a very simple obfuscation. There are only 232 possible IP addresses, so if somebody with technical knowledge wanted to figure it out, they could do that by calculating all 4 billion hashes, which wouldn't take very long. Depending on the sensitivity of those ip addresses, you may want to consider a private lookup table.
Did you compare the output of your hash algorithm with the output of MySQL's SHA1()? For example for IP address 1.2.3.4?
I ended up encrypting the IP addresses, and decrypting them on the other page. Then I can just use the raw IP address in the SQL query. Also, it protects against brute force attacks, like Autocracy said.