how to store an approximate number? (number is too small to be measured) - sql

I have a table representing standards of alloys. The standard is partly based on the chemical composition of the alloys. The composition is presented in percentages. The percentage is determined by a chemical composition test. Sample data.
But sometimes, the lab cannot measure below a certain percentage. So they indicate that the element is present, but the percentage is less than they can measure.
I was confused how to accurately store such a number in an SQL database. I thought to store the number with a negative sign. No element can have a negative composition of course, but i can interpret this as less than the specified value. Or option is to add another column for each element!! The latter option i really don't like.
Any other ideas? It's a small issue if you think about it, but i think a crowd is always wiser. Somebody might have a neater solution.
Question updated:
Thanks for all the replies.
The test results come from different labs, so there is no common lower bound.
The when the percentage of Titanium is less than <0.0004 for example, the number is still important, only the formula will differ slightly in this case.
Hence the value cannot be stored as NULL, and i don't know the lower bound for all values.
Tricky one.
Another possibility i thought of is to store it as a string. Any other ideas?

What you're talking about is a sentinel value. It's a common technique. Strings in most languages after all use 0 as a sentinel end-of-string value. You can do that. You just need to find a number that makes sense and isn't used for anything else. Many string functions will return -1 to indicate what you're looking for isn't there.
0 might work because if the element isn't there there shouldn't even be a record. You also face the problem that it might be mistaken for actually meaning 0. -1 is another option. It doesn't have that same problem obviously.
Another column to indicate if the amount is measurable or not is also a viable option. The case for this one becomes stronger if you need to store different categories of trace elements (eg <1%, <0.1%, <0.01%, etc). Storing the negative of those numbers seems a bit hacky to me.

You could just store it as NULL, meaning that the value exists but is undefined.
Any arithmetic operation with a NULL yields a NULL.
Division by NULL is safe.
NULL's are ignored by the aggregation functions, so queries like these:
SELECT SUM(metal_percent), COUNT(metal_percent)
FROM alloys
GROUP BY
metal
will give you the sum and the count of the actual, defined values, not taking the unfilled values into account.

I would use a threshold value which is at least one significant digit smaller than your smallest expected value. This way you can logically say that any value less than say 0.01, can be presented to you application as a "trace" amount. This remains easy to understand and gives you flexibility in determining where your threshold should lie.

Since the constraints of the values are well defined (cannot have negative composition), I would go for the "negative value to indicate less-than" approach. As long as this use of such sentinel values are sufficiently documented, it should be reasonably easy to implement and maintain.
An alternative but similar method would be to add 100 to the values, assuming that you can't get more than 100%. So <0.001 becomes 100.001.

I would have a table modeling the certificate, in a one to many relation with another table, storing the values for elements. Then, I would still have the elements table containing the value in one column and a flag (less than) as a separate column.
Draft:
create table CERTIFICATES
(
PK_ID integer,
NAME varchar(128)
)
create table ELEMENTS
(
ELEMENT_ID varchar(2),
CERTIFICATE_ID integer,
CONCENTRATION number,
MEASURABLE integer
)
Depending on the database engine you're using, the types of the columns may vary.

Why not add another column to store whether or not its a trace amount
This will allow you to to save the amount that the trace is less than too

Since there is no common lowest threshold value and NULL is not acceptable, the cleanest solution now is to have a marker column which indicates whether there is a quantifiable amount or a trace amount present. A value of "Trace" would indicate to anybody reading the raw data that only a trace amount was present. A value of "Quantity" would indicate that you should check an amount column to find the actual quantity present.
I would have to warn against storing numerical values as strings. It will inevitably add additional pain, since you now lose the assertions a strong type definition gives you. When your application consumes the values in that column, it has to read the string to determine whether it's a sentinel value, a numeric value or simply some other string it can't interpret. Trying to handle data conversion errors at this point in your application is something I'm sure you don't want to be doing.

Another field seems like the way to go; call it 'MinMeasurablePercent'.

Related

Is there a Postgres feature or built-in function that limits the display of uuids to only that needed to make them uniquely identifiable?

It would have to return the portion necessary to uniquely identify the row even if a select statement didn't return all rows, of course, to be of any use. And I'm not sure how it would work if the uuid column were not part of a pk/index and was repeated.
Does this exist?
I think you would have to decide what constitutes uniquely identifiable by assuming that a number of places from the right make it uniquely identifiable. I think this is folly but the way you would do that is something like this:
SELECT RIGHT(uuid_column_name::text, 7) as your_truncated_uuid FROM table_with_uuid_column;
That takes the 7 places from the right of the text value of the uuid column.
No, there is not. A UUID is a hex representation of a 120 bit random number, at least the v4 variant. It's not even guaranteed to be unique though it likely is.
You have a few options to implement this:
shave off characters and hope you don't introduce a collision. For instance, if you make d8366842-8c1d-4a31-a4c0-f1765b8ab108 d8366842, you have 16**8 possible combinations, or 4,294,967,296. how likely is your dataset to have a collision with 4.2 billion (2**32) possibilities? Perhaps you can add 8c1d back in to make it 16**12 or 28,147,497,6710,656 possibilities.
process and hash each row looking for collisions and recursively increase the frame of characters until no collisions are found, or hash every possible permutation.
That all said, another idea is to use ints and not uuids and then to use http://hashids.org/ which has a plugin for PostgreSQL. This is the method YouTube uses afaik.

How predictable is NEWSEQUENTIALID?

According to Microsoft's documentation on NEWSEQUENTIALID, the output of NEWSEQUENTIALID is predictable. But how predictable is predictable? Say I have a GUID that was generated by NEWSEQUENTIALID, how hard would it be to:
Calculate the next value?
Calculate the previous value?
Calculate the first value?
Calculate the first value, even without knowing any GUID's at all?
Calculate the amount of rows? E.g. when using integers, /order?id=842 tells me that there are 842 orders in the application.
Below is some background information about what I am doing and what the various tradeoffs are.
One of the security benefits of using GUID's over integers as primary keys is that GUID's are hard to guess. E.g. say a hacker sees a URL like /user?id=845 he might try to access /user?id=0, since it is probable that the first user in the database is an administrative user. Moreover, a hacker can iterate over /user?id=0..1..2 to quickly gather all users.
Similarly, a privacy downside of integers is that they leak information. /order?id=482 tells me that the web shop has had 482 orders since its implementation.
Unfortunately, using GUID's as primary keys has well-known performance downsides. To this end, SQL Server introduced the NEWSEQUENTIALID function. In this question, I would like to learn how predictable the output of NEWSEQUENTIALID is.
The underlying OS function is UuidCreateSequential. The value is derived from one of your network cards MAC address and a per-os-boot incremental value. See RFC4122. SQL Server does some byte-shuffling to make the result sort properly. So the value is highly predictable, in a sense. Specifically, if you know a value you can immediately predict a range of similar value.
However one cannot predict the equivalent of id=0, nor can it predict that 52DE358F-45F1-E311-93EA-00269E58F20D means the store sold at least 482 items.
The only 'approved' random generation is CRYPT_GEN_RANDOM (which wraps CryptGenRandom) but that is obviously a horrible key candidate.
In most cases, the next newsequentialid can be predicted by taking the current value and adding one to the first hex pair.
In other words:
1E29E599-45F1-E311-80CA-00155D008B1C
is followed by
1F29E599-45F1-E311-80CA-00155D008B1C
is followed by
2029E599-45F1-E311-80CA-00155D008B1C
Occasionally, the sequence will restart from a new value.
So, it's very predictable
NewSequentialID is a wrapper around the windows function UuidCreateSequential
You can try this code:
DECLARE #tbl TABLE (
PK uniqueidentifier DEFAULT NEWSEQUENTIALID(),
Num int
)
INSERT INTO #tbl(Num) values(1),(2),(3),(4),(5)
select * from #tbl
On my machine in this time is result:
PK Num
52DE358F-45F1-E311-93EA-00269E58F20D 1
53DE358F-45F1-E311-93EA-00269E58F20D 2
54DE358F-45F1-E311-93EA-00269E58F20D 3
55DE358F-45F1-E311-93EA-00269E58F20D 4
56DE358F-45F1-E311-93EA-00269E58F20D 5
You should try it several times in different time/date to interpolate the behaviour.
I tried it run several times and the first part is changing everytime (you see in results: 52...,53...,54...,etc...). I waited some time to check it, and after some time the second part is incremented too. I suppose the incementation continues to the all parts. Basically it look like simple +=1 incementation transformed into Guid.
EDIT:
If you want sequential GUID and you want have control over the values, you can use Sequences.
Sample code:
select cast(cast(next value for [dbo].[MySequence] as varbinary(max)) as uniqueidentifier)
• Calculate the next value? Yes
Microsoft says:
If privacy is a concern, do not use this function. It is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID.
SO it's a possibility to get the next value. I don't find information if it is possible to get the prevoius one.
from: http://msdn.microsoft.com/en-us/library/ms189786.aspx
edit: another few words about NEWSEQUENTIALID and security: http://vadivel.blogspot.com/2007/09/newid-vs-newsequentialid.html
Edit:
NewSequentialID contains the server's MAC address (or one of them), therefore knowing a sequential ID gives a potential attacker information that may be useful as part of a security or DoS attack.
from: Are there any downsides to using NewSequentialID?

SQL Server 'In' versus 'less than'

I've got a table of addresses that I'm replicating. There are 14 different types of addresses available. In order to cut down on the replicated data, I'm filtering on the AddressType field. The field is an int, and has a value of 1 to 14. I originally had a filter of
AddressType = 2
, as I was only interested in addresses with that type. However, a recent change requires I have both AddressType 1 and 2 replicated. I at first changed the filter to
AddressType in (1,2)
Would I be better off with a filter of
AddressType < 3
Thoughts?
There can be a significant difference as the numbers get larger. You won't see a performance difference at the smaller numbers but you will see a difference as it gets larger, especially if there is an index on AddressType. Your IN () version essentially gets translated to:
WHERE AddressType = 1
OR AddressType = 2
OR ...
I do agree with the others for this specific case. (1) the performance difference when there are only 14 values is unlikely to be noticeable. (2) Jonathan's point that IN () more accurately reflects what you want to do is a good one, also.
But for future readers who maybe have a lot more possible values, I think it's important to note how things can change when the list is not so limited (or when < and IN () no longer offer the same functionality, e.g. when an address type changes). At larger sizes even when there is the convenience that everything in the IN () list matches a range criterion, there are still other things to consider: (1) it is less convenient to type IN (1,2,3,4,5,6,7,8,9,10, ...) and this also can lead to a much larger batch size when we're talking about extremes.
When your requirements change and you need types 1, 2, 9 and 14, the IN formulation will be better. The list comparison more accurately reflects what you are doing (choosing two types from a small list of possible values).
The less than notation happens to work, but it is coincidental that the representation of the types is susceptible to range comparisons like that.
In terms of performance, there is essentially nothing to choose between the two. The less than operation will perhaps be marginally quicker, but the margin is unlikely to be measurable.
Execution plans look identical. I'm inclined to say you should go with IN in case you need to add another address type like "5" that will force you to rewrite the < query. IN is a lot more extensible because it doesnt matter what you add to it.
All answers are fine...in the same vein as other posters...its "what else you might do" that might make a difference.
So always consider NULL in any comparison. Your query is fine wrt NULLS as written but: IF nulls are possible... and IF you decide to change or reuse the SQL ad hoc to say, Negate it...you might have an issue with the comparison as opposed to IN.
for instance how would NOT IN (1,2) perform vs >= 3 ... or whatever incarnations we might use. A NULL is TRUE in the first but FALSE in the second. (NULLS is comparisons).
Considering NULLS should be like breathing in SQL creation.

precision gains where data move from one table to another in sql server

There are three tables in our sql server 2008
transact_orders
transact_shipments
transact_child_orders.
Three of them have a common column carrying_cost. Data type is same in all the three tables.It is float with NUMERIC_PRECISION 53 and NUMERIC_PRECISION_RADIX 2.
In table 1 - transact_orders this column has value 5.1 for three rows. convert(decimal(20,15), carrying_cost) returns 5.100000..... here.
Table 2 - transact_shipments three rows are fetching carrying_cost from those three rows in transact_orders.
convert(decimal(20,15), carrying_cost) returns 5.100000..... here also.
Table 3 - transact_child_orders is summing up those three carrying costs from transact_shipments. And the value shown there is 15.3 when I run a normal select.
But convert(decimal(20,15), carrying_cost) returns 15.299999999999999 in this stable. And its showing that precision gained value in ui also. Though ui is only fetching the value, not doing any conversion. In the java code the variable which is fetching the value from the db is defined as double.
The code in step 3, to sum up the three carrying_costs is simple ::
...sum(isnull(transact_shipments.carrying_costs,0)) sum_carrying_costs,...
Any idea why this change occurs in the third step ? Any help will be appreciated. Please let me know if any more information is needed.
Rather than post a bunch of comments, I'll write an answer.
Floats are not suitable for precise values where you can't accept rounding errors - For example, finance.
Floats can scale from very small numbers, to very high numbers. But they don't do that without losing a degree of accuracy. You can look the details up on line, there is a host of good work out there for you to read.
But, simplistically, it's because they're true binary numbers - some decimal numbers just can't be represented as a binary value with 100% accuracy. (Just like 1/3 can't be represented with 100% accuracy in decimal.)
I'm not sure what is causing your performance issue with the DECIMAL data type, often it's because there is some implicit conversion going on. (You've got a float somewhere, or decimals with different definitions, etc.)
But regardless of the cause; nothing is faster than integer arithmetic. So, store your values are integers? £1.10 could be stored as 110p. Or, if you know you'll get some fractions of a pence for some reason, 11000dp (deci-pennies).
You do then need to consider the biggest value you will ever reach, and whether INT or BIGINT is more appropriate.
Also, when working with integers, be careful of divisions. If you divide £10 between 3 people, where does the last 1p need to go? £3.33 for two people and £3.34 for one person? £0.01 eaten by the bank? But, invariably, it should not get lost to the digital elves.
And, obviously, when presenting the number to a user, you then need to manipulate it back to £ rather than dp; but you need to do that often anyway, to get £10k or £10M, etc.
Whatever you do, and if you don't want rounding errors due to floating point values, don't use FLOAT.
(There is ALOT written on line about how to use floats, and more importantly, how not to. It's a big topic; just don't fall into the trap of "it's so accurate, it's amazing, it can do anything" - I can't count the number of time people have screwed up data using that unfortunately common but naive assumption.)

Which data type to use for ordinal?

Whenever I have some records/objects that I want to be in a certain order, I usually create a field called Ordinal.
I often wonder if it would be better to use an integer or a decimal value for the ordinal field.
This is a consideration when moving an object to a different position in the order:
If you use consecutive integers, you have to do some serious reworking of all of the ordinals (or at least the ordinals that fall before the original position of the object being moved).
If you use integers but space them out (maybe at 1000 intervals), then you can just change the ordinal to a mid point value between the surrounding objects where you want to move the object. This could fail if somewhere down the line you end up with consecutive integers.
If you use decimal numbers you could just find the average of the surround object's ordinals and use that for the object to be moved.
Maybe it would be possible to use a string, but I could see that getting pretty goofy.
I'm sure there are other considerations I haven't thought of.
What do you use and why?
"This could fail if somewhere down the line you end up with consecutive integers."
For this (probably rare and thus not performance important) case, you could implement a renumber method that spaces out again. When I used to program in COMAL (anyone know that language?), you could do this very thing with line numbers.
Decimals seem to solve your problem pretty well. Since Decimals are just base 10 floats, you actually have a lot of digits available. Unless you've seen cases where you've gotten out to quite a few digits and had reason to suspect a reason for an unlimited number of digits being necessary, I'd let it ride.
If you really need an alternative and don't see a need to stick with a basic data bype, you might go with tumbler arithmetic. The basic idea is that it's a place notation that is infinitely expandable at each position. Pretty simple conceptually.
I used to use a decimal type for a field of this kind to order records in a table, which we actually exposed to the customer so that they could set their own order. Although it sounds cheesy our customers liked it; they found it very intuitive. They caught on very quickly that they could use numbers like 21.5 to move something between 21 and 22.
Maybe it's because they were accountants.
I use integers and just rearrange as necessary when a new item needs to be inserted in the middle of the order. Since you can create the necessary gap with a single update statement, it's fairly trivial. However, I've only ever done this on lookup tables of a few dozen rows at most, obviously this scales a bit poorly. But I would say that if you need a solution to this problem for a large number of rows, the process(es) for maintaining the order should be proceduralized anyway, which makes the choice of data type largely moot.
I remember this being a similar question to a previous post. It can be found here:
SQL Server Priority Ordering
The linked list would still work, but this is a much easier solution if you don't want to track a parent child relationship.
Sounds like what you want is a linked list. That way you always know what comes next and you don't have to guess. So the position field would be a pointer to the object following it.
The problem I have always had with using arbitrary numbers for position, is that it can quickly fall to entropy. What if more items get added and the number become consecutive etc. etc. It can quickly become unmanageable if the list of items changes position.
To implement this in sql server table, add another field with the same data type as the primary key. If the field is null then it is the bottom element in the list. If you are storing multiple lists in the same table you will probably want to add another field called ListID which designates all rows with the same ListID pertain to the same list. So something like this.
Table:
ID INT
ListID INT
Child INT
Pararent Row For first list:
1, 1, 2
First Child
2, 1, 3
Second Child
3, 1, NULL
Parent Row for second list:
4, 2, 5
First Child
5, 2, 6
Second Child
6, 2, NULL
You'll probably have to do an insert and an update every time you add a row, which can be a little tedious, but it will always make the list line up.
Is the "certain order" based on data outside of the table? If so, why not include it so you can do the sorting dynamically? If it's already in the table, adding a field is redundant.