Should we put units of measurements in attribute names? [closed] - naming-conventions

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I think most of us agree that it's a good idea to use a descriptive name for variables, object attributes, and database columns. If you want to store something's name, you may as well call the attribute Name so people know what to put in it.
Where the unit of measurement isn't immediately apparent, I think you should go a step further and include the unit of measurement in the name. Length_mm, for example, should help remind developers that they'd better convert the length to mm if the user just entered it in inches.
My database administrator, however, just told me that including units of measurement in database column names is “frowned upon”. I think that's just nuts, but perhaps there's some risk DBAs know about that I don't.
Throw me a line, here: should we embed units of measurement in our attribute names? Why? Why not?

If you have a consistent UOM for things, then your DBA's policy is OK.
For example, if timespans are ALWAYS in minutes, etc.
If the UOM could change, then you should store it in another column, alongside the qty.
That said, I tend to side with you on this. Clarity trumps most things, including this. I'd rather see DurationMinutes than Duration and have to guess what the UOM is.

Yes. You should.
The key, as #[Charles Bretana] pointed out, is legibility and that the other users of your table or developers following you know what you're using.
I would absolutely involve the units/measurement in a field name - in my business you can't guess what you'll find from the context or name: a field entitled MarketValue - is that in millions, thousands or units? US Dollars, Euros, pounds, $CURRENCY? Is that value a percentage, a ratio? Absolute or relative? Daily, monthly, calendar year, financial year? That timestamp, what time zone is it?
Your first, last and only task when providing data is to ensure that it isn't used incorrectly because the consumer wasn't able to find out enough about it. As developers, throwing "Metre", "USD", "GMT", "Percent" or whatever into a field name isn't the least bit smelly.
There are enormous smells that need resolving before the tiny whiff of field naming needs standardising.

This is why the Mars Climate Orbiter crashed into the surface at 350 meters/sec when it was planned to only handle 350 ft/sec (or something like that).
Although "Never say 'Never' or 'Always'" is, in general, a good rule of thumb, here I will bend my rule and say I think you should "always" make it clear what units a numeric value is in.

The convention of naming all my columns in the format:
{name}_in_{unit}
helped for one project, since I was using si units it actually ended up allowing me to be able to infer the column data type and generally simplify my writing style.
length_in_m
speed_in_ms-1
color_in_nm
there were a few exceptions that I handled either with at_time or number_of:
started_at_time
updated_at_time
number_of_rotations

I think this is a good idea anywhere since there is always room for ambiguity.
For example, the with high performance timer class we use, I keep having to check if the GetElapsed() method returns seconds or milliseconds or something else. If it were called GetElapsedMilliseconds() that would save the confusion.
The only downside being if you wanted to change your mind ... but in that case any clients would need to know about the change anyway.
F# has an interesting twist on this allowing measurement units to be specified in the type system. See this blog post, and another stackoverflow question discussing Are units of measurement unique to F#?

I've done a lot of database work, and I would not frown upon that at all, nor have I heard of frowning on it.
It's better than the extended properties, which is not apparent to the casual developer. It's better than in a separate document, because many developers won't read them, and certainly not in great detail. If the units are set, then having it in the name sounds like a good idea. If that changes, then when the unit field is added, change the name of the measurement field.

Where the unit of measurement isn't immediately apparent, I think you should go a step further and include the unit of measurement in the name. Length_mm, for example, should help remind developers that they'd better convert the length to mm if the user just entered it in inches.
You could go even a step further (in your code, not in the database) and have a Length type, which takes care of the measurement unit and of possible conversions. This is the approach of the "Quantity" pattern in Martin Fowler's "Analysis Patterns" book.

Do not put units of measurement (or column type) in your database column names.
Many Databases have the ability to document/comment columns in some way (in SQL Server it is sp_addextendedproperty), I would suggest that is a more appropriate place.

For Python datetimes, consider using objects from the datetime package. Doing so will capture the unit implicity to microsecond resolution. There is then no basis for including the unit in the variable name.
If you must use an int or float instead, it is strongly recommend to suffix the unit name abbreviation to the variable name. For example, instead of the variable name diff, use diff_secs for seconds, diff_ms for milliseconds, diff_µs for microseconds, or diff_ns for nanoseconds.

We don't put units of measurement in column names in our database. We do, however, have a data dictionary document where all of the columns and relationships are described.

The ideal approach is, if possible, to use a type that leaves no ambiguity as to the measurement. For example in .NET rather than saying int periodInSeconds you'd be much better off using TimeSpan period.
The F# language actually has units of measurement as part of the type system so you can declare types in units such as 10<m/s> and 5<s> and even perform calculations on them so something like 10<m/s> * 5<s> would result in 50<m>. See here for more info.
So I'd say if possible use a type that conveys your intention, but if that isn't possible then you should probably encode the measurement into the name. It's better and more obvious than a comment.

You definitely want units of measurement somewhere. I don't know if the column names are a good place or if the schema is better. Ask your database administrator
Where is the information about units of measure stored?
How can I get access to the units programmatically?
If the answers are "it isn't" or "you can't", complain bitterly---they have no right to deny you your naming convention. Otherwise, all may be happier if you work within the system.
P.S. I really like the support for units of measure that they've put into F#.

I have to say, I hate "descriptive" variable names becoming "incredibly verbose" variable names.
My preferred alternative is to use nothing but the unit-of-measure names in short functions. Eg.
function velocity(m, s) {
return m/s;
}
You don't need to say "length_m" because in this context, it's obvious that only lengths are measurable in metres.
Having said that. If I was writing a system where units of measure errors were really dangerous, I'd probably use the type system and define a Length class which always converted itself into a standard unit for any calculation. Maybe even different sub-classes for Feet, Metres etc.

NO, the name of the attribute is seperate from its unit of measurement.
If you call a variable length_mm then you are tied to mm.
what if you use a 32bit int to store length_mm, eventually the length in mm may get larger then 62,000, or whatever the limit is on 32bit ints. You cant switch over to m cause you tied you length variable to length_mm.

I think putting units in your identifiers is a huge design smell. It almost surely means that you chose the wrong language: if units are so important to the project, you'd better be using a language whose type system is capable of representing them.

Related

Why does class Money extend Expression in Kent Beck's TDD by Example?

I'm studying TDD by Example and up so far I'm finding it a great book. But there's a point where he tells us to write:
// in class Money:
Expression plus(Money addend) {
return new Money(amount + addend.amount, currency);
}
Which won't build unless we declare:
class Money implements Expression {...
This doesn't really make sense to me. Author created Expression as an interface for Sum and Money has nothing in common with Sum. Later on he adds method reduce() in common to both classes, but reduce in Money merely returns this.
Making Money implement Expression is only the path of least effort for removing the error from the plus() method, but it fills the code with unnecessary information (it will have to implement reduce() because of this decision) and increases entropy.
I haven't given it much of a thought, but wouldn't it be cleaner to do something such as this?
class Money {
Money plus(Money addend) {
return new Money(amount + addend.amount, currency).reduce();
}
}
// edited this, it previously returned Expression
Edit:
In the following chapter, author implements a reduce() method in another class (named Bank), which converts Money in between currencies. I still find this to be a weird solution, Sum and Expression names imply we should have a conversion expression class for this task instead. What the author is likely intending to do is to use recursion to add money in different currencies. Either way, it seems to me that he did some far-ahead planning which seems incompatible with TDD as is presented in the book.
On planning ahead
TDD doesn't prohibit planning ahead. The process is about getting rapid feedback on your plans instead of wasting days (or weeks) creating elaborate plans, only to see them 'not survive contact with reality' (to paraphrase Helmuth von Moltke). It's okay to think ahead.
Still, Kent Beck reveals in chapter 17 that this isn't his first rodeo:
"I have programmed money in production at least three times that I can think of. I have used it as an example in print another half-dozen times. I have programmed it live on stage [...] another fifteen times. I coded another three or four times preparing for writing [...] Then, while I was writing this, I thought of using expression as a metaphor and the design went in a completely different direction than before."
So, if you think that he's cheating: yes, he is. He's open about it, though. I think that the motivation was to present a compelling example. He also writes that it was in part based on early reviews of the book.
On the API
That doesn't explain why the code looks like it does, but there's a reason for that. It's actually a nice API.
Why can't you write something like return new Expression(amount + addend.amount, currency).reduce()?
You can't because the reduce method isn't nullary. It takes arguments. You must supply both a bank (which holds the currency conversion rates) and a destination currency.
Keep in mind the problem being solved by the code. People get this wrong all the time, and I think that Kent Beck (inadvertently) fuelled the confusion by naming the example Money.
The problem isn't to model money, but to model investment portfolios in different currencies. If you have a portfolio of 25.000 USD and 10.000 CHF, reducing it to a single currency hides important details. With a portfolio in multiple currencies you diversify risk. Portfolio owners will want to see more than one view of their portfolios. Sometimes, they want to see the portfolio grouped by currency, and at other times they want to see 'the current total worth' of the portfolio presented in a single currency.
The API in the book enables both views.
The reason the underlying 'metaphor' is called an expression is that the API is just a specialised expression tree. It turns out fairly well, though, because it's lawful. Restricted to the sub-types presented in the book, it informally gives rise to a monoid.

Object condition in multiple places/repeated code (DRY)

This is a fundamental application design question I’ve struggled with and flip-flopped on for years. We have a legacy webapp that doesn't really have a solid ORM, if that tidbit might influence your answer. To abstract my question let’s say we have a class Car, and a corresponding table in our database named car. Car has a few properties: color, weight, year, maxspeed These properties directly correspond to columns in the db table.
In our application, we define the car as “classic old” if year is < 1960 and color = black. And in many places within our app knowing whether the car is "classic old" is extremely important (maybe we’re running a very illogical insurance agency which gives steep discounts and other perks to cars which are “classic old”).
All over our application, we do things like:
--list all classic old cars
--give the current user a discount if their car is classic old
--list all classic old cars with max speed > 100 miles per hour
--email the current user if their car is classic old and weights more than 1000 pounds
What is the best way to go about this? We have a legacy application that does this in some places:
getOldClassicCars()
select * where year < 1960 and color = black
and in other places:
cararray = getAllCars();
for each car in cararray
if car.year < 1960 and car.color = black
oldcararray = car.add()
The point being that this very important, fundamental piece of our application – is the car classic old – is “hardcoded” as year < 1960 and color = black in many places. Sometimes in SQL, sometimes in application code, etc. Obviously that is not good, but as we’ve refactored things I’m not sure we’re refactoring things the best way we can.
Well, you are stuck with the fundamental problem that
you cant run your code on the database
you want to be able to use the database's selection functionality on this criteria.
you want the calculation of "classic old" to be defined in a single place (preferably code)
Lets enumerate the solutions
1: Put the calculation in a sproc and always use the sproc to retrieve cars.
The problem here is if you create a new car in code, its class status is undefined, so you haven't really solved the 'not in two places' problem.
2: Get the DB to run your calc via an assembly. for example you can get mssql to run functions from a .net assembly which you can also use in your code base to perform the same calculation.
Problem, its hard work. Plus essentially its still in two places, you have to keep the db up to date and ensure that the table is accessed correctly
3: Persist the calculated value on the DB, but perform the calc in the code
Problem, if the calculation changes the DB values will be incorrect and need updating.
3 seems to be the best option, as we will know when the calculation changes and be able to take some action to resolve the situation.
However, it might be best, given the fundamental nature of this calculation, to make that 'out of dateness' implicit in the way we structure the code.
Instead of simply persisting car.IsClassic we could add a CarStatusReport object with a datetime property. We then generate a CarStatusReport(2017) which evaluates all the cars at that point in time and saves that data in a separate table.
Our business logic is then no longer, "Is this car a classic?" but "What does the latest CarStatusReport say the status of this car is?"
You Business Logic will then reside in a single CarStatusReportGenerator service and any other logic accessing the IsClassic calculation, will be forced to acknowledge the ephemeral nature of the stored info.
No optimal solution here. But, one good point will be to move all the business logic into the one place. If you can't (when you make methods or functions calculating some property, for example isOld()) then hide all those inconsistencies under the hood, so implementation users (conceptually) will never notice DRY violation from outside.

Complex derived attributes in Django models

What I want to do is implement submission scoring for a site with users voting on the content, much like in e.g. reddit (see the 'hot' function in http://code.reddit.com/browser/sql/functions.sql). Edit: Ultimately I want to be able to retrieve an arbitrarily filtered list of arbitrary length of submissions ranked according to their score.
My submission model currently keeps track of up and down vote totals. Currently, when a user votes I create and save a related Vote object and then use F() expressions to update the Submission object's voting totals. The problem is that I want to update the score for the submission at the same time, but F() expressions are limited to only simple operations (it's missing support for log(), date_part(), sign() etc.)
From my limited experience with Django I can see 5 options here:
extend F() somehow (haven't looked at the code yet) to support the missing SQL functions; this is my preferred option and seems to fit within the Django framework the best
define a scoring function (much like reddit's 'hot' function) in my database, and have Django use the value of that function for the value of the score field; as far as I can tell, #2 is not possible
wrap my two step voting process in a suitably isolated transaction so that I can calculate the voting totals in Python and then update the Submission's voting totals without fear that another vote against the submission could be added/changed in the meantime; I'm hesitant to take this route because it seems overly complex - what is a "suitably isolated transaction" in this case anyway?
use raw SQL; I would prefer to avoid this entirely -- what's the point of an ORM if I have to revert to SQL for such a common use case as this! (Note that this coming from somebody who loves sprocs, but is using Django for ease of development.)
(edit: added this after further discussion) compute the score using an extra select parameter containing a call to my function; this would work but impose unnecessary load on the DB (would be forced to calculate the score for every submission ever made every time the query ran; caching could help here, but it still seems like a bit of lame workaround)
Before I embark on this mission to extend F() (which I'm not sure is even possible), am I about to reinvent the wheel? Is there a more standard way to do this? It seems like such a common use case and yet in an hour of searching I have yet to find a common solution...
EDIT: There is another option: set the default value of the field in the database script to be an expression containing my function. This is not as flexible as #1, but probably the quickest and cleanest approach to solving the problem (although my initial investigation into extending F() looks promising).
Why can't you just denormalize the score and reconstruct it with the Vote objects every once and a while?
If you can't do that, it is very easy to make a 'property' function that acts as an object attribute for scoring.
#property
def score(self):
... calculate score from Vote objects ...
return score
I've never used F() on a property like this, but it's Python, so I bet it works.
If you are using django-voting (which I recommend), you can put #3 in the manager's record_vote function since that's how all vote transactions take place.

how to store an approximate number? (number is too small to be measured)

I have a table representing standards of alloys. The standard is partly based on the chemical composition of the alloys. The composition is presented in percentages. The percentage is determined by a chemical composition test. Sample data.
But sometimes, the lab cannot measure below a certain percentage. So they indicate that the element is present, but the percentage is less than they can measure.
I was confused how to accurately store such a number in an SQL database. I thought to store the number with a negative sign. No element can have a negative composition of course, but i can interpret this as less than the specified value. Or option is to add another column for each element!! The latter option i really don't like.
Any other ideas? It's a small issue if you think about it, but i think a crowd is always wiser. Somebody might have a neater solution.
Question updated:
Thanks for all the replies.
The test results come from different labs, so there is no common lower bound.
The when the percentage of Titanium is less than <0.0004 for example, the number is still important, only the formula will differ slightly in this case.
Hence the value cannot be stored as NULL, and i don't know the lower bound for all values.
Tricky one.
Another possibility i thought of is to store it as a string. Any other ideas?
What you're talking about is a sentinel value. It's a common technique. Strings in most languages after all use 0 as a sentinel end-of-string value. You can do that. You just need to find a number that makes sense and isn't used for anything else. Many string functions will return -1 to indicate what you're looking for isn't there.
0 might work because if the element isn't there there shouldn't even be a record. You also face the problem that it might be mistaken for actually meaning 0. -1 is another option. It doesn't have that same problem obviously.
Another column to indicate if the amount is measurable or not is also a viable option. The case for this one becomes stronger if you need to store different categories of trace elements (eg <1%, <0.1%, <0.01%, etc). Storing the negative of those numbers seems a bit hacky to me.
You could just store it as NULL, meaning that the value exists but is undefined.
Any arithmetic operation with a NULL yields a NULL.
Division by NULL is safe.
NULL's are ignored by the aggregation functions, so queries like these:
SELECT SUM(metal_percent), COUNT(metal_percent)
FROM alloys
GROUP BY
metal
will give you the sum and the count of the actual, defined values, not taking the unfilled values into account.
I would use a threshold value which is at least one significant digit smaller than your smallest expected value. This way you can logically say that any value less than say 0.01, can be presented to you application as a "trace" amount. This remains easy to understand and gives you flexibility in determining where your threshold should lie.
Since the constraints of the values are well defined (cannot have negative composition), I would go for the "negative value to indicate less-than" approach. As long as this use of such sentinel values are sufficiently documented, it should be reasonably easy to implement and maintain.
An alternative but similar method would be to add 100 to the values, assuming that you can't get more than 100%. So <0.001 becomes 100.001.
I would have a table modeling the certificate, in a one to many relation with another table, storing the values for elements. Then, I would still have the elements table containing the value in one column and a flag (less than) as a separate column.
Draft:
create table CERTIFICATES
(
PK_ID integer,
NAME varchar(128)
)
create table ELEMENTS
(
ELEMENT_ID varchar(2),
CERTIFICATE_ID integer,
CONCENTRATION number,
MEASURABLE integer
)
Depending on the database engine you're using, the types of the columns may vary.
Why not add another column to store whether or not its a trace amount
This will allow you to to save the amount that the trace is less than too
Since there is no common lowest threshold value and NULL is not acceptable, the cleanest solution now is to have a marker column which indicates whether there is a quantifiable amount or a trace amount present. A value of "Trace" would indicate to anybody reading the raw data that only a trace amount was present. A value of "Quantity" would indicate that you should check an amount column to find the actual quantity present.
I would have to warn against storing numerical values as strings. It will inevitably add additional pain, since you now lose the assertions a strong type definition gives you. When your application consumes the values in that column, it has to read the string to determine whether it's a sentinel value, a numeric value or simply some other string it can't interpret. Trying to handle data conversion errors at this point in your application is something I'm sure you don't want to be doing.
Another field seems like the way to go; call it 'MinMeasurablePercent'.

Using integers and requiring multiplication vs. using decimals as a data type - what are your thoughts?

What are your thoughts on this? I'm working on integrating some new data that's in a tab-delimited text format, and all of the decimal columns are kept as single integers; in order to determine the decimal amount you need to multiply the number by .01. It does this for things like percentages, weight and pricing information. For example, an item's price is expressed as 3259 in the data files, and when I want to display it I would need to multiply it in order to get the "real" amount of 32.59.
Do you think this is a good or bad idea? Should I be keeping my data structure identical to the one provided by the vendor, or should I make the database columns true decimals and use SSIS or some sort of ETL process to automatically multiply the integer columns into their decimal equivalent? At this point I haven't decided if I am going to use an ORM or Stored Procedures or what to retrieve the data, so I'm trying to think long term and decide which approach to use. I could also easily just handle this in code from a DTO or similar, something along the lines of:
public class Product
{
// ...
private int _price;
public decimal Price
{
get
{
return (this._price * .01);
}
set
{
this._price = (value / .01);
}
}
}
But that seems like extra and unnecessary work on the part of a class. How would you approach this, keeping in mind that the data is provided in the integer format by a vendor that you will regularly need to get updates from.
"Do you think this is a good or bad idea?"
Bad.
"Should I be keeping my data structure identical to the one provided by the vendor?"
No.
"Should I make the database columns true decimals?"
Yes.
It's so much simpler to do what's right. Currently, the data is transmitted with no "." to separate the whole numbers from the decimals; that doesn't any real significance.
The data is decimal. Decimal math works. Use the decimal math provided by your language and database. Don't invent your own version of Decimal arithmetic.
Personally I would much prefer to have the data stored correctly in my database and just do a simple conversion every time an update comes in.
Pedantically: they aren't kept as ints either. They are strings that require parsing.
Philisophically: you have information in the file and you should write data into the database. That means transforming the information in any ways necessary to make it meaningful/useful. If you don't do this transform up front, then you'll be doomed to repeat the transform across all consumers of the database.
There are some scenarios where you aren't allowed to transform the data, such as being able to answer the question: "What was in the file?". Those scenarios would require the data to be written as string - if the parse failed, you wouldn't have an accurate representation of the file.
In my mind the most important facet of using Decimal over Int in this scenario is maintainability.
Data stored in the tables should be clearly meaningful without need for arbitrary manipulation. If manipulation is required is should be clearly evident that it is (such as from the field name).
I recently dealt with data where days of the week were stored as values 2-8. You can not imagine the fall out this caused (testing didn't show the problem for a variety of reasons, but live use did cause political explosions).
If you do ever run in to such a situation, I would be absolutely certain to ensure data can not be written to or read from the table without use of stored procedures or views. This enables you to ensure the necessary manipulation is both enforced and documented. If you don't have both of these, some poor sod who follows you in the future will curse your very name.