Should I store both birthday and age? - sql

Im making an app that uses both birthday and age to make some deductions,
but as the age can be obtain through the birthday and current date, Im questioning if I should be storing them both and not just the date, for one part I could use the age attribute to simplify some querys without converting dates, what would be the right thing to do following conventions?

Calculations based on data should be always... calculated, not stored. Well, not always, usually, but
it depends on situation. Below are couple of pros and cons:
Cons
calculation logic might change, so stored values will be no loner valid.
or invalid data could be entered (and you will receive invalid data when querying).
or the result changes with time, as age does, eg. today you have 20 years, but in one year you will have 21.
Pros
however, as #RonenAriely mentioned, storing calculated data in order to gain performance is one of pros of such approach.
So, to sum up, you should make calculations, like DATEDIFF(NOW(), DateOfBirth) to get an age, as the result changes in time and the function don't influence performance much.

I would say store just the DOB and calculate the age when using.
I mainly prefer this because age will continuously change and you have to make sure to update it depending on how accurately you are measuring it. This will kind of beat the purpose of computing once and using multiple times because you'll be recomputing a lot of times. Then since it is redundant it'll also unnecessarily occupy space in your tables.
Hope it helped

Generally only birth date is stored.
You can create a common helper method to calculate age. Preferably static to avoid additional memory consumption.
Also saving age in database makes less sense as in such a case you would be required to run a daily cron to see which user's age is increasing by 1 that day and then update in the database.

As said here,
you have to ensure that it is not possible for the derived value to
become out-of-date undetected.
Birthday never goes out-of-date so you would be OK!

Better to follow the normalised approach and only store date of birth. Age would be marginally quicker to retrieve but, for that to be correct, you'd have to refresh the table on a daily basis.
If you were running a DB search on age range, then you could convert min and max ages to an upper and lower date of birth based on the current date and then search accordingly.

Related

Store Quarter/Year in SQL Server

I have a table which needs to store the quarter and year, and I need to know which is the best way to do this. I found this answer from 10 years ago on SO: Best way to store quarter and year in SQL Server. However, there are two suggestions given--one is storing quarter and year in separate columns and making them integers, the other being storing as a datetime and using the first day of the month for the day (i.e., 1/1/2021, 4/1/2021, etc.).
Considering this answer is 10 years old and there could be better ways now for storing this data, what is the best method?
FYI, this data will not be used for calculation purposes, but will probably be searched on.
Thanks!
I recommend always storing date related data as the datetime data type.
Storing them separately is the worst possible approach, searching becomes very difficult. Try writing the query returning all quarters between 3Q2019 and 1Q2021 when your year and quarter are separate.
Breaking it into separate parts puts the responsibility on the developer to handle the year boundary appropriately, which many do not.
DateTime data type also includes validation (Q5 2020 would throw an error) to prevent data errors.
Use the right tool for the job. DateTime data should always be stored in a DateTime datatype.

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.

When to use separate date and time instead of a single datetime

If I want to store date and time, is it better to store them in a separate date and time or use a single datetime?
When should we use date and time instead of a single datetime?
I want to filter my queries either using date or time.
When you are talking about a moment in time, whether a universal moment, or a specific date and time on someone's local calendar, you use a datetime. If you want to be sure that you are talking about an exact moment in time, regardless of the observer, then you use a datetimeoffset.
If you are storing just a date then you mean a date without a time component, meaning "any time on this date".
If you are storing just a time then you mean a time without a date component, meaning "this time on any date", or "this time on a date determined by some other means".
There is no practical purpouse to having both a date and a time that are about the same thing, sitting on the same row. Just use a datetime for that.
In SQL Server 2008 you have date and time data types so this becomes a non issue.
If it is good choice it really depends by your business and how you will query you data.
If for example you want to know all the orders places between 1 and 2 PM for any day using a separated Date and Time column will make it quicker
If you intentionally do not care about the time, it's more efficient to store this data as a date datatype. Think a customer birthday column, there's not too many cases I can think of that would use this time. If there happens to be a time attached to it (often a bug), this needs to be removed via a convert statement in order to do a compare. It also consumes additional space if you don't need these values (3 bytes compared to 8).
I think it's similar to having a status code table with the id as a bigint instead of a tinyint or the like (depending on how many status codes you would plan to have).
It's just a matter of what you're using the data for, if you think there's a good chance you'll ever need the that data, then use datetime, otherwise use date.
Nothing brilliant about separating date and time,
Better you save date and time in Same column,
Here they have discussed the same issue check it : are-there-any-good-reasons-for-keeping-date-and-time-in-separate-columns
you can also get date and time separately by query
SELECT
CONVERT(VARCHAR(10),GETDATE(),111) as DatePart,
convert(varchar(15), getdate(), 108) TimePart

How to calculate blocks of free time using start and end time?

I have a Ruby on Rails application that uses MySQL and I need to calculate blocks of free (available) time given a table that has rows of start and end datetimes. This needs to be done for a range of dates, so for example, I would need to look for which times are free between May 1 and May 7. I can query the table with the times that are NOT available and use that to remove periods of time between May 1 and May 7. Times in the database are stored at a fidelity of 15 minutes on the quarter hour, meaning all times end at 00, 15, 30 or 45 minutes. There is never a time like 11:16 or 10:01, so no rounding is necessary.
I've thought about creating a hash that has time represented in 15 minute increments and defaulting all of the values to "available" (1), then iterating over an ordered resultset of rows and flipping the values in the hash to 0 for the times that come back from the database. I'm not sure if this is the most efficient way of doing this, and I'm a little concerned about the memory utilization and computational intensity of that approach. This calculation won't happen all the time, but it needs to scale to happening at least a couple hundred times a day. It seems like I would also need to reprocess the entire hash to find the blocks of time that are free after this which seems pretty inefficient.
Any ideas on a better way to do this?
Thanks.
I've done this a couple of ways. First, my assumption is that your table shows appointments, and now you want to get a list of un-booked time, right?
So, the first way I did this was like yours, just a hash of unused times. It's slow and limited and a little wasteful, since I have to re-calculate the hash every time someone needs to know the times that are available.
The next way I did this was borrow an idea from the data warehouse people. I build an attribute table of all time slots that I'm interested in. If you build this kind of table, you may want to put more information in there besides the slot times. You may also include things like whether it's a weekend, which hour of the day it's in, whether it's during regular business hours, whether it's on a holiday, that sort of thing. Then, I have to do a join of all slots between my start and end times and my appointments are null. So, this is a LEFT JOIN, something like:
SELECT *
FROM slots
WHERE ...
LEFT JOIN appointments
WHERE appointments.id IS NULL
That keeps me from having to re-create the hash every time, and it's using the database to do the set operations, something the database is optimized to do.
Also, if you make your slots table a little rich, you can start doing all sorts of queries about not only the available slots you may be after, but also on the kinds of times that tend to get booked, or the kinds of times that tend to always be available, or other interesting questions you might want to answer some day. At the very least, you should keep track of the fields that tell you whether a slot should be one that is being filled or not (like for business hours).
Why not have a flag in the row that indicates this. As time is allocated, flip the flag for every date/time in the appropriate range. For example May 2, 12pm to 1pm, would be marked as not available.
Then it's a simple matter of querying the date range for every row that has the availability flagged set as true.

Should we put units of measurements in attribute names? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I think most of us agree that it's a good idea to use a descriptive name for variables, object attributes, and database columns. If you want to store something's name, you may as well call the attribute Name so people know what to put in it.
Where the unit of measurement isn't immediately apparent, I think you should go a step further and include the unit of measurement in the name. Length_mm, for example, should help remind developers that they'd better convert the length to mm if the user just entered it in inches.
My database administrator, however, just told me that including units of measurement in database column names is “frowned upon”. I think that's just nuts, but perhaps there's some risk DBAs know about that I don't.
Throw me a line, here: should we embed units of measurement in our attribute names? Why? Why not?
If you have a consistent UOM for things, then your DBA's policy is OK.
For example, if timespans are ALWAYS in minutes, etc.
If the UOM could change, then you should store it in another column, alongside the qty.
That said, I tend to side with you on this. Clarity trumps most things, including this. I'd rather see DurationMinutes than Duration and have to guess what the UOM is.
Yes. You should.
The key, as #[Charles Bretana] pointed out, is legibility and that the other users of your table or developers following you know what you're using.
I would absolutely involve the units/measurement in a field name - in my business you can't guess what you'll find from the context or name: a field entitled MarketValue - is that in millions, thousands or units? US Dollars, Euros, pounds, $CURRENCY? Is that value a percentage, a ratio? Absolute or relative? Daily, monthly, calendar year, financial year? That timestamp, what time zone is it?
Your first, last and only task when providing data is to ensure that it isn't used incorrectly because the consumer wasn't able to find out enough about it. As developers, throwing "Metre", "USD", "GMT", "Percent" or whatever into a field name isn't the least bit smelly.
There are enormous smells that need resolving before the tiny whiff of field naming needs standardising.
This is why the Mars Climate Orbiter crashed into the surface at 350 meters/sec when it was planned to only handle 350 ft/sec (or something like that).
Although "Never say 'Never' or 'Always'" is, in general, a good rule of thumb, here I will bend my rule and say I think you should "always" make it clear what units a numeric value is in.
The convention of naming all my columns in the format:
{name}_in_{unit}
helped for one project, since I was using si units it actually ended up allowing me to be able to infer the column data type and generally simplify my writing style.
length_in_m
speed_in_ms-1
color_in_nm
there were a few exceptions that I handled either with at_time or number_of:
started_at_time
updated_at_time
number_of_rotations
I think this is a good idea anywhere since there is always room for ambiguity.
For example, the with high performance timer class we use, I keep having to check if the GetElapsed() method returns seconds or milliseconds or something else. If it were called GetElapsedMilliseconds() that would save the confusion.
The only downside being if you wanted to change your mind ... but in that case any clients would need to know about the change anyway.
F# has an interesting twist on this allowing measurement units to be specified in the type system. See this blog post, and another stackoverflow question discussing Are units of measurement unique to F#?
I've done a lot of database work, and I would not frown upon that at all, nor have I heard of frowning on it.
It's better than the extended properties, which is not apparent to the casual developer. It's better than in a separate document, because many developers won't read them, and certainly not in great detail. If the units are set, then having it in the name sounds like a good idea. If that changes, then when the unit field is added, change the name of the measurement field.
Where the unit of measurement isn't immediately apparent, I think you should go a step further and include the unit of measurement in the name. Length_mm, for example, should help remind developers that they'd better convert the length to mm if the user just entered it in inches.
You could go even a step further (in your code, not in the database) and have a Length type, which takes care of the measurement unit and of possible conversions. This is the approach of the "Quantity" pattern in Martin Fowler's "Analysis Patterns" book.
Do not put units of measurement (or column type) in your database column names.
Many Databases have the ability to document/comment columns in some way (in SQL Server it is sp_addextendedproperty), I would suggest that is a more appropriate place.
For Python datetimes, consider using objects from the datetime package. Doing so will capture the unit implicity to microsecond resolution. There is then no basis for including the unit in the variable name.
If you must use an int or float instead, it is strongly recommend to suffix the unit name abbreviation to the variable name. For example, instead of the variable name diff, use diff_secs for seconds, diff_ms for milliseconds, diff_µs for microseconds, or diff_ns for nanoseconds.
We don't put units of measurement in column names in our database. We do, however, have a data dictionary document where all of the columns and relationships are described.
The ideal approach is, if possible, to use a type that leaves no ambiguity as to the measurement. For example in .NET rather than saying int periodInSeconds you'd be much better off using TimeSpan period.
The F# language actually has units of measurement as part of the type system so you can declare types in units such as 10<m/s> and 5<s> and even perform calculations on them so something like 10<m/s> * 5<s> would result in 50<m>. See here for more info.
So I'd say if possible use a type that conveys your intention, but if that isn't possible then you should probably encode the measurement into the name. It's better and more obvious than a comment.
You definitely want units of measurement somewhere. I don't know if the column names are a good place or if the schema is better. Ask your database administrator
Where is the information about units of measure stored?
How can I get access to the units programmatically?
If the answers are "it isn't" or "you can't", complain bitterly---they have no right to deny you your naming convention. Otherwise, all may be happier if you work within the system.
P.S. I really like the support for units of measure that they've put into F#.
I have to say, I hate "descriptive" variable names becoming "incredibly verbose" variable names.
My preferred alternative is to use nothing but the unit-of-measure names in short functions. Eg.
function velocity(m, s) {
return m/s;
}
You don't need to say "length_m" because in this context, it's obvious that only lengths are measurable in metres.
Having said that. If I was writing a system where units of measure errors were really dangerous, I'd probably use the type system and define a Length class which always converted itself into a standard unit for any calculation. Maybe even different sub-classes for Feet, Metres etc.
NO, the name of the attribute is seperate from its unit of measurement.
If you call a variable length_mm then you are tied to mm.
what if you use a 32bit int to store length_mm, eventually the length in mm may get larger then 62,000, or whatever the limit is on 32bit ints. You cant switch over to m cause you tied you length variable to length_mm.
I think putting units in your identifiers is a huge design smell. It almost surely means that you chose the wrong language: if units are so important to the project, you'd better be using a language whose type system is capable of representing them.