In my database, I have a table that has to get info from two adjacent rows from another table.
Allow me to demonstrate. There's a bill that calculates the difference between two adjacent meter values and calculates the cost accordingly (i.e., I have a water meter and if I want to calculate the amount I should pay in December, I take the value I measured in November and subtract it from the December one).
My question is, how to implement the references the best way? I was thinking about:
Making each meter value an entity on its own. The bill will then have two foreign keys, one for each meter value. That way I can include other useful data, like measurement date and so on. However, implementing and validating adjacency becomes icky.
Making a pair of meter values an entity (or a meter value and a diff). The bill will reference that pair. However, that leads to data duplication.
Is there a better way? Thank you very much.
First, there is no such thing as "adjacent" rows in a relational database. Tables represent unordered sets. If you have a concept of ordering it needs to be implementing using data in the rows. Let me assume that you have some sort of "id" or "creation date" that specifies the ordering.
Because you don't specify the database, I'll assume you have a functional database that supports the ANSI standard window functions. In that case, you can get what you want using the LAG() function. The syntax to get the previous meter reading is something like:
select lag(value) over (partition by meterid order by readdatetime)
There is no need to have data duplication or some arcane data data structure. LAG() should also be able to take advantage of appropriate indexes.
Related
I have a pretty simple table with an ID, Date, and 20 value columns. Each column and each row can hold different type of data - with different unit of measure - and the ID column defines each fields unit of measure. So basically the ID field helps identifying the meaning behind each fields. Naturally I have an explanatory table that holds these definitions by ID.
The table holds sensor data, and these sensors are inserting thousands of rows of data each second (each TYPE of sensor has their own ID).
My problem is: how to aggregate this kind of table? Because each type of measurement requires different aggregation (some measuremants I need to average, other to sum or min or max etc...).
I think the perfect solution would be something like having an explanatory table by ID, which defines for each field (of that ID) that how should I aggregate them, and the aggregation command (somehow... magically...) should be dynamic by this table...
Do you have any suggestion how I can accomplish that? Or is it even possible to make the aggregation function dynamic by a certain condition (in this case the explanatory tables value)?
Are you sure SQL is the right tool for the job? sounds to me you want columnar DBs, or other types of noSQL will fit better
I have a database that keeps track of attendance for students in a school. There's one table (SpecificClasses) with dates of all of the classes, and another table (Attendance) with a list of all the students in each class and their attendance on that day.
The school wants to be able to view that data in many different ways and to filter it according to many different parameters. (I won't paste the entire query here because it is quite complicated and the details are not important for my question.) One of their options they want is to view the attendance of a specific student on a certain day of the week. Meaning, they want to be able to notice if a student is missing every Tuesday or something like that.
To make the query be able to do that, I have used DatePart("w",[SpecificClasses]![Day]). However, running this on every class (when we may be talking about hundreds of classes taken by one student in one semester) is quite time-consuming. So I was thinking of storing the day of the week manually in the SpecificClasses table, or perhaps even in the Attendance table to be able to avoid making a join, and just being very careful in my events to keep this data up-to-date (meaning to fill in the info when the secretaries insert a new SpecificClass or fix the Day field).
Then I was wondering whether I could just make a calculated field that would store this value. (The school has Access 2010 so I don't have to worry about compatibility). If I create a calculated field, does Access actually store that field and remember it for the future and not have to recalculate it each time?
As HansUp mentions in his answer, a Calculated field cannot be indexed so it might not give you much of a performance boost. However, since you are using Access 2010 you could create a "real" Integer field named [WeekdayNumber] and put an index on it,
and then use a Before Change data macro to insert the Weekday() value for you:
(The Weekday() function gives the same result as DatePart("w", ...).)
I was wondering whether I could just make a calculated field that
would store this value.
No, not for a calculated field expression which uses DatePart(). Access supports a limited set of functions for calculated fields, and DatePart() is not one of those.
If I create a calculated field, does Access actually store that field
and remember it for the future and not have to recalculate it each
time?
Doesn't apply to your current case. But for a calculated field which Access would accept, yes, that is the way it works.
However a calculated field can not be indexed so that limits how much improvement it can offer in terms of data retrieval speed. If you encounter another situation where you can create a valid calculated field, test the performance to see whether you notice any improvement (vs. calculating the value in a query).
For your DatePart() query problem, consider creating a calendar table with a row for each date and include the weekday number as a separate indexed field. Then you could join the calendar table into your query, avoid the need to compute DatePart() again, and allow Access to use the indexed weekday number to quickly identify which rows match the weekday of interest.
I have a need to store a fairly large history of data. I have been researching the best ways to store such an archive. It seems that a datawarehouse approach is what I need to tackle. It seems highly recommended to use a date dimension table rather than a date itself. Can anyone please explain to me why a separate table would be better? I don't have a need to summarize any of the data, just access it quickly and efficiently for any give day in the past. I'm sure I'm missing something, but I just can't see how storing the dates in a separate table is any better than just storing a date in my archive.
I have found these enlightening posts, but nothing that quite answers my question.
What should I have in mind when building OLAP solution from scratch?
Date Table/Dimension Querying and Indexes
What is the best way to store historical data in SQL Server 2005/2008?
How to create history fact table?
Well, one advantage is that as a dimension you can store many other attributes of the date in that other table - is it a holiday, is it a weekday, what fiscal quarter is it in, what is the UTC offset for a specific (or multiple) time zone(s), etc. etc. Some of those you could calculate at runtime, but in a lot of cases it's better (or only possible) to pre-calculate.
Another is that if you just store the DATE in the table, you only have one option for indicating a missing date (NULL) or you need to start making up meaningless token dates like 1900-01-01 to mean one thing (missing because you don't know) and 1899-12-31 to mean another (missing because the task is still running, the person is still alive, etc). If you use a dimension, you can have multiple rows that represent specific reasons why the DATE is unknown/missing, without any "magic" values.
Personally, I would prefer to just store a DATE, because it is smaller than an INT (!) and it keeps all kinds of date-related properties, the ability to perform date math etc. If the reason the date is missing is important, I could always add a column to the table to indicate that. But I am answering with someone else's data warehousing hat on.
Lets say you've got a thousand entries per day for the last year. If you've a date dimension your query grabs the date in the date dimension and then uses the join to collect the one thousand entries you're interested in. If there's no date dimension your query reads all 365 thousand rows to find the one thousand you want. Quicker, more efficient.
I know the question does not make that good of a sense.
What is a typical number of columns in a small, medium and large table (in a database) in professional environment.
Want to get an idea what is a good design, how big I can grow my table column wise. Is 100 columns or 200 columns in a table OK?
It totally depends upon the nature of the subject you are modeling. If you need one column, you need one; if you need 500, then you need 500. Properly designed, the size of the tables you end up with will always be "just right".
How big can they be, what performs well, what if you need more columns than you can physically stuff into SQL... those are all implementation, maintenance, and/or performance questions, and that's a different and secondary subject. Get the models right first, then worry about implementation.
Well, for SQL Server 2005, according to the Max Capacity Specifications, the maximum number of columns per base table are 1024. So that's a hard upper limit.
You should normalize your tables to at least third normal form (3NF). That should be your primary guide.
You should consider the row size. 100 sparse columns is differnt from 100 varchar (3000) columns. It is almost always better to make a related table (with an enforced 1-1 relationship) when you are starting to get past the record size that SQL Server can store on one page.
You also should consider how the data will be queried. Are many of those fields ones that will not frequently need to be returned? Do they have a natural grouping (think a person record vice a user login record) and will that natural grouping dictate how they will be queried? In that case it might be better to separate them out.
And of course you should consider normalization. If you are doing multiple columns to avoid having a one-to-many relationship and a join, then you should not do that even if you only have 6 columns. Joins (with the key fields indexed) are preferable to denormalized tables except in data warehousing situations in general. It is better if you don't have to add new columns because you need to store a new phone type.
To preface this, I'm not familiar with OLAP at all, so if the terminology is off, feel free to offer corrections.
I'm reading about OLAP and it seems to be all about trading space for speed, wherein you precalculate (or calculate on demand) and store aggregations about your data, keyed off by certain dimensions. I understand how this works for dimensions that have a discrete set of values, like { Male, Female } or { Jan, Feb, ... Dec } or { #US_STATES }. But what about dimensions that have completely arbitrary values like (0, 1.25, 3.14156, 70000.23, ...)?
Does the use of OLAP preclude the use of aggregations in queries that hit the fact tables, or is it merely used to bypass things that can be precalculated? Like, arbitrary aggregations on arbitrary values still need to be done on the fly?
Any other help regarding learning more about OLAP would be much appreciated. At first glance, both Google and SO seem to be a little dry (compared to other, more popular topics).
Edit: Was asked for a dimension on which there are arbitrary values.
VELOCITY of experiments: 1.256 m/s, -2.234 m/s, 33.78 m/s
VALUE of transactions: $120.56, $22.47, $9.47
Your velocity and value column examples are usually not the sort of columns that you would query in an OLAP way - they are the values you're trying to retrieve, and would presumably be in the result set, either as individual rows or aggregated.
However, I said usually. In our OLAP schema, we have a good example of a column you're thinking of: event_time (a date-time field, with granualarity to the second). In our data, it will be nearly unique - no two events will be happening during the same second, but since we have years of data in our table, that still means there are hundreds of millions of potentially discrete values, and when we run our OLAP queries, we almost always want to constrain based on time ranges.
The solution is to do what David Raznick has said - you create a "bucketed" version of the value. So, in our table, in addition to the event_time column, we have an event_time_bucketed column - which is merely the date of the event, with the time part being 00:00:00. This reduces the count of distinct values from hundreds of millions to a few thousand. Then, in all queries that constrain on date, we constrain on both the bucketed and the real column (since the bucketed column will not be accurate enough to give us the real value), e.g.:
WHERE event_time BETWEEN '2009.02.03 18:32:41' AND '2009.03.01 18:32:41'
AND event_time_bucketed BETWEEN '2009.02.03' AND '2009.03.01'
In these cases, the end user never sees the event_time_bucketed column - it's just there for query optimization.
For floating point values like you mention, the bucketing strategy may require a bit more thought, since you want to choose a method that will result in a relatively even distribution of the values and that preserves contiguity. For example, if you have a classic bell distribution (with tails that could be very long) you'd want to define the range where the bulk of the population lives (say, 1 or 2 standard deviations from mean), divide it into uniform buckets, and create two more buckets for "everything smaller" and "everything bigger".
I have found this link to be handy http://www.ssas-info.com/
Check out the webcasts section where in they walk you through different aspects starting from, what is BI, Warehousing TO designing a cube, dimensions, calculations, aggregations, KPIs, perspectives etc.
In OLAP aggregations help in reducing the query response time by having pre-calculated values which would be used by the query. However, the flip side is increase in storage space as more space would be needed to store aggregations apart from the base data.
SQL Server Analysis Services has Usage Based Optimization Wizard which helps in aggregation design by analyzing queries that have been submitted by clients (reporting clients like SQL Server Reporting Services, Excel or any other) and refining the aggregation design accordingly.
I hope this helps.
cheers