I have a pretty simple table with an ID, Date, and 20 value columns. Each column and each row can hold different type of data - with different unit of measure - and the ID column defines each fields unit of measure. So basically the ID field helps identifying the meaning behind each fields. Naturally I have an explanatory table that holds these definitions by ID.
The table holds sensor data, and these sensors are inserting thousands of rows of data each second (each TYPE of sensor has their own ID).
My problem is: how to aggregate this kind of table? Because each type of measurement requires different aggregation (some measuremants I need to average, other to sum or min or max etc...).
I think the perfect solution would be something like having an explanatory table by ID, which defines for each field (of that ID) that how should I aggregate them, and the aggregation command (somehow... magically...) should be dynamic by this table...
Do you have any suggestion how I can accomplish that? Or is it even possible to make the aggregation function dynamic by a certain condition (in this case the explanatory tables value)?
Are you sure SQL is the right tool for the job? sounds to me you want columnar DBs, or other types of noSQL will fit better
Related
I expect this is a common enough use-case, but I'm unsure the best way to leverage database features to do it. Hopefully the community can help.
Given a business domain where there are a number of attributes to make up a record. We can just call these a,b,c
Each of these belong to a parent record, of which there can be many,
Given an external datasource that will post updates to those attributes, at arbitrary times, and typically only a subset, so you get instructions like
z:{a:3}
or
y:{b:2,c:100}
What are good ways to be able to query postgres for the 'current state', ie. wanting a single row result that represents the most recent value for all of a,b,c, for each of the parent records.
current state looks overall like
x:{a:0, b:0, c:1}
y:{a:1, b:2, c:3}
z:{a:2, b:65, c:6}
If it matters, The difference in time between updates on a single value could be arbitrarily long
I am deliberately avoiding having a table that keeps updating and writing an individual row for the state because the write-contention could be a problem, and I think there must be a better overall pattern.
Your question is a bit theorical - but in essence you are describing a top-1-per-group problem. In Postgres, you can use distinct on for this.
Assuming that your table is called mytable, where attributes are stored in column attribute, and that column ordering_id defines the ordering of the rows (that could be a timestamp or an serial for example), you would phrase the query as:
select distinct on (attribute) t.*
from mytable t
order by attribute, ordering_id desc
I am creating a Data Warehouse and have hit a interesting problem...
I have DimQualification and DimUnit tables. A unit is a part of a qualification.
However some units are optional. In stating all available units in the DimUnit table i am puzzled by how best to show the customers choice.
FactAttendance - The attendance on the qualification
Would it be best to put multiple rows in the fact table (qualification and units taken) or is there another option?
The other option, besides putting multiple rows in the fact table, is to have a single row for each fact in the fact table, and a separate column for each unit. The column would be a count of the number of that unit associated with that fact. Something like this:
FactID Unit1Count Unit2Count Unit3Count ...
I have looked at a few things now and have decided that there is a way to achieve this without the reduction in speed which multiple rows in the fact table would create.
Instead of having the multiple rows for each unit I am going to create another fact table which holds all the units chosen then from the FactAttendance table we can immediately and efficiently identify the units chosen.
In my database, I have a table that has to get info from two adjacent rows from another table.
Allow me to demonstrate. There's a bill that calculates the difference between two adjacent meter values and calculates the cost accordingly (i.e., I have a water meter and if I want to calculate the amount I should pay in December, I take the value I measured in November and subtract it from the December one).
My question is, how to implement the references the best way? I was thinking about:
Making each meter value an entity on its own. The bill will then have two foreign keys, one for each meter value. That way I can include other useful data, like measurement date and so on. However, implementing and validating adjacency becomes icky.
Making a pair of meter values an entity (or a meter value and a diff). The bill will reference that pair. However, that leads to data duplication.
Is there a better way? Thank you very much.
First, there is no such thing as "adjacent" rows in a relational database. Tables represent unordered sets. If you have a concept of ordering it needs to be implementing using data in the rows. Let me assume that you have some sort of "id" or "creation date" that specifies the ordering.
Because you don't specify the database, I'll assume you have a functional database that supports the ANSI standard window functions. In that case, you can get what you want using the LAG() function. The syntax to get the previous meter reading is something like:
select lag(value) over (partition by meterid order by readdatetime)
There is no need to have data duplication or some arcane data data structure. LAG() should also be able to take advantage of appropriate indexes.
We have a client table with a field DateOfBirth.
I'm new to MS Analysis Services, OLAP and data cubes. I'm trying to report on client metrics by age categories (18-25,26-35,35-50,50-65,66+)
I don't see a way to accomplish this. (Note: I'm not concerned with age at the time of a sale. I'm interested in knowing the age distribution of my current active customers).
You can create either a TSQL or Named Calculation in the Data Source View that calculates the CurrentAge based on the DOB field.
You will likely also want to implement another similarly derived field that assigns the CurrentAge Value a Bucket in your date range. this is a simple TSQL Case statement.
Depending on how large the client table is (and the analytical purpose), you may want to make this into a fact table or at least use snowflaking to separate this from the other relatively static attribute fields in the client table.
I have a table with 25 columns where 20 columns can have null values for some (30-40%) rows.
Now what is the cost of having rows with 20 null columns? Is this OK?
Or
is it a good design to have another table to store those 20 columns and add a ref to the first table?
This way I will only write to the second table only when there is are values.
I am using SQL server 2005. Will migrate to 2008 in future.
Only 20 columns are varchar, rest smallint, smalldate
What I am storing:
These columns store different attributes of the row it belongs to. These attributes can be null sometimes.
The table will hold ~billion of rows
Please comment.
You should describe the type of data you are storing. It sounds like some of those columns should be moved to another table.
For example, if you have several columns that represent multiple columns for the same type of data, then I would say move it to another table On the other hand, if you need this many columns to describe different types of data, then you may need to keep it as it is.
So it kind of depends on what you are modelling.
Are there some circumstances where some of those columns are required? If so, then perhaps you should use some form of inheritance. For instance, if this were information about patients in a hospital, and there was some data that only made sense for female patients, then you could create a FemalePatients table with those columns. Those columns that must always be collected for female patients could then be declared NOT NULL in that separate table.
It depends on the data types (40 nullable ints is going to basically take the same space as 40 non-nullable ints, regardless of the values). In SQL Server, the space is fairly efficient with ordinary techniques. In 2008, you do have the SPARSE feature.
If you do split the table vertically with an optional 1:1 relationship, there is a possibility of wrapping the two tables with a view and adding triggers on the view to make it updatable and hide the underlying implementation.
So there are plenty of options, many of which can be implemented after you see the data load and behavior.
Create tables based on the distinct sets of attributes you have. So if you have some data where some of your columns do not apply then it would make sense to have that data in a table which doesn't have those columns. As far as possible, avoid repeating the same attribute in multiple tables. Make sure your data is in at least Boyce-Codd / 5th Normal Form and you won't go far wrong.