Storing large SQL datasets with variable numbers of columns - sql

In America’s Cup yachting, we generate large datasets where at every time-stamp (e.g. 100Hz) we need to store maybe 100-1000 channels of sensor data (e.g. speed, loads, pressures). We store this in MS SQL Server and need to be able to retrieve subsets of channels of the data for analysis, and perform queries such as the maximum pressure on a particular sensor in a test, or over an entire season.
The set of channels to be stored stays the same for several thousand time-stamps, but day-to-day will change as new sensors are added, renamed, etc... and depending on testing, racing or simulating, the number of channels can vary greatly.
The textbook way to structure the SQL tables would probably be:
OPTION 1
ChannelNames
+-----------+-------------+
| ChannelID | ChannelName |
+-----------+-------------+
| 50 | Pressure |
| 51 | Speed |
| ... | ... |
+-----------+-------------+
Sessions
+-----------+---------------+-------+----------+
| SessionID | Location | Boat | Helmsman |
+-----------+---------------+-------+----------+
| 789 | San Francisco | BoatA | SailorA |
| 790 | San Francisco | BoatB | SailorB |
| ... | ... | ... | |
+-----------+---------------+-------+----------+
SessionTimestamps
+-------------+-------------+------------------------+
| SessionID | TimestampID | DateTime |
+-------------+-------------+------------------------+
| 789 | 12345 | 2013/08/17 10:30:00:00 |
| 789 | 12346 | 2013/08/17 10:30:00:01 |
| ... | ... | ... |
+-------------+-------------+------------------------+
ChannelData
+-------------+-----------+-----------+
| TimestampID | ChannelID | DataValue |
+-------------+-----------+-----------+
| 12345 | 50 | 1015.23 |
| 12345 | 51 | 12.23 |
| ... | ... | ... |
+-------------+-----------+-----------+
This structure is neat but inefficient. Each DataValue requires three storage fields, and at each time-stamp we need to INSERT 100-1000 rows.
If we always had the same channels, it would be more sensible to use one row per time-stamp and structure like this:
OPTION 2
+-----------+------------------------+----------+-------+----------+--------+-----+
| SessionID | DateTime | Pressure | Speed | LoadPt | LoadSb | ... |
+-----------+------------------------+----------+-------+----------+--------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+-------+----------+--------+-----+
However, the channels change every day, and over the months the number of columns would grow and grow, with most cells ending up empty. We could create a new table for every new Session, but it doesn’t feel right to be using a table name as a variable, and would ultimately result in tens of thousands of tables – also, it becomes very difficult to query over a season, with data stored in multiple tables.
Another option would be:
OPTION 3
+-----------+------------------------+----------+----------+----------+----------+-----+
| SessionID | DateTime | Channel1 | Channel2 | Channel3 | Channel4 | ... |
+-----------+------------------------+----------+----------+----------+----------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+----------+----------+----------+-----+
with a look-up from Channel column IDs to channel names – but this requires an EXEC or eval to execute a pre-constructed query to obtain the channel we want – because SQL isn’t designed to have column names as variables. On the plus side, we can re-use columns when channels change, but there will still be many empty cells because the table has to be as wide as the largest number of channels we ever encounter. Using a SPARSE table may help here, but I am uncomfortable with the EXEC/eval issue above.
What is the right solution to this problem, that achieves efficiency of storage, inserts and queries?

I would go with Option 1.
Data integrity is first, optimization (if needed) - second.
Other options would eventually have a lot of NULL values and other problems stemming from not being normalized. Managing the data and making efficient queries would be difficult.
Besides, there is a limit on the number of columns that a table can have - 1024, so if you have 1000 sensors/channels you are already dangerously close to the limit. Even if you make your table a wide table, which allows 30,000 columns, still there is a limitation on the size of the row in a table - 8,060 bytes per row. And there are certain performance considerations.
I would not use wide tables in this case, even if I was sure that the data for each row would never exceed 8060 bytes and growing number of channels would never exceed 30,000.
I don't see a problem with inserting 100 - 1000 rows in Option 1 vs 1 row in other options. To do such INSERT efficiently don't make 1000 individual INSERT statements, do it in bulk. In various places in my system I use the following two approaches:
1) Build one long INSERT statement
INSERT INTO ChannelData (TimestampID, ChannelID, DataValue) VALUES
(12345, 50, 1015.23),
(12345, 51, 12.23),
...
(), (), (), (), ........... ();
that contains 1000 rows and execute it as normal INSERT in one transaction, rather than 1000 transactions (check the syntax details).
2) Have a stored procedure that accepts a table-valued parameter. Call such procedure passing 1000 rows as a table.
CREATE TYPE [dbo].[ChannelDataTableType] AS TABLE(
[TimestampID] [int] NOT NULL,
[ChannelID] [int] NOT NULL,
[DataValue] [float] NOT NULL
)
GO
CREATE PROCEDURE [dbo].[InsertChannelData]
-- Add the parameters for the stored procedure here
#ParamRows dbo.ChannelDataTableType READONLY
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
BEGIN TRY
INSERT INTO [dbo].[ChannelData]
([TimestampID],
[ChannelID],
[DataValue])
SELECT
TT.[TimestampID]
,TT.[ChannelID]
,TT.[DataValue]
FROM
#ParamRows AS TT
;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH;
END
GO
If possible, accumulate data from several timestamps before inserting to make the batches larger. You should try with your system and find the optimal size of the batch. I have batches around 10K rows using the stored procedure.
If you have your data coming from sensors 100 times a second, then I would at first dump the incoming raw data in some very simple CSV file(s) and have a parallel background process that would insert it into the database in chunks. In other words, have some buffer for incoming data, so that if the server can't cope with the incoming volume, you would not loose your data.
Based on your comments, when you said that some channels are likely to be more interesting and queried several times, while others are less interesting, here is one optimization that I would consider. In addition to having one table ChannelData for all channels have another table InterestingChannelData. ChannelData would have the whole set of data, just in case. InterestingChannelData would have a subset only for the most interesting channels. It should be much smaller and it should take less time to query it. In any case, this is an optimization (denormalization/data duplication) built on top of properly normalized structure.

Is your process like this:
Generate data during the day
Analyse data afterwards
If these are separate activities then you might want to consider using different 'insert' and 'select' schemas. You could create a schema that's fast for inserting on the boat, then afterwards you batch upload this data into an analysis optimised schema. This requires a transformation step (where for example you map generic column names into useful column names)
This is along the lines of data warehousing and data marts. In this kind of design, you batch load and optimise the schema for reporting. Does your current daily upload have much of a window?

Related

how to have one itempointer serialize from 1 to n across the selected rows

as shown in the example below, the output of the query contains blockid startds from 324 and it ends at 127, hence, the itempointer or the row index within the block starts from one for each new block id. in otherwords, as shown below
for the blockid 324 it has only itempointer with index 10
for the blockid 325 it has itempointers starts with 1 and ends with 9
i want to have a single blockid so that the itempointer or the row index starts from 1 and ends with 25
plese let me know how to achive that and
why i have three different blockids?
ex-1
query:
select ctid
from awanti_grid_cell_data agcd
where selectedsiteid = '202230060950'
and centerPointsOfWindowAsGeoJSONInEPSG4326ForCellsInTreatment IS NOT NULL
and centerPointsOfWindowAsGeoJSONInEPSG4326ForCellsInTreatment <> 'None'
result:
|ctid |
|--------|
|(324,10)|
|(325,1) |
|(325,2) |
|(325,3) |
|(325,4) |
|(325,5) |
|(325,6) |
|(325,7) |
|(325,8) |
|(325,9) |
|(326,1) |
|(326,2) |
|(326,3) |
|(326,4) |
|(326,5) |
|(326,6) |
|(326,7) |
|(326,8) |
|(326,9) |
|(327,1) |
|(327,2) |
|(327,3) |
|(327,4) |
|(327,5) |
|(327,6) |
You are missing the point. The ctid is the physical address of a row in the table, and it is none of your business. The database is free to choose whatever place it thinks fit for a table row. As a comparison, you cannot go to the authorities and request that your social security number should be 12345678 - it is simply assigned to you, and you have no say. That's how it is with the physical location of tuples.
Very likely you are not asking this question out of pure curiosity, but because you want to solve some problem. You should instead ask a question about your real problem, and there may be a good answer to that. But whatever problem you are trying to solve, using the ctid is probably not the correct answer, in particular if you want to control it.

SQLAlchemy getting label names out from columns

I want to use the same labels from a SQLAlchemy table, to re-aggregate some data (e.g. I want to iterate through mytable.c to get the column names exactly).
I have some spending data that looks like the following:
| name | region | date | spending |
| John | A | .... | 123 |
| Jack | A | .... | 20 |
| Jill | B | .... | 240 |
I'm then passing it to an existing function we have, that aggregates spending over 2 periods (using a case statement) and groups by region:
grouped table:
| Region | Total (this period) | Total (last period) |
| A | 3048 | 1034 |
| B | 2058 | 900 |
The function returns a SQLAlchemy query object that I can then use subquery() on to re-query e.g.:
subquery = get_aggregated_data(original_table)
region_A_results = session.query(subquery).filter(subquery.c.region = 'A')
I want to then re-aggregate this subquery (summing every column that can be summed, replacing the region column with a string 'other'.
The problem is, if I iterate through subquery.c, I get labels that look like:
anon_1.region
anon_1.sum_this_period
anon_1.sum_last_period
Is there a way to get the textual label from a set of column objects, without the anon_1. prefix? Especially since I feel that the prefix may change depending on how SQLAlchemy decides to generate the query.
Split the name string and take the second part, and if you want to prepare for the chance that the name is not prefixed by the table name, put the code in a try - except block:
for col in subquery.c:
try:
print(col.name.split('.')[1])
except IndexError:
print(col.name)
Also, the result proxy (region_A_results) has a method keys which returns an a list of column names. Again, if you don't need the table names, you can easily get rid of them.

SQL group by one column, sort by another and transponse a third

I have the following table, which is actually the minimal example of the result of multiple joined tables. I now would like to group by 'person_ID' and get all the 'value' entries in one row, sorted after the feature_ID.
person_ID | feature_ID | value
123 | 1 | 1.1
123 | 2 | 1.2
123 | 3 | 1.3
123 | 4 | 1.2
124 | 1 | 1.0
124 | 2 | 1.1
...
The result should be:
123 | 1.1 | 1.2 | 1.3 | 1.2
124 | 1.0 | 1.1 | ...
There should exist an elegant SQL query solution, which I can neither come up with, nor find it.
For fast reconstruction that would be the example data:
create table example(person_ID integer, feature_ID integer, value float);
insert into example(person_ID, feature_ID, value) values
(123,1,1.1),
(123,2,1.2),
(123,3,1.3),
(123,4,1.2),
(124,1,1.0),
(124,2,1.1),
(124,3,1.2),
(124,4,1.4);
Edit: Every person has 6374 entries in the real life application.
I am using a PostgreSQL 8.3.23 database, but I think that should probably be solvable with standard SQL.
Data bases aren't much at transposing. There is a nebulous column growth issue at hand, I mean how does the data base deal with a variable number of columns? It's not a spread sheet.
This transposing of sorts is normally done in the report writer, not in SQL.
... or in a program, like in php.
Dynamic cross tab in sql only by procedure, see:
https://www.simple-talk.com/sql/t-sql-programming/creating-cross-tab-queries-and-pivot-tables-in-sql/

Creating a flattened table/view of a hierarchically-defined set of data

I have a table containing hierarchical data. There are currently ~8 levels in this hierarchy.
I really like the way the data is structured, but performance is dismal when I need to know if a record at level 8 is a child of a record at level 1.
I have PL/SQL stored functions which do these lookups for me, each having a select * from tbl start with ... connect by... statement. This works fine when I'm querying a handful of records, but I'm in a situation now where I need to query ~10k records at once and for each of them run this function. It's taking 2-3 minutes where I need it to run in just a few seconds.
Using some heuristics based on my knowledge of the current data, I can get rid of the lookup function and just do childrecord.key || '%' LIKE parentrecord.key but that's a really dirty hack and will not always work.
So now I'm thinking that for this hierarchically-defined table I need to have a separate parent-child table, which will contain every relationship...for a hierarchy going from level 1-8 there would be 8! records, associating 1 with 2, 1 with 3,...,1 with 8 and 2 with 3, 2 with 4,...,2 with 8. And so forth.
My thought is that I would need to have an insert trigger where it will basically run the connect by query and for every match going up the hierarchy it will insert a record in the lookup table. And to deal with old data I'll just set up foreign keys to the main table with cascading deletes.
Are there better options than this? Am I missing another way that I could determine these distant ancestor/descendant relationships more quickly?
EDIT: This appears to be exactly what I'm thinking about: http://evolt.org/working_with_hierarchical_data_in_sql_using_ancestor_tables
So what you want is to materialize the transitive closures. That is, given this application table ...
ID | PARENT_ID
------+----------
1 |
2 | 1
3 | 2
4 | 2
5 | 4
... the graph table would look like this:
PARENT_ID | CHILD_ID
-----------+----------
1 | 2
1 | 3
1 | 4
1 | 5
2 | 3
2 | 4
2 | 5
4 | 5
It is possible to maintain a table like this in Oracle, although you will need to roll your own framework for it. The question is whether it is worth the overhead. If the source table is volatile then keeping the graph data fresh may cost more cycles than you will save on the queries. Only you know your data's profile.
I don't think you can maintain such a graph table with CONNECT BY queries and cascading foreign keys. Too much indirect activity, too hard to get right. Also a materialized view is out, because we cannot write a SQL query which will zap the 1->5 record when we delete the source record for ID=4.
So what I suggest you read a paper called Maintaining Transitive Closure of Graphs in SQL by Dong, Libkin, Su and Wong. This contains a lot of theory and some gnarly (Oracle) SQL but it will give you the grounding to build the PL/SQL you need to maintain a graph table.
"can you expand on the part about it
being too difficult to maintain with
CONNECT BY/cascading FKs? If I control
access to the table and all
inserts/updates/deletes take place via
stored procedures, what kinds of
scenarios are there where this would
break down?"
Consider the record 1->5 which is a short-circuit of 1->2->4->5. Now what happens if, as I said before, we delete the the source record for ID=4? Cascading foreign keys could delete the entries for 2->4 and 4->5. But that leaves 1->5 (and indeed 2->5) in the graph table although they no longer represent a valid edge in the graph.
What might work (I think, I haven't done it) would be to use an additional synthetic key in the source table, like this.
ID | PARENT_ID | NEW_KEY
------+-----------+---------
1 | | AAA
2 | 1 | BBB
3 | 2 | CCC
4 | 2 | DDD
5 | 4 | EEE
Now the graph table would look like this:
PARENT_ID | CHILD_ID | NEW_KEY
-----------+----------+---------
1 | 2 | BBB
1 | 3 | CCC
1 | 4 | DDD
1 | 5 | DDD
2 | 3 | CCC
2 | 4 | DDD
2 | 5 | DDD
4 | 5 | DDD
So the graph table has a foreign key referencing the relationship in the source table which generated it, rather than linking to the ID. Then deleting the record for ID=4 would cascade deletes of all records in the graph table where NEW_KEY=DDD.
This would work if any given ID can only have zero or one parent IDs. But it won't work if it is permissible for this to happen:
ID | PARENT_ID
------+----------
5 | 2
5 | 4
In other words the edge 1->5 represents both 1->2->4->5 and 1->2->5. So, what might work depends on the complexity of your data.

How to represent and insert into an ordered list in SQL?

I want to represent the list "hi", "hello", "goodbye", "good day", "howdy" (with that order), in a SQL table:
pk | i | val
------------
1 | 0 | hi
0 | 2 | hello
2 | 3 | goodbye
3 | 4 | good day
5 | 6 | howdy
'pk' is the primary key column. Disregard its values.
'i' is the "index" that defines that order of the values in the 'val' column. It is only used to establish the order and the values are otherwise unimportant.
The problem I'm having is with inserting values into the list while maintaining the order. For example, if I want to insert "hey" and I want it to appear between "hello" and "goodbye", then I have to shift the 'i' values of "goodbye" and "good day" (but preferably not "howdy") to make room for the new entry.
So, is there a standard SQL pattern to do the shift operation, but only shift the elements that are necessary? (Note that a simple "UPDATE table SET i=i+1 WHERE i>=3" doesn't work, because it violates the uniqueness constraint on 'i', and also it updates the "howdy" row unnecessarily.)
Or, is there a better way to represent the ordered list? I suppose you could make 'i' a floating point value and choose values between, but then you have to have a separate rebalancing operation when no such value exists.
Or, is there some standard algorithm for generating string values between arbitrary other strings, if I were to make 'i' a varchar?
Or should I just represent it as a linked list? I was avoiding that because I'd like to also be able to do a SELECT .. ORDER BY to get all the elements in order.
As i read your post, I kept thinking 'linked list'
and at the end, I still think that's the way to go.
If you are using Oracle, and the linked list is a separate table (or even the same table with a self referencing id - which i would avoid) then you can use a CONNECT BY query and the pseudo-column LEVEL to determine sort order.
You can easily achieve this by using a cascading trigger that updates any 'index' entry equal to the new one on the insert/update operation to the index value +1. This will cascade through all rows until the first gap stops the cascade - see the second example in this blog entry for a PostgreSQL implementation.
This approach should work independent of the RDBMS used, provided it offers support for triggers to fire before an update/insert. It basically does what you'd do if you implemented your desired behavior in code (increase all following index values until you encounter a gap), but in a simpler and more effective way.
Alternatively, if you can live with a restriction to SQL Server, check the hierarchyid type. While mainly geared at defining nested hierarchies, you can use it for flat ordering as well. It somewhat resembles your approach using floats, as it allows insertion between two positions by assigning fractional values, thus avoiding the need to update other entries.
If you don't use numbers, but Strings, you may have a table:
pk | i | val
------------
1 | a0 | hi
0 | a2 | hello
2 | a3 | goodbye
3 | b | good day
5 | b1 | howdy
You may insert a4 between a3 and b, a21 between a2 and a3, a1 between a0 and a2 and so on. You would need a clever function, to generate an i for new value v between p and n, and the index can become longer and longer, or you need a big rebalancing from time to time.
Another approach could be, to implement a (double-)linked-list in the table, where you don't save indexes, but links to previous and next, which would mean, that you normally have to update 1-2 elements:
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
2 | 0 | goodbye
3 | 2 | good day
5 | 3 | howdy
hey between hello & goodbye:
hey get's pk 6,
pk | prev | val
------------
1 | 0 | hi
0 | 1 | hello
6 | 0 | hi <- ins
2 | 6 | goodbye <- upd
3 | 2 | good day
5 | 3 | howdy
the previous element would be hello with pk=0, and goodbye, which linked to hello by now has to link to hey in future.
But I don't know, if it is possible to find a 'order by' mechanism for many db-implementations.
Since I had a similar problem, here is a very simple solution:
Make your i column floats, but insert integer values for the initial data:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
Then, if you want to insert something in between, just compute a float value in the middle between the two surrounding values:
pk | i | val
------------
1 | 0.0 | hi
0 | 2.0 | hello
2 | 3.0 | goodbye
3 | 4.0 | good day
5 | 6.0 | howdy
6 | 2.5 | hey
This way the number of inserts between the same two values is limited to the resolution of float values but for almost all cases that should be more than sufficient.