Extracting network/tree structure from table in PostgreSQL - sql

I have a PostgreSQL table that contains several conditions and their individual successors. Some conditions have several successors and these successors might have several successors too. So the goal is extract all possible chains of conditions to achieve something like a tree diagram in data.
The table looks like this:
id | con | succ
----|-----|-----
1 | a | b
2 | a | c
3 | a | d
4 | b | c
5 | b | f
6 | c | e
7 | c | g
8 | c | h
9 | d | h
10 | d | i
I still have no clear idea how to store the single chains in the end, but I need the starting point (a), the respective end point and all nodes between starting and end point.
I'm thankful for all kind of advice on how to store the chains and how to extract them.
UPDATE:
This is an extract of my data:
ID | parent_ID
----|----------
403 | 302
404 | 2xx
405 | 303
406 | 304
407 | 304
408 | 2xx
409 | 305
501 | 2xx
502 | 305
503 | 2xx
504 | 2xx
505 | 2xx
506 | 305
507 | 2xx
508 | 306
509 | 2xx
510 | 307
511 | 308
512 | 308
513 | 308
514 | 309
515 | 310
600 | 5xx
You see that some parent-IDs are not ID themselves but groups of IDs ('all beginning with 2'). Now the question is how to make the recursive query run or how to make the recursive query handle the '2xx'. The values are stored as characters. Instead of '2xx' another notation is possible aswell.

Querying tree- and graph-related data stored in a database efficiently is a rather vast topic.
In terms of storage, note that storing an (id, parent_id) pair will usually be the better (as in widely accepted) option.
The question is how to query it, and more importantly how to do so efficiently.
Your main options for trees include:
WITH queries: http://www.postgresql.org/docs/current/static/queries-with.html
Pros: Built-in, and works fine when dealing with small sets
Cons: Doesn't scale well for larger sets
MPTT, aka pre-ordered trees: http://en.wikipedia.org/wiki/Tree_traversal
Pros: Fastest reads for trees
Cons: Slow writes, hard to maintain unless you do rows one by one
Nested sets (or intervals) for trees: http://en.wikipedia.org/wiki/Nested_set_model
Pros: Fast reads for trees
Cons: Faster than MPTT but still slow, not trivial to understand
The ltree type in Postgres contrib: http://www.postgresql.org/docs/current/static/ltree.html
Pros: Built-in, indexable
Cons: Not ORM friendly
I'd add a hybrid variation of MPTT to the list: if you implement MPTT using float indexes, you can get away with not updating anything when moving things around in your tree, which makes things plenty fast. It's a lot trickier to maintain however, because collisions can occur when the difference between two indexes is too small — you need to re-index a large enough subset of the tree when this happens.
For graphs, WITH queries work too. Variations of MPTT and nested sets exist as well; for instance the GRIPP index. It's an area where research and new indexing methods are still quite active.

Your best best is to work with the ltree data type. See the documentation here. That does require that you rework your table structure a bit though. If that is not an option, you should look at recursive with-queries that can - at first sight - work with your current table structure, but the queries will provide data in a format that is not as easy to manipulate as ltree data.
Converting your current table to a ltree variant is best done using a recursive with-query. First you need to create a new table to hold the ltree column:
CREATE TABLE tree_list (
id int,
chain ltree
);
Then run the recursive query and insert the results into the new table:
WITH RECURSIVE build_tree(id, chain) AS (
SELECT id, con::ltree || succ
FROM tree
WHERE con = 'a'
UNION ALL
SELECT tree.id, build_tree.chain || tree.succ
FROM tree, build_tree
WHERE build_tree.chain ~ ('*.' || tree.con)::lquery)
INSERT INTO tree_list SELECT * FROM build_tree;
You will note that the 10 rows of data you provide above will yield 13 chains because there are multiple paths from a to each of e, g and h. This query should work with trees of practically unlimited depth.
id | chain
----+---------
1 | a.b
2 | a.c
3 | a.d
4 | a.b.c
5 | a.b.f
6 | a.c.e
7 | a.c.g
8 | a.c.h
9 | a.d.h
10 | a.d.i
6 | a.b.c.e
7 | a.b.c.g
8 | a.b.c.h
(13 rows)

Related

how to have one itempointer serialize from 1 to n across the selected rows

as shown in the example below, the output of the query contains blockid startds from 324 and it ends at 127, hence, the itempointer or the row index within the block starts from one for each new block id. in otherwords, as shown below
for the blockid 324 it has only itempointer with index 10
for the blockid 325 it has itempointers starts with 1 and ends with 9
i want to have a single blockid so that the itempointer or the row index starts from 1 and ends with 25
plese let me know how to achive that and
why i have three different blockids?
ex-1
query:
select ctid
from awanti_grid_cell_data agcd
where selectedsiteid = '202230060950'
and centerPointsOfWindowAsGeoJSONInEPSG4326ForCellsInTreatment IS NOT NULL
and centerPointsOfWindowAsGeoJSONInEPSG4326ForCellsInTreatment <> 'None'
result:
|ctid |
|--------|
|(324,10)|
|(325,1) |
|(325,2) |
|(325,3) |
|(325,4) |
|(325,5) |
|(325,6) |
|(325,7) |
|(325,8) |
|(325,9) |
|(326,1) |
|(326,2) |
|(326,3) |
|(326,4) |
|(326,5) |
|(326,6) |
|(326,7) |
|(326,8) |
|(326,9) |
|(327,1) |
|(327,2) |
|(327,3) |
|(327,4) |
|(327,5) |
|(327,6) |
You are missing the point. The ctid is the physical address of a row in the table, and it is none of your business. The database is free to choose whatever place it thinks fit for a table row. As a comparison, you cannot go to the authorities and request that your social security number should be 12345678 - it is simply assigned to you, and you have no say. That's how it is with the physical location of tuples.
Very likely you are not asking this question out of pure curiosity, but because you want to solve some problem. You should instead ask a question about your real problem, and there may be a good answer to that. But whatever problem you are trying to solve, using the ctid is probably not the correct answer, in particular if you want to control it.

Best data structure for finding tags of nested locations

Somebody pointed out that my data structure architecture sucks.
The task
I have a locations table which stores the name of a location. Then I have a tags table which stores information about those locations. The locations have a hierarchie which I want to use to get all tags.
Example
Locations:
USA <- California <- San Francisco <- Mission St
Tags:
USA: English
California: Sunny
California: West coast
San Francisco: Sea side
Mission St: Cable car station
If somebody requests information about the Mission St I want to deliver all tags of it and it's ancestors (["English", "Sunny", "West coast", "Sea side", "Cable car station"]. If I request all tags of California the answer would be ["English", "Sunny", "West coast"].
I'm looking for the best read performance! I don't care about write performance. This data is not changed very often. And I don't care about table sizes either. If I need more or larger tables to solve this quicker so be it.
The tables
So currently I'm thinking about setting up these tables:
locations
id | name
---|--------------
1 | USA
2 | California
3 | San Francisco
4 | Mission St
tags
id | location_id | name
---|-------------|------------------
1 | 1 | English
2 | 2 | Sunny
3 | 2 | West coast
4 | 3 | Sea side
5 | 4 | Cable car station
ancestors
I added a position field to store the hierarchy.
| id | location_id | ancestor_id | position |
|----|-------------|-------------|----------|
| 1 | 2 | 1 | 1 |
| 2 | 3 | 2 | 1 |
| 3 | 3 | 1 | 2 |
| 4 | 4 | 3 | 1 |
| 5 | 4 | 2 | 2 |
| 6 | 4 | 1 | 3 |
Question
Is this a good solution to solve the problem or is there a better one? I want to select as fast as possible all tags of any given location including all the tags of it's ancestors. I'm using a PostgreSQL database but I think this is a pure SQL architecture problem.
Your problem seems to consist of two challenges. The most interesting is "how do I store hierarchies in a relational database". There are lots of answers to that - the one you've proposed is the most common.
There's an alternative called "nested set" which is faster for reading (in your example, finding all locations within a particular hierarchy would be "between x and y".
Postgres has dedicated support for hierachies; I'd assume this would also provide great performance.
The second part of your question is "given a path in my hierarchy, retrieve all matching tags". The easiest option is to join to the tags table as you suggest.
The final aspect is "should you denormalize/precalculate". I usually recommend building and optimizing the "normalized" solution and only denormalize when you need to.
If you want to deliver all tags for a particular location, then I would recommend replicating the data and storing the tags in a tags array on a row for each location.
You say that the locations don't change very much. So, I would simply batch create the entire table, when any underlying data changes.
Modifying the data in situ is rather problematic. A single update could end up affecting a zillion different rows -- consider a tag change on USA. Recalculating the entire table is going to be more efficient.
If you need to search on the tags as well as return them, then I would go for a more traditional structure of a table with two important columns, location and tag. Then you can have indexes on both (location) and (tag) to facilitate searching in either direction.
If write performance is not crucial, I would go for denormalization of the database. That means you use the above structure for your write operations and fill a table for your read operations by a trigger or a some async job, if you are afraid of triggers. Then the read performance is optimal, but you have to invest a bit more into the write logic.
Using the above structure for read operations is indeed not a smart solution, cause you don't know how deep the tree can get.

Storing large SQL datasets with variable numbers of columns

In America’s Cup yachting, we generate large datasets where at every time-stamp (e.g. 100Hz) we need to store maybe 100-1000 channels of sensor data (e.g. speed, loads, pressures). We store this in MS SQL Server and need to be able to retrieve subsets of channels of the data for analysis, and perform queries such as the maximum pressure on a particular sensor in a test, or over an entire season.
The set of channels to be stored stays the same for several thousand time-stamps, but day-to-day will change as new sensors are added, renamed, etc... and depending on testing, racing or simulating, the number of channels can vary greatly.
The textbook way to structure the SQL tables would probably be:
OPTION 1
ChannelNames
+-----------+-------------+
| ChannelID | ChannelName |
+-----------+-------------+
| 50 | Pressure |
| 51 | Speed |
| ... | ... |
+-----------+-------------+
Sessions
+-----------+---------------+-------+----------+
| SessionID | Location | Boat | Helmsman |
+-----------+---------------+-------+----------+
| 789 | San Francisco | BoatA | SailorA |
| 790 | San Francisco | BoatB | SailorB |
| ... | ... | ... | |
+-----------+---------------+-------+----------+
SessionTimestamps
+-------------+-------------+------------------------+
| SessionID | TimestampID | DateTime |
+-------------+-------------+------------------------+
| 789 | 12345 | 2013/08/17 10:30:00:00 |
| 789 | 12346 | 2013/08/17 10:30:00:01 |
| ... | ... | ... |
+-------------+-------------+------------------------+
ChannelData
+-------------+-----------+-----------+
| TimestampID | ChannelID | DataValue |
+-------------+-----------+-----------+
| 12345 | 50 | 1015.23 |
| 12345 | 51 | 12.23 |
| ... | ... | ... |
+-------------+-----------+-----------+
This structure is neat but inefficient. Each DataValue requires three storage fields, and at each time-stamp we need to INSERT 100-1000 rows.
If we always had the same channels, it would be more sensible to use one row per time-stamp and structure like this:
OPTION 2
+-----------+------------------------+----------+-------+----------+--------+-----+
| SessionID | DateTime | Pressure | Speed | LoadPt | LoadSb | ... |
+-----------+------------------------+----------+-------+----------+--------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+-------+----------+--------+-----+
However, the channels change every day, and over the months the number of columns would grow and grow, with most cells ending up empty. We could create a new table for every new Session, but it doesn’t feel right to be using a table name as a variable, and would ultimately result in tens of thousands of tables – also, it becomes very difficult to query over a season, with data stored in multiple tables.
Another option would be:
OPTION 3
+-----------+------------------------+----------+----------+----------+----------+-----+
| SessionID | DateTime | Channel1 | Channel2 | Channel3 | Channel4 | ... |
+-----------+------------------------+----------+----------+----------+----------+-----+
| 789 | 2013/08/17 10:30:00:00 | 1015.23 | 12.23 | 101.12 | 98.23 | ... |
| 789 | 2013/08/17 10:30:00:01 | 1012.51 | 12.44 | 100.33 | 96.82 | ... |
| ... | ... | ... | | | | |
+-----------+------------------------+----------+----------+----------+----------+-----+
with a look-up from Channel column IDs to channel names – but this requires an EXEC or eval to execute a pre-constructed query to obtain the channel we want – because SQL isn’t designed to have column names as variables. On the plus side, we can re-use columns when channels change, but there will still be many empty cells because the table has to be as wide as the largest number of channels we ever encounter. Using a SPARSE table may help here, but I am uncomfortable with the EXEC/eval issue above.
What is the right solution to this problem, that achieves efficiency of storage, inserts and queries?
I would go with Option 1.
Data integrity is first, optimization (if needed) - second.
Other options would eventually have a lot of NULL values and other problems stemming from not being normalized. Managing the data and making efficient queries would be difficult.
Besides, there is a limit on the number of columns that a table can have - 1024, so if you have 1000 sensors/channels you are already dangerously close to the limit. Even if you make your table a wide table, which allows 30,000 columns, still there is a limitation on the size of the row in a table - 8,060 bytes per row. And there are certain performance considerations.
I would not use wide tables in this case, even if I was sure that the data for each row would never exceed 8060 bytes and growing number of channels would never exceed 30,000.
I don't see a problem with inserting 100 - 1000 rows in Option 1 vs 1 row in other options. To do such INSERT efficiently don't make 1000 individual INSERT statements, do it in bulk. In various places in my system I use the following two approaches:
1) Build one long INSERT statement
INSERT INTO ChannelData (TimestampID, ChannelID, DataValue) VALUES
(12345, 50, 1015.23),
(12345, 51, 12.23),
...
(), (), (), (), ........... ();
that contains 1000 rows and execute it as normal INSERT in one transaction, rather than 1000 transactions (check the syntax details).
2) Have a stored procedure that accepts a table-valued parameter. Call such procedure passing 1000 rows as a table.
CREATE TYPE [dbo].[ChannelDataTableType] AS TABLE(
[TimestampID] [int] NOT NULL,
[ChannelID] [int] NOT NULL,
[DataValue] [float] NOT NULL
)
GO
CREATE PROCEDURE [dbo].[InsertChannelData]
-- Add the parameters for the stored procedure here
#ParamRows dbo.ChannelDataTableType READONLY
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
BEGIN TRY
INSERT INTO [dbo].[ChannelData]
([TimestampID],
[ChannelID],
[DataValue])
SELECT
TT.[TimestampID]
,TT.[ChannelID]
,TT.[DataValue]
FROM
#ParamRows AS TT
;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH;
END
GO
If possible, accumulate data from several timestamps before inserting to make the batches larger. You should try with your system and find the optimal size of the batch. I have batches around 10K rows using the stored procedure.
If you have your data coming from sensors 100 times a second, then I would at first dump the incoming raw data in some very simple CSV file(s) and have a parallel background process that would insert it into the database in chunks. In other words, have some buffer for incoming data, so that if the server can't cope with the incoming volume, you would not loose your data.
Based on your comments, when you said that some channels are likely to be more interesting and queried several times, while others are less interesting, here is one optimization that I would consider. In addition to having one table ChannelData for all channels have another table InterestingChannelData. ChannelData would have the whole set of data, just in case. InterestingChannelData would have a subset only for the most interesting channels. It should be much smaller and it should take less time to query it. In any case, this is an optimization (denormalization/data duplication) built on top of properly normalized structure.
Is your process like this:
Generate data during the day
Analyse data afterwards
If these are separate activities then you might want to consider using different 'insert' and 'select' schemas. You could create a schema that's fast for inserting on the boat, then afterwards you batch upload this data into an analysis optimised schema. This requires a transformation step (where for example you map generic column names into useful column names)
This is along the lines of data warehousing and data marts. In this kind of design, you batch load and optimise the schema for reporting. Does your current daily upload have much of a window?

SQL: What is a value?

The Question
One thing that I am confused about is the technical definition of possibly the most basic component of a database: a single value.
Some Examples
I understand and follow (at a minimum) the first three normal forms of database normalization - or so I think. That said, with the introduction of RANGE in PostgreSQL 9.2 I started thinking about what makes a single value.
From the docs:
Range types are useful because they represent many element values in a single range value
So, what are you? Several values, or a single value... nothingness... 42?
Why does this matter?
Because is speaks directly to the Second Normal Form:
Create separate tables for sets of values that apply to multiple records.
Relate these tables with a foreign key.
#1 Ranges
For example, in Postgres 9.1 I had some tables structured like this:
"SomeSchema"."StatusType"
"StatusTypeID" | "StatusType"
--------------------|----------------
1 | Start
2 | Stop
"SomeSchema"."Statuses"
"StatusID" | "Identifier" | "StatusType" | "Value" | "Timestamp"
---------------|----------------|----------------|---------|---------------------
1 | 1 | 1 | 0 | 2000-01-01 00:00:00
2 | 1 | 2 | 5 | 2000-01-02 12:00:00
3 | 2 | 1 | 1 | 2000-01-01 00:00:00
4 | 3 | 1 | 2 | 2000-01-01 00:00:00
5 | 2 | 2 | 7 | 2000-01-01 18:30:00
6 | 1 | 2 | 3 | 2000-01-02 12:00:00
This enabled me to keep an historical record of how things were configured at any given point in time.
This structure takes the position that the data in the "Value" column were all separate values.
Now, in Postgres 9.2 if I do the same thing with a RANGE value it would look like this:
"SomeSchema"."Statuses"
"StatusID" | "Identifier" | "Value" | "Timestamp"
---------------|----------------|-------------|---------------------
1 | 1 | (0, NULL) | 2000-01-01 00:00:00
2 | 1 | (0, 5) | 2000-01-02 12:00:00
3 | 2 | (1, NULL) | 2000-01-01 00:00:00
4 | 3 | (2, NULL) | 2000-01-01 00:00:00
5 | 2 | (1, 7) | 2000-01-01 18:30:00
6 | 1 | (0, 3) | 2000-01-02 12:00:00
Again, this structure would enable me to keep an historical record of how things were configured, but I would be storing the same value several times in separate places. It makes updating (technically inserting a new record) more tricky because I have to make sure the data rolls over from the original record.
#2 Arrays
Arrays have been around for a long time, and while they can be abused, I tend to use them for things like color codes. For example, my project stores information and at times needs to know how to display it. I could create three columns to store red, green, and blue values; but that just seems silly. When would I ever create a foreign key (or even just filter) based on one of the given color codes.
When I created the field it was from the perspective that I needed to store a color in a neutral format so that I could feed anything that accepts a color value. I made the column an array and filled it with the appropriate codes to make the color I want.
#3 PostGIS: Geometry & Geography
When storing a polygon in PostGIS, it stores all the points that make the boundary in a single field. If one point were to change and I wanted to keep an historical record, I would have to store all of the points that have not changed twice in order to store the new polygon along with the old.
So, what is a value? and... if RANGE, ARRAY, and GEOGRAPHY are values do they really break the second normal form?
The fact that some operation can derive new values from X that appear to be components of X's value doesn't mean X itself isn't "single valued". Thus "range" values and "geography" values should be single values as far as the DBMSs type system is concerned. I don't know enough about Postgresql's implementation to know whether "arrays" can be considered as single values in themselves. SQL DBMSs like Postgresql are not truly relational DBMSs and SQL supports various structures that certainly aren't proper relation variables, values or types (pointers, nulls and other exotica).
This is a difficult and sometimes controversial topic however. If you haven't read it then I recommend the book Databases, Types, and the Relational Model - The Third Manifesto by Date and Darwen. It addresses exactly the kind of questions you are asking about.
I don't like your description of 2NF but it's not very relevant here.

Creating a flattened table/view of a hierarchically-defined set of data

I have a table containing hierarchical data. There are currently ~8 levels in this hierarchy.
I really like the way the data is structured, but performance is dismal when I need to know if a record at level 8 is a child of a record at level 1.
I have PL/SQL stored functions which do these lookups for me, each having a select * from tbl start with ... connect by... statement. This works fine when I'm querying a handful of records, but I'm in a situation now where I need to query ~10k records at once and for each of them run this function. It's taking 2-3 minutes where I need it to run in just a few seconds.
Using some heuristics based on my knowledge of the current data, I can get rid of the lookup function and just do childrecord.key || '%' LIKE parentrecord.key but that's a really dirty hack and will not always work.
So now I'm thinking that for this hierarchically-defined table I need to have a separate parent-child table, which will contain every relationship...for a hierarchy going from level 1-8 there would be 8! records, associating 1 with 2, 1 with 3,...,1 with 8 and 2 with 3, 2 with 4,...,2 with 8. And so forth.
My thought is that I would need to have an insert trigger where it will basically run the connect by query and for every match going up the hierarchy it will insert a record in the lookup table. And to deal with old data I'll just set up foreign keys to the main table with cascading deletes.
Are there better options than this? Am I missing another way that I could determine these distant ancestor/descendant relationships more quickly?
EDIT: This appears to be exactly what I'm thinking about: http://evolt.org/working_with_hierarchical_data_in_sql_using_ancestor_tables
So what you want is to materialize the transitive closures. That is, given this application table ...
ID | PARENT_ID
------+----------
1 |
2 | 1
3 | 2
4 | 2
5 | 4
... the graph table would look like this:
PARENT_ID | CHILD_ID
-----------+----------
1 | 2
1 | 3
1 | 4
1 | 5
2 | 3
2 | 4
2 | 5
4 | 5
It is possible to maintain a table like this in Oracle, although you will need to roll your own framework for it. The question is whether it is worth the overhead. If the source table is volatile then keeping the graph data fresh may cost more cycles than you will save on the queries. Only you know your data's profile.
I don't think you can maintain such a graph table with CONNECT BY queries and cascading foreign keys. Too much indirect activity, too hard to get right. Also a materialized view is out, because we cannot write a SQL query which will zap the 1->5 record when we delete the source record for ID=4.
So what I suggest you read a paper called Maintaining Transitive Closure of Graphs in SQL by Dong, Libkin, Su and Wong. This contains a lot of theory and some gnarly (Oracle) SQL but it will give you the grounding to build the PL/SQL you need to maintain a graph table.
"can you expand on the part about it
being too difficult to maintain with
CONNECT BY/cascading FKs? If I control
access to the table and all
inserts/updates/deletes take place via
stored procedures, what kinds of
scenarios are there where this would
break down?"
Consider the record 1->5 which is a short-circuit of 1->2->4->5. Now what happens if, as I said before, we delete the the source record for ID=4? Cascading foreign keys could delete the entries for 2->4 and 4->5. But that leaves 1->5 (and indeed 2->5) in the graph table although they no longer represent a valid edge in the graph.
What might work (I think, I haven't done it) would be to use an additional synthetic key in the source table, like this.
ID | PARENT_ID | NEW_KEY
------+-----------+---------
1 | | AAA
2 | 1 | BBB
3 | 2 | CCC
4 | 2 | DDD
5 | 4 | EEE
Now the graph table would look like this:
PARENT_ID | CHILD_ID | NEW_KEY
-----------+----------+---------
1 | 2 | BBB
1 | 3 | CCC
1 | 4 | DDD
1 | 5 | DDD
2 | 3 | CCC
2 | 4 | DDD
2 | 5 | DDD
4 | 5 | DDD
So the graph table has a foreign key referencing the relationship in the source table which generated it, rather than linking to the ID. Then deleting the record for ID=4 would cascade deletes of all records in the graph table where NEW_KEY=DDD.
This would work if any given ID can only have zero or one parent IDs. But it won't work if it is permissible for this to happen:
ID | PARENT_ID
------+----------
5 | 2
5 | 4
In other words the edge 1->5 represents both 1->2->4->5 and 1->2->5. So, what might work depends on the complexity of your data.