Improve MERGE performance when using big tables - sql

Context
We have a model in which each element has an element kind and from 0 to N features. Each feature belongs to only one element and has a feature name.
This is modeled as the following tables:
ELEMENTS
elem_id int not null -- PK
elem_elki_id int not null -- FK to ELEMENT_KINDS
-- more columns with elements data
ELEMENT_KINDS
elki_id int not null -- PK
-- more columns with elements kinds data
FEATURES
feat_id int not null -- PK
feat_elem_id int not null -- FK to ELEMENTS
feat_fena_id int not null -- FK to FEATURE_NAMES
-- more columns with features data
FEATURE_NAMES
fena_id int not null -- PK
-- more columns with feature_names data
Requirement
There is a new requirement of replacing the feature names table with a feature kinds table.
There is one (and only one) feature kind for each (element kind, feature name) pair.
The changes in the models were adding a new column and creating a new table:
ALTER TABLE features ADD feat_feki_id int null;
CREATE TABLE FEATURE_KINDS
(
feki_id int not null, -- PK
feki_elki_id int not null, -- FK to ELEMENT_KINDS
feki_fena_id int null, -- FK* to FEATURE_NAMES
-- more columns with feature kinds data
)
*feki_fena_id is actually a temp colum showing which feature name
was used to create each feature kind. After populating feat_feki_id, feki_fena_id should be discarded along with feat_fena_id and the feature names table.
Problem
After successfully populating the features kinds table we are trying to populate the feat_feki_id column using the following query:
MERGE INTO features F
USING
(
SELECT *
FROM elements
INNER JOIN feature_kinds
ON elem_elki_id = feki_elki_id
) EFK
ON
(
F.feat_elem_id = EFK.elem_id AND
F.feat_fena_id = EFK.feki_fena_id
)
WHEN MATCHED THEN
UPDATE SET F.feat_feki_id = EFK.feki_id;
This works in small case scenarios with test data, but in production we have ~20 million elements and ~2000 feature_kinds and it takes about an hour before throwing an ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDOTBS1' error.
Question
Is there any way I could improve the performance of the MERGE so that it works? (Maybe I'm lacking some indexes?)
Is there another alternative to fill up the feat_feki_id column? (We already have tried UPDATE instead of MERGE with similar results)

It's not clear whether there is something wrong going on or whether your undo segments are just too small. Can you do the following statement without getting an ORA-30036?
UPDATE features f SET f.feat_feki_id = 12345;
If that doesn't work, you just need to increase the size of your undo segment. Kludges are available to do the update in chunks, but you really shouldn't have to do that.
Assuming it's NOT a simple UNDO size issue, one thing you might do is make sure that your MERGE (or UPDATE) is updating rows in the order they appear in your table. Otherwise, you could be revisiting the same blocks over and over, really hurting performance and increasing UNDO usage. I encountered this in a similar operation I had to do a few years ago and I was shocked when I finally figured it out.
To avoid the problem I had, you would want something like this:
MERGE INTO features F
USING
(
SELECT f.feat_id, fk.feki_id
FROM features f
INNER JOIN elements e ON e.elem_id = f.feat_elem_id
INNER JOIN feature_kinds fk ON fk.feki_elki_id = e.elem_elki_id and fk.feki_fena_id = f.feat_fena_id
-- Order by the ROWID of the table you are updating to ensure you are not revisiting the same block over and over
ORDER BY f.rowid
) EFK
ON
(
F.feat_id = efk.feat_id )
)
WHEN MATCHED THEN
UPDATE SET F.feat_feki_id = EFK.feki_id;
I may have gotten your data model wrong, but the key point is to include the FEATURES table in the MERGE query and ORDER BY features.rowid to ensure that the updates happen in row order.

Related

Limit Rows in ETL Without Date Column for Cue

We have two large tables (Clients and Contacts) which undergo an ETL process every night, being inserted into a single "People" table in the data warehouse. This table is used in many places and cannot be significantly altered without a lot of work.
The source tables are populated by third party software; we used to assume that we could identify the rows that had been updated since last night by using the "UpdateDate" column in each, but more recently identified some rows that were not touched by the ETL, as the "UpdateDate" column was not behaving as we had thought; the software company do not see this as a bug, so we have to live with this fact.
As a result, we now take all source rows, transformed into a temp staging table and then Merge that into the data warehouse, using the Merge to identify any changed values. We have noticed that this process is taking too long on some days and would like to limit the number of rows that the ETL process looks at, as we believe that the reason for the hold-up is the principally the sheer volume of data that is examined and stored on the temp database. We can see no way to look purely at the source data and identify when each row last changed.
Here is a simplified pseudocode of the ETL stored procedure, although what the procedure actually does is not really relevant to the question (included just in case you disagree with me!)
CREATE #TempTable (ClientOrContact BIT NOT NULL, Id INT NOT NULL, [Some_Other_Columns])
INSERT #TempTable
SELECT 1 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ClientsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
INSERT #TempTable
SELECT 0 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ContactsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
ALTER #TempTable ADD PRIMARY KEY (ClientOrContact, Id);
MERGE Target_PeopleTable AS Tgt
USING (SELECT [SomeColumns] FROM #TempTable JOIN [SomeOtherTables]) AS Src
ON Tgt.ClientOrContact = Src.ClientOrContact AND Tgt.Id = Src.Id
WHEN MATCHED AND NOT EXISTS (SELECT Tgt.* INTERSECT SELECT Src.*)
THEN UPDATE SET ([All_NonKeyTargetColumns] = [All_NonKeySourceColumns])
WHEN NOT MATCHED BY Target THEN INSERT [All_TargetColumns] VALUES [All_SourceColumns]
OUTPUT $Action INTO #Changes;
RETURN COUNT(*) FROM #Changes;
GO
The source tables have about 1.5M rows each, but each day only a relatively small number of rows are inserted or updated (never deleted). There are about 50 columns in each table, of those, about 40 columns can have changed values each night. Most columns are VARCHAR and each table contains an independent incremental primary key column. We can add indexes to the source tables, but not alter them in any other way (They have already been indexed by a predecessor) The source tables and target table are on the same server, but different databases. Edit: The Target Table has a composite primary key on the ClientOrContact and Id columns, matching that shown on the temp table in the script above.
So, my question is this - please could you suggest any general possible strategies that might be useful to limit the number of rows we look at or copy across each night? If we only touched the rows that we needed to each night, we would be touching less than 1% of the data we do at the moment...
Before you try the following suggestion, just one thing to check is that the Target_PeopleTable has an index or primary key on the id column. It probably does but without schema information to verify I am making no assumptions and this might speed up the merge stage.
As you've identified if you could somehow limit the records in TempTable to just the changed rows then this could offer a performance win for the actual MERGE statement (depending on how expensive determining just the changed rows is).
As a general strategy I would consider some kind of checksum to try and identify the changed records only. The T-SQL Checksum function could be used to calculate a check sum across the required columns by specifying the columns as a comma separated list to that function or there are actual column types available for this such as Binary_Checksum.
Since you cannot change the source schema you would have to maintain a list of record ids and associated checksums in your target database so that you can readily compare the source checksums to the target checksums from the last run in order to identify a difference.
You can then only insert into the Temp table where there is a checksum difference between the target and source or the id does not exist in the target db.
This might just be moving the performance problem to the temp insert part but I think it's worth a try.
Have you considered triggers? I avoid them like the plague, but they really are the solution to some problems.
Put an INSERT/UPDATE [/DELETE?] trigger on your two source tables. Program it such that when rows are added or updated, the trigger will log the IDs of these rows in a (you'll have to create this) audit table, where that table would contain the ID, the type of change (update or insert – and delete, if you have to worry about those) and when the change was made. When you run ETL, join this list of “to be merged” items with the source tables. When you’re done, delete the table and it’s reset for the next run. (Use the “added on” datetime column to make sure you don’t delete rows that may have been added while you were running ETL.)
There’s lots of details behind proper use and implementation, but overall this idea should do what you need.

Column Copy and Update vs. Column Create and Insert

I have a table with 32 Million rows and 31 columns in PostgreSQL 9.2.10. I am altering the table by adding columns with updated values.
For example, if the initial table is:
id initial_color
-- -------------
1 blue
2 red
3 yellow
I am modifying the table so that the result is:
id initial_color modified_color
-- ------------- --------------
1 blue blue_green
2 red red_orange
3 yellow yellow_brown
I have code that will read the initial_color column and update the value.
Given that my table has 32 million rows and that I have to apply this procedure on five of the 31 columns, what is the most efficient way to do this? My present choices are:
Copy the column and update the rows in the new column
Create an empty column and insert new values
I could do either option with one column at a time or with all five at once. The columns types are either character varying or character.
The columns types are either character varying or character.
Don't use character, that's a misunderstanding. varchar is ok, but I would suggest just text for arbitrary character data.
Any downsides of using data type "text" for storing strings?
Given that my table has 32 million rows and that I have to apply this
procedure on five of the 31 columns, what is the most efficient way to do this?
If you don't have objects (views, foreign keys, functions) depending on the existing table, the most efficient way is create a new table. Something like this ( details depend on the details of your installation):
BEGIN;
LOCK TABLE tbl_org IN SHARE MODE; -- to prevent concurrent writes
CREATE TABLE tbl_new (LIKE tbl_org INCLUDING STORAGE INCLUDING COMMENTS);
ALTER tbl_new ADD COLUMN modified_color text
, ADD COLUMN modified_something text;
-- , etc
INSERT INTO tbl_new (<all columns in order here>)
SELECT <all columns in order here>
, myfunction(initial_color) AS modified_color -- etc
FROM tbl_org;
-- ORDER BY tbl_id; -- optionally order rows while being at it.
-- Add constraints and indexes like in the original table here
DROP tbl_org;
ALTER tbl_new RENAME TO tbl_org;
COMMIT;
If you have depending objects, you need to do more.
Either was, be sure to add all five at once. If you update each in a separate query you write another row version each time due to the MVCC model of Postgres.
Related cases with more details, links and explanation:
Updating database rows without locking the table in PostgreSQL 9.2
Best way to populate a new column in a large table?
Optimizing bulk update performance in PostgreSQL
While creating a new table you might also order columns in an optimized fashion:
Calculating and saving space in PostgreSQL
Maybe I'm misreading the question, but as far as I know, you have 2 possibilities for creating a table with the extra columns:
CREATE TABLE
This would create a new table and filling could be done using
CREATE TABLE .. AS SELECT.. for filling with creation or
using a separate INSERT...SELECT... later on
Both variants are not what you seem to want to do, as you stated solution without listing all the fields.
Also this would require all data (plus the new fields) to be copied.
ALTER TABLE...ADD ...
This creates the new columns. As I'm not aware of any possibility to reference existing column values, you will need an additional UPDATE ..SET... for filling in values.
So, I' not seeing any way to realize a procedure that follows your choice 1.
Nevertheless, copying the (column) data just to overwrite them in a second step would be suboptimal in any case. Altering a table adding new columns is doing minimal I/O. From this, even if there would be a possibility to execute your choice 1, following choice 2 promises better performance by factors.
Thus, do 2 statements one ALTER TABLE adding all your new columns in on go and then an UPDATE providing the new values for these columns will achieve what you want.
create new column (modified colour), it will have a value of NULL or blank on all records,
run an update statement, assuming your table name is 'Table'.
update table
set modified_color = 'blue_green'
where initial_color = 'blue'
if I am correct this can also work like this
update table set modified_color = 'blue_green' where initial_color = 'blue';
update table set modified_color = 'red_orange' where initial_color = 'red';
update table set modified_color = 'yellow_brown' where initial_color = 'yellow';
once you have done this you can do another update (assuming you have another column that I will call modified_color1)
update table set 'modified_color1'= 'modified_color'

Fastest way to modify each row in a table

What's the recommended way of updating a relatively large table (~70 million rows), in order to replace a foreign key column with an id of a different table (indirectly linked by the current key)?
Let's say I have three tables:
Person
Id long,
Group_id long --> foreign key to Group table
Group
Id long
Device_id long --> foreign key to Device table
Device
Id long
I would like to update the Person table to have a direct foreign key to the Device table, i.e.:
Person
Id long,
Device_Id long --> foreign key to Device table
Device
Id long
The query would look something like this:
-- replace Group_id with Device_id
update p from Person p
inner join Group g
on g.Id = p.Group_id
set p.Group_id = g.Device_id
I would first drop the FK constraint, and then rename the column afterwards.
Will this work?
Is there a better way?
Can I speed it up? (while this query is running, everything else will be offline, server is UPS backed-up, so I'd like to skip any transactional updates)
It would work if you wrote the UPDATE properly (assuming this is SQL Server)
update p
set p.Group_id = g.Device_id
from Person p
inner join Group g on g.Id = p.Group_id
Apart from that, it's a really smart move to re-use, then rename the column*. Can't think of any smart way to make this any faster, unless you wish to use a WHILE loop and person.Id markers to break up the updates into batches.
* - ALTER TABLE DROP COLUMN DOES NOT RECLAIM THE SPACE THE COLUMN TOOK
Drop indexes on the table you are updating and recreate after the update is complete.
Drop constraints on the table you are updating and recreate appropriately (you are changing the reference after all) after the update is complete.
Turn off triggers on the table you are updating and enable after the update is complete.
You might want to consider running batches. I personally would create a loop and batch update 10k rows at a time. This seemed to cause the fewest problems on my hardware (running out of disk space, etc). You could order the update and track the PK so you know where you are at. Or create a bit column that is set when a particular record is updated; this method might make it easier overall as you won't need to track the PK at all.
An example of such a loop might look like this:
select top 1 * from table
DECLARE #MinPK BIGINT
DECLARE #MaxPK BIGINT
SET #MinPK=0
SET #MaxPK=0
WHILE ##ROWCOUNT>0
BEGIN
SELECT
#MaxPK=MAX(a.PK)
FROM (
SELECT TOP 3
PK
FROM Table
WHERE PK>#MinPK
ORDER BY PK ASC
) a
--Change this to an update
SELECT
PK
FROM Table
WHERE PK>#MinPK
AND PK<=#MaxPK
SET #MinPK=#MaxPK
END
Your idea won't "work", unless there is only one device per group (which would be ridiculous, so I assume not).
The problem is that you would have to cram many device_id values into one column in the person table - that's why you've got a group table in the first place.

What is the preferred way of saving dynamic lists in database?

In our application user can create different lists (like sharepoint) for example a user can create a list of cars (name, model, brand) and a list of students (name, dob, address, nationality), e.t.c.
Our application should be able to query on different columns of the list so we can't just serialize each row and save it in one row.
Should I create a new table at runtime for each newly created list? If this was the best solution then probably Microsoft SharePoint would have done it as well I suppose?
Should I use the following schema
Lists (Id, Name)
ListColumns (Id, ListId, Name)
ListRows (Id, ListId)
ListData(RowId, ColumnId, Value)
Though a single row will create as many rows in list data table as there are columns in the list, this just doesn't feel right.
Have you dealt with this situation? How did you handle it in database?
what you did is called EAV (Entity-Attribute-Value Model).
For a list with 3 columns and 1000 entries:
1 record in Lists
3 records in ListColumns
and 3000 Entries in ListData
This is fine. I'm not a fan of creating tables on-the-fly because it could mess up your database and you would have to "generate" your SQL queries dynamically. I would get a strange feeling when users could CREATE/DROP/ALTER Tables in my database!
Another nice feature of the EAV model is that you could merge two lists easily without droping and altering a table.
Edit:
I think you need another table called ListRows that tells you which ListData records belong together in a row!
Well I've experienced something like this before - I don't want to share the actual table schema so lets do some thought exercises using some of the suggested table structures:
Lets have a lists table containing a list of all my lists
Lets also have a columns table containing the metadata (column names)
Now we need a values table which contains the column values
We also need a rows table which contains a list of all the rows, otherwise it gets very difficult to work out how many rows there actually are
To keep things simple lets just make everything a string (VARCAHR) and have a go at coming up with some queries:
Counting all the rows in a table
SELECT COUNT(*) FROM [rows]
JOIN [lists]
ON [rows].list_id = [Lists].id
WHERE [Lists].name = 'Cars'
Hmm, not too bad, compared to:
SELECT * FROM [Cars]
Inserting a row into a table
BEGIN TRANSACTION
DECLARE #row_id INT
DECLARE #list_id INT
SELECT #list_id = id FROM [lists] WHERE name = 'Cars'
INSERT INTO [rows] (list_id) VALUES (#list_id)
SELECT #row_id = ##IDENTITY
DECLARE #column_id INT
-- === Need one of these for each column ===
SELECT #column_id = id FROM [columns]
WHERE name = 'Make'
AND list_id = #list_id
INSERT INTO [values] (column_id, row_id, value)
VALUES (#column_id, #row_id, 'Rover')
-- === Need one of these for each column ===
SELECT #column_id = id FROM [columns]
WHERE name = 'Model'
AND list_id = #list_id
INSERT INTO [values] (column_id, row_id, value)
VALUES (#column_id, #row_id, 'Metro')
COMMIT TRANSACTION
Um, starting to get a little bit hairy compared to:
INSERT INTO [Cars] ([Make], [Model}) VALUES ('Rover', 'Metro')
Simple queries
I'm now getting bored of constructing tediously complex SQL statements so maybe you can have a go at coming up with equivalent queries for the followng statements:
SELECT [Model] FROM [Cars] WHRE [Make] = 'Rover'
SELECT [Cars].[Make], [Cars].[Model], [Owners].[Name] FROM [Cars]
JOIN [Owners] ON [Owners].id = [Cars].owner_id
WHERE [Owners].Age > 50
SELECT [Cars].[Make], [Cars].[Model], [Owners].[Name] FROM [Cars]
JOIN [Owners] ON [Owners].id = [Cars].owner_id
JOIN [Addresses] ON [Addresses].id = [Owners].address_id
WHERE [Addresses].City = 'London'
I hope you are beginning to get the idea...
In short - I've experienced this before and I can assure you that creating a database inside a database in this way is definitely a Bad Thing.
If you need to do anything but the most basic querying on these lists (and literally I mean "Can I have all the items in this list please?"), you should try and find an alternative.
As long as each user pretty much has their own database I'll definitely recommend the CREATE TABLE approach. Even if they don't I'd still recommend that you at least consider it.
Perhaps a potential solution would be the creating of lists can involve CREATE TABLE statements for those entities/lists?
It sounds like the db structure or schema can change at runtime, or at the user's command, so perhaps something like this might help?
User wants to create a new list of an entity never seen before. Call it Computer.
User defines the attributes (screensize, CpuSpeed, AmountRAM, NumberOfCores)
System allows user to create in the UI
system generally lets them all be strings, unless can tell when all supplied values are indeed dates or numbers.
build the CREATE scripts, execute them against the DB.
insert the data that the user defined into that new table.
Properly coded, we're working with the requirements given: let users create new entities. There was no mention of scale here. Of course, this requires all input to be sanitized, queries parameterized, actions logged, etc.
The negative comment below doesn't actually give any good reasons, but creates a bit of FUD. I'd be interested in addressing any concerns with this potential solution. We haven't heard about scale, security, performance, or usage (internal LAN vs. internet).
You should absolutely not dynamically create tables when your users create lists. That isn't how databases are meant to work.
Your schema is correct, and the pluralization is, in my opinion, also correct, though I would remove the camel case and call them lists, list_columns, list_rows and list_data.
I would further improve upon your schema by skipping rows and columns tables, they serve no purpose. Simply have a row/column number attached to each cell, and keep things sparse: Don't bother holding empty cells in the database. You retain the ability to query/sort based on row/column, your queries will be (potentially very much) faster because the number of list_cells will be reduced, and you won't have to do any crazy joining to link your data back to its table.
Here is the complete schema:
create table lists (
id int primary key,
name varchar(25) not null
);
create table list_cells (
id int primary key,
list_id int not null references lists(id)
on delete cascade on update cascade,
row int not null,
col int not null,
data varchar(25) not null
);
It sounds like you might have Sharepoint already deployed in your environment.
Consider integrating your application with Sharepoint, and have it be your datastore. No need to recreate all the things you like about Sharepoint, when you could leverage it.
It'd take a bit of configuring, but you could call SP web services to CRUD your list data for you.
inserting list data into Sharepoint via web services
reading SP lists via web services
Sharepoint 2010 can also expose lists via OData, which would be simple to consume from any application.

Linked List in SQL

What's the best way to store a linked list in a MySQL database so that inserts are simple (i.e. you don't have to re-index a bunch of stuff every time) and such that the list can easily be pulled out in order?
Using Adrian's solution, but instead of incrementing by 1, increment by 10 or even 100. Then insertions can be calculated at half of the difference of what you're inserting between without having to update everything below the insertion. Pick a number large enough to handle your average number of insertions - if its too small then you'll have to fall back to updating all rows with a higher position during an insertion.
create a table with two self referencing columns PreviousID and NextID. If the item is the first thing in the list PreviousID will be null, if it is the last, NextID will be null. The SQL will look something like this:
create table tblDummy
{
PKColumn int not null,
PreviousID int null,
DataColumn1 varchar(50) not null,
DataColumn2 varchar(50) not null,
DataColumn3 varchar(50) not null,
DataColumn4 varchar(50) not null,
DataColumn5 varchar(50) not null,
DataColumn6 varchar(50) not null,
DataColumn7 varchar(50) not null,
NextID int null
}
Store an integer column in your table called 'position'. Record a 0 for the first item in your list, a 1 for the second item, etc. Index that column in your database, and when you want to pull your values out, sort by that column.
alter table linked_list add column position integer not null default 0;
alter table linked_list add index position_index (position);
select * from linked_list order by position;
To insert a value at index 3, modify the positions of rows 3 and above, and then insert:
update linked_list set position = position + 1 where position >= 3;
insert into linked_list (my_value, position) values ("new value", 3);
A linked list can be stored using recursive pointers in the table. This is very much the same hierarchies are stored in Sql and this is using the recursive association pattern.
You can learn more about it here (Wayback Machine link).
I hope this helps.
The simplest option would be creating a table with a row per list item, a column for the item position, and columns for other data in the item. Then you can use ORDER BY on the position column to retrieve in the desired order.
create table linked_list
( list_id integer not null
, position integer not null
, data varchar(100) not null
);
alter table linked_list add primary key ( list_id, position );
To manipulate the list just update the position and then insert/delete records as needed. So to insert an item into list 1 at index 3:
begin transaction;
update linked_list set position = position + 1 where position >= 3 and list_id = 1;
insert into linked_list (list_id, position, data)
values (1, 3, "some data");
commit;
Since operations on the list can require multiple commands (eg an insert will require an INSERT and an UPDATE), ensure you always perform the commands within a transaction.
A variation of this simple option is to have position incrementing by some factor for each item, say 100, so that when you perform an INSERT you don't always need to renumber the position of the following elements. However, this requires a little more effort to work out when to increment the following elements, so you lose simplicity but gain performance if you will have many inserts.
Depending on your requirements other options might appeal, such as:
If you want to perform lots of manipulations on the list and not many retrievals you may prefer to have an ID column pointing to the next item in the list, instead of using a position column. Then you need to iterative logic in the retrieval of the list in order to get the items in order. This can be relatively easily implemented in a stored proc.
If you have many lists, a quick way to serialise and deserialise your list to text/binary, and you only ever want to store and retrieve the entire list, then store the entire list as a single value in a single column. Probably not what you're asking for here though.
This is something I've been trying to figure out for a while myself. The best way I've found so far is to create a single table for the linked list using the following format (this is pseudo code):
LinkedList(
key1,
information,
key2
)
key1 is the starting point. Key2 is a foreign key linking to itself in the next column. So your columns will link something link something like this
col1
key1 = 0,
information= 'hello'
key2 = 1
Key1 is primary key of col1. key2 is a foreign key leading to the key1 of col2
col2
key1 = 1,
information= 'wassup'
key2 = null
key2 from col2 is set to null because it doesn't point to anything
When you first enter a column in for the table, you'll need to make sure key2 is set to null or you'll get an error. After you enter the second column, you can go back and set key2 of the first column to the primary key of the second column.
This makes the best method to enter many entries at a time, then go back and set the foreign keys accordingly (or build a GUI that just does that for you)
Here's some actual code I've prepared (all actual code worked on MSSQL. You may want to do some research for the version of SQL you are using!):
createtable.sql
create table linkedlist00 (
key1 int primary key not null identity(1,1),
info varchar(10),
key2 int
)
register_foreign_key.sql
alter table dbo.linkedlist00
add foreign key (key2) references dbo.linkedlist00(key1)
*I put them into two seperate files, because it has to be done in two steps. MSSQL won't let you do it in one step, because the table doesn't exist yet for the foreign key to reference.
Linked List is especially powerful in one-to-many relationships. So if you've ever wanted to make an array of foreign keys? Well this is one way to do it! You can make a primary table that points to the first column in the linked-list table, and then instead of the "information" field, you can use a foreign key to the desired information table.
Example:
Let's say you have a Bureaucracy that keeps forms.
Let's say they have a table called file cabinet
FileCabinet(
Cabinet ID (pk)
Files ID (fk)
)
each column contains a primary key for the cabinet and a foreign key for the files. These files could be tax forms, health insurance papers, field trip permissions slips etc
Files(
Files ID (pk)
File ID (fk)
Next File ID (fk)
)
this serves as a container for the Files
File(
File ID (pk)
Information on the file
)
this is the specific file
There may be better ways to do this and there are, depending on your specific needs. The example just illustrates possible usage.
There are a few approaches I can think of right off, each with differing levels of complexity and flexibility. I'm assuming your goal is to preserve an order in retrieval, rather than requiring storage as an actual linked list.
The simplest method would be to assign an ordinal value to each record in the table (e.g. 1, 2, 3, ...). Then, when you retrieve the records, specify an order-by on the ordinal column to get them back in order.
This approach also allows you to retrieve the records without regard to membership in a list, but allows for membership in only one list, and may require an additional "list id" column to indicate to which list the record belongs.
An slightly more elaborate, but also more flexible approach would be to store information about membership in a list or lists in a separate table. The table would need 3 columns: The list id, the ordinal value, and a foreign key pointer to the data record. Under this approach, the underlying data knows nothing about its membership in lists, and can easily be included in multiple lists.
This post is old but still going to give my .02$. Updating every record in a table or record set sounds crazy to solve ordering. the amount of indexing also crazy, but it sounds like most have accepted it.
Crazy solution i came up with to reduce updates and indexing is to create two tables (and in most use cases you don's sort all records in just one table anyway). Table A to hold the records of the list being sorted and table B to group and hold a record of the order as a string. the order string represents an array that can be used to order the selected records either on the web server or browser layer of a webpage application.
Create Table A{
Id int primary key identity(1,1),
Data varchar(10) not null
B_Id int
}
Create Table B{
Id int primary key Identity(1,1),
GroupName varchat(10) not null,
Order varchar(max) null
}
The format of the order sting should be id, position and some separator to split() your string by. in the case of jQuery UI the .sortable('serialize') function outputs an order string for you that is POST friendly that includes the id and position of each record in the list.
The real magic is the way you choose to reorder the selected list using the saved ordering string. this will depend on the application you are building. here is an example again from jQuery to reorder the list of items: http://ovisdevelopment.com/oramincite/?p=155
https://dba.stackexchange.com/questions/46238/linked-list-in-sql-and-trees suggests a trick of using floating-point position column for fast inserts and ordering.
It also mentions specialized SQL Server 2014 hierarchyid feature.
I think its much simpler adding a created column of Datetime type and a position column of int, so now you can have duplicate positions, at the select statement use the order by position, created desc option and your list will be fetched in order.
Increment the SERIAL 'index' by 100, but manually add intermediate values with an 'index' equal to Prev+Next / 2. If you ever saturate the 100 rows, reorder the index back to 100s.
This should maintain sequence with primary index.
A list can be stored by having a column contain the offset (list index position) -- an insert in the middle is then incrementing all above the new parent and then doing an insert.
You could implement it like a double ended queue (deque) to support fast push/pop/delete(if oridnal is known) and retrieval you would have two data structures. One with the actual data and another with the number of elements added over the history of the key. Tradeoff: This method would be slower for any insert into the middle of the linked list O(n).
create table queue (
primary_key,
queue_key
ordinal,
data
)
You would have an index on queue_key+ordinal
You would also have another table which stores the number of rows EVER added to the queue...
create table queue_addcount (
primary_key,
add_count
)
When pushing a new item to either end of the queue (left or right) you would always increment the add_count.
If you push to the back you could set the ordinal...
ordinal = add_count + 1
If you push to the front you could set the ordinal...
ordinal = -(add_count + 1)
update
add_count = add_count + 1
This way you can delete anywhere in the queue/list and it would still return in order and you could also continue to push new items maintaining the order.
You could optionally rewrite the ordinal to avoid overflow if a lot of deletes have occurred.
You could also have an index on the ordinal to support fast ordered retrieval of the list.
If you want to support inserts into the middle you would need to find the ordinal which it needs to be insert at then insert with that ordinal. Then increment every ordinal by one following that insertion point. Also, increment the add_count as usual. If the ordinal is negative you could decrement all of the earlier ordinals to do fewer updates. This would be O(n)