SQL Full Text Index on multiple tables and columns - sql

We have electronic forms that filers fill out online and we store the data in an SQL Server. We want to provide a search feature that allows us to search inside each electronic filing for matching keywords. We don’t need to know what word matched or where in the form it matches, we just need a ranked list of forms that match our keywords. We think SQL Full-Text Searching would be our best option because we are already using SQL server 2016. We just started with implementing a solution but would like some guidance since this is new territory for us.
Here is an example of how our tables are structured.
Filing is our top-level table for all electronic forms. We have sub tables that are all related through the FilingId. The Form Six Published Filings table has child tables to store information like Assets. The Form One Published Filings table has child tables to store information like Liabilities.
CREATE SCHEMA [Forms]
GO
CREATE SCHEMA [Form6]
GO
CREATE SCHEMA [Form1]
GO
CREATE TABLE [Forms].[Filing](
[FilingId] INT NOT NULL IDENTITY(1,1)
CONSTRAINT [PK_Forms_Filing_FilingId] PRIMARY KEY CLUSTERED,
[FilerUserId] [int] NOT NULL,
[FormYear] [int] NOT NULL,
[FormTypeId] [int] NOT NULL,
[FilingStatusId] [int] NOT NULL,
[FilerSignatureId] INT NULL,
[SubmissionDate] DATETIME2(0) NULL,
[IsScannedForm] BIT NOT NULL
CONSTRAINT [DF_Forms_Filing_IsScannedForm] DEFAULT(0)
)
GO
CREATE TABLE [Form6].[FormSixPublishedFilings](
[FormSixPublishedFilingId] INT NOT NULL IDENTITY(1,1)
CONSTRAINT [PK_Form6_FormSixPublishedFilings_FormSixPublishedFilingId] PRIMARY KEY CLUSTERED,
[FilingId] INT NOT NULL
CONSTRAINT [FK_Form6_FormSixPublishedFilings_Filings] FOREIGN KEY ([FilingId]) REFERENCES [Forms].[Filing] ([FilingId]),
[LastDateOfEmployment] DATE NULL,
[NetWorthDate] DATE NULL,
[NetWorth] MONEY NULL
)
GO
CREATE TABLE [Form6].[FormSixPublishedAssets](
[FormSixPublishedAssetId] INT NOT NULL IDENTITY(1,1)
CONSTRAINT [PK_Form6_FormSixPublishedAssets_FormSixPublishedAssetId] PRIMARY KEY CLUSTERED,
[FormSixPublishedFilingId] INT NOT NULL
CONSTRAINT [FK_Form6_FormSixPublishedAssets_FormSixPublishedFilings] FOREIGN KEY ([FormSixPublishedFilingId]) REFERENCES [Form6].[FormSixPublishedFilings] ([FormSixPublishedFilingId]),
[Name] VARCHAR(8000) NOT NULL,
[Amount] MONEY NOT NULL
)
GO
CREATE TABLE [Form1].[FormOnePublishedFilings]
(
[FormOnePublishedFilingId] INT NOT NULL IDENTITY(1,1)
CONSTRAINT [PK_Form1_FormOnePublishedFilings_FormOnePublishedFilingId] PRIMARY KEY CLUSTERED,
[FilingId] INT NOT NULL,
CONSTRAINT [FK_Form1_FormOnePublishedFilings_Filing] FOREIGN KEY ([FilingId]) REFERENCES [Forms].[Filing] ([FilingId]),
[HasServedAsAgent] BIT NULL,
[LastDateOfEmployment] DATE NULL,
[AmendmentReason] VARCHAR(1024) NULL,
)
GO
CREATE TABLE [Form1].[FormOnePublishedLiabilities]
(
[FormOnePublishedLiabilityId] INT NOT NULL IDENTITY(1,1)
CONSTRAINT [PK_Form1_FormOnePublishedLiabilities_FormOnePublishedLiabilityId] PRIMARY KEY CLUSTERED,
[FormOnePublishedFilingId] INT NOT NULL,
CONSTRAINT [FK_Form1_FormOnePublishedLiabilities_FormOnePublishedFilings] FOREIGN KEY ([FormOnePublishedFilingId]) REFERENCES [Form1].[FormOnePublishedFilings] ([FormOnePublishedFilingId]),
[NameOfCreditor] VARCHAR(8000) NOT NULL,
[AddressOfCreditor] VARCHAR(8000) NOT NULL
)
GO
In order to be able to search through all the forms, I think we need to create a view that just has two columns. One for the FilingId and the other column would be an XML data type which would be an XML representation of all the data in each electronic filing. This XML column is what we will be using to set up our full-text index. I think we will be using the FreeTextTable search because we would like to have the results ranked and also the search terms will be entered by end-users.
create view ViewForFullTextSearching with schemabinding as
select f.FilingId,
(select
filing.FilingId
,filing.FormYear
,filing.FormTypeId
,filing.FilingStatusId
,filing.FilerSignatureId
,filing.SubmissionDate
,filing.IsScannedForm
,form6Filing.LastDateOfEmployment 'Form6LastDateOfEmployment'
,form6Filing.NetWorthDate
,form6Filing.NetWorth
,form6Asset.Name
,form6Asset.Amount
,form1Filing.HasServedAsAgent
,form1Filing.LastDateOfEmployment 'Form1LastDateOfEmployment'
,form1Filing.AmendmentReason
,form1Liability.NameOfCreditor
,form1Liability.AddressOfCreditor
from Forms.Filing filing
left join Form6.FormSixPublishedFilings form6Filing on filing.FilingId = form6Filing.FilingId
left join Form6.FormSixPublishedAssets form6Asset on form6Filing.FormSixPublishedFilingId = form6Asset.FormSixPublishedFilingId
left join Form1.FormOnePublishedFilings form1Filing on filing.FilingId = form1Filing.FilingId
left join Form1.FormOnePublishedLiabilities form1Liability on form1Liability.FormOnePublishedFilingId = form1Filing.FormOnePublishedFilingId
where filing.FilingId = f.FilingId
for xml auto, type
) as 'Filing'
from Forms.Filing f
GO
create unique clustered index [IX_ViewForFullTextSearching_FilingId] ON [Forms].[ViewForFullTextSearching] ([FilingId])
GO
The above SQL does not actually work because I get this error.
Cannot create an index on view "EthicsFdms.Forms.ViewForFullTextSearching" because it contains one or more subqueries. Consider changing the view to use only joins instead of subqueries. Alternatively, consider not indexing this view.
So, I’m a bit lost on how to create a view with XML to search over if I’m not allowed to create a materialized view that has subqueries.
This view results look like this:
Next we setup our Full Text Catalog and Index on this view:
CREATE FULLTEXT CATALOG [FtcFilings];
GO
CREATE FULLTEXT INDEX ON [Forms].[ViewForFullTextSearching] ([Filing] language 1033) key index [IX_ViewForFullTextSearching_FilingId] on [FtcFilings];
GO
Then I was hoping we could search the filings like so:
select ftt.*
from [Forms].[Filing] filing
inner join freetextable(Forms.ViewForFullTextSearching, Filing, 'APPLE') as ftt on filing.FilingId = ftt.[KEY]
order by rank desc
Right now my challenges are, is it possible to create a materialized view like this? Seems like I can’t because materialized views can’t have subqueries. I’m not sure how to build the XML field w/out subqueries.
If I’m not able to create a materialized view then how else can I create a full-text index that can search electronic Forms?

You cannot create an indexed view (which is a synchronous materialized view in SQL Server) only if there is a mathematical surjection and all scalar computation is deterministic and precise. By the way OUTER JOIN, SUBQUERIES and set operators (UNION, EXCEPT, INTERSECT) cannot be used...
The best ways to design your systeme is to do it in the reverse way...
Create a persistent computed column using the CONCAT function of all the columns you want to fulltext index.
Create fulltext indexes on the computed columns
Create an UDF that search in the fulltext index on each tables and concatenate the result by UNION, and then aggregate results to compute the rank.
Let me know if you want more assistance to do so...

If these form filling data are seldom changed once created and it makes sense in business to store data of form1 and form6 together with its Filling, you may consider to go with document oriented design.
SQL server has good json support now. You can save all the Filling and form info in json, against which you can do full text search, and create views to simulate your current design if needed.
Here is an example -
create table tst.form (
form_id int not null identity primary key
,content_json nvarchar(max)
)
-- inside content_json, the json may look like -
{
"filler_user_id": 111,
"filler_type_id": 1,
"is_scanned_form": 1,
"form1": [
{
"form1_filling_id": 101,
"has_served_as_agent":0,
"liabilities": [{"name_of_creditor": "abc"}]
}
]
}
I only modelled form1 related info. You can add form6 related info as needed.
Then you can do full text search against this content_json column.
Then create views to simulate your current design if needed -
create or alter view tst.form_base WITH SCHEMABINDING as
select form_id
,convert(int, JSON_VALUE(content_json, '$.filler_user_id')) filler_user_id
,convert(int, JSON_VALUE(content_json, '$.filler_type_id')) filler_type_id
,convert(bit, JSON_VALUE(content_json, '$.is_scanned_form')) is_scanned_form
,JSON_QUERY(content_json, '$.form1') form1_json
from tst.form
create unique clustered index idx_form_base_form_id on tst.form_base(form_id);
-- you can create index as needed
create index idx_form_base_filler_user_id on tst.form_base(filler_user_id);
create or alter view tst.form1 as
select form_id
,a.form1_filling_id
,a.has_served_as_agent
,a.liabilities liabilities_json
from tst.form_base cross apply OPENJSON(form1_json) WITH (
form1_filling_id int '$.form1_filling_id',
has_served_as_agent int '$.has_served_as_agent',
liabilities nvarchar(max) '$.liabilities' as json) a
create or alter view tst.form1_liabilities as
select form_id
,form1_filling_id
,a.name_of_creditor
from tst.form1 cross apply OPENJSON(liabilities_json) WITH (
name_of_creditor nvarchar(max) '$.name_of_creditor') a
Then create some test data -
insert into tst.form (content_json) values ('{
"filler_user_id": 111,
"filler_type_id": 1,
"is_scanned_form": 1,
"form1": [
{
"form1_filling_id": 101,
"has_served_as_agent":0,
"liabilities": [{"name_of_creditor": "abc"}]
}
]
}');
insert into tst.form (content_json) values ('{
"filler_user_id": 222,
"filler_type_id": 1,
"is_scanned_form": 0,
"form1": [
{
"form1_filling_id": 102,
"has_served_as_agent":1,
"liabilities": [{"name_of_creditor": "def"}]
}
]
}');
Try it -
select *
from tst.form1_liabilities

Related

Get value of PRIMARY KEY during SELECT in ORACLE

For a specific task I need to store the identity of a row in a tabel to access it later. Most of these tables do NOT have a numeric ID and the primary key sometimes consists of multiple fields. VARCHAR & INT combined.
Background info:
The participating tables have a trigger storing delete, update and insert events in a general 'sync' tabel (Oracle v11). Every 15 minutes a script is then launched to update corresponding tables in a remote database (SQL Server 2012).
One solution I came up with was to use multiple columns in this 'sync' table, 3 INT columns and 3 VARCHAR columns. A table with 2 VARCHAR columns would then use 2 VARCHAR columns in this 'sync' table.
A better/nicer solution would be to 'select' the value of the primary key and store this in this table.
Example:
CREATE TABLE [dbo].[Workers](
[company] [nvarchar](50) NOT NULL,
[number] [int] NOT NULL,
[name] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_Workers] PRIMARY KEY CLUSTERED ( [company] ASC, [number] ASC )
)
// Fails:
SELECT [PK_Workers], [name] FROM [dbo].[Workers]
UPDATE [dbo].[Workers] SET [name]='new name' WHERE [PK_Workers]=#PKWorkers
// Bad (?) but works:
SELECT ([company] + CAST([number] AS NVARCHAR)) PK, [name] FROM [dbo].[Workers];
UPDATE [dbo].[Workers] SET [name]='newname' WHERE ([company] + CAST([number] AS NVARCHAR))=#PK
The [PK_Workers] fails in these queries. Is there another way to get this value without manually combining and casting the index?
Or is there some other way to do this that I don't know?
for each table create a function returning a concatenated primary key. create a function based index on this function too. then use this function in SELECT and WHERE clauses

Derived table with an index

Please see the TSQL below:
DECLARE #TestTable table (reference int identity,
TestField varchar(10),
primary key (reference))
INSERT INTO #TestTable VALUES ('Ian')
select * from #TestTable as TestTable
INNER JOIN LiveTable on LiveTable.Reference=TestTable.Reference
Is it possible to create an index on #Test.TestField? The following webpage suggests it is not. However, I read on another webpage that it is possible.
I know I could create a physical table instead (for #TestTable). However, I want to see if I can do this with a derived table first.
You can create an index on a table variable as described in the top voted answer on this question:
SQL Server : Creating an index on a table variable
Sample syntax from that post:
DECLARE #TEMPTABLE TABLE (
[ID] [INT] NOT NULL PRIMARY KEY,
[Name] [NVARCHAR] (255) COLLATE DATABASE_DEFAULT NULL,
UNIQUE NONCLUSTERED ([Name], [ID])
)
Alternately, you may want to consider using a temp table, which will persist during the scope of the current operation, i.e. during execution of a stored procedure exactly like table variables. Temp tables will be structured and optimized just like regular tables, but they will be stored in tempDb, therefore they can be indexed in the same way as regular table.
Temp tables will generally offer better performance than table variables, but it's worth testing with your dataset.
More in depth details can be found here:
When should I use a table variable vs temporary table in sql server?
You can see a sample of creating a temp table with an index from:
SQL Server Planet - Create Index on Temp Table
One of the most valuable assets of a temp table (#temp) is the ability
to add either a clustered or non clustered index. Additionally, #temp
tables allow for the auto-generated statistics to be created against
them. This can help the optimizer when determining cardinality. Below
is an example of creating both a clustered and non-clustered index on
a temp table.
Sample code from site:
CREATE TABLE #Users
(
ID INT IDENTITY(1,1),
UserID INT,
UserName VARCHAR(50)
)
INSERT INTO #Users
(
UserID,
UserName
)
SELECT
UserID = u.UserID
,UserName = u.UserName
FROM dbo.Users u
CREATE CLUSTERED INDEX IDX_C_Users_UserID ON #Users(UserID)
CREATE INDEX IDX_Users_UserName ON #Users(UserName)

Creating a table specifically for tracking change information to remove duplicated columns from tables

When creating tables, I have generally created them with a couple extra columns that track change times and the corresponding user:
CREATE TABLE dbo.Object
(
ObjectId int NOT NULL IDENTITY (1, 1),
ObjectName varchar(50) NULL ,
CreateTime datetime NOT NULL,
CreateUserId int NOT NULL,
ModifyTime datetime NULL ,
ModifyUserId int NULL
) ON [PRIMARY]
GO
I have a new project now where if I continued with this structure I would have 6 additional columns on each table with this type of change tracking. A time column, user id column and a geography column. I'm now thinking that adding 6 columns to every table I want to do this on doesn't make sense. What I'm wondering is if the following structure would make more sense:
CREATE TABLE dbo.Object
(
ObjectId int NOT NULL IDENTITY (1, 1),
ObjectName varchar(50) NULL ,
CreateChangeId int NOT NULL,
ModifyChangeId int NULL
) ON [PRIMARY]
GO
-- foreign key relationships on CreateChangeId & ModifyChangeId
CREATE TABLE dbo.Change
(
ChangeId int NOT NULL IDENTITY (1, 1),
ChangeTime datetime NOT NULL,
ChangeUserId int NOT NULL,
ChangeCoordinates geography NULL
) ON [PRIMARY]
GO
Can anyone offer some insight into this minor database design problem, such as common practices and functional designs?
Where i work, we use the same construct as yours - every table has the following fields:
CreatedBy (int, not null, FK users table - user id)
CreationDate (datetime, not null)
ChangedBy (int, null, FK users table - user id)
ChangeDate (datetime, null)
Pro: easy to track and maintain; only one I/O operation (i'll come to that later)
Con: i can't think of any at the moment (well ok, sometimes we don't use the change fields ;-)
IMO the approach with the extra table has the problem, that you will have to reference somehow also the belonging table for every record (unless you only need the one direction Object to Tracking table). The approach also leads to more I/O database operations - for every insert or modify you will need to:
add entry to Table Object
add entry to Tracking Table and get the new Id
update Object Table entry with the Tracking Table Id
It would certainly make the application code that communicates with the DB a bit more complicated and error-prone.

store two values in one field sql

I have to create a table in sql where one of the columns stores awards for a movie. The schema says it should store something like Oscar, screenplay. Is it possible to store two values in the same field in SQL. If so what datatype would that be and how would you query the table for it?
It's a horrible design pattern to store more than one piece of data in a single column in a relational database. The exact design of your system depends on several things, but here is one possible way to model it:
CREATE TABLE Movie_Awards (
movie_id INT NOT NULL,
award_id INT NOT NULL,
CONSTRAINT PK_Movie_Awards PRIMARY KEY CLUSTERED (movie_id, award_id)
)
CREATE TABLE Movies (
movie_id INT NOT NULL,
title VARCHAR(50) NOT NULL,
year_released SMALLINT NULL,
...
CONSTRAINT PK_Movies PRIMARY KEY CLUSTERED (movie_id)
)
CREATE TABLE Awards (
award_id INT NOT NULL,
ceremony_id INT NOT NULL,
name VARCHAR(50) NOT NULL, -- Ex: Best Picture
CONSTRAINT PK_Awards PRIMARY KEY CLUSTERED (award_id)
)
CREATE TABLE Ceremonies (
ceremony_id INT NOT NULL,
name VARCHAR(50) NOT NULL, -- Ex: "Academy Awards"
nickname VARCHAR(50) NULL, -- Ex: "Oscars"
CONSTRAINT PK_Ceremonies PRIMARY KEY CLUSTERED (ceremony_id)
)
I didn't include Foreign Key constraints here, but hopefully they should be pretty obvious.
Anything's possible; that doesn't mean it's a good idea :)
Far better to normalize your structure and store types like so:
AwardTypes:
AwardTypeID
AwardTypeName
Movies:
MovieID
MovieName
MovieAwardType:
MovieID
AwardTypeID
You can serialize your data in Json format,store Json string, and deselialize on read. More sefer than using your own format
Data presentation does't have to be so close tied with phisical data organisation. Wouldn't it be bether to store these two data in two separate columns and then just do some kind of concatenation at the display time?
It is much less painfull to join data than to split it, if you happen to need just a screenplay, one day...

Can "auto_increment" on "sub_groups" be enforced at a database level?

In Rails, I have the following
class Token < ActiveRecord
belongs_to :grid
attr_accessible :turn_order
end
When you insert a new token, turn_order should auto-increment. HOWEVER, it should only auto-increment for tokens belonging to the same grid.
So, take 4 tokens for example:
Token_1 belongs to Grid_1, turn_order should be 1 upon insert.
Token_2 belongs to Grid_2, turn_Order should be 1 upon insert.
If I insert Token_3 to Grid_1, turn_order should be 2 upon insert.
If I insert Token_4 to Grid_2, turn_order should be 2 upon insert.
There is an additional constraint, imagine I execute #Token_3.turn_order = 1, now #Token_1 must automatically set its turn_order to 2, because within these "sub-groups" there can be no turn_order collision.
I know MySQL has auto_increment, I was wondering if there is any logic that can be applied at the DB level to enforce a constraint such as this. Basically auto_incrementing within sub-groups of a query, those sub-groups being based on a foreign key.
Is this something that can be handled at a DB level, or should I just strive for implementing rock-solid constraints at the application layer?
If i understood your question properly then you could use one of the following two methods (innodb vs myisam). Personally, I'd take the innodb road as i'm a fan of clustered indexes which myisam doesnt support and I prefer performance over how many lines of code I need to type, but the decision is yours...
http://dev.mysql.com/doc/refman/5.0/en/innodb-table-and-index.html
Rewriting mysql select to reduce time and writing tmp to disk
full sql script here : http://pastie.org/1259734
innodb implementation (recommended)
-- TABLES
drop table if exists grid;
create table grid
(
grid_id int unsigned not null auto_increment primary key,
name varchar(255) not null,
next_token_id int unsigned not null default 0
)
engine = innodb;
drop table if exists grid_token;
create table grid_token
(
grid_id int unsigned not null,
token_id int unsigned not null,
name varchar(255) not null,
primary key (grid_id, token_id) -- note clustered PK order (innodb only)
)
engine = innodb;
-- TRIGGERS
delimiter #
create trigger grid_token_before_ins_trig before insert on grid_token
for each row
begin
declare tid int unsigned default 0;
select next_token_id + 1 into tid from grid where grid_id = new.grid_id;
set new.token_id = tid;
update grid set next_token_id = tid where grid_id = new.grid_id;
end#
delimiter ;
-- TEST DATA
insert into grid (name) values ('g1'),('g2'),('g3');
insert into grid_token (grid_id, name) values
(1,'g1 t1'),(1,'g1 t2'),(1,'g1 t3'),
(2,'g2 t1'),
(3,'g3 t1'),(3,'g3 t2');
select * from grid;
select * from grid_token;
myisam implementation (not recommended)
-- TABLES
drop table if exists grid;
create table grid
(
grid_id int unsigned not null auto_increment primary key,
name varchar(255) not null
)
engine = myisam;
drop table if exists grid_token;
create table grid_token
(
grid_id int unsigned not null,
token_id int unsigned not null auto_increment,
name varchar(255) not null,
primary key (grid_id, token_id) -- non clustered PK
)
engine = myisam;
-- TEST DATA
insert into grid (name) values ('g1'),('g2'),('g3');
insert into grid_token (grid_id, name) values
(1,'g1 t1'),(1,'g1 t2'),(1,'g1 t3'),
(2,'g2 t1'),
(3,'g3 t1'),(3,'g3 t2');
select * from grid;
select * from grid_token;
My opinion: Rock-solid constraints at the app level. You may get it to work in SQL -- I've seen some people do some pretty amazing stuff. A lot of SQL logic used to be squirreled away in triggers, but I don't see much of that lately.
This smells more like business logic and you absolutely can get it done in Ruby without wrapping yourself around a tree. And... people will be able to see the tests and read the code.
This to me sounds like something you'd want to handle in an after_save method or in an observer. If the model itself doesn't need to be aware of when or how something increments then I'd stick the business logic in the observer. This approach will make the incrementing logic more expressive to other developers and database agnostic.