Using Full Text Search on file names

Using Full Text Search on file names - sql-server-2005

I have a table that stores a tree like structure of file names. There are currently 8 million records in this table. I am working on a way to quickly find a list of files what have a specific serial number embedded in the name.
FS_NODES
-----------------------------------
NODE_ID bigint PK
ROOT_ID bigint
PARENT_ID bigint
NODE_TYPE tinyint
NODE_NAME nvarchar(250)
REC_MODIFIED_UTC datetime
REC_DELETION_BIT bit
Example file name (as stored in the node_name):
scriptname_SomeSerialNumber_201205240730.xml
As expected, the LIKE statement to find the files takes several minutes to scan the entire table and would like to improve this. There is no consistent patterns for the names as each developer likes to create their own naming convention.
I tried using the Full Text Search and really love the idea but not able to get it to find files based off keywords in the name. I believe the problem is due to the underscores.
Any suggestions on how I can get this to work? I am using a neutral language for the catalog.
##VERSION
Microsoft SQL Server 2005 - 9.00.4035.00 (Intel X86)
Nov 24 2008 13:01:59
Copyright (c) 1988-2005 Microsoft Corporation
Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)
Is there a way to alter the catalog and split the keywords out manually?
Thank you!

Full-text search is not the answer. It is used for words, not partial string matching. What you should do is, when inserting or updating data in this table, extract the parts of the filename that are relevant for future searching into their own column(s) which you can index. After all, they are separate pieces of data the way you are using them. You could also consider enforcing a more predictable naming convention instead of just letting the developers do whatever they want.
EDIT per user request:
Add a computed column that is REPLACE(filename, '_', ' '). Or instead of a computed column, just a column you manually populate for existing data and change your insert procedure to deal with going forward. Or even break those out into separate rows in a related table.

Related

Firebird External Tables

I am trying to find a way to quickly load a lot of data into database and one suggested to use Firebird External Tables, I would like know more about this method, I've tried searching online but I'm not getting the useful information about this, I want to know how do they really work? Do the tables have to be exactly the same? and what if you are loading data from more than one database?

Use external tables like this:
CREATE TABLE ext1 EXTERNAL 'c:\myfile.txt'
(
field1 char(20),
field2 smallint
);
To do quick import into regular table, do something like this:
INSERT INTO realtable1 (field1, field2)
SELECT field1, field2 FROM ext1;
Remember to disable triggers and indexes (if possible) before loading, and reactivate them after.
This information is from Firebird FAQ: http://www.firebirdfaq.org/faq209/
Here's more information about using external tables, including information about file format: http://www.delphiman.de/Bin/UsingExternalFilesAsTables.pdf

Using an external file as a table is a great way to get lots of data into Firebird quickly. However, the sample, which is from the Firebird FAQs, seems to me to be either unnecessarily complex or incorrect, because of use of smallint in table definition. As the FB 2.5 documentation points out, "for most purposes, only columns of CHAR types would be useful."
The external file must be a text file of fixed-length records (so a .csv file won't work). The external table def should then use CHAR fields with sizes that match the lengths of the fields in each record.
Any variation in the length of the records in the text file will lead to misery (from bitter experience). I suppose the example possibly might work if all of the smallints were the same number of digits but more generically, things will go more smoothly if other formats (date, numeric) are simply expressed as CHAR in the text file by padding with spaces.
For example, if the raw data looked like this:
Canada 37855702
Central African Republic 4829764
Chad 16425859
Chile 19116209
China 1404676330
Then the text file should look like this:
Canada 37855702
Central African Republic 4829764
Chad 16425859
Chile 19116209
China 1404676330
Countries are right-padded to twenty-five characters and the (big) integers are left-padded to 10 characters, so the records are 35 characters, plus one for a line feed (*nix) or two for Window's CRLF. (Note that things get more complicated if the file uses Unicode encoding.)
The table def would look like this:
CREATE TABLE ext_test EXTERNAL '/home/dave/fbtest.txt'
(
COUNTRY CHAR(25),
POPULATION CHAR(10),
LF CHAR(1)
);
Make sure that the file resides on the same file system as the FB server process, that the server process has rights to the file (maybe through a FB group) and that the ExternalFileAccess parameter in firebird.conf is set appropriately - see the 2.5 documentation for details.
There are some limited things you can do with an external table, but it's most useful as a temporary transfer table, as a source for the ultimate FB table. INSERT each row from the external table into the ultimate target, casting the CHAR fields to the appropriate data types. For data of any real volume, the process runs much faster than, say, some Python code to read and feed each line individually.
If you are using an older version of FB, don't forget to DROP the external table when you're done with it to free up file locks, as outlined in the FAQs. Newer versions do this automatically. There's lots more on external tables in the 2.5 documentation at the above link.
PS - I have emailed the above to the Firebird documentation team.

How to replace an extremely high occurrence of the same character quickly in a CLOB field (Oracle 10g)?

Due to a bug in one of our applications, a certain character was duplicated 2^n times in many CLOB fields, where n is anywhere between 1 and 24. For the sake of simplicity, lets say the character would be X. It is safe to assume that any adjacent occurrence of two or more of these characters identifies broken data.
We've thought of running over every CLOB field in the database, and replacing the value where necessary. We've quickly found out that you can easily replace the value by using REGEXP_REPLACE, e.g. like this (might contain syntax errors, typed this by heart):
SELECT REGEXP_REPLACE( clob_value, 'XX*', 'X' )
FROM someTable
WHERE clob_value LIKE 'XX%';
However, even when changing the WHERE part to WHERE primary_key = 1234, for a data set which contains around four million characters in two locations within its CLOB field, this query takes more than fifteen minutes to execute (we aborted the attempt after that time, not sure how long it would actually take).
As a comparison, reading the same value into a C# application, fixing it there using a similar regular expression approach, and writing it back into the database only takes 3 seconds.
We could write such a C# application and execute that, but due to security restrictions it would just be a lot easier to send a database script to our customer which they could execute theirselves.
Is there any way to do a replacement like this much faster on an Oracle 10g (10.2.0.3) database?
Note: There are two configurations, one running the database on a Windows 2003 Server with the Clients being Windows XP, and another one running both the database and the client on a standalone Windows XP notebook. Both configurations are affected

How does your client access the Oracle server? If it is via a Unix environment(which most likely is the case) then maybe you can write a shell script to extract the value from database, fix it using sed, and write back to database. Replacing in unix should be real quick.

Maybe you facing problem with LOB segment space fragmentation. In fact each of your lobs will be shorted that before. Try to create a new table and copy modified clobs into this new table.

As we didn't find any way to make it faster on the database, we delivered the C# tool within an executable patch.

How to implement Full-Text search in multilingual content in SQL Server

We have a site which supports different languages. We have millions of data so in search we would like to implement SQL Server Full-Text Search.
The table structure we have currently like below.
CREATE TABLE Product
(
ID INT IDENTITY(1,1),
Code VARCHAR(50),
........
........
)
CREATE TABLE ProductLanguage
(
ID INT,
LanguageID INT,
Name NVARCHAR(200),
........
........
)
We would like to implement Full-Text search in "Name" column so we have created Full-Text index on the Name column. But while creating Full-Text index we can select only one language per column. If we select "English" or "Neutral" its not returning expected data in other languages like Japanese, Chinese, French etc.
So what is the best way to implement Full-Text search in SQL Server for multilingual content.
Do we need to create a different table. If yes then what will be the table structure (We need to keep in mind that the Languages are not fixed, different language can be added later) and what will be search query?
We are using SQL Server 2008 R2.

Certain content (document) types support language settings - e.g. Microsoft Office Documents, PDF, [X]HTML, or XML.
If you change the type of your Name column to XML, you can determine the language of each value (i.e. per row). For instance:
Instead of storing values as strings
name 1
name 2
name 3
...you could store them as XML documents with the appropriate language declarations:
<content xml:lang="en-US">name 1</content>
<content xml:lang="fr-FR">name 2</content>
<content xml:lang="en-UK">name 3</content>
During Full-text index population the correct word breaker/stemmer will be used, based on the language settings of each value (XML document): US English for name 1, French or name 2, and UK English for name 3.
Of course, this would require a significant change in the way your data is managed and consumed.
ML

I'd be concerned about the performance of using XML instead of NVARCHAR(n) - though I have no hard proof for it.
One alternative could be to use dynamic SQL (generate the language specific code on the fly), combined with language specific indexed views on the Product table. Drawback of thsi is the lack of execution plan caching, i.e. again: performance.

Same idea as Matija Lah's answer, but this is the suggested solution outlined in the MS whitepaper.
When the indexed content is of binary type (such as a Microsoft Word
document), the iFilter responsible for processing the text content
before sending it to the word breaker might honor specific language
tags in the binary file. When this is the case, at indexing time the
iFilter invokes the correct word breaker for a specific document or
section of a document specified in a particular language. All you need
to do in this case is to verify after indexing that the multilanguage
content was indexed correctly. Filters for Word, HTML, and XML
documents honor language specification attributes in document content:
Word – language settings
HTML - <meta name=“MS.locale”…>
XML –
xml:lang attribute
When your content is plain text, you
can convert it to the XML data type and add specific language tags to
indicate the language corresponding to that specific document or
document section. Note that for this to work, before you index you
must know the language that will be used.
https://technet.microsoft.com/en-us/library/cc721269%28v=sql.100%29.aspx

Informix 7.3 isql insert statement - text/blob/clob field insert error

Is a way around this??
I am trying to insert some data into a table whose structure is:
Column name Type Nulls
crs_no char(12) no
cat char(4) no
pr_cat char(1) yes
pr_sch char(1) yes
abstr text yes
The type of the last field reads 'text', but when trying to insert into this table, I get this error:
insert into crsabstr_rec values ("COMS110","UG09","Y","Y","CHEESE");
617: A blob data type must be supplied within this context.
Error in line 1
Near character position 66
So this field is some sort of blob apparently, but won't take inserts (or updates). Normally, these records are inserted into a GUI form, then C code handles the insertions.

There are no blob (BYTE or TEXT) literals in Informix Dynamic Server (IDS) - nor for CLOB or BLOB types in IDS 9.00 and later. It is an ongoing source of frustration to me; I've had the feature request in the system for years, but it never reaches the pain threshold internally that means it gets fixed -- other things get given a higher priority.
Nevertheless, it bites people all the time.
In IDS 7.3 (which you should aim to upgrade - it goes out of service in September 2009 after a decade or so), you are pretty much stuck with using C to get the data into the TEXT field of a database. You have to use the approved C type 'loc_t' to store the information about the BYTE or TEXT data, and pass that to the server.
If you need examples in ESQL/C, look at the International Informix User Group web site, and especially the Software Repository. Amongst other things, you'll find the original SQLCMD program (Microsoft's program of the same name is a Johnny-Come-Lately) in source form. It also includes a set of programs that I dub 'vignettes'; they manipulate blobs in various ways, and are designed to show how to use 'loc_t' structures in various scenarios.

in iSQL....
Load from desc.txt insert into crsabstr_rec;
3 row(s) loaded.
desc.txt is a | (pipe) delimited text file and the number of fields in the txt have to match the number of fields in the table

Informix SQL text Blob wildcard search

I am looking for an efficient way to use a wild card search on a text (blob) column.
I have seen that it is internally stored as bytes...
The data amount will be limited, but unfortunately my vendor has decided to use this stupid datatype. I would also consider to move everything in a temp table if there is an easy system side function to modify it - unfortunately something like rpad does not work...
I can see the text value correctly via using the column in the select part or when reading the data via Perl's DBI module.

Unfortunately, you are stuck - there are very few operations that you can perform on TEXT or BYTE blobs. In particular, none of these work:
+ create table t (t text in table);
+ select t from t where t[1,3] = "abc";
SQL -615: Blobs are not allowed in this expression.
+ select t from t where t like "%abc%";
SQL -219: Wildcard matching may not be used with non-character types.
+ select t from t where t matches "*abc*";
SQL -219: Wildcard matching may not be used with non-character types.
Depending on the version of IDS, you may have options with BTS - Basic Text Search (requires IDS v11), or with other text search data blades. On the other hand, if the data is already in the DB and cannot be type-converted, then you may be forced to extract the blobs and search them client-side, which is less efficient. If you must do that, ensure you filter on as many other conditions as possible to minimize the traffic that is needed.
You might also notice that DBD::Informix has to go through some machinations to make blobs appear to work - machinations that it should not, quite frankly, have to go through. So far, in a decade of trying, I've not persuaded people that these things need fixing.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas