Informix SQL text Blob wildcard search - sql

I am looking for an efficient way to use a wild card search on a text (blob) column.
I have seen that it is internally stored as bytes...
The data amount will be limited, but unfortunately my vendor has decided to use this stupid datatype. I would also consider to move everything in a temp table if there is an easy system side function to modify it - unfortunately something like rpad does not work...
I can see the text value correctly via using the column in the select part or when reading the data via Perl's DBI module.

Unfortunately, you are stuck - there are very few operations that you can perform on TEXT or BYTE blobs. In particular, none of these work:
+ create table t (t text in table);
+ select t from t where t[1,3] = "abc";
SQL -615: Blobs are not allowed in this expression.
+ select t from t where t like "%abc%";
SQL -219: Wildcard matching may not be used with non-character types.
+ select t from t where t matches "*abc*";
SQL -219: Wildcard matching may not be used with non-character types.
Depending on the version of IDS, you may have options with BTS - Basic Text Search (requires IDS v11), or with other text search data blades. On the other hand, if the data is already in the DB and cannot be type-converted, then you may be forced to extract the blobs and search them client-side, which is less efficient. If you must do that, ensure you filter on as many other conditions as possible to minimize the traffic that is needed.
You might also notice that DBD::Informix has to go through some machinations to make blobs appear to work - machinations that it should not, quite frankly, have to go through. So far, in a decade of trying, I've not persuaded people that these things need fixing.

Related

Reason why am I getting results querying a column with Data actually in another column, using like '%text%'

With Firebird 2.5.8, and a table with a dozen of blob fields, I have this weird behavior querying this way:
SELECT *
FROM TABLE
WHERE BLOBFIELD4 LIKE '%SOMETEXT%'
and I get results though SOMETEXT is actually in a different column and not in BLOBFIELD4 (happens with every blob column).
What am I missing?
Thanks for the data. I made few fast tests using latest IB Expert with Firebird 2.5.5 (what i had on hands).
It seems that you actually have much more data, than you might think you have.
First of all - it is a bad, dangerous practice to keep text data in columns marked as CHARSET NONE ! Make sure that your columns are marked with some reasonable charset, like Windows 1250 or UTF8 or something. And also that the very CONNECTION of your all applicationa (including development tools) to the database server also has some explicitly defined character set that suits your textual data.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Or, if you want those BLOBs be seen as binary - then explitly create them as SUB_TYPE BINARY not SUB_TYPE TEXT
However, here is the simple script to run on your database.
alter table comm
add NF_VC VARCHAR(4000) CHARACTER SET UTF8,
add NF_BL BLOB SUB_TYPE 1 SEGMENT SIZE 4096 CHARACTER SET UTF8
then
update comm
set nf_vc = '**' || com1 || '**'
then
update comm
set nf_bl = '##' || nf_vc || '##'
Notice, i intentionally force Firebird to do conversion BLOB -> VARCHAR -> BLOB.
Just to be on a safe side.
Now check some data.
select id_comm, nf_vc
from comm where
nf_vc containing 'f4le dans 2 ans'
and
select id_comm, nf_bl
from comm where
nf_bl containing 'f4le dans 2 ans'
What do you see now?
On the first picture we see that very mystery - the line is selected, but we can not see your search pattern in it, the "f4le dans 2 ans".
BUT !!!
Can you see the marks, the double asterisks, the ** ?
Yes, you can, in the beginning! But you can not see them in the ending!!!
That means, you DO NOT see the whole text, but only some first part of it!
On the second picture - you see the very same row ID=854392, but re-converted back to BLOB and additionally marked with ## at both ends.
Can you see the marks on both start and end?
Can you see your search pattern?
Yes and yes - if you look at the grid row (white).
No and no - if you look and the tooltip (yellow).
So, again, the data you search for - it DOES exist. But you just fail to see it for some reason.
Now, when may be a typical reason the string is not displayed completely?
It can be the zero-value byte (or several bytes, UNICODE codepoint), the way C language marks the end of line, the custom that is vastly used in Windows and many libraries and programs. Or maybe some other unusual value (EOF, EOT, -1, etc), that makes those programs you use falsely detect the end of the text where it did not actually ended yet.
Look at the two screenshots again, where is that, that lines start to differ? it is after \viewkind4 ... \par} and before pard. Notice the weird anomaly! that said pard should start with reversed slash - \ - to be a vaild RTF command. But it is instead prepended with something invisible, something blank. What can it be?...
Let us go back to your original query in your comments.
Also, it is bad practice to put important details into comments! They are hard to find there for any person, that was not tracking the story from the very start. And the more comments added, the harder it gets. The proper avenue for you would have been to EDIT the question adding the new data into the quesiton body, and then adding a comment (for notification sake) saying the question was edited. Please, in future, add new data that way.
select id_comm, COM1
from comm where
COM1 containing 'f4le dans 2 ans'
On the first glance our fishing ended with nothing, we see the text that does not have your pattern, ending at that very \par}.
But is it so? Switch into binary view, and....
Voila! What is there before the found-lost-found-again pard? there is that very ZERO BYTE i talked about earlier.
So, what did happenned, to wrap it up.
Firebird is correct, the data is found because the data is really there, in the BLOBs.
Your applications, reading the data, are not correct. Being confused with zero byte they show you only part of data, not all of data.
Your application, writing the data, might be not correct. Or the data itself.
How did that zero byte ended there? Why RTF structure was corrupt, lacking the reversed slash before pard? Was data size, you passed to the server when inserting that data, larger than it should had been, passing some garbage after a meaningful data? Was data size correct, but the data contents corrupt before inserting?
Something is fishy there. I do not think RTF Specifications explicitly prohibits zero byte, but having it is very untypical, because it triggers bugs like this in way too many applications and libraries.
P.S. the design of the table having MANY columns with BLOB types seems poor.
"wide" tables often lead to problems in future development and maintenance.
While it is not the essense of your quesiton, but please do think about remaking this table into a narrow one, and save your data as a number of one-BLOB rows.
It will give you some fixed added work now, but probably would save you from a snowballing problems in future.

Creating a table in NexusDB with german umlauts?

I'm trying to import a CREATE TABLE statement in NexusDB.
The table name contains some german umlauts and so do some field names but I receive an error that there were some invalid characters in my statement (obviously the umlauts...).
My question is now: can somebody give a solution or any ideas to solve my problem?
It's not so easy to just change the umlauts into equivalent terms like ä -> ae or ö -> oe since our application has fixed table names every customer uses currently.
It is not a good idea to use characters outside what is normally permitted in the SQL standard. This will bite you not only in NexusDB, but in many other databases as well. Take special note that there is a good chance you will also run into problems when you want to access data via ODBC etc, as other environments may also have similar standard restrictions. My strong recommendation would be to avoid use of characters outside the SQL naming standard for tables, no matter which database is used.
However... having said all that, given that NexusDB is one of the most flexible database systems for the programmer (it comes with full source), there is already a solution. If you add an "extendedliterals" define to your database server project, then a larger array of characters are considered valid. For the exact change this enables, see the nxcValidIdentChars constant in the nxllConst.pas unit. The constant may also be changed if required.

Replace all occurrences of a substring in a database text field

I have a database that has around 10k records and some of them contain HTML characters which I would like to replace.
For example I can find all occurrences:
SELECT * FROM TABLE
WHERE TEXTFIELD LIKE '%&#47%'
the original string example:
this is the cool mega string that contains &#47
how to replace all &#47 with / ?
The end result should be:
this is the cool mega string that contains /
If you want to replace a specific string with another string or transformation of that string, you could use the "replace" function in postgresql. For instance, to replace all occurances of "cat" with "dog" in the column "myfield", you would do:
UPDATE tablename
SET myfield = replace(myfield,"cat", "dog")
You could add a WHERE clause or any other logic as you see fit.
Alternatively, if you are trying to convert HTML entities, ASCII characters, or between various encoding schemes, postgre has functions for that as well. Postgresql String Functions.
The answer given by #davesnitty will work, but you need to think very carefully about whether the text pattern you're replacing could appear embedded in a longer pattern you don't want to modify. Otherwise you'll find someone's nooking a fire, and that's just weird.
If possible, use a suitable dedicated tool for what you're un-escaping. Got URLEncoded text? use a url decoder. Got XML entities? Process them though an XSLT stylesheet in text mode output. etc. These are usually safer for your data than hacking it with find-and-replace, in that find and replace often has unfortunate side effects if not applied very carefully, as noted above.
It's possible you may want to use a regular expression. They are not a universal solution to all problems but are really handy for some jobs.
If you want to unconditionally replace all instances of "&#47" with "/", you don't need a regexp.
If you want to replace "&#47" but not "&#471", you might need a regexp, because you can do things like match only whole words, match various patterns, specify min/max runs of digits, etc.
In the PostgreSQL string functions and operators documentation you'll find the regexp_replace function, which will let you apply a regexp during an UPDATE statement.
To be able to say much more I'd need to know what your real data is and what you're really trying to do.
If you don't have postgres, you can export all database to a sql file, replace your string with a text editor and delete your db on your host, and re-import your new db
PS: be careful

Error Inserting Entry With Text Column That Contains New Line And Quotes

I have an Informix 11.70 database.I am unable to sucessfully execute this insert statement on a table.
INSERT INTO some_table(
col1,
col2,
text_col,
col3)
VALUES(
5,
50,
CAST('"id","title1","title2"
"row1","some data","some other data"
"row2","some data","some other"' AS TEXT),
3);
The error I receive is:
[Error Code: -9634, SQL State: IX000] No cast from char to text.
I found that I should add this statement in order to allow using new lines in text literals, so I added this above the same query I have already written:
EXECUTE PROCEDURE IFX_ALLOW_NEWLINE('t');
Still, I receive the same error.
I have also read the IBM documentation that says: to alternatively allow new lines, I could set the ALLOW_NEWLINE parameter in the ONCONFIG file. I suppose the last one requires administrative access to the server to alter that config file, which I do not have, and I prefer not to take advantage of this setting.
Informix's TEXT (and BYTE) columns pre-date any standard, and are in many ways very peculiar types. TEXT in Informix is very different from TEXT found in other DBMS. One of the long-standing (over 20 years) problems with them is that there isn't a string literal notation that can be used to insert data into them. The 'No cast from char to text' is saying there is no explicit conversion from string literal to TEXT, either.
You have a variety of options:
Use LVARCHAR in the table (good if your values won't be longer than a few KiB, because the total row length is approximately 32 KiB). Maximum size of an LVARCHAR column is just under 32 KiB.
Use a programming language which can handle Informix 'locator' structures — in ESQL/C, the type used to hold a TEXT is loc_t.
Consider using CLOB instead. However, this has the same limitation (no string to CLOB conversion), but you'd be able to use the FILETOCLOB() function to get the information from a file on the client to the database (and LOTOFILE transfers information from the DB to a file on the client).
If you can use LVARCHAR, that is by far the simplest alternative.
I forgot to mention an important detail in the question - I use Java and the Hibernate ORM to access my Informix database, thus some of the suggested approaches (the loc_t handling in particular) in Jonathan Leffler's answer are unfortunately not applicable. Also, I need to store large data of dynamic length and I fear the LVARCHAR column would not be sufficient to hold it.
The way I got it working was to follow Michał Niklas's suggestion from his comment, and use PreparedStatement. This could potentially be explained by Informix handing the TEXT data type in its own manner.

Full Text Searching for single characters

I have a table with a TEXT column where the contents is just strings of CSV numbers. Example ",1,76,77,115," Each string can have an arbitrary number of numbers.
I am trying to set up Full Text Indexing so that I can search this column rapidly. This works great. Instead of running queries with
where MY_COL LIKE '%,77,%' and MY_COL LIKE '%,115,%'
I can do
where CONTAINS(MY_COL,'77 and 115')
However, when I try to search for a single character it doesn't work.
where CONTAINS(MY_COL,'1')
But I know that there should be records returned! I quickly found that I need to edit the Noise file and rebuild the index. But even after doing that it still doesn't work.
Working with relational databases that way is going to hurt.
Use a proper schema. Either store the values in different rows or use an array datatype for the column.
That will make solving the problem trivial.
I fixed my own problem, although I'm not exactly sure what fixed it.
I dropped my table and populated a new one (my program does batch processing) and created a new Full Text Index. Maybe I wasn't being patient enough to allow the indexing to fully rebuild.
Agreed. How does 12,15,33 not return that record for a search for 1 with fulltext? Use an actual table schema to accomplish this.