How do you do a PostgreSQL fulltext search on encoded or encrypted data? - sql

For various reasons that don't matter here we are storing chunks of text in either an encrypted or base64 encoded format in PostgreSQL. However, we want to be able to use PostgreSQL's fulltext search to find and return data which in its unencrypted/decoded form matches a search query.
How would one go about accomplishing this? I've seen other posts mention the ability to build the tsvector values before sending data to the database, but I was hoping there would be something available on the Postgres end of things (at least for the base64 text).

Encrypted values
For encrypted values you can't. Even if you created the tsvector client-side, the tsvector would contain a form of the encrypted text so it wouldn't be acceptable for most applications. Observe:
regress=> SELECT to_tsvector('my secret password is CandyStrip3r');
to_tsvector
------------------------------------------
'candystrip3r':5 'password':3 'secret':2
(1 row)
... whoops. It doesn't matter if you create that value client side instead of using to_tsvector, it'll still have your password in cleartext. You could encrypt the tsvector, but then you couldn't use it for fulltext seach.
Sure, given the encrypted value:
CREATE EXTENSION pgcrypto;
regress=> SELECT encrypt( convert_to('my s3kritPassw1rd','utf-8'), '\xdeadbeef', 'aes');
encrypt
--------------------------------------------------------------------
\x10441717bfc843677d2b76ac357a55ac5566ffe737105332552f98c2338480ff
(1 row)
you can (but shouldn't) do something like this:
regress=> SELECT to_tsvector( convert_from(decrypt('\x10441717bfc843677d2b76ac357a55ac5566ffe737105332552f98c2338480ff', '\xdeadbeef', 'aes'), 'utf-8') );
to_tsvector
--------------------
's3kritpassw1rd':2
(1 row)
... but if the problems with that aren't immediately obvious after scrolling right in the code display box then you should really be getting somebody else to do your security design for you ;-)
There's been tons of research on ways to perform operations on encrypted values without decrypting them, like adding two encrypted numbers together to produce a result that's encrypted with the same key, so the process doing the adding doesn't need the ability to decrypt the inputs in order to get the output. It's possible some of this could be applied to fts - but it's way beyond my level of expertise in the area and likely to be horribly inefficient and/or cryptographically weak anyway.
Base64-encoded values
For base64 you decode the base64 before feeding it into to_tsvector. Because decode returns a bytea and you know the encoded data is text you need to use convert_from to decode the bytea into text in the database encoding, eg:
regress=> SELECT encode(convert_to('some text to search','utf-8'), 'base64');
encode
------------------------------
c29tZSB0ZXh0IHRvIHNlYXJjaA==
(1 row)
regress=> SELECT to_tsvector(convert_from( decode('c29tZSB0ZXh0IHRvIHNlYXJjaA==', 'base64'), getdatabaseencoding() ));
to_tsvector
---------------------
'search':4 'text':2
(1 row)
In this case I've used the database encoding as the input to convert_from, but you need to make sure you use the encoding that the underlying base64 encoded text was in. Your application is responsible for getting this right. I suggest either storing the encoding in a 2nd column or ensuring that your application always encodes the text as utf-8 before applying base64 encoding.

Related

DB2 creating a random but unique character identifier upon row insert

I currently have a column in a DB2 table which is being passed through web calls and procedure by a character-encrypted value. It is type CHARACTER(13) with a CSSID for encryption.
This has become a huge pain to accommodate through multiple APIs but was initially intended to allow us a unique ID to use in calls that wasn't the primary key.
In DB2-400, what would be the next best thing as far as a 13 or more character string that is unique and randomly created upon insert, but doesn't require decryption (just a plain string)?
Is there a commonly-gravitated-to method for this? We aren't passing secure data, so there's no need for encryption, but we just want a randomly created and unique character
Try hex(generate_unique()). It's unique CHAR(26) string.
Or to_char(timestamp(generate_unique()), 'YYYYMMDDHH24MISSFF6'). You may play with format of the to_char function as well. May be useful to use, let's say, reverse format like FF6SSMIHH24DDMMYYYY to avoid unique index page contention upon heavy insert activity.
This is a comment that doesn't fit in the comments section.
I don't have access to a DB2-400 (anymore), but I tested the code below in DB2 10.5 for Linux.
create sequence seq1;
select concat('A', varchar_format(next value for seq1, '000000000000')) as my_id
from sysibm.sysdummy1;
Result, if you run it 4 times in a row:
A0000000000001
A0000000000002
A0000000000003
A0000000000004
Maybe there's something equivalent in DB2-400.
Sounds like you might be using GENERATE_UNIQUE()
GENERATE_UNIQUE function returns a bit data character string 13 bytes
long (CHAR(13) FOR BIT DATA)
Doesn't really have anything to do with encryption...
And pretty much the ideal solution in my opinion generating a unique value other than a simple numeric identity. So what the problem you are having?

HANA: Unknown Characters in Database column of datatype BLOB

I need help on how to resolve characters of unknown type from a database field into a readable format, because I need to overwrite this value on database level with another valid value (in the exact format the application stores it in) to automate system copy acitvities.
I have a proprietary application that also allows users to configure it in via the frontend. This configuration data gets stored in a table and the values of a configuration property are stored in a column of type "BLOB". For the here desired value, I provide a valid URL in the application frontend (like http://myserver:8080). However, what gets stored in the database is not readable (some square characters). I tried all sorts of conversion functions of HANA (HEX, binary), simple, and in a cascaded way (e.g. first to binary, then to varchar) to make it readable. Also, I tried it the other way around and make the value that I want to insert appear in the correct format (conversion to BLOL over hex or binary) but this does not work either. I copied the value to clipboard and compared it to all sorts of character set tables (although I am not sure if this can work at all).
My conversion tries look somewhat like this:
SELECT TO_ALPHANUM('') FROM DUMMY;
while the brackets would contain the characters in question. I cant even print them here.
How can one approach this and maybe find out the character set that is used by this application? I would be grateful for some more ideas.
What you have in your BLOB column is a series of bytes. As you mentioned, these bytes have been written by an application that uses an unknown character set.
In order to interpret those bytes correctly, you need to know the character set as this is literally the mapping of bytes to characters or character identifiers (e.g. code points in UTF).
Now, HANA doesn't come with a whole lot of options to work on LOB data in the first place and for C(haracter)LOB data most manipulations implicitly perform a conversion to a string data type.
So, what I would recommend is to write a custom application that is able to read out the BLOB bytes and perform the conversion in that custom app. Once successfully converted into a string you can store the data in a new NVCLOB field that keeps it in UTF-8 encoding.
You will have to know the character set in the first place, though. No way around that.
I assume you are on Oracle. You can convert BLOB to CLOB as described here.
http://www.dba-oracle.com/t_convert_blob_to_clob_script.htm
In case of your example try this query:
select UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.SUBSTR(<your_blob_value)) from dual;
Obviously this only works for values below 32767 characters.

Storing HASHBYTES output in NVARCHAR vs BYTES

I am going to create:
a table for storing IDs and unique text values (which are expected to
be large)
a stored procedure which will have a text value as input parameter
(it will check if the value exists in the above table and return the
corresponding ID if it exists, or inserted a new record if not and
return the new ID as well)
I want to optimize the search of text values using hash value of the text and created index on it. So, during the search I expect a non-clustered index to be used (not the clustered index).
I decided to use the HASHBYTES with SHA2_256 and I am wondering are there any differences/benefits if I am storing the hash value as BINARY(32) or NVARCHAR(16)?
You can't reasonably store a hash value as chars because binary data is not text. Various text processing and comparison functions interpret those chars. For example trailing whitespace is sometimes ignored leading to incorrect results.
Since you've got 32 totally random unstructured bytes to store a binary(32) is the most natural format and it is the fastest one.

Update multiple rows in same query using mapping table in Postgres 8.4

In question Update multiple rows in same query using PostgreSQL Roman Peckar gave an answer similar to this; I have modified it for the purpose of my question:
update test as t set
column_a = c.column_a,
column_b = c.column_b
from (values
('123', bytea1),
('345', bytea2)
) as c(column_a, column_b)
where c.column_a = t.column_a;
In my case table test has a column of type bytea, say column_b. However, this does not work as c.column_b is of type text and thus an error is produced saying there is no conversion from text to bytea and hinting to use a cast. Well, using a cast does not help either as another error occurs about encoding referring to a LATIN encoding. I apologise for the imprecise reporting of the errors but I do not presently have access to the machine on which this work was carried out.
It seems that the default type of the c.column_b is text. Cannot the type of a column be dictated in the 'as' clause say, 'as c(column_a, column_b type bytea)' or in some other way? If not I assume I must resort to using some binary string function which seems a bit inelegant to say the least.
Because text type is for text. It needs properly encoded text in your client encoding and which can be saved with no data loss in your server encoding (so for example in latin1 no “ or €, as characters like this can not be saved using this encoding).
So if you need to save text, which can contain characters outside of latin1 (like anything typed to a web form) you'd need to change database encoding to utf-8. Or, as a last resort, use encode(data,'base64').

Selecting rows where a column contains at least one character not in a whitelist

I'm converting some sensitive data from a low-security encryption to a higher security encryption (specifically, from CFMX_COMPAT to AES with a 256-bit key). I intend to encode my AES-encrypted strings using Hex, and CFMX_COMPAT is extremely likely to use special characters, so finding records that aren't yet converted should be as simple as (pseudocode):
select from table where column has at least one character not in [A-Z0-9]
Is this possible in SQL? If so, how?
I found this documentation, but I had no idea it was possible in a simple LIKE clause. Awesome!
select top 10 foo
from bar
where foo like '%[^A-Z0-9]%'