Hash database values such that NULL is assigned an integer, not NULL - sql

I'm interested in hashing database field values as part of an attempt to detect changes in tables.
The database in question (Vertica) has a HASH function, mainly for internal use I guess, as well as other hashes. The internal function assigns a non-null hash value to NULL (in fact, it differs for NULLs of different datatypes).
I might end up using that internal hash function, but if it turns out that its statistical properties and collision avoidance aren't that good, how can I use other provided functions like md5 etc (I don't need strong cryptographic hashes) when they all send NULL to NULL?
Of course I could just assign another hash value to NULL, but I don't know an elegant way to do that. (As opposed to expanding the set of hash values and adding one for NULL.)

You could simply select the part of the tables (I mean select the required columns from the table), generate a Hash Function of that queried data, and compare it the next time.
The query to a table is another table. Track change to that new table.

Related

Query without partition key in DynamoDB

I'm designing a table in DynamoDB which will contain a large number of records, each with a unique ID and a timestamp. I will need to retrieve a set of records that fall between two dates, irrespective of all other property values.
Adding a global secondary index for the timestamp field seems like a logical solution, however this isn't straightforward.
The Query command in DynamoDB requires a KeyConditionExpression parameter, which determines which results are returned by the query. From the DynamoDB developer guide:
To specify the search criteria, you use a key condition expression—a string that determines the items to be read from the table or index. You must specify the partition key name and value as an equality condition. You can optionally provide a second condition for the sort key (if present).
Since the partition key must be specified exactly, it cannot be used for querying a range of values. A range could be specified for the sort key, but only in addition to an exact match on the partition key. Hence, the only way I can see this working is to add a dummy field for the index partition key, where every record has the same value, then perform the query on the timestamp as the sort key. This seems hacky, and, presumably, is not how it's intended to be used.
Some answers to similar questions suggest using the Scan command instead of Query, however this would be very inefficient when fetching a small number of records from a very large table.
Is it possible to efficiently query the table to get all records where a condition matches the timestamp field only?
How big of a range are you dealing with?
You could for instance have a GSI partition key of
YYYY
or YYYY-MM
or YYYY-MM-DD
Your sort key could be the remainder of the timestamp..
You may need to make multiple queries, if for instance the amount of data necessitates daily partitions and you want to show 7 days at a time.
Also be sure to read the best practices for time-scale data part of the developer guide.

How to identify best Max_bucket and Seed_value for Oracle ORA_Hash function?

I am new to Oracle Hash function. I know that this function is for encryption purpose. It is actually convert a very large paragraph into one single hash value.
The Ora_hash function have three different parameters:
Expression
Max_bucket
Seed_value
For the Max_bucket and seed value, the document says I can specify between 0 to 429496725. Max_bucket is default to 429496725 and Seed_Value is default to 0.
However, does anyone know what is the difference between 0 and 429496725 for those values?
I am actually planning to use it to compare two columns from two different tables, each rows in each columns have close to 3000 characters, and 1 table will have close to 1 million of records while the other will have close to billions of records. Of course both table can be joined with an ID columns.
As a result of this, I think using a hash value will be a better option than simply using A = B.
However, could anyone teach me how to identify best Max_bucket and Seed_value for Oracle ORA_Hash function?
Thanks in advance!
ORA_HASH is not intended for generating unique hash values. You probably want to use a function like STANDARD_HASH instead.
ORA_HASH is intended for situations where you want to quickly throw a bunch of values into a group of buckets, and hash collisions are useful. ORA_HASH is useful for hash partitioning; for example, you might want to split a table into 64 segments to improve manageability.
STANDARD_HASH can be used to generate practically-unique hashes, using algorithms like MD5 or SHA. These hash algorithms are useful for cryptographic purposes, whereas ORA_HASH would not be suitable. For example:
select standard_hash('asdf') the_hash from dual;
THE_HASH
--------
3DA541559918A808C2402BBA5012F6C60B27661C

Replacing a SQL index text column with numeric values would make it faster?

I have and old and very bad database.
I have a child table with a text column for the users, all my users have numeric values, but there is an exception for the admin user, the code for the admin user is 'ADMIN'.
So I created numeric code for the ADMIN user and I will update all the records with that numeric value, but I wont change the column type to integer.
So I want to know if making this change, and having all the values of the user column with numeric value, the index for user column will be better, faster and stronger?
Indexing performance aside, it is always better to use the database type that matches the actual type in your model. Since the actual type of the ID is integer, changing database type to int would make it more natural to work with your database.
For example, ordering on ID would behave in a natural way, because it would no longer alphabetize your numbers (i.e. ordering 199 ahead of 2, because 199 comes first lexicographically). Searches using BETWEEN operator would produce correct results for the numbers as well.
Another important improvement is that the application relying on your database would no longer be able to insert non-numeric data into the ID column by mistake. This additional validation alone is worth making the change.
As far as the size and performance of an index goes, the size is very likely to shrink, which has a potential of improving performance by reducing the amount of reads.
It sounds like you really want a reference table.
Integers have advantages over strings for indexes:
They are fixed length.
They are usually shorter (although at 32 bits each, your codes might be shorter).
I think they are easier to gather statistics on.
The first two as optimizations for the index, but they are pretty minor, and the third might affect the optimizer. These are the sort of thing that is helpful, but you wouldn't change your data structure for it.
These also affect joins and foreign keys. The second is particularly important for foreign key references. If your values are wide, you end up repeating them in multiple tables -- eating up even more space.

Implicit enumerated types (i.e. symbols) in SQL

We are often using VARCHARs for essentially enumerated values. I know it would often be smart to extract them into a separate lookup table and use an integer ID as a foreign key, but sometimes no other table is using it, and we don't want another JOIN, so we opt to keep them in the main table.
So, the question is, is there some DB feature that would allow me to mark such columns, and then use some internal lookup table to save space and improve performance of my queries? Something similar to Postgres' ENUMs, but that would not require explicitly declaring possible values up front.
For example, I would want to do an INSERT:
INSERT INTO table (date, status) VALUES ('2011-01-25', 'pending');
and 'pending' would be internally treated as an integer, keeping only one instance of the actual string, even if multiple rows contain the same value 'pending'.
In some programming languages (LISP, Ruby), similar feature is called symbols, de facto "named integers".
I'm mainly interested in Postgres and MySQL, but any other pointers would be appreciated as well.
Oracle table compression and SQL Server page compression both do this, in addition to other tricks. The nice thing about using inbuilt compression routines is that they are completely transparent - no extra joins are required in your code, and because there's less disk access, it is often quicker to access compressed than it is uncompressed. I think Postgres does this as part of TOAST when it uses the EXTERNAL storage strategy, but only on larger fields.
I know this doesn't answer your question, but I've done it with functions and look up tables, or where speed is important, functions which just return a constant.
ie:
INSERT INTO
table (date, status)
VALUES
('2011-01-25', udf_getConst('statuscode','pending'));
or
INSERT INTO
table (date, status)
VALUES
('2011-01-25', udf_Const_StatusCode_Pending());
If you're using the constant in multiple places in a query, consider selecting it into a variable first.
You can also use bitwise logic for different status codes and store multiple values in a single integer column.

Generally, are string (or varchar) fields used as join fields?

We have two tables. The first contains a name (varchar) field. The second contains a field that references the name field from the first table. This foreign key in the second table will be repeated for every row associated with that name. Is it generally discouraged to use a varchar/string field as a join between two tables? When is the best case where a string field can be used as a join field?
It's certainly possible to use a varchar as a key field (or simply something to join on). The main problems with it are based on what you normally store in a varchar field; mutable data. Strictly speaking, it's not advisable to have key fields change. A person's name, telephone number, even their SSN can all change. However, the employee with internal ID 3 will always be ID 3, even if there are two John Smiths.
Second, string comparison is dependent on a number of nit-picky details, such as culture, collation, whitespace translation, etc. that can break a join for no immediately-apparent reason. Say you use a tabspace character \t for a certain string you're joining on. Later, you change your software to replace \t with 3 spaces to reduce character escapes in your raw strings. You have now broken any functionality requiring a string with escaped tabs to be matched to an identical-looking, but differently-composed, string.
Lastly, even given two perfectly identical strings, there is a slight performance benefit to comparing two integer numbers than comparing two strings. Integer comparison is effectively constant-time. String comparison is linear at best, based on the length of the string.
Is it generally discouraged to use a varchar/string field as a join between two tables?
If there's a natural key to be used (extremely rare in real life, but state/province abbreviations are a good example), then VARCHAR fields are fine.
When is the best case where a string field can be used as a join field?
Depends on the database because of the bits allocated to the data type, but generally VARCHAR(4) or less takes around the same amount of space (less the less number of characters) as INT would.
Generally speaking you shouldn't use anything that is editable by the end users as a FK as an edit would require not one update, but one update per table which references that key.
Everyone else has already mentioned the potenetial performance implications of a query, but the update cost is also worth noting. I strongly suggest the use of a generated key instead.
If you're concerned about performance, the best way to know is to create tables that implement your potential design choices, then load them up with massive amounts of data to see what happens.
In theory, very small strings should perform as well as a number in joins. In practice, it would definitely depend upon the database, indexing, and other implementation choices.
In a relational database, you shouldn't use a string in one table that references the same string in another table. If the second table is a look-up, create an identity column for the table, and then reference the integer value in the first. When displaying the data, use a join to the second table. Just make sure in the second table you never actually delete records.
The only exception would be if you are creating an archive table where you want to store exactly what was chosen at a given time.
Sometimes a join will happen on fields that are not "join fields", because that's just the nature of the query (e.g. most ways of identifying records that are duplicates in a particular column). If the query you want relates to those values, then that's what the join will be on, end of story.
If a field genuinely identifies a row, then it is possible to use it as a key. It's even possible to do so if it could change (it brings issues, but not insurmountable issues) as long as it remains a genuine identifier (it'll never change to a value that exists for another row).
The performance impact varies by common query and by database. By database the type of indexing strategies of some makes them better at using varchar and other textual keys than other databases (in particular, hash-indices are nice).
Common queries can be such that it becomes more performant to use varchar even without hash indices. A classic example is storing pieces of text for a multi-lingual website. Each such piece of text will have a particular languageID relating to the language it is in. However, obtaining other information about that language (it's name etc.) is rarely needed; what's much more often needed is to either filter by the RFC 5646 code, or to find out what that RFC 6546 code is. If we use a numeric id, then we will have to join for both types of query to obtain that code. If we use the code as the ID, then the most common queries concerned with the language won't need to look in the language table at all. Most queries that do care about the details of the language also won't need to do any join; pretty much the only time the key will be used as a foreign key is in maintaining referential integrity on update and insert of text or on deletion of languages. Hence while the join is less efficient when it is used the system as a whole will be more efficient by using fewer joins.
It depends on the nature of your data.
If the string is some user-entered and updated value then I would probably shy away from joining on it. You may run into consistency difficulties for storing the name in both the parent and the detail table.
Nothing has duplicate names?
I have used a string field as a join when using GUIDs or single char identifiers or when I know the string to be a natural key (though I almost always prefer a surrogate)
Natural primary keys like a zip code, phone number, email address or user name are by definition strings. There are unique and relatively short.
If you put an index on such a column there is no problem with using them a join. Impact on performance will usually be minimal.