PostgreSQL: Using BIT VARYING column for bitmask operations - sql

Having used bitmasks for years as a C programmer I'm attempting to do something similar in Postgres and it's not working as I expected. So here is a table definition with 2 columns:
dummy
(
countrymask BIT VARYING (255) not null, -- Yes it's a pretty wide bitmask
countryname CHARACTER VARYING NOT NULL,
);
So, some data in the "dummy" table would be:
Now, what is the SQL to return Albania, Armenia and Belarus with one select using the mask?? (i.e. '100010001')
I thought it would be something like this:
SELECT * FROM DUMMY WHERE (countrymask & (b'100010001')) <> 0;
But I get a type mismatch.. which I'd love some assistance on.
But also, is this going to work when the typecasting is sorted out?

You would have to use bit strings of the same length throughout, that is bit(255), and store all the leading zeros.
This would be simpler if you could use integers and do
WHERE countrymask & 273 <> 0
but there are no integer types with 255 bits supporting the & operator.
Anyway, such a query could never use an index, which is no problem with a tiny table like dummy, but it could be a problem if you want to scan a bigger table.
In a way, that data model violates the first normal form, because it stores several country codes in a single datum. I think that you would be happier with a classical relational model: have a country table that has a numerical primary key filled with a sequence, and use a mapping table to associate rows in another table with several countries.
An alternative would be to store the countries for a row as an array of country identifiers (bigint[]). Then you can use the “overlaps” operator && to scan the table for rows that have any of the countries in a given array. Such an operation can be made fast with a GIN index.

Related

Difference between an array and single-column in SQL

For databases that support arrays -- for example, Postgres -- what would be the difference between the following two items:
`name` `field_a` (row array)
Tom [1, 2, 3]
And:
`name` `field_a` (single column)
Tom 1
Tom 2
Tom 3
The above would be two 'variations' of combining two tables:
name
`name`
Tom
numbers
`field_a`
1
2
3
If the array version vs the other version are not interchangeable, what are the main differences between the two?
Array stores the data in a single row and it needs some kind of processing (Different for different databases) before accessing/searching/sorting the particular value. It saves the size of table as repeating data occurs in single row but ultimately had more processing time when it comes to updating/searching/sorting and many more operations on the data stored as array.
Single values in each row is more preferred in databases as it is easy to find the record, update particular data, sort the data and many more operations.
So according to me only insertion of array is faster than the individual values and it saves some space of the table but all other operations will be time consuming. So it is better to store individual values in each row for database.
Databases are designed to handle single values more easily and operations on single values are faster than the arrays.
Simple example of complexity from your question: Replace 2 with 5 in field_a for name = 'Tom'
Another way of thinking about it is that an array named column is effectively column0, column1, column2 etc which the DB handles for you, whereas the table is normalized (1st Normal Form) into rows.
It is, however, harder to enforce a fixed size array in a normalized structure. You can enforce a maximum by defining a third table with numbers 0,1,2 and foreign-keying the child table on that. You cannot enforce a minimum like this (except in certain DBMSs with DB level constraints).
Rarely are fixed size arrays actually necessary. The majority of cases when they are used just break 1st Normal Form

in sql in a table, in a given column with data type text, how can we show the rest of the entries in that column after a particular entry

in sql, in any given table, in a column named "name", wih data type as text
if there are ten entries, suppose an entry in the column is "rohit". i want to show all the entries in the name column after rohit. and i do not know the row id or id. can it be done??
select * from your_table where name > 'rohit'
but in general you should not treat text columns like that.
a database is more than a collection of tables.
think about how to organize your data, what defines a datarow.
maybe, beside their name, there is another thing how you would classify such a row? some things like "shall be displayed?" "is modified" "is active"?
so if you had a second column, say display of type int and your table looked like
CREATE TABLE MYDATA
NAME TEXT,
DISPLAY INT NOT NULL DEFAULT(1);
you could flag every row with 1 or 0 whether it should be displayed or not and then your query could look like
SELECT * FROM MYDATA WHERE DISPLAY=1 ORDER BY NAME
to get your list of values.
it's not much of a difference with ten rows, you don't even need indexes here, but if you build something bigger, say 10,000+ rows, you'd be surprised how slow that would become!
in general, TEXT columns are good to select and display, but should be avoided as a WHERE condition as much as you can. Use describing columns, preferrably int fields which can be indexed with extreme high efficiency and an application doesn't get slower even if the record size goes over 100k.
You can use "default" keyword for it.
CREATE TABLE Persons (
ID int NOT NULL,
name varchar(255) DEFAULT 'rohit'
);

Joining same column from same table multiple times

I need a two retrieve data from the same table but divided in different columns.
First table "PRODUCTS" has the following columns:
PROD_ID
PRO_TYPE_ID
PRO_COLOR_ID
PRO_WEIGHT_ID
PRO_PRICE_RANGE_ID
Second table "COUNTRY_TRANSLATIONS" has the following columns:
ATTRIBUTE_ID
ATT_LANGUAGE_ID
ATT_TEXT_ID
Third and last table "TEXT_TRANSLATIONS" has the following columns:
TRANS_TEXT_ID
TRA_TEXT
PRO_TYPE_ID, PRO_COLOR_ID, PRO_WEIGHT_ID and PRO_PRICE_RANGE_ID are all integers and are found back in the column ATTRIBUTE_ID multiple times (depending on howmany translations are available). Then ATT_TEXT_ID is joined with TRANS_TEXT_ID from the TEXT_TRANSLATIONS table.
Basically I need to run a query so I can retreive information from TEXT_TRANSLATIONS multiple times. Right now I get an error saying that the correlation is not unique.
The data is available in more then 20 languages, therefore the need to work with intergers for each of the attributes.
Any suggestion on how I should build up the query? Thank you.
Hopefully, you're on an RDBMS that supports CTEs (pretty much everything except mySQL), or you'll have to modify this to refer to the joined tables each time...
WITH Translations (attribute_id, text)
as (SELECT c.attribute_id, t.tra_text
FROM Country_Translations c
JOIN Text_Translations t
ON t.trans_text_id = c.att_text_id
WHERE c.att_language_id = #languageId)
SELECT Products.prod_id,
Type.text,
Color.text,
Weight.text,
Price_Range.text
FROM Products
JOIN Translations as Type
ON Type.attribute_id = Products.pro_type_id
JOIN Translations as Color
ON Color.attribute_id = Products.pro_color_id
JOIN Translations as Weight
ON Weight.attribute_id = Products.pro_weight_id
JOIN Translations as Price_Range
ON Price_Range.attribute_id = Products.pro_price_range_id
Of course, personally I think the design of the localization table was botched in two ways -
Everything is in the same table (especially without an 'attribute type' column).
The language attribute is in the wrong table.
For 1), this is mostly going to be a problem because you now have to maintain system-wide uniqueness of all attribute values. I can pretty much guarantee that, at some point, you're going to run into 'duplicates'. Also, unless you've designed your ranges with a lot of free space, the data values are non-consecutive for type; if you're not careful there is the potential for update statements being run over the wrong values, simply because the start and end of the given range belong to the same attribute, but not every value in the range.
For 2), this is because a text can't be completely divorced from it's language (and country 'locale'). From what I understand, there are parts of some text that are valid as written in multiple languages, but mean completely different things when read.
You'd likely be better off storing your localizations in something similar to this (only one table shown here, the rest are an exercise for the reader):
Color
=========
color_id -- autoincrement
cyan -- smallint
yellow -- smallint
magenta -- smallint
key -- smallint
-- assuming CYMK palette, add other required attributes
Color_Localization
===================
color_localization_id -- autoincrement, but optional:
-- the tuple (color_id, locale_id) should be unique
color_id -- fk reference to Color.color_id
locale_id -- fk reference to locale table.
-- Technically this is also country dependent,
-- but you can start off with just language
color_name -- localized text
This should make it so that all attributes have their own set of ids, and tie the localized text to what it was localized to directly.

MySQL. Working with Integer Data Interval

I've just started using SQL, so that have no idea how t work with not standard data types.
I'm working with MySQL...
Say, there are 2 tables: Stats and Common. The Common table looks like this:
CREATE TABLE Common (
Mutation VARCHAR(10) NOT NULL,
Deletion VARCHAR(10) NOT NULL,
Stats_id ??????????????????????,
UNIQUE(Mutation, Deletion) );
Instead of ? symbols there must be some type that references on the Stats table (Stats.id).
The problem is, this type must make it possible to save data in such a format: 1..30 (interval between 1 and 30). According to this type, it was my idea to shorten the Common table's length.
Is it possible to do this, are there any different ideas?
Assuming that Stats.id is an INTEGER (if not, change the below items as appropriate):
first_stats_id INTEGER NOT NULL REFERENCES Stats(id)
last_stats_id INTEGER NOT NULL REFERENCES Stats(id)
Given that your table contains two VARCHAR fields and an unique index over them, having an additional integer field is the least of your concerns as far as memory usage goes (seriously, one integer field represents a mere 1GB of memory for 262 million lines).

SQL static data / lookup lists IDENTIFIER

In regard to static data table design. Having static data in tables like shown:
Currencies (Code, Name). Row example: USD, United States Dollar
Countries (Code, Name). Row example: DE, Germany
XXXObjectType (Code, Name, ... additional attributes)
...
does it make sense to have another (INTEGER) column as a Primary Key so that all Foreign Key references would use it?
Possible solutions:
Use additional INTEGER as PK and FK
Use Code (usually CHAR(N), where N is small) as PK and FK
Use Code only if less then certain size... What size?
Other _______
What would be your suggestion? Why?
I usually used INT IDENTITY columns, but very often having the short code is good enough to show to the user on the UI, in which case the query would have one JOIN less.
An INT IDENTITY is absolutely not needed here. use the 2 or 3 digit mnemonics instead. If you have an entity that has no small, unique property, you should then consider using a synthetic key. But currency codes and country codes aren't the time to do it.
I once worked on a system where someone actually had a table of years, and each year had a YearID. And, true to form, 2001 was year 3, and 2000 was year 4. It made everything else in the system so much harder to understand and query for, and it was for nothing.
If you use a ID INT or a CHAR, referential integrity is preserved in both cases.
An INT is 4 bytes long, so it's equal in size as a CHAR(4); if you use CHAR(x) where x<4, your CHAR key will be shorter than a INT one; if you use CHAR(x) where x>4, your CHAR key will be greater than a INT one; for short keys doesn't usually make sense to use VARCHAR, as the latter has a 2-byte overhead. Anyway, when talking about tables with - say - 500 records, the total overhead of a CHAR(5) over a INT key would be just 500 bytes, a value hilarious for database where some tables could have millions of records.
Considering that countries and currencies (e.g.) are limited in numbers (a few hundred, at most) you have no real gain in using an ID INT instead of a CHAR(4); moreover, a CHAR(4) key can be easier to remember for the end user, and can ease your life when you have to debug/test your Sql and/or data.
Therefore, though I usually use an ID INT key for most of my tables, in several circumstances I choose to have a PK/FK made of CHARs: countries, languages, currencies are amongst those cases.