Is boolean or tinyint a better choice for data that only have two states in HIVE? - hive

I was designing a data mart for a site where I need to use a boolean datetype to store only 2 states, true or false. I am using Hive.
1.What is a better choice,boolean or tinyint,and why? Should I consider in terms of how many bits they tak up?

Related

What is a portable SQL data type for storing boolean values?

I am doing an ODBC based SQL client and I need to store a boolean value - what is the most portable SQL data type to use for the corresponding column? I already saw the related questions What is the best data type to store boolean values in a database? and Are there reasons for not storing boolean values in SQL as bit data types?, but they do not cover portability per dbms. I assume TINYINT is the way to go since it seems that at least MySQL and MSSQL support it, but maybe someone can give a more precise answer from experience?

Storing flags in SQL column, and indexing them

I need to store a set of flags that are related to an entity into database. Flags might not be the best word because these are not binary information (on/off), but rather a to-be-defined set of codes.
Normally, you would store each information (say each flag value) in a distinct column, but I'm exploring opportunities for storing such information in data structures different than one-column-for-each-attribute to prevent a dramatic increase in column mappings. Since each flag is valid for each attribute of an entity, you understand that for large entities that intrinsically require a large number of columns the total number of columns may grow as 2n.
Eventually, these codes can be mapped to a positional string.
I'm thinking about something like: 02A not being interpreted as dec 42 but rather as:
Flag 0 in position 1 (or zero if you prefer...)
Flag 2 in position 2
Flag A in position 3
Data formatted in such a way can be easily processed by high-level programming languages, because PL/SQL is out of the scope of the question and all these values are supposed to be processed by Java.
Now the real problem
One of my specs is to optimize searching. I have been required to find a way (say, an efficient way) to seek for entities that show a certain flag (or a special 0 flag) in a given position.
Normally, in SQL, given the RDBMS-specific substring function, you would
SELECT * FROM ENTITIES WHERE SUBSTRING(FLAGS,{POSITION},1) = {VALUE};
This works, but I'm afraid it may be a little slow on all platforms but Oracle, which, AFAIK, supports creating secondary indexes mapped to a substring.
However, my solution must work in MySQL, Oracle, SQL Server and DB2 thanks to Hibernate.
Given such a design, is there some, possibly cross-platform, indexing strategy that I'm missing?
If performance is an issue I would go for a some different model here.
Say a table that store entities and a relation 1->N to another table (say: flags table: entId(fk), flag, position) and this table would have an index on flag and position.
The issue here would be to get this flags in a simple column wich can be done in java or even on the database (but it would be difficult to have a cross plataform query to this)
If you want a database-independent, reasonable method for storing such flags, then use typical SQL data types. For a binary flag, you can use bit or boolean (this differs among databases). For other flags, you can use tinyint or smallint.
Doing bit-fiddling is not going to be portable. If nothing else, the functions used to extract particular bits from data differ among databases.
Second, if performance is an issue, then you may need to create indexes to avoid full table scans. You can create indexes on normal SQL data types (although some databases may not allow indexes on bits).
It sounds like you are trying to be overly clever. You should first get the application to work using reasonable data structures. Then you will understand where the performance issues are and can work on fixing them.
I have improved my design and performed a benchmark and found an interesting result.
I created a dummy demographic entity with first/last name columns, birthdate, birthplace, email, SSN...
Then in version 1
I added a column VALIDATION VARCAHR(40) NULL DEFAULT NULL with an index on it.
Instead of positional flags, the new column contains an unordered set of codes each representing a specific format error (e.g. A01 means "last name not specified", etc.). Each code is terminated by a colon : symbol.
Example columns look like
NULL
'A01:A03:A10:'
'A05:'
Typical queries are:
SELECT * FROM ENTITIES WHERE VALIDATION IS {NOT} NULL
Search for entities that are valid/invalid (NULL = no problem)
SELECT * FROM ENTITIES WHERE VALIDATION LIKE '%AXX:';
Selects entities with a specific problem
Then in version 1
I added a column VALID TINYINT NOT NULL with an index which is 0=invalid, 1=valid (Hibernate maps a Boolean to a TINYINT in MySQL).
I added a lookup table
CREATE TABLE ENTITY_VALIDATION (
ID BIGINT NOT NULL PRIMARY KEY,
PERSON_ID LONG NOT NULL, --REFERENCES PERSONS(ID) --Omitted for performance
ERROR CHAR(3) NOT NULL
)
With index on both PERSON_ID and ERROR. This represents the 1:N relationship
Queries:
SELECT * FROM ENTITIES WHERE VALIDATION = {0|1}
Select invalid/valid entities
SELECT * FROM ENTITIES JOIN ENTITY_VALIDATION ON ENTITIES.ID = ENTITY_VALIDATION.PERSON_ID WHERE ERROR = 'Axx';
Selects entities with a given problem
Then I benchmarked
the count(*) function via JUnit+JDBC. So the same queries you see above replace * with COUNT(*).
I did several benchmarks, with entity table containing 100k, 250k, 500k, 750k, 1M entities with a mean ratio entity:flag of 1:3 (there are meanly 3 errors for each entity).
The result
is displayed below. While correct/incorrect entities lookup is equally performing, it looks like MySQL is faster in the LIKE operator rather than in a JOIN, even though there are indexes
Of course,
This was only a benchmark on MySQL. While the approach is cross-platform, the benchmark does not (yet) compare performance in different DBMSes

SQL: best way to store yes/no values? Looking after performance in huge databases

I have some columns where I have to store basically yes/no values.
For example user status for active or inactive. Newsletter suscription status for suscribed or unsuscribed.
Well I want to know (considering tables with a lot of records) if the best way is to put a tiny int with char length of 1 and set 1 for yes, and 0 for no.
Is this a correct thought? Or there are no impact in the performance of db queries when using just words like yes, no, active, inactive, suscribed, etc.
thanks in advance.
Semantically, I suggest you use bit if it's available to you. When looking at the column, any other developer can immediately determine that a boolean value is stored in it. If you don't have bit, try using tinyint. Ensuring that 1 is the only true value and 0 is the only false value will bring consistency. Otherwise, you could end up with a messy mixture of true/false, yes/no, valid/invalid, y/n, and/or t/f.
Comparing bit or tinyint values probably isn't slower than comparing strings, and even if it were slower than comparing strings, I can't imagine it having a significant effect on overall speed.
Is there something you don't like about the 'bit' data type?
The most commonly supported means is to use CHAR(1) - on most databases, it takes the same amount of space as BIT (assuming BIT is available, 1 byte) but supports more values (26 if case insensitive, 52 if not) if there's any chance of supporting more values. Unlike BIT, CHAR(1) is human readable. Also, BIT isn't supported on every database.
If your RDBMS supports bitmap indexes, go for BIT every time. If it doesn't, use whatever you want, there is really no difference between char(1), tinyint (byte).
Are you just asking in general, what the most efficient way to store a yes/no flag is?
Or do you have a performance problem at hand?
If so, when do you have the performance problem (specific queries, inserts, maintenance etc)? What kind of performance gain are you looking for?
2%? 10%? 50%?
Changing datatypes will likely result in only a minor improvement unless we are talking about several hundred million rows. I will give you an example. Let's say that whatever change you did, you shaved of 3 bytes per row. Let's say the table contains 100,000,000 rows. That would be a saving of ~285 mb. Assuming the disk subsystem can provide you 100mb/s you have saved a whopping 3 seconds for a full table scan. Something tells me that the users would think 2 hours and 3 seconds vs 2 hours is same same :)
My intuition would have said performance would have been better with tinyints, but this post doesn't really bare that thought out. This SO post also offers some other interesting opinions.
I do think that performing analysis with data stored as numbers is typically easier than character data. What other programs are you going to have to interface with and use? For example, several of my analysis tools do not read character data at all, so we have to recode any data we receive in the format of "yes", "no", etc.

Why do we care about data types?

Specifically, in relational database management systems, why do we need to know the data type of a column (more likely, the attribute of an object) at creation time?
To me, data types feel like an optimization, because one data point can be implemented in any number of ways. Wouldn't it be better to assign semantic roles and constraints to a data point and then have the engine internally examine and optimize which data type best serves the user?
I suspect this is where the heavy lifting is and why it's easier to just ask the user rather than to do the work.
What do you think? Where are we headed? Is this a realistic expectation? Or do I have a misguided assumption?
The type expresses a desired constraint on the values of the column.
The answer is storage space and fixed size rows.
Fixed-size rows are much, MUCH faster to search than variable length rows, because you can seek directly to the correct byte if you know which record number and field you want.
Edit: Having said that, if you use proper indexing in your database tables, the fixed-size rows thing isn't as important as it used to be.
SQLite does not care.
Other RDBMS's use principles that were designed in early 80's, when it was vital for performance.
Oracle, for instance, does not distinguish between a NULL and an empty string, and keeps its NUMBER's as sets of centesimal digits.
That hardly makes sense today, but these were very clever solutions when Oracle was being developed.
In one of the databases I developed, though, non-indexed values were used that were stored as VARCHAR2's, casted dynamically into appropriate datatypes depending on several conditions.
That was quite a special thing, though: it was used for bulk loading key-value pairs in one call to the database using collections.
Dynamic SQL statements were used for parsing data and putting them into appropriate tables based on key name.
All values were loaded to the temporary VARCHAR2 column as is and then converted into NUMBER's and DATETIME's to be put into their columns.
Explicit data types are huge for efficiency, and storage. If they are implicit they have to be 'figured' out and therefore incur speed costs. Indexes would be hard to implement as well.
I would suspect, although not positive, that having explicit types also on average incur less storage space. For numbers especially, there is no comparison between a binary int and a string of digit characters.
Hm... Your question is sort of confusing.
If I understand it correctly, you're asking why it is that we specify data types for table columns, and why it is that the "engine" automatically determines what is needed for the user.
Data types act as a constraint - they secure the data's integrity. An int column will never have letters in it, which is a good thing. The data type isn't automatically decided for you, you specify it when you create the database - almost always using SQL.
You're right: assigning a data type to a column is an implementation detail and has nothing to do with the set theory or calculus behind a database engine. As a theoretical model, a database ought to be "typeless" and able to store whatever we throw at it.
But we have to implement the database on a real computer with real constraints. It's not practical, from a performance standpoint, to have the computer dynamically try to figure out how to best store the data.
For example, let's say you have a table in which you store a few million integers. The computer could -- correctly -- figure out that it should store each datum as an integral value. But if you were to one day suddenly try to store a string in that table, should the database engine stop everything until it converts all the data to a more general string format?
Unfortunately, specifying a data type is a necessary evil.
If you know that some data item is supposed to be numeric integer, and you deliberately choose NOT to let the DBMS take care of enforcing this, then it becomes YOUR responsibility to ensure all sorts of things such as data integrity (ensuring that no value 'A' can be entered in the column, ensuring that no value 1.5 can be entered in the column), such as consistency of system behaviour (ensuring that the value '01' is considered equal to the value '1', which is not the behaviour you get from type String), ...
Types take care of all those sorts of things for you.
I'm not sure of the history of datatypes in databases, but to me it makes sense to know the datatype of a field.
When would you want to do a sum of some fields which are entirely varchar?
If I know that a field is an integer, it makes perfect sense to do a sum, avg, max, etc.
Not all databases work this way. SQLite was mentioned earlier, but a much older set of databases also does this, multivalued databases.
Consider UniVerse (now an IBM property). It does not do any data validation, nor does it require that you specify what type it is. Searches are still (relatively) fast, it takes up less space (due to the way it stores data dynamically).
You can describe what the data may look like using meta-data (dictionary items), but that is the limit of how you restrict the data.
See the wikipedia article on UniVerse
When you're pushing half a billion rows in 5 months after go live, every byte counts (in our system)
There is no such anti-pattern as "premature optimisation" in database design.
Disk space is cheap, of course, but you use the data in memory.
You should care about datatypes when it comes to filtering (WHERE clause) or sorting (ORDER BY). For example "200" is LOWER than "3" if those values are strings, and the opposite when they are integers.
I believe sooner or later you wil have to sort or filter your data ("200" > "3" ?) or use some aggregate functions in reports (like sum() or (avg()). Until then you are good with text datatype :)
A book I've been reading on database theory tells me that the SQL standard defines a concept of a domain. For instance, height and width could be two different domains. Although both might be stored as numeric(10,2), a height and a width column could not be compared without casting. This allows for a "type" constraint that is not related to implementation.
I like this idea in general, though, since I've never seen it implemented, I don't know what it would be like to use it. I can see that it would reduce the chance of errors in using values whose implementation happen to be the same, when their conceptual domain is quite different. It might also help keep people from comparing cm and inches, for instance.
Constraint is perhaps the most important thing mentioned here. Data types exist for ensuring the correctness of your data so you are sure you can manipulate it correctly. There are 2 ways we can store a date. In a type of date or as a string "4th of January 1893". But the string could also have been "4/1 1893", "1/4 1893" or similar. Datatypes constrain that and defines a canonical form for a date.
Furthermore, a datatype has the advantage that it can undergo checks. The string "0th of February 1975" is accepted as a string, but should not be as a date. How about "30th of February 1983"? Poor databases, like MySQL, does not make these checks by default (although you can configure MySQL to do it -- and you should!).
data types will ensure the consistency of your data. This is one of the most important concepts as keeping your data sane will spare your head from insanity.
RDBMs generally require definition of column types so it can perform lookups fast. If you want to get the 5th column of every row in a huge dataset, having the columns defined is a huge optimisation.
Instead of scanning each row for some form of delimiter to retrieve the 5th column (if column widths were not fixed width), the RDBMs can just take the item at sizeOf(column1 - 4(bytes)) + sizeOf(column5(bytes)). Imagine how much quicker this would be on a table of say 10,000,000 rows.
Alternatively, if you don't want to specify the types of each column, you have two options that I'm aware of. Specify each column as a varchar(255) and decide what you want to do with it within the calling program. Or you can use a different database system that uses key-value pairs such as Redis.
database is all about physical storage, data type define this!!!

Char(4) versus int as StatusID/StatusCode column in a table

I need a status column that will have about a dozen possible values.
Is there any reason why I should choose int (StatusID) over char(4) (StatusCode)?
Since sql server doesn't support named constants, char is far more descriptive than int when used in stored procedure and views as constants.
To clarify, I would still use a lookup table either way. Since the I will need a more descriptive text for the UI. So this decision is only to help me as the developer when I'm maintaining the stored procedures and views.
Right now I'm leaning toward char(4). Especially since designing views in SQL Server Management Studio prevents me from adding comments (I know it's possible to add it in the script editor, but realistically I will use the View Designer far more often, especially if the view is trivial). StateCODE = 'NEW' is much more readable than StateID = 1000.
I guess the question is will there be cases where char(4) is problematic, and since the database is pretty small, I'm not too concerned about slight performance hit (like using TinyInt versus int), but more afraid of code maintenance problems.
Database purists will say a key should have no meaning in the business domain, and that you should create a status table where you look up the description and other meanings of the status.
But for operators and end users, having a descriptive status code can be a blessing. And it doesn't even have to be char(4), you can make it varchar(20). This allows them to query without joins, and inspect the database in an easier way.
In the end, I think the char(20) organization will run more smoothly, and go home earlier on Friday. But the int organization has a better abstraction of the database, and they can enjoy meta programming on friday evening (or boosting on forums.)
(All of this assuming that you're writing business support software. One of the more succesful business support systems, SAP, makes successful use of meaningful keys.)
There are many pro's and con's to each method. I'm sure other arguments will come up in favour of using a char(4). My reasons for choosing an int over a char include:
I always use lookup tables. They allow for an audit trail of the value to be retained and easily examined. For example, if one of your status codes is 'MING' and a business decision is made to change it from 'MING' to 'MONG' from a certain date, my lookup table handles this.
Smaller index - if you need to index this column, it will be thinner.
Extendability - OK, I made that word up, but if you need to go from 4 chars to 5 chars for example, a lookup table would be a blessing.
Descriptions: We use a lot of TLA's here which once you know what they are is great but if I gave a business user a report that said "GDA's 2007 1001", they wouldn't necessarily twig that GDA = Good Dead on Arrival. With a lookup table, I can add this description.
Best practice: Can't find the link to hand but it might be something I read in a K.Tripp article. Aim to make your clustered primary key incrementing integers to optimise the index.
Of course if you are absolutely positive that you will never need any more than a handful of 4 characters, there is no reason not to bang it in the table.
The best thing should be a lookup table with defined values and then relate it to original table, that uses that enumeration.
Collation ambigities are one reason to say no to char 4: Does ABcD = abCD = äBCd?
If you have 12 possible values, why not tinyint/byte and a Status table?
If you have to store the status for 10 million rows the 3 bytes different and the collation/string compares add up.
The place where I've run into this use case is columns that would map onto things that I would typically use an Enum for when programming. Do you store the integer value of the Enum or the name of the Enum in the database column? Honestly, I've done it both ways. Usually, I ask myself if the database will be used outside the application I'm building. If so, I will choose the human readable format to store in the database. If not, then I'll choose the integer value as it saves a little time when reconstituting (it's just a cast instead of a parse operation) the Enum in code.
You could also use a tinyint over an int
i always choose int's simply because they are easier to map to enums in code.
If you're dealing with huge amounts of data and high throughput then a smallint or tinyint can give better performance and a smaller footprint on the hard disk. If the data in your application is often viewed directly through applications like Access or Cognos then your business people will probably appreciate the descriptive values. I know that when I'm analyzing data as part of my Database Developer role I get tired of joining a lot of lookup tables because I can't remember if 1 = Foo and 2 = Bar or 1 = Bar and 2 = Foo.
Also, although performance will be enhanced if you have to lookup rows by these codes which can have smaller indexes, it can also be hurt (in a minor way) by having to do the joins if you are often looking up rows regardless of the code but where you have to include the text value. In most applications that's not an issue though and would probably only come into play in large data warehousing/reporting environments.