Hive Table Structure and Formats

Hive Table Structure and Formats - hive

Can we have Constraints in Hive Tables.Is that Possible to have two tables as one table with Primary key column and another table as Foreign key column

No, this is not possible. This is getting into ways in which Hive differs greatly from a traditional RDBMS.
For instance, one crucial feature of Hive is that you can "load" data into it simply by adding files to an HDFS directory that is the underlying location of a Hive table. Since this operation does not use Hive at all, it would have no way of checking constraints for this new data.
Hive is first and foremost a batch processing system; this kind of row-level checks are not the focus at all.

In addition what Joe K explained, note Hive is "Schema on Read", the underlying data would not complain when data is loaded using "load inpath" etc. Its only during read that the checks are done. what are you trying to accomplish?. The query seems to be more of a RDBMS requirement.
The reading schema could be different for the same underlying data. for example. Lets consider the following sample dataset (focus on the format)
Manufacturer | product | qty| list price
Manu1, CA| home, accessories | 12 | 73.11
Manu2, GA| mobile phone | 25 | 200
Manu3, TX| mattress | 3 | 1000
The hive table contents would change based on your create table syntax. Consider the basic create table below
CREATE EXTERNAL TABLE products (
manufacturer string,
product string,
qty double,
listprice double)
row format delimited
fields terminated by '|'
location '<source location>'
In the fields terminated by if '|' is ignore or replaced by ',' the select on the table completely varies and that is where the checks happen and it fails if qty gets a string value.

Related

How does BigQuery search through a cluster / partition?

My colleague asked if it was possible to reverse the order of the data in a cluster. So it would look something like the following.
| Normal cluster | Reversed cluster |
|---|---|
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
I said that I can remember reading that the data is searched through like a binary tree, so it doesn't really matter if it's reversed or not. But now I can't find anything that mentions how it actually searches through the cluster.
How does BigQuery actually search for a specific value in clusters / partitions?

When you create a clustered table in BigQuery, the data is automatically organized based on the contents of one or more columns in the table’s schema. The columns that we specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns we specify is important, as the order of the columns determines the sort order of the data.
When you create a partitioned table, data is stored in physical blocks, each of which holds one partition of data. A partitioned table maintains these properties across all operations that modify the data. You can typically split large tables into many smaller partitions using data ingestion time or TIMESTAMP/DATE column or an INTEGER column.

Tagging rows in SQL database with usage status

We have some tables in PostgreSQL with details such as customer transactions, where we have a bunch of old data but new data is constantly coming in. It could look something like:
| id | timestamp | cost | tax |
| 0 | 1432807913984 | 20.10 | 3.20 |
...
A bunch of us use this data in specific ways, and we'd like to each keep track of which rows have been used for our own purposes.
An example could be, we decide to apply machine learning to the transactions table and to keep track of which rows have been learned, a column is made to indicate that it's been used for that purpose.
Then later on, a new task is required to use the transactions table for another post-processing task. Another column is made in transactions indicating if it's been used or not.
Is there any convention to use for these kinds of scenarios?

I'm not sure about conventions but certainly you can use array data type by adding new column:
alter table table_name add column markers text[];
To mark record as processed you'd like to run:
update table_name set markers = coalesce(markers, array[]::text[]) || array['some_marker'] where ...;
This will add new marker to the existing ones.
To check whether record is marked with some_marker run:
select * from table_name where 'some_marker' = any(markers);
For more info about arrays in postgresql see:
http://www.postgresql.org/docs/9.4/static/arrays.html and
http://www.postgresql.org/docs/9.4/static/functions-array.html

Database design about list of constant strings

For database design, if the value of a column is from a constant list of strings, such as status, type. Should I create a new table and have a foreign key or just store plain strings in the same table.
For example, I have a orders table with status:
----------------------------
| id | price | status |
----------------------------
| 1 | 10.00 | pending |
| 2 | 03.00 | in_progress |
| 3 | xx.xx | done |
An alternative for above table is to have a order_status table and store status_id in orders table. I'm not sure if another table is necessary here.

If it's more than just a few different values and/or values are frequently added you should go with a normalized data model, i.e. a table.
Otherwise you also might go for a column, but you need to add a CHECK(status in ('pending','in_progress,'done')) to avoid wrong data. This way you get the same consistency without the FK.
To save space you might use abbreviations (one or a few characters, e.g. 'p', 'i', 'd') but not meaningless numbers(1,2,3). Resolving the long values can be done in a View level using CASE.
ENUMs are proprietary, so IMHO better avoid it...

It's not a good practice to create a table just for static values.
Instead, you could use the ENUM type, which has a pre set value, as the example:
CREATE TABLE orders (
id INT,
price DOUBLE,
status ENUM('pending', 'in progress', 'done')
);

There are pros and cons for each solution and you need pick the best for your own project and you may have to switch later if the initial choice is bad.
In your case, storing status directly can be good enough. But if you want to prevent invalid status stored in your database, or you have a very long status text, you may want to store them separately with a foreign key constraint.
ENUM is another solution. However, if you need a new status later, you have to change your table definition, which can be a very bad thing.

If the status has extra data associated with it, like display order or a colour, then you would need a separate table. Also, choosing pre-entered values from a table prevents semi-duplicate values (for example, one person might write "in progress" whereas another might write "in_progress" or "progressing") and aids in searching for orders with the same status.
I would go for a separate table as it allows more capabilities and lowers error.

I would use an order_status table with the literal as the primary key. Then in your orders table, cascade updates on the status column in case you modify the literals in the order_status table. This way you have data consistency and avoid join queries.

How to bond N database table with one master-table?

Lets assume that I have N tables for N Bookstores. I have to keep data about books in separate tables for each bookstore, because each table has different scheme (number and types of columns is different), however there are same set of columns which is common for all Bookstores table;
Now I want to create one "MasterTable" with only few columns.
| MasterTable |
|id. | title| isbn|
| 1 | abc | 123 |
| MasterToBookstores |
|m_id | tb_id | p_id |
| 1 | 1 | 2 |
| 1 | 2 | 1 |
| BookStore_Foo |
|p_id| title| isbn| date | size|
| 1 | xyz | 456 | 1998 | 3KB |
| 2 | abc | 123 | 2003 | 4KB |
| BookStore_Bar |
|p_id| title| isbn| publisher | Format |
| 1 | abc | 123 | H&K | PDF |
| 2 | mnh | 986 | Amazon | MOBI |
My question, is it right to keep data in such way? What are best-practise about this and similar cases? Can I give particular Bookstore table an aliase with number, which will help me manage whole set of tables?
Is there a better way of doing such thing?

I think you are confusing the concepts of "store" and "book".
From you comments and the example data, it appears the problem is in having different sets of attributes for books, not stores. If so, you'll need a structure similar to this:
The symbol: denotes inheritance1. The BOOK is the "base class" and BOOK1/BOOK2/BOOK3 are various "subclasses"2. This is a common strategy when entities share a set of attributes or relationships3. For the fuller explanation of this concept, please search for "Subtype Relationships" in the ERwin Methods Guide.
Unfortunately, inheritance is not directly supported by current relational databases, so you'll need to transform this hierarchy into plain tables. There are generally 3 strategies for doing so, as described in these posts:
Interpreting ER diagram
Parent and Child tables - ensuring children are complete
Supertype-subtype database design
NOTE: The structure above allows various book types to be mixed inside the same bookstore. Let me know if that's not desirable (i.e. you need exactly one type of books in any given bookstore)...
1 Aka. category, subclassing, subtyping, generalization hierarchy etc.
2 I.e. types of books, depending on which attributes they require.
3 In this case, books of all types are in the many-to-many relationship with stores.

If you had at least two columns which all other tables use it then you could have base table for all books and add more tables for the rest of the data using the id from Base table.
UPDATE:
If you use entity framework to connect to your DB I suggest you to try this:
Create your entities model something like this:
then let entity framework generate the database(Update database from Model) for you. Note this uses inheritance(not in database).
Let me know if you have questions.

Suggest data model:
1. Have a master database, which saves master data
2. The dimension tables in master database, transtional replicated to your distributed bookstore database
3. You can choose to use updatable scriscriber or merge replication is also a good choice
4. Each distributed bookstore database still work independently, however master data either merge back by merge replication or updatable subscriber.
5. If you want to make sure master data integrity, you can only read-only subscriber, and use transational replication to distribute master data into distributed database, but in this design, you need to have store proceduces in master database to register your dimension data. Make sure there is no double-hop issue.

I would suggest you to have two tables:
bookStores:
id name someMoreColumns
books:
id bookStore_id title isbn date publisher format size someMoreColumns
It's easy to see the relationship here: a bookStore has many books.
Pay attention that I'm putting all the columns you have in all of your BookStore tables in just one table, even if some row from some table does not have a value to some column.
Why I prefer this way:
1) To all the data from BookStore tables, just few columns will never have a value on table books (as example, size and format if you don't have an e-book version). The other columns can be filled someday (you can set a date to your e-books, but you don't have this column on your table BookStore_Bar, which seems to refer to the e-books). This way you can have much more detailed infos from all your books if someday you want to update it.
2) If you have a bunch of tables BookStore, lets say 12, you will not be able to handle your data easily. What I say is, if you want to run some query to all your books (which means to all your tables), you will have at least three ways:
First: run manually the query to each of the 12 tables and so merge the data;
Second: write a query with 12 joins or set 12 tables on your FROM clause to query all your data;
Third: be dependent of some script, stored procedure or software to do for you the first or the second way I just said;
I like to be able to work with my data as easy as possible and with no dependence of some other script or software, unless I really need it.
3) As of MySQL (because I know much more of MySQL) you can use partitions on your table books. It is a high level of data management in which you can distribute the data from your table to several files on your disk instead of just one, as generally a table is allocated. It is very useful when handling a large ammount of data in a same table and it speeds up queries based on your data distribution plan. Lets see an example:
Lets say you already have 12 distinct bookStores, but under my database model. For each row in your table books you'll have an association to one of the 12 bookStore. If you partition your data over the bookStore_id it will be almost the same as you had 12 tables, because you can create a partition for each bookStore_id and so each partition will handle only the related data (the data that match the bookStore_id).
Lets say you want to query the table books to the bookStore_id in (1, 4, 9). If your query really just need of these three partitions to give you the desired output, then the others will not be queried and it will be as fast as you were querying each separated table.
You can drop a partition and the other will not be affected. You can add new partitions to handle new bookStores. You can subpartition a partition. You can merge two partitions. In a nutshell, you can turn your single table books in an easy-to-handle, multi-storage table.
Side Effects:
1) I don't know all of table partitioning, so it's good to refer to the documentation to learn all important points to create and manage it.
2) Take care of data with regular backups (dumps) as you probably may have a very populated table books.
I hope it helps you!

Is there a way to execute T-SQL code inside a Dataflow in SSIS?

Background
I have a dimension table that has a single record for each day. Each record has a primary key so example data would be:
Dimension Table
---------------
---------------------------------
| ID | DateTime |
---------------------------------
| 1083 | 04/10/2008 10:02:00 PM |
---------------------------------
What I am trying to do is take my source data column which has a SQL datetime value (such as 04/10/2008 10:02:00 PM) and have SSIS derive what the primary key from the dimension table should be (1083 in the above example). I am trying to fit this into the Data Flow within my package and avoid using staging tables.
I would like to call a database function during my dataflow to have my SSIS package discover the timeid for a datetime record. I have tried to use DeriveColumn but that doesn't seem to allow the use of T-SQL; rather only functions that are built into ANSI SQL.
Question
Is there another way to do this inside the dataflow? Or will I need to use staging tables and use a SQLTask outside of the dataflow to manipulate my data?

If I understand you, you have a data mart with a time dimension, and you need to get the timeId that corresponds to a particular time.
If that's correct, then you want to use a Lookup component. For the reference table use something like SELECT timeId, timeStamp FROM TimeDimension, then look up against the input column that holds the timestamp. Use the timeId as the output column, and now each row in your data flow will have the timeId that corresponds to its time stamp.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas