composite key for hive column's for hive-hbase integration

composite key for hive column's for hive-hbase integration - hive

I have a some data for a specific reading which holds different kinds of data with various data types. I need to show case the same in a HBase table with Hive-HBase integration.
The problem is with deciding the key for the HBase table because there is no specific column data in those readings which can hold primary key concept but we can have two different column data which together can work as unique with datatype as int and timestamp.
Can anyone please help me with the designing for the Hive table where I can keep data from two column (int and timestamp) as a new column which will hold for HBase key.
Thanks,
Manan

Related

OLAP Data Warehouse - composite primary key as multiple or single fields

I'm building a data warehouse, and the data is of a quality where 8 fields may be required to uniquely identify a record, and this applies to three tables, each of which will have a few million rows of data per year. It's all 0NF.
Obviously every situation is unique, but considering that the purpose of the data warehouse is for OLAP, am I right in thinking that I would be better to create a single column to use as the primary key rather than a composite primary key of 8 separate fields? It's straightforward to concatenate the fields into an extra column as part of the ETL pipeline.
I appreciate the redundancy increases the storage requirement, and we are talking millions of rows a year, but I'm guessing it'll significantly improve query performance? And reduce memory requirements if the data is modelled in a BI tool?
Can anybody give me any general thoughts or advice on this please?
Below is some entirely made-up simulated data. I need to like the order table to the shipment table to get where the order was shipped from, for example, or maybe the order table to the shipment table to sum the quantity shipped.
I don't think normalising the tables is the way to go, as all four of the columns I'm using here would be subject to change, and only combined they form a reliable key for a unique shipment.
Because of this the data is bulk deleted/inserted based on shift date.
Thanks!
Phil.

Those look like fact tables. In a dimensional model only dimension tables need single-column keys. Fact tables typically have compound keys made up of the dimension foreign keys that define the fact table grain.

Creating tables with one to many relationships or just a table with a single column

If I have multiple key value pairs in Azure Blob Storage such as:
-/files/key1
-/files/key2
-/files/key3
And each key is uploaded by a user, but a user can upload multiple keys, what is the best table design in my SQL database to reference what keys are associated with what user?
A) Table with single column - Everytime I add a file to BLOB storage I add a row to a single column table with the username and the associated key value i.e:
AssociationColumn
-User1+key1
-User2+key2
-User1+key3
Will this be slow in looking up all the keys for User1 for example if I query using some sort of regex starts with? Will making this two column with User as one column and key as another column affect performance at all? How can I achieve this one to many relationship?
Also is it bad to store keys using an identifier such as 1-2-n? Any suggestions on how to create unique identifiers that can fit in the space of varchar(MAX)?

The correct approach in a relational database is to have a junction table. This would have at least two columns:
User
Key
You wouldn't put these in a single column -- I don't think even in Azure.

Sql table Default Value

Hi everyone i have a project i am working on that consists on keeping tables the same at 3 different locations
i pull data that doesnt exist from each of these locations into a corporate table, i then need to send back down to the locations the new data so they are all the same
The table i am pulling from is a identity
My question is in Sql is there any way to make a table a identity without making it a identity as in setting the default value to be the max(id)+1, this is the only way i can figure i can keep the data structure the same without going to replication

The problem is that you're generating records in an IDENTITY field in multiple sources, then unable to combine them without those records being assigned new IDENTITY values.
By using a GUID as your key field, each of the 3 databases can create records which will have a unique ID, and you'll be able to then combine them without issue. You can still have a UNIQUE constraint on the field, but the likelihood of generating the same GUID is astronomically small.
Most replication processes utilize this GUID approach at some level already, so it's a common solution to this problem.

A single table that represents multiple tables

I have a problem with finding a way to represent multiple tables hash tables into a single table.
Say I have 3 tables with the format:
Table1(Table1_PK1,Table1_PK2,Table1_PK3,Table1_Hash)
Table2(Table2_PK1,Table2_PK2,Table2_Hash)
Table3(Table3_Pk1,Table3_PK2,Table3_PK3,Table3_PK4,Table3_PK5,Table3_Hash)
Table1_PK1,Table1_PK2,Table1_PK3... are columns and they might have different datatypes (VARCHAR, INT or DATETIME ...).
My question is if there is a way to create a single table (fixed number of columns) that can represent all of these 3 tables (may be more in practical).
I am trying to do this for my database tool. Each table actual a table which contains primary keys and a hash data associating with them.

Since you're apparently building a database tool, not a database, it might make more sense to do this in application code rather than in a database table.
In a different answer, you commented
I am still looking for a dynamic way to do it without knowing how many primary keys a table can have.
A table can have only one primary key. That primary key can consist of more than one column, though. (You already knew this; you were just using the wrong words, which might confuse others.)
A table can also have an arbitrary number of other keys, which will be either declared (as NOT NULL UNIQUE) or "undeclared" (by creating an index that guarantees uniqueness over a set of columns).
You can look all that stuff up at run time in one or both of two ways. (Links go to documentation for PostgreSQL.)
System tables, sometimes called system catalogs
information_schema views
As far as I know, all modern SQL platforms implement at least one of these interfaces. The information_schema views are covered in the SQL standards, but there seems to be some room for interpretation. They don't look quite the same on all platforms.

Why combine the 3 tables into one? Would be really bad db design. But here's a way to do it:
The one table will have a column for each of the 3 tables' columns you want in the final table. I am making the assumption that TableX_Hash is the same type, so that remains as one unique column:
Table_All_in_One (
Table1_PK1,
Table1_PK2,
Table1_PK3,
# space just for clarity of grouping
Table2_PK1,
Table2_PK2,
Table3_PK1,
Table3_PK2,
Table3_PK3,
Table3_PK4,
Table3_PK5,
TableX_Hash # Assuming all the _Hash'es are the same type+length,
# otherwise, add Table1_Hash, Table2_Hash, Table3_Hash
# This can be your new primary key
)
The Primary Keys (PKx) are required to be non-NULL only in their own tables. For this table, they have to allow nulls. The idea is that each row of this new table will only hold the data for one of the tables. The other columns will be empty for that row. If you want to associate the row of one table with another, you can add that to the same row or add FK_Table1_Hash, FK_Table2_Hash and FK_Table3_Hash columns which will refer to the TableX_Hash value of a record.
PS: I wonder if what you are really looking for is a View and not this really bad all-in-one table.
Edit: Combining them into one "without knowing how many primary keys a table can have." as per your comment:
Store all the _PKs concatenated into one column:
Table_All_in_One (
New_PK,
TableX_Hash,
Table1_PKx, # Concatenated PKs of Table1
Table2_PKx, # Concatenated PKs of Table2, etc.
...,
# OR just one
TableX_PKs, # concatenate all the PK's into one VARCHAR field
# Add a pipe `|` between them optionally.
Table_Num # If using just one, then you'll need to store the table number
)
You will not be able to conveniently pick records based on part of their composite primary key. It will always have to be TableX_PKs = CONCAT_WS('|', Table1_PK1, Table1_PK2, ...). So your only dependency is the number of PKs in the original column.

In order to model a bunch of tables you will need 3 tables. An entity table that contains the table names of the tables you wish to set up this way called a factor or entity table. A Factor_detail table that contains all the columns and their associated properties of the tables. A table, factor_detail_value, for storing things like lookup values for lookup tables. I'm trying to learn more about this myself as well because we are using this technique at work as well. Genrate sql on the fly for any table so encoded, and store the data in a repository pertiinant to the data itself. This way if a table changes and you need to add a column or change a datatype, you can add a row to the factor detail table without affecting a database shut down in production. In most businesses a four hour shut down to make a sql data table change can cost thousands of dollars. If you are dealing with insurance for example, each additional state that you sell insurance in has different requirements for being able to seel it and that will result in table changes. We reduced our table count way down from over 700 tables in this manner also we can make changes without database shut down thus avoiding the loss in revenue.

SQL Server select primary key from table where the key contains multiple columns

I am working on a legacy database. I am not able to change the schema :( in a couple of tables the primary key uses multiple columns.
In the app I read the data in each row into a table the user then updates the data and I write the data back into the table.
Currently I concatenate the various PK columns and store them as a unique id for when I put the data back into the table.
Now I was wondering if there is a more efficient way to do that. Coming from a mySQL background I am not aware of any but thought SQL Server 2005 may have a function
SELECT PRIMARYKEY() as pk, ... FROM table WHERE ...
the above would select the key that the database engine uses as the primary key for the given record
I searched and couldn't find anything. Its probably just me being fussy but I don't like the concatenation trick.
DC

In SQL Server, there is no equivalent of PRIMARYKEY() that I would be aware of, really. You can consult the system catalog views to find out which columns make up the primary key, but you can't just simply select the primary key value(s) with a function call.
I would agree with StarShip3000 - what do you concatenate your PK values for? While I don't think a compound primary key made up of several columns is necessarily a very good idea, if it's a legacy system and you can't change it, I wouldn't bother concatenating the PK values on read, and then having to split them apart again when you write your data back. Just leave the structure as it is - compound keys aren't generally recommended, but they are indeed supported, no problem.

"Currently I concatenate the various PK columns and store them as a unique id for when I put the data back into the table."
Can't you just store the pk as two columns in the target table and use that to join back to the two columns on the source table?
What benefit is concatenating giving you here?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas