I want to add a unique value to my hive table whenever i enter any record, that value should not be repeated in the entire hive table. I am not able to find any solutions or any function for this. In my case i want to enter the record in hive using pig latin. Please help.
HIVE does not provide RDBMS database like constraints.
The suggested approch using PIG Script is as below.
1. Load data
2. Apply DISTINCT to data
3. Store data at a location
4. Create external hive table at the same location.
Step 3 and 4 can be combined if you can use HCATALOG which allows you to directly store data in Hive table.
Official documentation :Link 1 link 2
did you take a look to this? https://github.com/manojkumarvohra/hive-hilo it seems to provide a way to generate sequence numbers in hive using hi/lo algorithm
Related
Suppose I have a non-transactional table in Hive named 'ccm'. It has hundreds of columns and one partition field.
I know how to create a copy with "create table abc like ccm' but I would like abc to be bucketed, ORC, and have transaction support set on via TBLPROPERTIES.
I do not want to mention all the columns in ccm when I compose the HQL.
Can I do this?
This answer may have the correct way to proceed in your case, and it also explains some limitation of the method used.
Create hive table using "as select" or "like" and also specify delimiter
So, from the example, you should add the missing parts:
CLUSTER BY
TBLPROPERTIES ("transactional"="true")
I have some doubts that you can achieve exactly your expected results but i would consider it as a step forward
My hive version is 1.2.0
I am doing hive hbase integration where my hbase table already present.
While creating hive table, I was checking if I can use few of hive's built-in date functions as a candidate for virtual columns/derived columns, which is something like this -
create external table `Hive_Test`(
*existing hbase columns*,
*new_column* AS to_date(from_unixtime(unix_timestamp(*existing_column*,'yyyy/MM/dd HH:mm:ss')...
)CLUSTERED BY (..) SORTED BY (new_colulmn) INTO n BUCKETS
..
WITH SERDEPROPERTIES(
hbase.columns.mappings=':key,cf:*,:timestamp',
..
)
If there is any other way where I can use built-in functions capability in create table, then please let me know.
Thanks.
With reference to - Hive Computed Column, i think you are defining a logic when creating a table which is not possible with hive.
You can refer this article for Apache Hive Derived Column Support and Alternative
A better way is to create a view on top of the non-native table created for Hive-HBase integration, with which you can do almost any kind of mapping that facilitates your business.
How to find when table rows were last updated/inserted? Presto is ANSI-SQL compliant so even if you don't know Presto, maybe there's a generic SQL way that would point me in the right direction.
I'm using Hadoop. Presto queries are quicker than Hive. "Describe" just gives column names.
https://prestosql.io/docs/current/
Presto 309 added a hidden $properties table in the Hive connector for each table that exposes the Hive table properties. You can use it to find the last update time (replace example with your table name):
SELECT transient_lastddltime FROM "example$properties"
How can I programatically find all Impala tables that need INVALIDATE METADATA statement (because they were created in Hive, but not yet known to Impala) or REFRESH (because column added, datafile added, etc.)?
Invalidate Metadata:
As a workaround, create a shell script to do the below steps.
Using beeline, connect to a particular database and run show tables statement and save output data to a file.
Using impala-shell, connect to the same particular database and run show tables statement and save output data to another file.
Now compare both the file to remove the duplicates and get the unique tables list from the first file which is a list of tables which are only in hive but not in impala.
Note:
a. Instead of a particular database each at a time in 1 and 2 steps, you can loop over all databases and save the output to a file. Inside the loop itself, you can redirect and append the output files to another final output file with data in some format like database.table or database_table to get all tables from all databases into a single file. Finally, follow step 3.
b. The unique tables from the second output file after removing duplicates will be tables that are deleted in hive and invalidate metadata needs to be run in impala to remove them from the impala list.
c. Rename of a table in impala can be recognized by hive but vice-versa is not possible and invalidate metadata should be run for both old and new table names to remove and add respectively in impala. This applies to most operations not just rename of table.
Refresh:
Consider a text format table with 2 columns and 1 row data.
Now suppose, a third column is added to that table in the beeline.
select * from table; ---gives 3 columns in beeline and 2 columns in impala since refresh is not run on impala for this table.
If we run compute stats in impala before running refresh in this case, then that newly added column from the beeline will be removed from the table schema in hive as well.
select * from table; ---gives 2 columns in beeline and 2 columns in impala since compute stats from impala deleted the extra column metadata of table although data resides in hdfs for that column. This might cause parsing issues in impala if the column is added somewhere in the middle or front instead of ending.
So it is advised to run REFRESH table name in impala right after adding a new column or doing any modifications in beeline for an existing table to not lose table schema as explained in the above scenario.
refresh table; ---Right after modification in hive run refresh in impala.
select * from table; ---gives 3 columns in beeline and 3 columns in impala since refresh is run before compute stats in impala.
For Pig, the default schema is ByteArray. Is there a default schema for Hive if we don't mention a schema in Hive? I tried to look at some Hive documentation but couldn't find any.
Hive is schema on Read --- I am not sure this is the answer...If some one could give an insight on this that would be great
Hive does the best that it can to
read the data. You will get lots of null values if there aren’t enough fields in each record
to match the schema. If some fields are numbers and Hive encounters nonnumeric
strings, it will return nulls for those fields. Above all else, Hive tries to recover from all
errors as best it can.
There is not default schema in Hive, in order to query data in hive you have to first create a table explaining the content of your data (by using create external table ... location).
So you basically have to tell hive the "scheme" before querying the data.