Mapping hbase table with counter column to external hive table? - hive

I am trying to map a table in hive to view an hbase table. I did this without a problem with several columns, but am unsure how to manage with a counter column. Is this possible?
When I scan the hbase table an example value of the counter column is \x00\x00\x00\x00\x00\x00\x00\x01.
I suspect I am setting the column type incorrectly in the hive table. I have tried int and string (both show only nulls in the hive view). Is there a better way at getting the number of increments from this value? The ideal world would be a column in hive that is the sum of all the increments, i assume.
It is entirely possible I am misunderstanding what is possible in viewing the counter (or how the counter was originally setup).

Ended up finding answer through this link on cloudera community.
Answer is to define counter column in hive table as bigint and define the SERDEPROPERTIES with '#b' added to the end to indicate the hbase column type is binary.
For example:
create external table md_extract_file_status ( table_key string, fl_counter bigint)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,colfam:FL_Counter#b )
TBLPROPERTIES('hbase.table.name' ='HBTABLE');

Related

Hbase scan showing hexa character for special character

Inserted data to hbase table through hive external table.
Hive table contains basically 3 columns
id - String
identifier - map<string,string>
src - string
inserted the tale to hbase table.
map data present in hive table for identifier column.
sample map data
{"CUSTID":"CUST4302109","LYLT":"44302109"}
Data inserted to hbase table. while fetching the data through scan command.
O2008031353044301300 column=INTR:IDNFS-string, timestamp=1626515550906, value=CUSTID\x03CUST4301300\x02\x03\x02LYLT\x0344301300
Hexachar are coming instead of special char.
Using below mentioned properties, while creating the hbase hive external table,
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'`enter code here`
WITH SERDEPROPERTIES (
'hbase.columns.mapping'=':key,INTR:CID,INTR:IDNFS-string,INTR:SRC',enter code here
'hbase.table.default.storage.type' ='string',
'serialization.format'='1')
how to get the actual special char?

Can we add column to an existing table in AWS Athena using SQL query?

I have a table in AWS Athena which contains 2 records. Is there a SQL query using which a new column can be inserted in to the table?
You can find more information about adding columns to table in Athena documentation
Or you can use CTAS
For example, you have a table with
CREATE EXTERNAL TABLE sample_test(
id string)
LOCATION
's3://bucket/path'
and you can create another table from sample_test with the query
CREATE TABLE new_test
AS
SELECT *, 'new' AS new_col FROM sample_test
You can use any available query after AS
This is mainly for future readers like me, who was struggling to get this working for Hive table with AVRO data and if you don't want to create new table i.e updating schema of the existing table. It works for csv using 'add columns', but not for Hive + AVRO. For Hive + AVRO, to append columns at the end, before partition columns, the solution is available at this link. However, there are couple of things to note that, we need to pass full schema to the literal attribute and not just the changes; and (not sure why but) we had to alter hive table for all 3 things in the same order - 1. add columns using add columns 2. set tblproperties and 3. set serdeproperties. Hopefully it helps someone.

data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

Bucket is not creating on hadoop-hive

I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.
First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.

automatically partition Hive tables based on S3 directory names

I have data stored in S3 like:
/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN
/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...
My understanding is that if I pull in that data via Hive, it will automatically interpret date as a partition. My table creation looks like:
CREATE EXTERNAL TABLE search_input(
col 1 STRING,
col 2 STRING,
...
)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';
However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via:
CREATE EXTERNAL TABLE search_input_20140701(
col 1 STRING,
col 2 STRING,
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';
I can query data just fine.
Why doesn't Hive recognize the nested directories with the "date=date_str" partition?
Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?
In order to get this to work I had to do 2 things:
Enable recursive directory support:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
For some reason it would still not recognize my partitions so I had to recover them via:
ALTER TABLE search_input RECOVER PARTITIONS;
You can use:
SHOW PARTITIONS table;
to check and see that they've been recovered.
I had faced the same issue and realized that hive does not have partitions metadata with it. So we need to add that metadata using ALTER TABLE ADD PARTITION query. It becomes tedious, if you have few hundred partitions to create same queries with different values.
ALTER TABLE <table name> ADD PARTITION(<partitioned column name>=<partition value>);
Once you run above query for all available partitions. You should see the results in hive queries.