DynamoDB EMR Hive Connector writes 1 item at a time - hive

While writing to dynamodb with on-demand capacity using hive > INSERT OVERWRITE TABLE t SELECT * FROM s3data; I notice that it writes 1 item at a time which is evident from the writecapacity graph below. Here are the settings
SET dynamodb.throughput.write.percent=1.0;
CREATE EXTERNAL TABLE IF NOT EXISTS t (userId string, categoryName string, score double)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "data.reader-score.test",
"dynamodb.column.mapping" = "userId:userId,categoryName:categoryName,score:score",
"dynamodb.throughput.write" = "5000");
Is there any other configuration that must be done?

Related

Snowflake return error "Number of auto-ingest pipes on location <bucket name> cannot be greater than allowed limit: 50000"

I have a lambda function load data into an AWS GLUE table a few times a day, then at the end, it create an external table on snowflake. the function is running ok for some time but now it starts to return this error to me every now and then:
Number of auto-ingest pipes on location <bucket name> cannot be greater than allowed limit: 50000
the create table sql is like the following:
create or replace external table table_name(
row_id STRING as (value:row_id::string),
created TIMESTAMP as (value:created::timestamp)
...
) with location = #conformedstage/table_name/ file_format = (type = parquet);
I have googled this issue and almost all the answers are referring to sonwpipe, however, it doesn't use snowpipe at all.
any ideas are appreciated
When an external table is created with auto_refresh, Snowflake creates an internal pipe to process the events, and you have probably lots of external tables using the same bucket.
Can you try to set AUTO_REFRESH=false?
create or replace external table table_name(
row_id STRING as (value:row_id::string),
created TIMESTAMP as (value:created::timestamp)
...
) with location = #conformedstage/table_name/ file_format = (type = parquet) auto_refresh=false;
https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html

How to create a Hive table using Ozone?

How we can create a hive table using ozone object store.
In order to create a hive table, first we need to create a volume and bucket in ozone.
Step1: Create the volume with the name vol1 in Apache Ozone.
# ozone sh volume create /vol1
Step2: Create the bucket with the name bucket1 under vol1.
# ozone sh bucket create /vol1/bucket1
Step3: Login to beeline shell
Step4: Create the hive database.
CREATE DATABASE IF NOT EXISTS ozone_db;
USE ozone_db;
Step5: Create the hive table.
CREATE EXTERNAL TABLE IF NOT EXISTS `employee`(
`id` bigint,
`name` string,
`age` smallint)
STORED AS parquet
LOCATION 'o3fs://bucket1.vol1.om.host.example.com/employee';
Note: Update the om.host.example.com value.
Reference:
https://community.cloudera.com/t5/Community-Articles/Spark-Hive-Ozone-Integration-in-CDP/ta-p/323346

Hive-Hbase integration - issues while inserting data

I was able to successfully integrate Hive & Hbase for straight forward scenarios (No Partition & bucketing). I was able to insert data in both Hive & hbase for those simple scenarios.
I am having issues with Hive partitioned table stored in Hbase. I was able to execute "Create DDL" statement. When I try to perform Insert I get an error message saying "Must specify table name"
CREATE TABLE hivehbase_customer(id int,holdid int,fname string,lname string,address string,zipcode string)
partitioned by (city string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal_data:hold_id,personal_data:f_name,personal_data:l_name,personal_address:address,personal_address:zipcode")
TBLPROPERTIES ("hbase.table.name" = "hivehbase_custom", "hbase.mapred.output.outputtable" = "hivehbase_custom");
insert into table hivehbase_customer partition(city= 'tegacay') values (7394,NULL,NULL,NULL,NULL,29708);
try following insert query
insert into table hivehbase_customer partition(city) values (7394,NULL,NULL,NULL,NULL,29708,'tegacay');
actually partitioned column needs to be specified as last column in insert query.

hive transactional table compaction fails

Table created with this :
create table syslog_staged (id string, facility string, sender string, severity string, tstamp string, service string, msg string) partitioned by (hostname string, year string, month string, day string) clustered by (id) into 20 buckets stored as orc tblproperties("transactional"="true");
the table is populated with Apache nifi's PutHiveStreaming...
alter table syslog_staged partition (hostname="cloudserver19", year="2016", month="10", day="24") compact 'major';
Now it turns out compaction fails for some reason.....(from job history)
No of maps and reduces are 0 job_1476884195505_0031
Job commit failed: java.io.FileNotFoundException: File hdfs://hadoop1.openstacksetup.com:8020/apps/hive/warehouse/log.db/syslog_staged/hostname=cloudserver19/year=2016/month=10/day=24/_tmp_27c40005-658e-48c1-90f7-2acaa124e2fa does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:904)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:113)
at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:966)
at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:962)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:962)
at org.apache.hadoop.hive.ql.txn.compactor.CompactorMR$CompactorOutputCommitter.commitJob(CompactorMR.java:776)
at org.apache.hadoop.mapred.OutputCommitter.commitJob(OutputCommitter.java:291)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:285)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
from hive metastore log :
2016-10-24 16:33:35,503 WARN [Thread-14]: compactor.Initiator (Initiator.java:run(132)) - Will not initiate compaction for log.syslog_staged.hostname=cloudserver19/year=2016/month=10/day=24 since last hive.compactor.initiator.failed.compacts.threshold attempts to compact it failed.
Please set below properties for optimizing compaction for transactional table-
set hive.compactor.worker.threads=1;
set hive.compactor.initiator.on=true;
I assume you have set below transactional Hive properties
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.support.concurrency=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.enforce.bucketing=true;

Issue creating Hive External table using tblproperties

I am trying to create an external table with tblproperties in Hive. The table gets created but it does not display the rows. Any ideas? Please find the scripts i am using below:
Thanks for your time and suggestions in advance.
Data is in a recursive folder: /user/test/test1/test2/samplefile.csv
use dw_raw;
drop table if exists temp_external_tab1;
create external table if not exists temp_external_tab1 (
col1 int,
col2 string,
col3 string,
col4 string
)
row format delimited fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/user/test/test1/'
tblproperties ("hive.input.dir.recursive" = "TRUE",
"hive.mapred.supports.subdirectories" = "TRUE",
"hive.supports.subdirectories" = "TRUE",
"mapred.input.dir.recursive" = "TRUE");
These are not table properties, but global settings.
You should set these using 'set', i.e.:
set hive.mapred.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
You've created a table but haven't put any data into it. Try
hive> LOAD DATA LOCAL INPATH '/user/test/test1/test2/samplefile.csv'
INTO TABLE temp_external_tab1;
If you are using ambari the set the following properties to hive advanced config inside custom hive-site.xml.
SET hive.input.dir.recursive=TRUE
SET hive.mapred.supports.subdirectories=TRUE
SET hive.supports.subdirectories=TRUE
SET mapred.input.dir.recursive=TRUE
And then restart the affected services. This will read all the data recursively.