How to fix org.apache.kafka.common.config.ConfigException: Missing required configuration "group.id" which has no default value - hive

I have a Kafka topic setup and am attempting to create an external table in Hive to query the Kafka stream.
However, when querying the external table I get the error message
Error: java.io.IOException: org.apache.kafka.common.config.ConfigException: Missing required configuration "group.id" which has no default value. (state=,code=0)
Tried putting group.id in the server.properties when starting the Kafka server.
Tried putting group.id in external table properties.
CREATE EXTERNAL TABLE kafka_table2
(`timestamp` timestamp , `page` string, `newPage` boolean,
added int, deleted bigint, delta double)
STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES
("kafka.topic" = "connect-test", "kafka.bootstrap.servers"="mykafka:9092","kafka.group.id"="1")
INFO : Completed compiling command(queryId=hive_20190426082255_729f8adb-bb23-4317-8f3f-2f9049b62bd7); Time taken: 0.6 seconds
INFO : Executing command(queryId=hive_20190426082255_729f8adb-bb23-4317-8f3f-2f9049b62bd7): select * from kafka_table2
INFO : Completed executing command(queryId=hive_20190426082255_729f8adb-bb23-4317-8f3f-2f9049b62bd7); Time taken: 0.018 seconds
INFO : OK
Error: java.io.IOException: org.apache.kafka.common.config.ConfigException: Missing required configuration "group.id" which has no default value. (state=,code=0)

You should put "kafka.consumer.group.id"="1" and not "kafka.group.id"="1" in TBLPROPERTIES.
See: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_set_consumer_producer.html

Related

DBT RUN - Getting Database Error using VS Code BUT Not Getting Database Error using DBT Cloud

I'm using DBT connected to Snowflake. I use DBT Cloud, but we are moving to using VS Code for our DBT project work.
I have an incremental DBT model that compiles and works without error when I issue the DBT RUN command in DBT Cloud. Yet when I attempt to run the exact same model from the same git branch using the DBT RUN command from the terminal in VS Code I get the following error:
Database Error in model dim_cifs (models\core_data_warehouse\dim_cifs.sql)
16:45:31 040050 (22000): SQL compilation error: cannot change column LOAN_MGMT_SYS from type VARCHAR(7) to VARCHAR(3) because reducing the byte-length of a varchar is not supported.
The table in Snowflake defines this column as VARCHAR(50). I have no idea why DBT is attempting to change the data length or why it only happens when the command is run from VS Code Terminal. There is no need to make this DDL change to the table.
When I view the compiled SQL in the Target folder there is nothing that indicates a DDL change.
When I look in the logs I find the following, but don't understand what is triggering the DDL change:
describe table "DEVELOPMENT_DW"."DBT_XXXXXXXX"."DIM_CIFS"
16:45:31.354314 [debug] [Thread-9 (]: SQL status: SUCCESS 36 in 0.09 seconds
16:45:31.378864 [debug] [Thread-9 (]:
In "DEVELOPMENT_DW"."DBT_XXXXXXXX"."DIM_CIFS":
Schema changed: True
Source columns not in target: []
Target columns not in source: []
New column types: [{'column_name': 'LOAN_MGMT_SYS', 'new_type': 'character varying(3)'}]
16:45:31.391828 [debug] [Thread-9 (]: Using snowflake connection "model.xxxxxxxxxx.dim_cifs"
16:45:31.391828 [debug] [Thread-9 (]: On model.xxxxxxxxxx.dim_cifs: /* {"app": "dbt", "dbt_version": "1.1.1", "profile_name": "xxxxxxxxxx", "target_name": "dev", "node_id": "model.xxxxxxxxxx.dim_cifs"} */
alter table "DEVELOPMENT_DW"."DBT_XXXXXXXX"."DIM_CIFS" alter "LOAN_MGMT_SYS" set data type character varying(3);
16:45:31.546962 [debug] [Thread-9 (]: Snowflake adapter: Snowflake query id: 01a5bc8d-0404-c9c1-0000-91b5178ac72a
16:45:31.548895 [debug] [Thread-9 (]: Snowflake adapter: Snowflake error: 040050 (22000): SQL compilation error: cannot change column LOAN_MGMT_SYS from type VARCHAR(7) to VARCHAR(3) because reducing the byte-length of a varchar is not supported.
Any help is greatly appreciated.

Loading CSV data containing string and numeric format to Ignite is failing

I am evaluating Ignite and trying to load CSV data to Apache Ignite. I have created a table in Ignite:
jdbc:ignite:thin://127.0.0.1/> create table if not exists SAMPLE_DATA_PK(SID varchar(30),id_status varchar(50), active varchar, count_opening int,count_updated int,ID_caller varchar(50),opened_time varchar(50),created_at varchar(50),type_contact varchar, location varchar,support_incharge varchar,pk varchar(10) primary key);
I tried to load data to this table with command:
copy from '/home/kkn/data/sample_data_pk.csv' into SAMPLE_DATA_PK(SID,ID_status,active,count_opening,count_updated,ID_caller,opened_time,created_at,type_contact,location,support_incharge,pk) format csv;
But the data load is failing with this error:
Error: Server error: class org.apache.ignite.internal.processors.query.IgniteSQLException: Value conversion failed [column=COUNT_OPENING, from=java.lang.String, to=java.lang.Integer] (state=50000,code=1)
java.sql.SQLException: Server error: class org.apache.ignite.internal.processors.query.IgniteSQLException: Value conversion failed [column=COUNT_OPENING, from=java.lang.String, to=java.lang.Integer]
at org.apache.ignite.internal.jdbc.thin.JdbcThinConnection.sendRequest(JdbcThinConnection.java:1009)
at org.apache.ignite.internal.jdbc.thin.JdbcThinStatement.sendFile(JdbcThinStatement.java:336)
at org.apache.ignite.internal.jdbc.thin.JdbcThinStatement.execute0(JdbcThinStatement.java:243)
at org.apache.ignite.internal.jdbc.thin.JdbcThinStatement.execute(JdbcThinStatement.java:560)
at sqlline.Commands.executeSingleQuery(Commands.java:1054)
at sqlline.Commands.execute(Commands.java:1003)
at sqlline.Commands.sql(Commands.java:967)
at sqlline.SqlLine.dispatch(SqlLine.java:734)
at sqlline.SqlLine.begin(SqlLine.java:541)
at sqlline.SqlLine.start(SqlLine.java:267)
at sqlline.SqlLine.main(SqlLine.java:206)
Below is the sample data I am trying to load:
SID|ID_status|active|count_opening|count_updated|ID_caller|opened_time|created_at|type_contact|location|support_incharge|pk
|---|---------|------|-------------|-------------|---------|-----------|----------|------------|--------|----------------|--|
INC0000045|New|true|1000|0|Caller2403|29-02-2016 01:16|29-02-2016 01:23|Phone|Location143||1
INC0000045|Resolved|true|0|3|Caller2403|29-02-2016 01:16|29-02-2016 01:23|Phone|Location143||2
INC0000045|Closed|false|0|1|Caller2403|29-02-2016 01:16|29-02-2016 01:23|Phone|Location143||3
INC0000047|Active|true|0|1|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||4
INC0000047|Active|true|0|2|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||5
INC0000047|Active|true|0|489|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||6
INC0000047|Active|true|0|5|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||7
INC0000047|AwaitingUserInfo|true|0|6|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||8
INC0000047|Closed|false|0|8|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||9
INC0000057|New|true|0|0|Caller4416|29-02-2016 06:10||Phone|Location204||10
Need help to understand how to figure out what is the issue and resolve it
You have to upload CSV without header line. Which contains the column names. An error is thrown when trying to convert the string value "count_opening" to a Integer.

Hive 3.1+ doesn't deserialize Avro 1.8.3+ messages from Kafka 1.0+

Lets say I have topic created via kafka streams from Confluent which contains messages serialized in avro with io.confluent.kafka.streams.serdes.avro.SpecificAvroSerializer
Then I create an external kafka table in Hive
CREATE EXTERNAL TABLE k_table
(`id` string , `sequence` int)
STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES
(
"kafka.topic" = "sample-topic",
"kafka.bootstrap.servers"="kafka1:9092",
"kafka.serde.class"="org.apache.hadoop.hive.serde2.avro.AvroSerDe",
"avro.schema.url"="Sample.avsc"
);
When I run the query:
select * from k_table WHERE `__timestamp` > 1000 * to_unix_timestamp(CURRENT_TIMESTAMP - interval '2' DAYS)
I got unexpected IO error:
INFO : Executing command(queryId=root_20190205160129_4579b5ff-9a5c-496d-8d03-9a7ccc0f6d90): select * from k_tickets_prod2 WHERE `__timestamp` > 1000 * to_unix_timestamp(CURRENT_TIMESTAMP - interval '1' minute)
INFO : Completed executing command(queryId=root_20190205160129_4579b5ff-9a5c-496d-8d03-9a7ccc0f6d90); Time taken: 0.002 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
Error: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 55 (state=,code=0)
Well all works fine with Confluent kafka consumer and also I tried to set confluent kafka deserializer in TBLPROPERTIES which seems to have to effect.
Environment:
Hive 4.0 + Beeline 3.1.1 + Kafka 1.1 (Clients & Broker) + Confluent 4.1
The problem is Confluent producer serializes avro messages with custom format as <magic_byte 0x00><4 bytes of schema ID><regular avro bytes for object that conforms to schema>. So Hive kafka handler has problem to deserialize cuz it uses basic bytearray kafka deserializer and these 5 bytes at the beginning of the message are unexpected.
I've created a bug in hive to support Confluent format and Schema registry as well and also I've made a PR with quick fix that removes 5 bytes from message after "avro.serde.magic.bytes"="true" property is set in TBLPROPERTIES.
After this patch it works like charm.

Presto DDL on S3 using FileHiveMetastore not working

I tried to connect Presto to S3 using FileHiveMetaStore with below configurations , but it when I am trying to create table with the statement mentioned but it fails with error message mentioned below . could any one let me know if the configurations mentioned are wrong.
I could see that it is possible as some one has already mentioned it is possible to connect
reference thread :- Setup Standalone Hive Metastore Service For Presto and AWS S3
error message:- com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: 33F01AA7477B12FC)
**connector.name=hive-hadoop2
hive.metastore=file
hive.metastore.catalog.dir=s3://ap-south-1.amazonaws.com/prestos3test/
hive.s3.aws-access-key=yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
hive.s3.aws-secret-key=zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
hive.s3.endpoint=http://prestos3test.s3-ap-south-1.amazonaws.com
hive.s3.ssl.enabled=false
hive.metastore.uri=thrift://localhost:9083**
External Table Creation
**CREATE TABLE PropData (
prop0 integer,
prop1 integer,
prop2 varchar,
prop3 varchar ,
prop4 varchar
)
WITH (
format = 'ORC',
external_location = 's3://prestos3test'
)**
Thanks
Santosh
I got help form other corners ,thought it would be helpful to others hence documenting necessary config in below .
connector.name=hive-hadoop2
hive.metastore=file
hive.metastore.catalog.dir=s3://prestos3test/
hive.s3.aws-access-key=yyyyyyyyyyyyyyyyyy
hive.s3.aws-secret-key=zzzzzzzzzzzzzzzzzzzzzz
hive.s3.ssl.enabled=false
hive.metastore.uri=thrift://localhost:9083
Thanks
Santosh

Problems using Hive + Cassandra community

I am trying to use HIVE 0.13 to access cassandra 2.0.8 column families created with CQL3.
Here is how I created my column families:
CREATE KEYSPACE IF NOT EXISTS Identification
WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
'DC1' : 2 };
USE Identification;
CREATE TABLE IF NOT EXISTS entitylookup (
name varchar,
value varchar,
entity_id uuid,
PRIMARY KEY ((name, value), entity_id))
WITH
caching=all
;
I followed the instructions from the README of this project: https://github.com/tuplejump/cash/tree/master/cassandra-handler
I generated hive-cassandra-1.2.6.jar, copied it and cassandra-all-1.2.6.jar, cassandra-thrift-1.2.6.jar to hive lib folder.
Then I started hive and tried the following:
CREATE EXTERNAL TABLE identification.entitylookup(name string, value string, entity_id binary)
STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' WITH SERDEPROPERTIES("cql.primarykey" = "name, value", "cassandra.host" = "localhost", "cassandra.port "= "9160")
TBLPROPERTIES ("cassandra.ks.name" = "identification", "cassandra.ks.stratOptions"="'DC1':2", "cassandra.ks.strategy"="NetworkTopologyStrategy");
Here is the output:
hive> mvalle#mvalle:~/hadoop$ hive
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/05/30 12:02:02 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
Logging initialized using configuration in jar:file:/home/mvalle/hadoop/apache-hive-0.13.0-bin/lib/hive-common-0.13.0.jar!/hive-log4j.properties
OpenJDK 64-Bit Server VM warning: You have loaded library /home/mvalle/hadoop/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
hive> CREATE EXTERNAL TABLE identification.entitylookup(name string, value string, entity_id binary)
> STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' WITH SERDEPROPERTIES("cql.primarykey" = "name, value", "cassandra.host" = "ident.s1mbi0se.com", "cassandra.port "= "9160")
> TBLPROPERTIES ("cassandra.ks.name" = "identification", "cassandra.ks.stratOptions"="'DC1':2", "cassandra.ks.strategy"="NetworkTopologyStrategy");
FAILED: SemanticException [Error 10072]: Database does not exist: identification
Question: how do I do to get more information about what is going wrong? I tried the same hive commando using "Identification" (capital I), but same result. Is it possible to access CQL3 column families in cassandra community? It seems the keyspace has not been mapped, but I don't see how to map then. In DSE, they are automatically mapped...
EDIT:
To clarify more, if I create an empty database and then try to create the external table, here is what I get:
hive> create database identification;
OK
Time taken: 0.154 seconds
hive> CREATE EXTERNAL TABLE identification.entity_lookup(name string, value string, entity_id binary)
> STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' WITH SERDEPROPERTIES("cql.primarykey" = "name, value", "cassandra.host" = "localhost", "cassandra.port "= "9160")
> TBLPROPERTIES ("cassandra.ks.name" = "identification", "cassandra.ks.stratOptions"="'DC1':3", "cassandra.ks.strategy"="NetworkTopologyStrategy");
OK
Time taken: 3.58 seconds
hive> select * from identification.entity_lookup limit 10;
OK
Exception in thread "main" java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext
at org.apache.hadoop.hive.cassandra.input.cql.HiveCqlInputFormat.getSplits(HiveCqlInputFormat.java:166)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:418)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:561)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:534)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:137)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1488)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:285)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
The error is not because Cash couldn't map the keyspace, but because the database in not present in hive.
Just create the database in hive using,
CREATE DATABASE identification;
That should get it working.