Can we use bucketing in hive table backed by avro schema - hive

I am trying to create one hive table backed by avro schema. Below is the DDL for that
CREATE TABLE avro_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
CLUSTERED BY (col_name) INTO N BUCKETS
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ( 'avro.schema.url' = 'hdfs://sandbox.hortonworks.com:8020/avroschema/test_schema.avsc')
But it is throwing below mentioned error
FAILED: ParseException line 3:3 missing EOF at 'clustered' near ''org.apache.hadoop.hive.serde2.avro.AvroSerDe''
I am not sure wheather we can use bucketing in Hive backed by AVRO or not
hive version--1.2
Can any one help me or provide any idea to achieve this .....

Your syntax is in the wrong order, and missing stuff. ROW FORMAT is defined after CLUSTERED BY, and CLUSTERED BY requires a column name which presumably needs to be defined as part of the CREATE TABLE command.
I assume the N in N BUCKETS is really replaced with your actual number of buckets, but if not, that's another error.
I have formatted the query in your question so that I could read it, and comparing to syntax here it was easier to spot what the parser didn't like.

Related

athena generate table ddl output not working

In the AWS console (for Athena), one has the option of directly running the statement SHOW CREATE TABLE `foo`; or doing the same thing but via the GUI (in the tables option) to provide a single DDL (for that table).
I've created multiple DDLs this way and am now (just for experimentation) trying to run them for another database (db2, db3, ...). Here's one of them for example:
CREATE EXTERNAL TABLE db1.anomaly_eh_pred(
"node" string,
"workzone_desc" string,
"eh_pred" double,
"create_dt" timestamp)
COMMENT "test"
PARTITIONED BY (
"site_id" string,
"dt" date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
WITH SERDEPROPERTIES (
'escape.delim'='\\')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://foo/s3-stuff'
TBLPROPERTIES (
'transient_lastDdlTime'='1616731896')
After trying to run this, I get the following error:
line 1:8: mismatched input 'EXTERNAL'. Expecting: 'OR', 'SCHEMA', 'TABLE', 'VIEW'
This error is documented in other places (here, here, and here), and I've compared/tried those potential solution to no avail. Regardless, even if they did work, I'm confused why a raw, AWS-supplied DDL for an Athena table would simple not work. I've applied no editing whatsoever.
What could be a solution to this issue?

AWS Athena DDL from parquet file with structs as columns

I generated an Athena DDL using glue crawler to create an AWS Athena table from a Parquet file stored in S3. However, on copying the DDL and using it in a different AWS account I get the following error :
line 7:25: mismatched input '<'. expecting: ')', ',' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: ...; proxy: null)
Athena DDL
CREATE TABLE x.y(
"questiontext" string,
"dataexporttag" string,
"questiontype" string,
"selector" string,
"subselector" string,
"configuration" struct<ChoiceColumnWidth:bigint,MobileFirst:boolean,QuestionDescriptionOption:string,RepeatHeaders:string,TextPosition:string,WhiteSpace:string>,
"questiondescription" string,
"choices" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>,7:struct<Display:string>,8:struct<Display:string,ExclusiveAnswer:boolean>,9:struct<Display:string>>,
"choiceorder" array<bigint>,
"validation" struct<Settings:struct<ForceResponse:string,ForceResponseType:string,Type:string>>,
"language" array<int>,
"nextchoiceid" bigint,
"nextanswerid" bigint,
"questionid" string,
"questiontext_unsafe" string,
"variablenaming" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>,
"datavisibility" struct<Hidden:boolean,Private:boolean>,
"recodevalues" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>,
"randomization" struct<Advanced:struct<FixedOrder:array<string>,RandomSubSet:array<int>,RandomizeAll:array<string>,TotalRandSubset:bigint,Undisplayed:array<int>>,EvenPresentation:boolean,TotalRandSubset:string,Type:string>,
"defaultchoices" boolean,
"gradingdata" array<int>,
"searchsource" struct<AllowFreeResponse:string>,
"displaylogic" struct<0:struct<0:struct<ChoiceLocator:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,QuestionID:string,QuestionIDFromLocator:string,QuestionIsInLoop:string,RightOperand:string,Type:string>,1:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,2:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,3:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,4:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,5:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,6:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,Type:string>,Type:string,inPage:boolean>,
"analyzechoices" struct<6:string,8:string>,
"answers" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>>,
"answerorder" array<bigint>,
"choicedataexporttags" boolean)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
I'm able to query the table using the table generated by the crawler and the schema seems correct. Can anyone help me understand why can't I copy the DDL abd using it for the same file in a different AWS account?
There are a number of things wrong with the DDL statement. How was it generated? I recommend using SHOW CREATE TABLE x to generate a DDL statement that will work with Athena.
These are some of the problems:
The first line is missing EXTERNAL between CREATE and TABLE.
Column names cannot be quoted with double quotes in Athena DDL. This is a bit weird since this is how you quote them in DML, but DDL is parsed by Hive whereas DML is parsed by Presto and they have different syntaxes ¯\(ツ)/¯. If you need to quote column names in DDL the correct character is backtick.
Struct fields cannot start with numbers. Do these structs really have fields with numeric names? Are they in fact arrays?
You're also probably going to have some trouble with the casing of the field names, Athena is case insensitive and this can trip things up in struct fields, but YMMV.
Glue crawlers are notoriously bad at generating correct schemas when things aren't bog standard and basic. I recommend that you set up your tables by hand and use partition projection.
You may wonder how Glue managed to create a table when the DDL can't be used to create another table. The reason is that Glue crawlers use the Glue API. They don't generate a DDL statement and run it through Athena. The Glue APIs don't impose the same rules, because they are made to support multiple services besides Athena, like Spark and Hadoop on EMR, and Redshift Spectrum. Just because there is a table in Glue Data Catalog does not mean it will work with Athena, unfortunately.

What is the default schema in Hive?

For Pig, the default schema is ByteArray. Is there a default schema for Hive if we don't mention a schema in Hive? I tried to look at some Hive documentation but couldn't find any.
Hive is schema on Read --- I am not sure this is the answer...If some one could give an insight on this that would be great
Hive does the best that it can to
read the data. You will get lots of null values if there aren’t enough fields in each record
to match the schema. If some fields are numbers and Hive encounters nonnumeric
strings, it will return nulls for those fields. Above all else, Hive tries to recover from all
errors as best it can.
There is not default schema in Hive, in order to query data in hive you have to first create a table explaining the content of your data (by using create external table ... location).
So you basically have to tell hive the "scheme" before querying the data.

Partitioned by non-first column

I have a table, which created using following hiveQl-script:
CREATE EXTERNAL TABLE Logs
(
ip STRING,
time STRING,
query STRING,
pageSize STRING,
statusCode STRING,
browser STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
-- some regexps
)
STORED AS TEXTFILE
LOCATION '/path';
I need to create partitioning by time field. But in all examples I saw, that partitioning creates only by first field or by the sequence of fields starting at first. Also I saw, that if I write the field in PARTITIONED BY section, I mustn't write it in CREATE TABLE section.
I tried to create partitioning by time in several ways but always cought different exceptions.
For example this:
ParseException line 11:20 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
or this:
ParseException line 16:0 missing EOF at 'PARTITIONED' near ')'
and so on.
So, how can I create partitioning by time field in my case?
The partition column in hive is not a real column.It just gives hive the hint where to find the files of specific partition.
So if you have a file that you want to store the file into different partitions based on one column in this file.There is no aotumatic way to do this,you have to split the input file on your own and load the splited file into different partition.(In case you dont know how to split a file based on column,use awk {print $0>>"filebase."$2;})
Or you can load your input to an unpartitioned table first.And then use a query to insert these data to another partitioned table.
I hope this can help.

Create hive table for schema less avro files

I have multiple avro files and each file have a STRING in it. Each avro file is a single row. How can I write hive table to consume all the avro files located in a single directory .
Each file has a big number in it and hence I do not have any json kind of schema that I can relate too. I might be wrong when I say schema less . But I cannot find a way for hive to understand this data. This might be very simple but I am lost since I tried numerous different ways without success. I created tables pointing to json schema as avro uri, but this is not the case here.
For more context files were written using crunch api
final Path outcomesVersionPath = ...
pipeline.write(fruit.keys(), To.avroFile(outcomesVersionPath));
I tried following query which creates table but does not read data properly
CREATE EXTERNAL TABLE test_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
If your data set only has one STRING field then you should be able to read it from Hive with a single column called data (or whatever you would like) by changing your DDL to:
CREATE EXTERNAL TABLE test_table
(data STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
And then read the data with:
SELECT data FROM test_table;
Use avro utilities jar to see avro schema for any given binary file here!
Then just link the schema file while creating a table.