I was following along with the tutorials for connecting Tableau to Amazon Athena and got hung up when running the query and returning the expected result. I downloaded the student-db.csv from https://github.com/aws-samples/amazon-athena-tableau-integration and uploaded the csv to a S3 bucket that I created. I can create the database within Athena however when I create a table either with the bulk add or directly from the query editor and preview with a query the data gets corrupted. and includes unexpected characters and unexpected/unnecessary punctuations and sometimes all the data is aggregated into a single column and also contains metadata such as "1 ?20220830_185102_00048_tnqre"0 2 ?hive" 3 Query Plan* 4 Query Plan2?varchar8 #H?P?". Also with my Athena - Tableau connected receiving the same issues when I preview the table that was created with Athena and stored in my bucket.
CREATE EXTERNAL TABLE IF NOT EXISTS student(
`school` string,
`country` string,
`gender` string,
`age` string,
`studytime` int,
`failures` int,
`preschool` string,
`higher` string,
`remotestudy` string,
`health` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://jj2-test-bucket/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1',
'transient_lastDdlTime'='1595149168')
SELECT * FROM "studentdb"."student" limit 10;
Query preview
The solution is to create a separate S3 bucket to house the query results. Additionally, when connecting to Tableau you must set the S3 Staging Directory to the location of the Query Result bucket rather than connecting to the S3 bucket that contains your raw data/csv
I generated an Athena DDL using glue crawler to create an AWS Athena table from a Parquet file stored in S3. However, on copying the DDL and using it in a different AWS account I get the following error :
line 7:25: mismatched input '<'. expecting: ')', ',' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: ...; proxy: null)
Athena DDL
CREATE TABLE x.y(
"questiontext" string,
"dataexporttag" string,
"questiontype" string,
"selector" string,
"subselector" string,
"configuration" struct<ChoiceColumnWidth:bigint,MobileFirst:boolean,QuestionDescriptionOption:string,RepeatHeaders:string,TextPosition:string,WhiteSpace:string>,
"questiondescription" string,
"choices" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>,7:struct<Display:string>,8:struct<Display:string,ExclusiveAnswer:boolean>,9:struct<Display:string>>,
"choiceorder" array<bigint>,
"validation" struct<Settings:struct<ForceResponse:string,ForceResponseType:string,Type:string>>,
"language" array<int>,
"nextchoiceid" bigint,
"nextanswerid" bigint,
"questionid" string,
"questiontext_unsafe" string,
"variablenaming" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>,
"datavisibility" struct<Hidden:boolean,Private:boolean>,
"recodevalues" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>,
"randomization" struct<Advanced:struct<FixedOrder:array<string>,RandomSubSet:array<int>,RandomizeAll:array<string>,TotalRandSubset:bigint,Undisplayed:array<int>>,EvenPresentation:boolean,TotalRandSubset:string,Type:string>,
"defaultchoices" boolean,
"gradingdata" array<int>,
"searchsource" struct<AllowFreeResponse:string>,
"displaylogic" struct<0:struct<0:struct<ChoiceLocator:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,QuestionID:string,QuestionIDFromLocator:string,QuestionIsInLoop:string,RightOperand:string,Type:string>,1:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,2:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,3:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,4:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,5:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,6:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,Type:string>,Type:string,inPage:boolean>,
"analyzechoices" struct<6:string,8:string>,
"answers" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>>,
"answerorder" array<bigint>,
"choicedataexporttags" boolean)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
I'm able to query the table using the table generated by the crawler and the schema seems correct. Can anyone help me understand why can't I copy the DDL abd using it for the same file in a different AWS account?
There are a number of things wrong with the DDL statement. How was it generated? I recommend using SHOW CREATE TABLE x to generate a DDL statement that will work with Athena.
These are some of the problems:
The first line is missing EXTERNAL between CREATE and TABLE.
Column names cannot be quoted with double quotes in Athena DDL. This is a bit weird since this is how you quote them in DML, but DDL is parsed by Hive whereas DML is parsed by Presto and they have different syntaxes ¯\(ツ)/¯. If you need to quote column names in DDL the correct character is backtick.
Struct fields cannot start with numbers. Do these structs really have fields with numeric names? Are they in fact arrays?
You're also probably going to have some trouble with the casing of the field names, Athena is case insensitive and this can trip things up in struct fields, but YMMV.
Glue crawlers are notoriously bad at generating correct schemas when things aren't bog standard and basic. I recommend that you set up your tables by hand and use partition projection.
You may wonder how Glue managed to create a table when the DDL can't be used to create another table. The reason is that Glue crawlers use the Glue API. They don't generate a DDL statement and run it through Athena. The Glue APIs don't impose the same rules, because they are made to support multiple services besides Athena, like Spark and Hadoop on EMR, and Redshift Spectrum. Just because there is a table in Glue Data Catalog does not mean it will work with Athena, unfortunately.
I tried creating a hive external table:
CREATE EXTERNAL TABLE TestXML (storexml string)
STORED AS TEXTFILE
LOCATION 'wasb:///test/';
However when i try executing query like below, its not able to extract the fields:
SELECT
xpath_string (storexml, '/trades/trade/USI')
FROM TestXML;
I saw a post, that talked about specifying the input format.
add JARS <>
set xmlinput.element=Store;
CREATE EXTERNAL TABLE EventStoreXML (storexml string)
STORED AS INPUTFORMAT 'msdn.hadoop.mapreduce.input.XmlElementStreamingInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'wasb:///eventstore#tradedata.blob.core.windows.net/';
I could not determine, which jars to include in the add JARs statement. I am using HDInsight on Linux.
Any pointers will be appreciated.
-Madhu
Realised the issue was with the XML having carriage return, as a result it was not able to read the XML.
I am trying to create one hive table backed by avro schema. Below is the DDL for that
CREATE TABLE avro_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
CLUSTERED BY (col_name) INTO N BUCKETS
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ( 'avro.schema.url' = 'hdfs://sandbox.hortonworks.com:8020/avroschema/test_schema.avsc')
But it is throwing below mentioned error
FAILED: ParseException line 3:3 missing EOF at 'clustered' near ''org.apache.hadoop.hive.serde2.avro.AvroSerDe''
I am not sure wheather we can use bucketing in Hive backed by AVRO or not
hive version--1.2
Can any one help me or provide any idea to achieve this .....
Your syntax is in the wrong order, and missing stuff. ROW FORMAT is defined after CLUSTERED BY, and CLUSTERED BY requires a column name which presumably needs to be defined as part of the CREATE TABLE command.
I assume the N in N BUCKETS is really replaced with your actual number of buckets, but if not, that's another error.
I have formatted the query in your question so that I could read it, and comparing to syntax here it was easier to spot what the parser didn't like.
I have a table, which created using following hiveQl-script:
CREATE EXTERNAL TABLE Logs
(
ip STRING,
time STRING,
query STRING,
pageSize STRING,
statusCode STRING,
browser STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
-- some regexps
)
STORED AS TEXTFILE
LOCATION '/path';
I need to create partitioning by time field. But in all examples I saw, that partitioning creates only by first field or by the sequence of fields starting at first. Also I saw, that if I write the field in PARTITIONED BY section, I mustn't write it in CREATE TABLE section.
I tried to create partitioning by time in several ways but always cought different exceptions.
For example this:
ParseException line 11:20 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
or this:
ParseException line 16:0 missing EOF at 'PARTITIONED' near ')'
and so on.
So, how can I create partitioning by time field in my case?
The partition column in hive is not a real column.It just gives hive the hint where to find the files of specific partition.
So if you have a file that you want to store the file into different partitions based on one column in this file.There is no aotumatic way to do this,you have to split the input file on your own and load the splited file into different partition.(In case you dont know how to split a file based on column,use awk {print $0>>"filebase."$2;})
Or you can load your input to an unpartitioned table first.And then use a query to insert these data to another partitioned table.
I hope this can help.