Amazon Athena partitioning query error "no viable alternative" - sql

I'm trying to make a partitioned data table in Amazon Athena, so that I can analyze the contents of a bucket containing S3 access logs. I've followed the instructions almost exactly as they are written in the docs, just substituting my own info. However, I keep getting the error line 1:8: no viable alternative at input 'create external' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: 847e3d9c-8d3c-4810-a98c-8527270f8dd8). Here's what I'm inputting:
CREATE EXTERNAL TABLE access_data (
`Date` DATE,
Time STRING,
Location STRING,
Bytes INT,
RequestIP STRING,
Host STRING,
Uri STRING,
Status INT,
Referrer STRING,
os STRING,
Browser STRING,
BrowserVersion STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH serdeproperties ( 'paths'='`Date`,Time, Uri' )
PARTITIONED BY (dt DATE) STORED AS parquet LOCATION 's3://[source bucket]/';
I've looked at other similar questions on here but I don't have a hyphenated table name, no trailing commas, no unbalanced back ticks or missing parentheses, etc... so I really don't know what's wrong. Thanks to anyone who can help!

It appears that these two lines are conflicting with each other:
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' WITH serdeproperties ...
and
STORED AS parquet
Removing one of them allows the table creation to proceed.
Parquet does not store data in JSON format.

Related

Query Athena from s3 database - remove metadata/corrupted data

I was following along with the tutorials for connecting Tableau to Amazon Athena and got hung up when running the query and returning the expected result. I downloaded the student-db.csv from https://github.com/aws-samples/amazon-athena-tableau-integration and uploaded the csv to a S3 bucket that I created. I can create the database within Athena however when I create a table either with the bulk add or directly from the query editor and preview with a query the data gets corrupted. and includes unexpected characters and unexpected/unnecessary punctuations and sometimes all the data is aggregated into a single column and also contains metadata such as "1 ?20220830_185102_00048_tnqre"0 2 ?hive" 3 Query Plan* 4 Query Plan2?varchar8 #H?P?". Also with my Athena - Tableau connected receiving the same issues when I preview the table that was created with Athena and stored in my bucket.
CREATE EXTERNAL TABLE IF NOT EXISTS student(
`school` string,
`country` string,
`gender` string,
`age` string,
`studytime` int,
`failures` int,
`preschool` string,
`higher` string,
`remotestudy` string,
`health` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://jj2-test-bucket/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1',
'transient_lastDdlTime'='1595149168')
SELECT * FROM "studentdb"."student" limit 10;
Query preview
The solution is to create a separate S3 bucket to house the query results. Additionally, when connecting to Tableau you must set the S3 Staging Directory to the location of the Query Result bucket rather than connecting to the S3 bucket that contains your raw data/csv

Problem in performing SELECT operation on AWS Athena with JSON files with array and struct type data formatting [duplicate]

This question already has answers here:
OpenCSVSerde escapeChar overriding quoteChar
(1 answer)
How to load json snappy compressed in HIVE
(2 answers)
Hive external table read json as textfile
(1 answer)
Closed 1 year ago.
I have created a table to perform queries from S3 data(JSON format, which is the output from SageMaker) using Athena.
The file extension is: "filename.json.out". The formatting is in "JSON" format.
The files are coming from the SageMaker batch transform job. The table that I have been trying to create has the format like this:
the "col10" key is basically an array of string input like "col10": ["value1", "value2"]
Now, I've tried to create a table in a database like below:
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
`data` array<struct<
`col1`: string,
`col2`: string,
`col3`: string,
`col4`: string,
`col5`: string,
`col6`: string,
`col7`: string,
`col8`: string,
`col9`: string,
`col10`: array,
`col11`: string,
`col12`: string,
`col13`: string,
`col14`: string,
`col15`: string
>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket_name/object/'
TBLPROPERTIES ('has_encrypted_data'='false');
After creating the table, I tried to see the records by using the query below:
select * from db_name.table_name limit 10
Whenever I try to run the "select" query above, I find the error:
I have followed the links below:
Stackoverflow: How to Create Tables AWS Athena --> Mappings Json Array?
AWS: https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/
Stackoverflow: https://stackoverflow.com/a/50411984/9409770
And some others that I have forgotten. This problem is driving me crazy.
I have tried querying the same data format on a single file using "S3 Select", it worked without any problem. In Athena, I cannot make it work. I am sure I am making a silly mistake here. However, I cannot figure out where. I am not that expert in this field.
If there is any cheaper alternative to get data programmatically from S3 without incurring much cost, please share this.
Thank you.

AWS Athena DDL from parquet file with structs as columns

I generated an Athena DDL using glue crawler to create an AWS Athena table from a Parquet file stored in S3. However, on copying the DDL and using it in a different AWS account I get the following error :
line 7:25: mismatched input '<'. expecting: ')', ',' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: ...; proxy: null)
Athena DDL
CREATE TABLE x.y(
"questiontext" string,
"dataexporttag" string,
"questiontype" string,
"selector" string,
"subselector" string,
"configuration" struct<ChoiceColumnWidth:bigint,MobileFirst:boolean,QuestionDescriptionOption:string,RepeatHeaders:string,TextPosition:string,WhiteSpace:string>,
"questiondescription" string,
"choices" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>,7:struct<Display:string>,8:struct<Display:string,ExclusiveAnswer:boolean>,9:struct<Display:string>>,
"choiceorder" array<bigint>,
"validation" struct<Settings:struct<ForceResponse:string,ForceResponseType:string,Type:string>>,
"language" array<int>,
"nextchoiceid" bigint,
"nextanswerid" bigint,
"questionid" string,
"questiontext_unsafe" string,
"variablenaming" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>,
"datavisibility" struct<Hidden:boolean,Private:boolean>,
"recodevalues" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>,
"randomization" struct<Advanced:struct<FixedOrder:array<string>,RandomSubSet:array<int>,RandomizeAll:array<string>,TotalRandSubset:bigint,Undisplayed:array<int>>,EvenPresentation:boolean,TotalRandSubset:string,Type:string>,
"defaultchoices" boolean,
"gradingdata" array<int>,
"searchsource" struct<AllowFreeResponse:string>,
"displaylogic" struct<0:struct<0:struct<ChoiceLocator:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,QuestionID:string,QuestionIDFromLocator:string,QuestionIsInLoop:string,RightOperand:string,Type:string>,1:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,2:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,3:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,4:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,5:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,6:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,Type:string>,Type:string,inPage:boolean>,
"analyzechoices" struct<6:string,8:string>,
"answers" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>>,
"answerorder" array<bigint>,
"choicedataexporttags" boolean)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
I'm able to query the table using the table generated by the crawler and the schema seems correct. Can anyone help me understand why can't I copy the DDL abd using it for the same file in a different AWS account?
There are a number of things wrong with the DDL statement. How was it generated? I recommend using SHOW CREATE TABLE x to generate a DDL statement that will work with Athena.
These are some of the problems:
The first line is missing EXTERNAL between CREATE and TABLE.
Column names cannot be quoted with double quotes in Athena DDL. This is a bit weird since this is how you quote them in DML, but DDL is parsed by Hive whereas DML is parsed by Presto and they have different syntaxes ¯\(ツ)/¯. If you need to quote column names in DDL the correct character is backtick.
Struct fields cannot start with numbers. Do these structs really have fields with numeric names? Are they in fact arrays?
You're also probably going to have some trouble with the casing of the field names, Athena is case insensitive and this can trip things up in struct fields, but YMMV.
Glue crawlers are notoriously bad at generating correct schemas when things aren't bog standard and basic. I recommend that you set up your tables by hand and use partition projection.
You may wonder how Glue managed to create a table when the DDL can't be used to create another table. The reason is that Glue crawlers use the Glue API. They don't generate a DDL statement and run it through Athena. The Glue APIs don't impose the same rules, because they are made to support multiple services besides Athena, like Spark and Hadoop on EMR, and Redshift Spectrum. Just because there is a table in Glue Data Catalog does not mean it will work with Athena, unfortunately.

Presto fails to import PARQUET files from S3

I have a presto table that imports PARQUET files based on partitions from s3 as follows:
create table hive.data.datadump
(
tUnixEpoch varchar,
tDateTime varchar,
temperature varchar,
series varchar,
sno varchar,
date date
)
WITH (
format = 'PARQUET',
partitioned_by = ARRAY['series','sno','date'],
external_location = 's3a://dev/files');
The S3 folder structure where the parquet files are stored looks like:
s3a://dev/files/series=S5/sno=242=/date=2020-1-23
and the partition starts from series.
The original code in pyspark that produces the parquet files has all the schema as String type and I am trying to import that as a string but when I run my create script in Presto, it successfully created the table but fails to import the data.
On Running,
select * from hive.data.datadump;
I get the following error:
[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE
Can you guys help to resolve this issue?
Thank You in advance!
I ran into same issues and I found out that this was caused by one of the records in my source doesnt have a matching datatype for the column it was complaining about. I am sure this is just data. You need to trap the exact record which doesnt have the right type.
This might have been solved, just for info, this could be due to column declaration mismatch between hive and parquet file. To use the column names instead of the order, use the property -
hive.parquet.use-column-names=true

How to load data to Hive table and make it also accessible in Impala

I have a table in Hive:
CREATE EXTERNAL TABLE sr2015(
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',', 'skip.header.line.count'='1',
'quoteChar'= "\"")
The table is loaded data this way:
LOAD DATA INPATH "hdfs:///user/rxie/SR2015.csv" INTO TABLE sr2015;
Why the table is only accessible in Hive? when I attempt to access it in HUE/Impala Editor I got the following error:
AnalysisException: Could not resolve table reference: 'sr2015'
which seems saying there is no such a table, but the table does show up in the left panel.
In Impala-shell, error is different as below:
ERROR: AnalysisException: Failed to load metadata for table: 'sr2015'
CAUSED BY: TableLoadingException: Failed to load metadata for table:
sr2015 CAUSED BY: InvalidStorageDescriptorException: Impala does not
support tables of this type. REASON: SerDe library
'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported.
I have always been thinking Hive table and Impala table are essentially the same and difference is Impala is a more efficient query engine.
Can anyone help sort it out? Thank you very much.
Assuming that sr2015 is located in DB called db, in order to make the table visible in Impala, you need to either issue
invalidate metadata db;
or
invalidate metadata db.sr2015;
in Impala shell
However in your case, the reason is probably the version of Impala you're using, since it doesn't support the table format altogether