AWS Athena DDL from parquet file with structs as columns - amazon-s3

I generated an Athena DDL using glue crawler to create an AWS Athena table from a Parquet file stored in S3. However, on copying the DDL and using it in a different AWS account I get the following error :
line 7:25: mismatched input '<'. expecting: ')', ',' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: ...; proxy: null)
Athena DDL
CREATE TABLE x.y(
"questiontext" string,
"dataexporttag" string,
"questiontype" string,
"selector" string,
"subselector" string,
"configuration" struct<ChoiceColumnWidth:bigint,MobileFirst:boolean,QuestionDescriptionOption:string,RepeatHeaders:string,TextPosition:string,WhiteSpace:string>,
"questiondescription" string,
"choices" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>,7:struct<Display:string>,8:struct<Display:string,ExclusiveAnswer:boolean>,9:struct<Display:string>>,
"choiceorder" array<bigint>,
"validation" struct<Settings:struct<ForceResponse:string,ForceResponseType:string,Type:string>>,
"language" array<int>,
"nextchoiceid" bigint,
"nextanswerid" bigint,
"questionid" string,
"questiontext_unsafe" string,
"variablenaming" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>,
"datavisibility" struct<Hidden:boolean,Private:boolean>,
"recodevalues" struct<1:string,2:string,3:string,4:string,5:string,6:string,7:string,8:string,9:string>,
"randomization" struct<Advanced:struct<FixedOrder:array<string>,RandomSubSet:array<int>,RandomizeAll:array<string>,TotalRandSubset:bigint,Undisplayed:array<int>>,EvenPresentation:boolean,TotalRandSubset:string,Type:string>,
"defaultchoices" boolean,
"gradingdata" array<int>,
"searchsource" struct<AllowFreeResponse:string>,
"displaylogic" struct<0:struct<0:struct<ChoiceLocator:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,QuestionID:string,QuestionIDFromLocator:string,QuestionIsInLoop:string,RightOperand:string,Type:string>,1:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,2:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,3:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,4:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,5:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,6:struct<Conjuction:string,Description:string,LeftOperand:string,LogicType:string,Operator:string,RightOperand:string,Type:string>,Type:string>,Type:string,inPage:boolean>,
"analyzechoices" struct<6:string,8:string>,
"answers" struct<1:struct<Display:string>,2:struct<Display:string>,3:struct<Display:string>,4:struct<Display:string>,5:struct<Display:string>,6:struct<Display:string>>,
"answerorder" array<bigint>,
"choicedataexporttags" boolean)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
I'm able to query the table using the table generated by the crawler and the schema seems correct. Can anyone help me understand why can't I copy the DDL abd using it for the same file in a different AWS account?

There are a number of things wrong with the DDL statement. How was it generated? I recommend using SHOW CREATE TABLE x to generate a DDL statement that will work with Athena.
These are some of the problems:
The first line is missing EXTERNAL between CREATE and TABLE.
Column names cannot be quoted with double quotes in Athena DDL. This is a bit weird since this is how you quote them in DML, but DDL is parsed by Hive whereas DML is parsed by Presto and they have different syntaxes ¯\(ツ)/¯. If you need to quote column names in DDL the correct character is backtick.
Struct fields cannot start with numbers. Do these structs really have fields with numeric names? Are they in fact arrays?
You're also probably going to have some trouble with the casing of the field names, Athena is case insensitive and this can trip things up in struct fields, but YMMV.
Glue crawlers are notoriously bad at generating correct schemas when things aren't bog standard and basic. I recommend that you set up your tables by hand and use partition projection.
You may wonder how Glue managed to create a table when the DDL can't be used to create another table. The reason is that Glue crawlers use the Glue API. They don't generate a DDL statement and run it through Athena. The Glue APIs don't impose the same rules, because they are made to support multiple services besides Athena, like Spark and Hadoop on EMR, and Redshift Spectrum. Just because there is a table in Glue Data Catalog does not mean it will work with Athena, unfortunately.

Related

Redshift Spectrum query returns 0 row from S3 file

I tried Redshift Spectrum. Both of query below ended success without any error message, but I can't get the right count of the uploaded file in S3, it's just returned 0 row count, even though that file has over 3 million records.
-- Create External Schema
CREATE EXTERNAL SCHEMA spectrum_schema FROM data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
create external database if not exists;
-- Create External Table
create EXTERNAL TABLE spectrum_schema.principals(
tconst VARCHAR (20),
ordering BIGINT,
nconst VARCHAR (20),
category VARCHAR (500),
job VARCHAR (500),
characters VARCHAR(5000)
)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://xxxxx/xxxxx/'
I also tried the option, 'stored as parquet', the result was same.
My iam role has "s3:","athena:", "glue:*" permissions, and Glue table created successfully.
And just in case, I confirmed the same S3 file could be copied into table in Redshift Cluster successfully. So, I concluded the file/data has no issue by itself.
If there is something wrong with my procedure or query. Any advice would be appreciated.
As your DDL is not scanning any data it looks like the issue seems to be with it not understanding actual data in s3. To figure this out you can simply generate table using AWS Glue crawler.
Once the table is created you can compare this table properties with another table created using DDL in Glue data catalog. That will give you the difference and what is missing in your table that is created using DDL manually.

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

Migrating data from Hive PARQUET table to BigQuery, Hive String data type is getting converted in BQ - BYTES datatype

I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.
That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.

Hive partitioning for data on s3

Our data is stored using s3://bucket/YYYY/MM/DD/HH and we are using aws firehouse to land parquet data in there locations in near real time . I can query data using AWS athena just fine however we have a hive query cluster which is giving troubles querying data when partitioning is enabled .
This is what I am doing :
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
This doesn't seem to work when data on s3 is stored as s3:bucket/YYYY/MM/DD/HH
however this does work for s3:bucket/year=YYYY/month=MM/day=DD/hour=HH
Given the stringent bucket paths of firehose i cannot modify the s3 paths. So my question is what's the right partitioning scheme in hive ddl when you don't have an explicitly defined column name on your data path like year = or month= ?
Now you can specify S3 prefix in firehose.https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
myPrefix/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
If you can't obtain folder names as per hive naming convention, you will need to map all the partitions manually
ALTER TABLE tableName ADD PARTITION (year='YYYY') LOCATION 's3:bucket/YYYY'

Can we use bucketing in hive table backed by avro schema

I am trying to create one hive table backed by avro schema. Below is the DDL for that
CREATE TABLE avro_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
CLUSTERED BY (col_name) INTO N BUCKETS
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ( 'avro.schema.url' = 'hdfs://sandbox.hortonworks.com:8020/avroschema/test_schema.avsc')
But it is throwing below mentioned error
FAILED: ParseException line 3:3 missing EOF at 'clustered' near ''org.apache.hadoop.hive.serde2.avro.AvroSerDe''
I am not sure wheather we can use bucketing in Hive backed by AVRO or not
hive version--1.2
Can any one help me or provide any idea to achieve this .....
Your syntax is in the wrong order, and missing stuff. ROW FORMAT is defined after CLUSTERED BY, and CLUSTERED BY requires a column name which presumably needs to be defined as part of the CREATE TABLE command.
I assume the N in N BUCKETS is really replaced with your actual number of buckets, but if not, that's another error.
I have formatted the query in your question so that I could read it, and comparing to syntax here it was easier to spot what the parser didn't like.