We have an emr server which is running Hive 0.13.1 ( I know very archaic it is , but this cluster has a lot of dependencies due to which we are not able to do away with it )
Anyway cutting the chase , we processed something like 10 TB of TSV data in parquet using a different emr cluster which has a latest version of hive.
This was a temporary thing to faciliate the hugh data processing.
Now we are back to the old emr to do incremental processing of TSV to parquet.
We are using aws redshift spectrum coupled with glue to do query on this data. Glue crawls the s3 path where data resides thereby giving us a schema to work with.
Now the data which has been processed by the old emr gives us issues regarding has an incompatible Parquet schema.
The error we get when we try to read parquet data comprising of data processed by newer hive and old hive is ,
[2018-08-13 09:40:36] error: S3 Query Exception (Fetch)
[2018-08-13 09:40:36] code: 15001
[2018-08-13 09:40:36] context: Task failed due to an internal error. File '<Some s3 path >/parquet/<Some table name>/caldate=2018080900/8e71ebbe-b398-483c-bda0-81db6f848d42-000000 has an incompatible Parquet schema for column
[2018-08-13 09:40:36] query: 11500732
[2018-08-13 09:40:36] location: dory_util.cpp:724
[2018-08-13 09:40:36] process: query1_703_11500732 [pid=5384]
My hunch is it is because of the different hive version or it could be a redshift spectrum bug.
Does anyone has faced the same issue ?
I think this particular post will help you in solving this issue. It talks about the issues where schema is written by a different version and read by another version.
https://slack.engineering/data-wrangling-at-slack-f2e0ff633b69
Related
I am using spark 3.x with apache-hudi 0.8.0 version.
While I am trying to create presto table by using hudi-hive-sync tool I am getting below error.
Got runtime exception when hive syncing
java.lang.IllegalArgumentException: Could not find any data file written for commit [20220116033425__commit__COMPLETED], could not get schema for table
But I checked all data for partitiionKeys using zepplin notebook , I see all data present.
Its understood that I need to do manually commit the file. How to do it ?
I have a pipeline set up that reads data from Kafka, processes it using Spark structured streaming and then writes parquet files to HDFS. Downstream clients of the data query is using Presto configured to read the data as Hive tables.
Kafka --> Spark --> Parquet on HDFS --> Presto
In general this works. The problem arises when a query happens while the Spark job is running a batch. The Spark job creates a zero-length Parquet file on HDFS. If Presto attempts to open this file in the course of processing a query, then it throws an error:
Query 20171116_170937_07282_489cc failed: Error opening Hive split hdfs://namenode:50071/hive/warehouse/table/part-00000-5a7c242a-3e53-46d0-9ee4-5d004ef4b1e4-c000.snappy.parquet (offset=0, length=0): hdfs://namenode:50071/hive/warehouse/table/part-00000-5a7c242a-3e53-46d0-9ee4-5d004ef4b1e4-c000.snappy.parquet is not a Parquet file (too small)
The file is indeed zero bytes at this time, so the error is strictly correct, but this is not the behavior I want for the pipeline. I would like to be able to continuously write in to the appropriate HDFS folders, without disturbing the Presto queries.
The Spark scala code for the job looks like this:
val FilesOnDisk = 1
Spark
.initKafkaStream("fleet_profile_test")
.filter(_.name.contains(job.kafkaTag))
.flatMap(job.parser)
.coalesce(FilesOnDisk)
.writeStream
.trigger(ProcessingTime("1 hours"))
.outputMode("append")
.queryName(job.queryName)
.format("parquet")
.option("path", job.outputFilesPath)
.start()
The job starts at the top of the hour, :00. The file is first visible on HDFS as a zero-length file at :05. It is not updated until it is written completely at :21, just before the job finishes. This makes the table effectively unusable from Presto 25% of the time.
Each file is only a little over 500kB, so I wouldn't expect the physical writing of the file to take very long. From my understanding, Parquet files have their metadata at the end of the file so someone writing bigger files would have even more trouble.
What strategies have people used to integrate Spark structured streaming and Presto while working around this Presto error?
You could try to persuade Presto (or Presto team) to ignore empty files, but that wouldn't help, as the program writing the file (here: Spark) will eventually flush partial data and the file would appear partial, non-empty and not well formed, thus leading to an error as well.
The approach preventing Presto (or other programs reading the table data for that matter) from seeing partial file would be to assembler the file in different location and then atomically move the file into the correct location.
I am currently using Cloudera 5.6 trying to create a parquet format table in hive table based off another table, but I am running into an error.
create table sfdc_opportunities_sandbox_parquet like
sfdc_opportunities_sandbox STORED AS PARQUET
Error Message
Parquet does not support date. See HIVE-6384
I read that hive 1.2 has a fix for this issue, but Cloudera 5.6 and 5.7 do not come with hive 1.2. Has anyone found way around this issue?
Except from using an other data type like TIMESTAMP or an other storage format like ORC, there might be no way around if there is a dependency to the used Hive version and Parquet file storage format.
According Clouderas CDH 5 Packaging and Tarball Information, the whole branch 5 comes packed with Apache Parquet in v1.5.0 and Apache Hive in v1.1.0.
Date was implemented in ParquetSerde with HIVE-8119 and as of Hive 1.2.
I have couple of spark jobs that produce parquet files in AWS S3. Every once in a while i need to run some ad-hoc queries on a given date range of this data. I don't want to do this in spark because I want our QA team which has no knowledge os spark be able to do this. What i like to do is to spin up an AWS EMR cluster and load the parquet files into HDFS and run my queries against it. I have figured out how to create tables with hive and point it to one s3 path. But then that limits my data to only one day. because each day of date has multiple files under a path like
s3://mybucket/table/date/(parquet files 1 ... n).
So problem one is to figure how to load multiple days of data into hive. ie
s3://mybucket/table_a/day_1/(parquet files 1 ... n).
s3://mybucket/table_a/day_2/(parquet files 1 ... n).
s3://mybucket/table_a/day_3/(parquet files 1 ... n).
...
s3://mybucket/table_b/day_1/(parquet files 1 ... n).
s3://mybucket/table_b/day_2/(parquet files 1 ... n).
s3://mybucket/table_b/day_3/(parquet files 1 ... n).
I know hive can support partitions but my s3 files are not setup that way.
I have also looked into prestodb which looks like to be the favorite tool for this type of data analysis. The fact it supports ansi SQL makes it a great tool for people that have SQL knowledge but know very little about hadoop or spark. I did install this on my cluster and it works great. But looks like you can't really load data into your tables and you have to rely on Hive to do that part. Is this the right way to use prestodb? I watched a netflix presentation about their use of prestodb and using s3 in place of HDFS. If this works its great but i wonder how the data is moved into memory. At what point the parquet files will be moved from s3 to the cluster. Do i need to have cluster that can load the entire data into memory? how is this generally setup?
You can install Hive and create Hive tables with you data in S3, described in the blog post here: https://blog.mustardgrain.com/2010/09/30/using-hive-with-existing-files-on-s3/
Then install Presto on AWS, configure Presto to connect the hive catalog which you installed previously. Then you can query the your data on S3, with Presto by using SQL.
Rather than trying to load multiple files, you could instead use the API to concatenate the days you want into a single object, which you can then load through the means you already mention.
AWS has a blog post highlighting how to do this exact thing purely through the API (without downloading + re-uploading the data):
https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby
I am using hive and hbase to do some analysis on data.
I accidently removed the following file in hdfs
(hdfs) /user/hive/mydata.db
Although all table are existing in hive but when I retrieve data from hive through thrift server to plot it , it shows nothing. How can I rebuild my data. Any guide etc.