I am testing a large data set (1.5TB, 5.5b records) in athena in both parquet and orc formats. My first test is a simple one, a count query-
SELECT COUNT(*) FROM events_orc
SELECT COUNT(*) FROM events_parquet
The parquet file takes half to run this query as the orc file. But one thing I noticed is that when running a count on a parquet file, it return 0kb as the bytes scanned, where with the orc, it returns 78gb. This makes sense for the parquet because the count is in the meta, no need to scan bytes. The orc also has a meta with the count, but it doesn't seem to be using that meta to determine the counts of these files.
Why doesn't Athena use the metadata in the orc file to determine the count, where it clearly does with parquet files?
The answer is as you say that Athena reads the Parquet metadata, but not the ORC. There is no reason besides that feature not being in the version of Presto and/or ORC serde that Athena uses.
I've also noticed that Athena reads too much data when using ORC, it doesn't skip columns it should, etc. I think the Athena ORC serde is just old and doesn't have all the optimisations you would expect. Athena is after all based on a very old Presto version.
This depends on how those ORC files were created. Could you explain a bit how did you ETL in the data and what are the table definitions?
There are few indexes that ORC has:
Indexes ORC provides three level of indexes within each file:
file level:
statistics about the values in each column across the entire file
stripe level:
statistics about the values in each column for each
stripe
row-level:
statistics about the values in each column for each set of 10,000 rows
within a stripe The file and stripe level column statistics are in the
file footer so that they are easy to access to determine if the rest
of the file needs to be read at all. Row level indexes include both
the column statistics for each row group and the position for seeking
to the start of the row group.
Athena just like PrestoDb the query engine used by Athena can use these indexes to speed up queries.
I would be extremely surprised if Athena would not be using these bits of information for the queries.
Related
I have a Glue table on S3 where partitions are populated through Spark save mode overwrite (script executed through Glue job).
What is expected behavior from Athena if we are querying such partitions while they are being overwritten?
If you rewrite files while queries are running you may run into errors like "HIVE_FILESYSTEM_ERROR: Incorrect fileSize 1234567 for file".
The reason is that during query planning all the files are listed on S3, and among other things the file sizes are used to divide up the work between the worker nodes. If a file is splittable, which includes file formats like ORC and Parquet, as well as uncompressed text formats (e.g. JSON, CSV), parts of it (called splits) may be processed by different nodes.
If the file changes between query planning and query execution the plan is no longer valid and the query execution fails.
New partitions are being picked up by Athena as long as you set enableUpdateCatalog = True when writing. If you just overwrite the content of existing partitions, Athena will be able to query the data, as long as you don't have a schema mismatch.
I want to create a hive table which will store data with orc format and snappy compression. Will power bi be able to read from that table? Also do you suggest any other format/compression for my table?
ORC is a special file format only going to work with hive and its highly optimized for HDFS read operations. And power BI can connect to hive using hive odbc data connection. So, i think if you have to use hive all the time, you can use this format to store the data. But if you want flexibility of both hive and impala and use cludera provided impala ODBC driver, you can think of using parquet.
Now, both orc and parquet has their own advantages and disadvantages. And main deciding factor can be tools that access the data, how nested data is, and how many columns are there .
If you have many columns with nested data and if you want to use both hive and impala to access data, go with parquet. And if you have few columns with flat data structure and huge amount of data, go with orc.
There are options for transferring a DB snapshot from a relational database to S3 in AWS.
But S3 is an object store, so it only stores files (e.g. parquet).
Are the relationships (like keys) between tables in the relational DB somehow carried over to S3? Can queries still be made against the files in S3 that would allow joins to be made between tables?
There are no "keys" like foreign key, primary key in the exported parquet files in S3, but you can still query the the exported data directly through tools like Amazon Athena or Amazon Redshift Spectrum. For more information on using Athena to read Parquet data, see Parquet SerDe in the Amazon Athena User Guide. For more information on using Redshift Spectrum to read Parquet data, see COPY from columnar data formats in the Amazon Redshift Database Developer Guide.
The time it takes for the export to complete depends on the data stored in the database. For example, tables with well distributed numeric primary key or index columns will export the fastest. Tables that don't contain a column suitable for partitioning and tables with only one index on a string-based column will take longer because the export uses a slower single threaded process. For example if a table got a numeric pk and got 100,000 rows, during export data will be "partitioned" in a few portion, each portion are a directory in the S3 bucket, so that when you query data in Athena/Redshift spectrum with that id, AWS know what buckets to scan to get the data and thus improve performance and speed.
In summary, after data exported as columnar format like parquet in S3, you can do inplace query by Athena, load the data to redshift or data store for more analytics, etc..
Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'
I have tabular data of 100 Million records, each record having 15 columns.
I need to query 3 columns of this data and filter out the records to be used in further processing.
Currently I'm deciding between two approaches
Approach 1
Store the data as a csv or parquet in HDFS. When I need to query read the whole data and query using Spark SQL.
Approach 2
Create a Hive table using HiveContext and persist the table and Hive-metadata. Query this table when needed using HiveContext.
Doubts:
In Approach 2, is the query pushed to database level(HDFS) and only the records which satisfy the criteria are read and returned? Or the entire data is read into memory(as is the case with most spark jobs) and then query is run using the metadata?
Runtime: Of the two approaches, which one will be faster?
Please note that the Hive setup isn't Hive over Spark, it's HiveContext provided with Spark.
Spark Version: 2.2.0
In approach2, You should have hive table structured and stored in proper way.
Spark doesn't load all the data if hive table is partitioned and stored in file format that supports indexing(like ORC).
Spark optimized engine will use partition pruning and predicate push down and load only relevant data for further processing(transformation/action).
Partition Pruning:
choose appropriate column(which distribute data across partition evenly) to partition the hive table.
Spark partition pruning works efficiently with hive meta store. It will look into only relevant partition as per partition_column used in WHERE clause of your query.
Predicate PushDown:
ORC file has min/max index and bloom filters . Will work for string columns also in ORC(not sure about latest parquet string support), but more efficient on numerical column.
Spark will read only rows that are matching the filters as it pushed the filter down to underlying storage (orc files).
Below is a sample spark snippet to create such hive table. (assuming raw_df is the dataframe created from your raw data)
sorted_df = raw_df .sort("column2")
sorted_df.write.mode("append").format("orc").partitionBy("column1").saveAsTable("hive_table_name")
This will partition the data as per column1 values save orc files in hdfs and update hive metastore.
Sorting the table using column2 assuming that we are going to use column2 in our query WHERE clause.(sort is needed for efficient orc index)
Then you can query hive and load spark dataframe with relevant data . below is the sample.
filtered_df = spark.sql('SELECT column1,column2,column3 FROM hive_table_name WHERE column1= "some_value1" AND column2= "some_value2"')
In above sample spark will look into only some_value1 partition as column1 is the partition column in hive table created .
Then Spark will push the predicate(i,e filter) "some_value2" for column2 in orc files only under "some_value1" partition.
Here Spark will load only values of column1,column2,column3 , ignoring even other columns in the table.
Unless you combine the second approach with more advanced storage layout (bucketBy / DISTRIBUTE BY) which can be used to optimize the query there shoulde be no difference between between these two as long as you don't use schema inference in the approach 1 (you'll have to provide schema for the DataFrameReader).
Bucketing can be used to optimize execution plans for joins, aggregations and filters on bucketing column, but everything is still executed with Spark. In general Spark will use Hive only as metastore, not as execution engine.