Delta Lake: Performance Challenges - pandas

Approach 1: My input data is bunch of json files. After preprocessing, the output is in pandas dataframe format which will be written to Azure SQL Database table.
Approach 2: I had implemented the delta lake where the output pandas dataframe is converted to Spark dataframe and then data inserted to Partitioned Delta Table. The process is simple and also time required to convert pandas dataframe to spark dataframe is in milliseconds. But the performance compared to Approach 1 is bad. With Approach1, I am able to finish in less than half of the time required by Approach 2.
I tried with different optimizing techniques like ZORDER, Compaction (bin-packing), using insertInto rather than saveAsTable. But none really improved the performance.
Please let me know if I had missed any performance tuning methods. And if there are none, I am curious to know why Delta Lake did not perform better than pandas+database approach. And also, I am happy to know any other better approaches. For example, I came across dask.
Many Thanks for your Answers in advance.
Regards,
Chaitanya

you dont give enough information to answer your question. What exactly is not performant the whole process of data ingest?
Z-ordering doesn't give you an advantage if you are processing the data into the delta lake it will even more likely slow you down. It gives you an advantage when you are reading the data in afterwards. Z-ordering by for example ID, tries to save columns with the same ID in the same file(s) which will enable spark to use dataskipping to avoid reading in unnecessary data.
Also how big is your data actually? If we are talking about a few GBs of data at the end pandas and a traditional database will perform faster.
I can give you an example:
Lets say you have a daily batch job that processes 4 GB of data. If its just about processing that 4 GB to store it somewhere spark will not necessarly perform faster as I already mentioned.
But now consider you have that job running for a year which gives you ~1.5 TB of data at the end of the year. Now you can perform analytics on the entire history of data and in this scenario you probably will be much faster than a database and pandas.
As a side note you say you are reading in a bunch of json files to convert them to pandas and than to delta lake.
If there is not a specific reason to do so in approach 2 I would just use:
spark.read.json("path")
To avoid that process of converting it from pandas to spark dataframes.

Related

When exactly should I repartition my dataframes?

I'm working with a large Spark application on EMR that processes data in the hundreds of GBs. At the moment, I'm reading in a number of parquet files that are transformed into Spark dataframes and subsequently performing data enrichment.
During this step, I'm joining several of these dataframes together in order to compute intermediate values or derive new columns from existing ones. As I'm learning more about Spark, it's become obvious to me as this job continues to scale that I need to be mindful of potential performance optimizations.
After looking into the Spark UI for my application, I notice that a number of my executors are not getting input and there is seemingly skew in the shuffle reads and writes
There is an ID key in my dataframes that I am joining on that should be evenly distributed so it makes me feel like I need to repartition my dataframes upon reading them in from the original parquet files. Is this line of thinking correct? Should I be calling repartition manually on my dataframe on that ID key, or is Spark already handling this and what I am seeing above is actually a symptom of something else?
Thanks!

How to save an input data stream to a performing data structure for sql queries in Spark?

i'm new in the word of bigdata. My goal is to maintain an input data stream in some kind of data structures to perform queries and aggregation operations on them.
Having as input a continuous data streaming, through the Structured Streaming of spark, I store it in a DataFrame. My questions are:
is a DataFrame a volatile data structure?
In case the program crashes, is the DataFrame maintained?
Is a DataFrame distributed on the various nodes of the cluster or is it kept on the node that executes the code?
Is it possible to create index on a Dataframe to speed up the response of some queries?
If your are planning to use Spark 2.4+ then go with Dataframes
Yes DataFrame are volatile they vanish once Spark job finishes but you can save them to Disk as Parquet files and query them later via SPARK / HIVE ro any tool which can read parquet file format.
In case the program crashes, the DataFrame is not recoverable unless you happen to save them before crash which you can read later once your Spark job is up again.
Data Frame is distributed Data structure used and understood by Spark. So yes it is partitioned/split across Spark nodes.
Data frame has partitions to tune the query performance and minimize data shuffling.
Apart from above points Spark has inbuilt check pointing mechanism to make sure there is no data loss when your job crashes. Detail doc can be found on Spark

Improving write performance in Hive

I am performing various calculations (using UDFs) on Hive. The computations are pretty fast enough, but I am hitting a roadblock with the write performance in Hive. My result set is close to ten million records, and it takes a few minutes to write them to the table. I have experimented with cached tables and various file formats (ORC AND RC), but haven't seen any performance improvement.
Indexes are not possible since I am using Shark. It would be great to know the suggestions from the SO community on the various methods which I can try to improve the write performance.
Thanks,
TM
I don't really use shark since it is deprecated but I believe it has the ability to read and write parquet files just like spark SQL. In spark SQL it is trivial(from website):
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a JavaSchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")
Basically parquet is your best bet at improving IO speeds without considering another framework (impala is supposed to be extremely fast, but queries are more limited). This is because if you have a table with many rows parquet allows you to only deserialize the needed rows since it is stored in a columnar format. In addition that deserialization maybe faster then with normal storage since storing data of the same types next to each other in memory can offer better compression rates. Also as I said in my comments it would be a good idea to upgrade to spark SQL since shark is no longer being supported and I don't believe there is much difference in terms of syntax.

Database for 5-dimensional data?

I commonly deal with data-sets that have well over 5 billion data points in a 3D grid over time. Each data point has a certain value, which needs to be visualised. So it's a 5-dimensional data-set. Lets say the data for each point looks like (x, y, z, time, value)
I need to run arbitrary queries against these data-sets where for example the value is between a certain range, or below a certain value.
I need to run queries where I need all data for a specific z value
These are the most common queries that I would need to run against this data-set. I have tried the likes of MySQL and MongoDB and created indices for those values, but the resource requirements are quite extreme with long query-runtime. I ended up writing my own file-format to at least store the data for relative easy retrieval. This approach makes it difficult to find data without having to read/scan the entire data-set.
I have looked at the likes of Hadoop and Hive, but the queries are not designed to be run in real-time. In terms of data size it seems a better fit though.
What would be the best method to index such larger amounts of data efficiently? Is a custom indexing system the best approach or to slice the data into smaller chunks and device a specific way of indexing (which way though?). the goal is to be able to run queries against the data and have the results returned in under 0.5 seconds. My best was 5 seconds by running the entire DB on a huge RAM drive.
Any comments and suggestions are welcome.
EDIT:
the data for all x, y, z, time and value are all FLOAT
It really depends on the hardware you have available, but regardless of that and considering the type and the amount of data you are dealing with, I definitely suggest a clustered solution.
As you already mentioned, Hadoop is not a good fit because it is primarily a batch processing tool.
Have a look at Cassandra and see if it solves your problem. I feel like a column-store rdbms like CitusDB (Free up to 6 nodes) or Vertica (Free up to 3 nodes) may also prove useful.

Improve apache hive performance

I have 5GB of data in my HDFS sink. When I run any query on Hive it takes more than 10-15 minutes to complete. The number of rows I get when I run,
select count(*) from table_name
is 3,880,900. My VM has 4.5 GB mem and it runs on MBP 2012. I would like to know if creating index in the table will have any performance improvement. Also are there any other ways to tell hive to only use this much amount of data or rows so as to get results faster? I am ok even if the queries are run for a lesser subset of data at least to get a glimpse of the results.
Yes, indexing should help. However, getting a subset of data (using limit) isn't really helpful as hive still scans the whole data before limiting the output.
You can try using RCFile/ORCFile format for faster results. In my experiments, RCFile based tables executed queries roughly 10 times faster than textfile/sequence file based tables.
Depending on the data you are querying you can get gains by using the different file formats like ORC, Parquet. What kind of data are you querying, is it structured or unstructured data? What kind of queries are you trying to perform? If it is structured data you can see gains also by using other SQL on Hadoop solutions such as InfiniDB, Presto, Impala etc...
I am an architect for InfiniDB
http://infinidb.co
SQL on Hadoop solutions like InfiniDB, Impala and others work by you loading your data through them at which they will perform calculations, optimizations etc... to make that data faster to query. This helps tremendously for interactive analytical queries, especially when compared to something like Hive.
With that said, you are working with 5GB of data (but data always grows! someday could be TBs), which is pretty small so you can still work in the worlds of the some of the tools that are not intended for high performance queries. Your best solution with Hive is to look at how your data is and see if ORC or Parquet could benefit your queries (columnar formats are good for analytic queries).
Hive is always going to be one of the slower options though for performing SQL queries on your HDFS data. Hortonworks with their Stinger initiative is making it better, you might want to check that out.
http://hortonworks.com/labs/stinger/
The use case sounds fit for ORC, Parquet if you are interested in a subset of the columns. ORC with hive 0.12 comes with PPD which will help you discarding blocks while running the queries using the meta data that it stores for each column.
We did an implementation on top of hive to support bloom filters in the meta data indexes for ORC files which gave a performance gain of 5-6X.
What is average number of Mapper/Reducer tasks launched for the queries you execute? Tuning some parameters can definitely help.