Multiple xml file processing on pyspark - pandas

I have multiple xml files around (15000) and we are using databricks notebook and pandas df to process multiple files in loop using XML tree. Each file takes around 1.67 sec which is like 6hrs for all files. Which is quite high for daily job.
Is there a better way to achieve good performance? Can PySpark df be faster compared to pandas Df? Also can combining all xml files in one big and then processing it with pandas be faster?
Any suggestions would be appreciated.
Thank you
Avani

You can try the steps below to improve performance:
Use High Concurrency clusters:
The key benefits of High Concurrency clusters are that they provide fine-grained sharing for maximum resource utilization and minimum query latencies.
Enable autoscaling.
All-Purpose cluster - On the Create Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:
Job cluster - On the Configure Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:
Configure the min and max workers.
When the cluster is running, the cluster detail page displays the number of allocated workers. You can compare the number of allocated workers with the worker configuration and adjust as needed.
Refer - https://docs.databricks.com/clusters/configure.html#high-concurrency-clusters
EDIT -
Can PySpark df be faster compared to pandas Df?
Pandas run operations on a single machine whereas PySpark runs on multiple machines. PySpark is a best fit which could processes operations many times(100x) faster than Pandas.

Related

Suggestion for Non Analytical Distributed Processing Frameworks

Can someone please suggest a tool, framework or a service to perform the below task faster.
Input : The input to the service is a CSV file which consists of an identifier and several image columns with over a million rows.
Objective: To check if any of the image column of the row meets the minimum resolution and create a new boolean column for every row according to the results.
True - If any of the image in the row meets the min resolution
False - If no images in the row meets the min resolution
Current Implementation: Python script with pandas and multiprocessing running on a large VM(60 Core CPU) which takes about 4 - 5 Hours. Since this is a periodic task we schedule and manage it with Cloud Workflow and Celery Backend.
Note: We are looking to cut down on costs as uptime of server is just about 4-6Hrs a day. Hence 60 Core CPU 24*7 would be a lot of resources wasted.
Options Explored:
We have ruled out Cloud Run due to the memory, cpu and timeout limitations.
Apache Beam with Cloud Dataflow, seems like there is less support for non analytical workloads and Dataframe implementation with Apache Beam looks buggy still.
Spark and Dataproc seems to be good for analytical workloads. Although a Serverless option would be much preferred.
Which direction should i be looking into?

Which is a more efficient orchestrating mechanism, chaining Databricks notebooks together or using Apache Airflow?

The data size for the data is in the Terabytes.
I have multiple Databricks notebooks for incremental data load into Google BigQuery for each dimension table.
Now, I have to perform this data load every two hours i.e. run these notebooks.
What is a better approach among the following:
Create a master Databricks notebook and use dbutils to chain/parallelize the execution of the aforementioned Databricks notebooks.
Use Google Composer (Apache Airflow's Databricks Operator) to create a master DAG to orchestrate these notebooks remotely.
I want to know which is better approach when I have use cases for both parallel execution and sequential execution of said notebooks.
I'd be extremely grateful if I could get a suggestion or opinion on this topic, thank you.
why can't you try with databricks jobs . So that you can use job for way of running a notebook either immediately or on a scheduled basis.

Using pyspark on AWS EMR

I am new to both PySpark and AWS EMR. I have been given a small project where I need to scrub large amounts of data files every hour and build aggregated data sets based on them. These data files are stored on S3 and I can utilize some of the basic functions in Spark (like filter and map) to derive the aggregated data. To save on egress costs and after performing some CBA analysis, I decided to create an EMR cluster and make pypark calls. The concept is working fine using Lambda functions triggered by file created in the S3 bucket. I am writing the output files back to S3.
But I am not able to comprehend the need for the 3 node EMR cluster I created and its use for me. How can I use the Hadoop file system to my advantage here and all the storage that is made available on the nodes?
How do I view (if possible) the utilization of the slave/core nodes in the cluster? How do I know they are used, how often, etc etc? I am executing the pyspark code on the master node.
Are there alternatives to EMR that I can use with pyspark?
Is there any good documentation available to get a better understanding.
Thanks
Spark is a framework for distributed computing. It can process larger than memory datasets and split the workload in chunks onto multiple workers in parallel. By default EMR creates 1 master node and 2 worker nodes. The disk space on the spark nodes is typically not used directly. Spark can use the space to cache temp results.
To use a Hadoop filesystem, you need to start a hdfs service in aws .
However s3 is also distributed storage. It is supported by Hadoop libraries. Spark EMR ships with Hadoop drivers and support S3 out of the box. Using spark with S3 is perfectly valid storage solution and will be good enough for a lot of basic data processing tasks.
The is a spark manager UI in AWS EMR. You can see each running spark application session and current job. By clicking on the job you can see how many executors are used. Whether those executors run on all nodes depends on your spark memory and cpu configuration. Tuning those is a really big topic. There are good hints here on SO.
There is also a hardware monitoring tab, showing cpu and memory usage for each node.
The spark code is always executed on the master node. But it just creates a DAG plan on that node and shifts the actual work to the worker nodes according to the plan. Hence the guides speak of submitting the spark application rather than executing.
Yes. You can start your own spark cluster on normal ec2 instances. There is even a standalone mode , allowing to start spark on only one machine. It is quite some footprint, that is installed then. And you still need to tune the memory, cpu and executor settings. So it is quite a complexity compared to just implement some multiprocessing in python or use dask. However there are valid reasons to do so. It allows to use all cores on one machine. And it allows you to use a well known , good documented api. The same one, which can be used to process petabytes of data. The linked article above, explains the motivation.
Another possibility is to use AWS Glue. It is serverless spark. The
service will submit your jobs to some on demand spark nodes on AWS,
where you have no control over. Similar to how lambda functions run
on random AWS EC2 instances. However glue has some limitations. With
pyspark on glue, you cannot install python libs with c-extensions
e.g numpy, pandas, most of ml libs. Also Glue forces you to create
schema mapping of your data in Athena catalog. But standalone spark
can just process those on the fly.
Databricks also offers a separate serverless spark solution outside of AWS. It is more sophisticated in my opinion. It also allows custom c-extensions.
Big part of official documentation is focusing on the different data processing apis and not on the internals of apache spark. There are some good notes on spark internals on github. I assume every good book will cover some inner workings on spark. AWS EMR is just an automated spark cluster with yarn orchestrator. (Unfortunately, never read some good book on spark, got some info here and there, so cannot recommend one)

Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution.
I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?
Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:
Export + Download:
bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv && gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/
Load into Dask:
>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")
Hope it helps.
First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?
Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.
Experiment!
Some options:
Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to
Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery
dataset. That way, network overhead is minimized.
Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it.
Here are the steps you need to take:
Create an intermediate table which will contain the data you want to
export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.
Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.
You can read the example here
https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
and you can also spin up Jupyter Notebook in the Dataproc cluster
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Hope this helps.
A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!

Use spark RDD as a source of data in a REST API

There is a graph that computes on Spark and stores to Cassandra. Also there is a REST API which has endpoint to get graph node with edges and edges of edges.
This second degree graph may include up to 70000 nodes. Currently uses Cassandra as the database, but to extract a lot of data by key from Cassandra takes much time and resources. We tried TitanDB, Neo4j and OriendDB to improve performance but Cassandra showed the best results.
Now there is another idea. Persist RDD (or may be GrapgX object) in the API service and on API call filter necessary data from persisted RDD.
I guess that it will work fast while RDD fits in memory, but in the case that it caches to disk it will work like a full scan (e.g. full scan parquet file).
Also I expect that we will face to these issues:
memory leak in spark;
updating this RDD (unpersist previous, read new and persist new one) will require stop API;
concurrent using this RDD will require manually manage CPU resources.
Do anybody have such experience?
Spark is NOT a storage engine. Unless you will process big amount of data each time, you should consider:
In-memory data grids - Hazelcast, Apache Ignite, Coherence, GigaSpaces, etc.
Cassandra in-memory - https://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/inMemory.html
search for "in-memory" option in other framework/database