I'm new to PySpark and I wonder what is the best practice for exporting large amount of data from Hive using PySpark?
I have a SQL query that retrieves a huge amount of data\results and I want to export them to another machine for further processing.
How to do this in the quickest way?
I know that I can use "collect" but since this is a huge amount of data I'll run out of memory pretty fast...
Related
Approach 1: My input data is bunch of json files. After preprocessing, the output is in pandas dataframe format which will be written to Azure SQL Database table.
Approach 2: I had implemented the delta lake where the output pandas dataframe is converted to Spark dataframe and then data inserted to Partitioned Delta Table. The process is simple and also time required to convert pandas dataframe to spark dataframe is in milliseconds. But the performance compared to Approach 1 is bad. With Approach1, I am able to finish in less than half of the time required by Approach 2.
I tried with different optimizing techniques like ZORDER, Compaction (bin-packing), using insertInto rather than saveAsTable. But none really improved the performance.
Please let me know if I had missed any performance tuning methods. And if there are none, I am curious to know why Delta Lake did not perform better than pandas+database approach. And also, I am happy to know any other better approaches. For example, I came across dask.
Many Thanks for your Answers in advance.
Regards,
Chaitanya
you dont give enough information to answer your question. What exactly is not performant the whole process of data ingest?
Z-ordering doesn't give you an advantage if you are processing the data into the delta lake it will even more likely slow you down. It gives you an advantage when you are reading the data in afterwards. Z-ordering by for example ID, tries to save columns with the same ID in the same file(s) which will enable spark to use dataskipping to avoid reading in unnecessary data.
Also how big is your data actually? If we are talking about a few GBs of data at the end pandas and a traditional database will perform faster.
I can give you an example:
Lets say you have a daily batch job that processes 4 GB of data. If its just about processing that 4 GB to store it somewhere spark will not necessarly perform faster as I already mentioned.
But now consider you have that job running for a year which gives you ~1.5 TB of data at the end of the year. Now you can perform analytics on the entire history of data and in this scenario you probably will be much faster than a database and pandas.
As a side note you say you are reading in a bunch of json files to convert them to pandas and than to delta lake.
If there is not a specific reason to do so in approach 2 I would just use:
spark.read.json("path")
To avoid that process of converting it from pandas to spark dataframes.
I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?
Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.
Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.
https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables
https://cloud.google.com/community/tutorials/cbt-import-csv
https://cloud.google.com/bigquery/quotas
So, does that mean I can use BigQuery for the above usecase?
The short answer is yes.
I wrote about this in:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And when loading cluster your tables, for massive improvements in costs for the most common queries:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
In summary:
BigQuery can read CSVs and other files straight from GCS.
You can define a view that parses those CSVs in any way you might prefer, all within SQL.
You can run a CREATE TABLE statement to materialize the CSVs into BigQuery native tables for better performance and costs.
Instead of CREATE TABLE you can do imports via API, those are free (instead of cost of query for CREATE TABLE.
15 TB can be handled easily by BigQuery.
Say in Dataflow/Apache Beam program, I am trying to read table which has data that is exponentially growing. I want to improve the performance of the read.
BigQueryIO.Read.from("projectid:dataset.tablename")
or
BigQueryIO.Read.fromQuery("SELECT A, B FROM [projectid:dataset.tablename]")
Will the performance of my read improve, if i am only selecting the required columns in the table, rather than the entire table in above?
I am aware that selecting few columns results in the reduced cost. But would like to know the read performance in above.
You're right that it will reduce cost instead of referencing all the columns in the SQL/query. Also, when you use from() instead of fromQuery(), you don't pay for any table scans in BigQuery. I'm not sure if you were aware of that or not.
Under the hood, whenever Dataflow reads from BigQuery, it actually calls its export API and instructs BigQuery to dump the table(s) to GCS as sharded files. Then Dataflow reads these files in parallel into your pipeline. It does not ready "directly" from BigQuery.
As such, yes, this might improve performance because the amount of data that needs to be exported to GCS under the hood, and read into your pipeline will be less i.e. less columns = less data.
However, I'd also consider using partitioned tables, and then even think about clustering them too. Also, use WHERE clauses to even further reduce the amount of data to be exported and read.
Given a 1-terabyte data set which comes from the sources in a couple hundred csv files, and divides naturally into two large tables, what's the best way to store the data in Google Cloud Storage? Partitioning by date does not apply as the data is relatively static and only updated quarterly. Is it best to combine all of the data into two large files and map each to a BigQuery table? Is it better to partition? If so, on what basis? Is there a threshold file size above which BigQuery performance degrades?
Depending on the use case:
To query data => then load it into BigQuery from GCS.
To store the data => leave it in GCS.
Question: "I want to query and have created a table in BiqQuery, but with only a subset of the data totaling a few GB. My question is if I have a TB of data should I keep it in one giant file GCS or should I split it up?"
Answer: Just load it all into BigQuery. BigQuery eats TB's for breakfast.
I have 5GB of data in my HDFS sink. When I run any query on Hive it takes more than 10-15 minutes to complete. The number of rows I get when I run,
select count(*) from table_name
is 3,880,900. My VM has 4.5 GB mem and it runs on MBP 2012. I would like to know if creating index in the table will have any performance improvement. Also are there any other ways to tell hive to only use this much amount of data or rows so as to get results faster? I am ok even if the queries are run for a lesser subset of data at least to get a glimpse of the results.
Yes, indexing should help. However, getting a subset of data (using limit) isn't really helpful as hive still scans the whole data before limiting the output.
You can try using RCFile/ORCFile format for faster results. In my experiments, RCFile based tables executed queries roughly 10 times faster than textfile/sequence file based tables.
Depending on the data you are querying you can get gains by using the different file formats like ORC, Parquet. What kind of data are you querying, is it structured or unstructured data? What kind of queries are you trying to perform? If it is structured data you can see gains also by using other SQL on Hadoop solutions such as InfiniDB, Presto, Impala etc...
I am an architect for InfiniDB
http://infinidb.co
SQL on Hadoop solutions like InfiniDB, Impala and others work by you loading your data through them at which they will perform calculations, optimizations etc... to make that data faster to query. This helps tremendously for interactive analytical queries, especially when compared to something like Hive.
With that said, you are working with 5GB of data (but data always grows! someday could be TBs), which is pretty small so you can still work in the worlds of the some of the tools that are not intended for high performance queries. Your best solution with Hive is to look at how your data is and see if ORC or Parquet could benefit your queries (columnar formats are good for analytic queries).
Hive is always going to be one of the slower options though for performing SQL queries on your HDFS data. Hortonworks with their Stinger initiative is making it better, you might want to check that out.
http://hortonworks.com/labs/stinger/
The use case sounds fit for ORC, Parquet if you are interested in a subset of the columns. ORC with hive 0.12 comes with PPD which will help you discarding blocks while running the queries using the meta data that it stores for each column.
We did an implementation on top of hive to support bloom filters in the meta data indexes for ORC files which gave a performance gain of 5-6X.
What is average number of Mapper/Reducer tasks launched for the queries you execute? Tuning some parameters can definitely help.