How to get accurate percentiles for big data in PIG?

How to get accurate percentiles for big data in PIG? - apache-pig

I'm trying to find percentiles in PIG and for that I'm using datafu-1.2.0.jar package, but my answers are not matching. I found out that all the data is pushed to a single reducer per key so I think that's why it is not working for me (file is 10.5 GB). Is there a way to solve this problem without having to partition my data as in I want to find percentiles for entire record using group all?

Related

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.

Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.

Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

Pyspark - identical dataframe filter operation gives different output

I’m facing a particularly bizarre issue while firing filter queries on a spark dataframe. Here's a screenshot of the filter command I'm trying to run:
As you can see, I'm trying to run the same command multiple times. Each time, it's giving a different number of rows. It is actually meant to return 6 records, but it ends up showing a random number of records every time.
FYI, The underlying data source (from which I'm creating the dataframe) is an Avro file in a Hadoop data lake.
This query only gives me consistent results if I cache the dataframe. But this is not always possible for me because the dataframe might be very huge and hence would choke up memory resources if I cache it.
What might be the possible reasons for this random behavior? Any advice on how to fix it?
Many thanks :)

Google Bigquery results

I am getting a part of result from the Bigquery API.
Earlier, I solved the issue of 1,00,000 records per result using iterators.
However, now I'm stuck at some other obstacle.
If I take more than 6-7 columns in a result, I do not get the complete set of result.
However, if I take a single column, I get the complete result.
Can there be a size limit as well for results in Bigquery API ?

There are some limits for Query Job
In particular
Maximum response size — 128 MB compressed
Of course, it is unlimited when writing large query results to a destination table (and then reading from there)

Why is pyspark so much slower in finding the max of a column?

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column?
I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value.
I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.)
The only thing I did ot try yet is the RDD way.
Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know
can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?

As #MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.
Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.

#MaxU, #MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.
sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.
In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.
On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.
Thanks for yourinput and this topic really got me much further in understanding Spark.

How to preserve Google Cloud Storage rows order in compressed files

We've created a query in BigQuery that returns SKUs and correlations between them. Something like:
sku_0,sku_1,0.023
sku_0,sku_2,0.482
sku_0,sku_3,0.328
sku_1,sku_0,0.023
sku_1,sku_2,0.848
sku_1,sku_3,0.736
The result has millions of rows and we export it to Google Cloud Storage which results in several compressed files.
These files are downloaded and we have a Python application that loops through them to make some calculations using the correlations.
We tried then to make use of the fact that our first columns of SKUs is already ordered and not have to apply this ordering inside of our application.
But then we just found that the files we get from GCS changes the order in which the skus appear.
It looks like the files are created by several processes reading the results and saving it in different files, which breaks the ordering we wanted to maintain.
As an example, if we have 2 files created, the first file would look something like that:
sku_0,sku_1,0.023
sku_0,sku_3,0.328
sku_1,sku_2,0.0848
And the second file:
sku_0,sku_2,0.482
sku_1,sku_0,0.328
sku_1,sku_3,0.736
This is an example of what it looks like two processes reading the results and each one saving its current row on a specific file which changes the order of the column.
So we looked for some flag that we could use to force the preservation of the ordering but couldn't find any so far.
Is there some way we could use to force the order in these GCS files to be preserved? Or is there some workaround?
Thanks in advance,

As far I know there is no flag to maintain order.
As a workaround you can rethink your data output to use of NESTED type, and make sure that what you want to group together are converted in NESTED rows, and you can export to JSON.

is there some workaround?
As an option - you can move your processing logic from Python to BigQuery, thus eliminating moving data out of BigQuery to GCS.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas