According to the documentation the percentile command should give the exact percentile for every numeric column. At least when my input is composed of floating points - this is not true.
In hive docs, it says the percentile command only works for integers. I didn't understand the relation between spark sql and hive, but it seems spark doesn't just run hive - otherwise it wouldn't have changed it's docs. Also, spark's percentile has a different signature, allowing it to get a frequency parameter, which I also have no idea what purpose it serves.
This is an example with unexpected output:
from pyspark.sql import functions as sf
d = spark.createDataFrame([[35.138071000000004], [34.119932999999996], [34.487992]], ['a'])
d.select(sf.expr('percentile(a, array(0.25,0.5,0.75,0.9,0.95)) AS res')).collect()
Out[1]: [Row(res=[34.3039625, 34.487992, 34.8130315, 35.0080552, 35.0730631])]
If I switch sf.expr content to percentile_approx with a high accuracy, or use high frequency in the current method - I get a reasonable output.
Could you explain what's happening?
Also:
Can you please explain/point me to some resource about the relation between spark sql and apache hive?
Where is the code that spark sql commands run?
Thanks
There is no direct relation between Spark and Hive except Spark's ability to retrieve metadata from Hive MetaStore regarding databases, tables and views defined in Hive. You can get familiar with Spark by reading its online documentation.
SparkSQL is a completely independent (from Hive) implementation of SQL language written in Scala. SparkSQL is one of Spark's modules that uses Spark cluster computing platform. Along with other Spark modules, it can run on Spark's own cluster (aka standalone), or make the use of YARN or Mesos.
Specifically, the percentile function in SparkSQL according to SparkSQL documentation...
Returns the exact percentile value of numeric column col at the given
percentage. The value of percentage must be between 0.0 and 1.0. The
value of frequency should be positive integral.
EDIT
Frequency parameter was added to the percentile function as part of SPARK-18940, to be able to optionally supply extra column (generally speaking, an expression) that contains distribution of the analyzed values. The default value is frequency = 1L.
There is a follow-up SPARK-27929 that will relax the requirement to have it as type Long.
Related
I want to select a sample of random 'n' bins from a set in the namespace. Is there a way to achieve this in Aerospike Query Language?
In Oracle, we achieve something similar with the following query:
SELECT * FROM <table-name> sample block(10) where rownum < 101
The above query fetches blocks of size of 10 rows from a sample size of 100.
Can we do something similar to this in Aerospike also?
Rows are like records in Aerospike, and columns are like bins. You don’t have a way to sample random columns from a table, do you?
You can sample random records from a set using ScanPolicy.maxRecords added to a scan of that set. Note the new (optional) set indexes in Aerospike version 5.6 may accelerate that operation.
Each namespace has its data partitioned into 4096 logical partitions, and the records in the namespace evenly distributed to each of those using the characteristics of the 20-byte RIPEMD-160 digest. Therefore, Aerospike doesn't have a rownum, but you can leverage the data distribution to sample data.
Each partition is roughly 0.0244% of the namespace. That's a sample space you can use, similar to the SQL query above. Next, if you are using the ScanParition method of the client, you can give it the ScanPolicy.maxRecords to pick a specific number of records out of that partition. Further you can start after an arbitrary digest (see PartitionFilter.after) if you'd like.
Ok, now let's talk data browsing. Instead of using the aql tool, you could be using the Aerospike JDBC driver, which works with any JDBC compatible data browser like DBeaver, SQuirreL, and Tableau. When you use LIMIT on a SELECT statement it will basically do what I described above - use partition scanning and a max-records sample on that scan. I suggest you try this as an alternative.
AQL is a tool written using Aerospike C Client. Aerospike does not have a SQL like query language per se that the server understands. What ever functionality that AQL provides is documented - type HELP on the aql> prompt.
You can write an application in C or Java to achieve this. For example, in Java, you can do a scanAll() API call with maxRecords defined in the ScanPolicy. I don't see AQL tool offering that option for scans. (It just allows you to specify a scan rate, one of the other ScanPolicy options.)
I'm just wondering if this spark code
val df = spark.sql("select * from db.table").filter(col("field")=value)
is as efficient as this one:
val df = spark.sql("select * from db.table where field=value")
In the first bloc are we loading all hive data to the RAM or is spark smart enough to filter those values in hive during the execution of the generated DAG
Thanks in advance!
Whether we apply filter through DataFrame functions or Spark SQL on a dataframe or its view , they both will result in same physical plan (it is a plan according to which a spark job is actually executed across a cluster).
The reason behind this is Apache Spark's Catalyst optimiser. It is an in-built feature of Spark which turns input SQL queries or DataFrame transformations into a logical and cost optimised physical plan.
You can also have a look at this databricks link to understand it more clearly. Further, we can check this physical plan using .explain function (Caution: .explain's output should be read opposite to conventional way as its last line represents the start of physical plan and first line represents the end of physical plan.)
you dont use same functions, but internaly it's same.
you can use explain() to check the logical plan :
spark.sql("select * from db.table").filter(col("field")=value).explain()
spark.sql("select * from db.table where field=value").explain()
in the first case you use a mixte between spark SQL and Dataset api with the .filter(col("field")=value)
in the second case you are pure sql
I am tuning my cluster which has Hive LLAP, According to the below link, https://community.hortonworks.com/articles/215868/hive-llap-deep-dive.html I need to calculate the value of heapsize, but not sure what is the meaning of *?
I also have a question regarding how to calculate the value for hive.llap.daemon.yarn.container.mb other then then default value given by ambari?
I have tried calculating the value by considering this (* as multiplication) and set container value equal to yarn.scheduler.maximum.allocation.mb, However HiveServer 2 interactive does not start after tuning.
Here's the excellent wiki article for setting up hive llap in HDP suite.
https://community.hortonworks.com/articles/149486/llap-sizing-and-setup.html
Your understanding for * is correct, it's used for multiplication.
Rule of thumb here is set hive.llap.daemon.yarn.container.mb to yarn.scheduler.maximum-allocation-mb but if your service is not coming up with that value then I would recommend you to change llap_heap_size to 80% of hive.llap.daemon.yarn.container.mb.
Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column?
I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value.
I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.)
The only thing I did ot try yet is the RDD way.
Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know
can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?
As #MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.
Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.
#MaxU, #MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.
sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.
In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.
On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.
Thanks for yourinput and this topic really got me much further in understanding Spark.
I am new in apache spark sql in scala.
How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte. I am looking for scala solution.
This is actually kind of a tricky problem. Spark SQL uses columnar data Storage so thinking of individual row sizes isn't super natural. We can of course call .rdd on from there you can filter the resulting RDD using the techniques as from Calculate size of Object in Java to determine the object size, and then you can take your RDD of Rows and convert it back to a DataFrame using your SQLContext.