How to check every value in array within a range using spark sql? - apache-spark-sql

I hope to filter on an array column to check whether every value in array is greater than 10 or smaller than 5, which is just like usage of cardinality(filter(col, x -> x < 5 or x >10)) > 0 in presto. Is it possible to achieve it with pure spark sql?

You can use forall function to achieve this
below is sample filter
df.filter(F.expr("forall(arr, x -> x<5 or x>10)"))
Here I am assuming arr is column name inside dataframe df where each value should either be less than 5 or greater than 10
Details of forall here

Related

AttributeError: 'int' object has no attribute 'count' while using itertuples() method with dataframes

I am trying to iterate over rows in a Pandas Dataframe using the itertuples()-method, which works quite fine for my case. Now i want to check if a specific value ('x') is in a specific tuple. I used the count() method for that, as i need to use the number of occurences of x later.
The weird part is, for some Tuples that works just fine (i.e. in my case (namedtuple[7].count('x')) + (namedtuple[8].count('x')) ), but for some (i.e. namedtuple[9].count('x')) i get an AttributeError: 'int' object has no attribute 'count'
Would appreciate your help very much!
Apparently, some columns of your DataFrame are of object type (actually a string)
and some of them are of int type (more generally - numbers).
To count occurrences of x in each row, you should:
Apply a function to each row which:
checks whether the type of the current element is str,
if it is, return count('x'),
if not, return 0 (don't attempt to look for x in a number).
So far this function returns a Series, with a number of x in each column
(separately), so to compute the total for the whole row, this Series should
be summed.
Example of working code:
Test DataFrame:
C1 C2 C3
0 axxv bxy 10
1 vx cy 20
2 vv vx 30
Code:
for ind, row in df.iterrows():
print(ind, row.apply(lambda it:
it.count('x') if type(it).__name__ == 'str' else 0).sum())
(in my opinion, iterrows is more convenient here).
The result is:
0 3
1 1
2 1
So as you can see, it is possible to count occurrences of x,
even when some columns are not strings.

iteration in spark sql dataframe , getting 1st row value in first iteration and second row value in next iteration and so on

Below is the query that will give the data and distance where distance is <=10km
var s=spark.sql("select date,distance from table_new where distance <=10km")
s.show()
this will give the output like
12/05/2018 | 5
13/05/2018 | 8
14/05/2018 | 18
15/05/2018 | 15
16/05/2018 | 23
---------- | --
i want to use first row of the dataframe s , store the date value in a variable v , in first iteration.
In next iteration it should pick the second row , and corresponding data value to be replaced the old variable b .
like wise so on .
I think you should look at Spark "Window Functions". You may find here what you need.
The "bad" way to do this would be to collect the dataframe using df.collect() which would return a list of Rows which you can manually iterate over each using a loop.This is bad cause it brings all the data in your driver.
The better way would be to use foreach() :
df.foreach(lambda x: <<your code here>>)
foreach() takes a lambda function as argument which iterates over each row of the dataframe without bringing all the data in the driver.But you cant use a simple local variable v inside a lambda fuction when there is overwriting involved.you can use spark accumulators for such a case.
eg: if i want to sum all the values in 2nd column
counter = sc.longAccumulator("counter")
df.foreach(lambda row: counter.add(row.get(1)))

By Using DataFrame's and 'where()' method which selects rows where A is greater than 5 or B is greater than 5

Given a Spark DataFrame in a variable t representing a table with two integer columns (A, B) , write the expression using DataFrame columns to be passed as parameter of the DataFrame's where() method which selects rows where A is greater than 5 or B is greater than 5. Using the DataFrame variable and not use by the col() function.
There are two col functions: one from the Dataset class and one from org.apache.spark.sql.functions. In this simple case, both would work:
t.where(t.col("A").gt(5).or(t.col("B").gt(5))).show() //from dataset
import org.apache.spark.sql.functions._
t.where(col("A").gt(5).or(col("B").gt(5))).show() //from functions
Depending which of the two you want to avoid, you can take the other one.
If you use Scala, also $ works:
t.where($"A">5 or $"B">5).show
You can also switch entirely to the sql syntax:
t.where("A > 5 or B > 5").show
If filter is allowed, the lamdba version would also work:
t.filter(r => r.getInt(0) > 5 || r.getInt(1) > 5).show

Get Maximum Value from Dataframe

I'm running the following code to get the maximum value in a dataframe. It works fine.
p_max_shot1_15_CH8 = corrected_shot1_data[['CH 8 [psi]']][0.0119:0.0122].max()
I would like to use the max value for math, but it is not a value but another dataframe
CH 8 [psi] 1.419032
dtype: float64
How do I get just the maximum value with no index?
You should be able to get values without an index using .values:
p_max_shot1_15_CH8 = corrected_shot1_data[['CH 8 [psi]']][0.0119:0.0122].max().values
Alternatively, try putting max first, e.g.:
max(p_max_shot1_15_CH8 = corrected_shot1_data[['CH 8 [psi]']][0.0119:0.0122])

Using CONTAINS with variables sql

Ok so I am trying to reference one variable with another in SQL.
X= a,b,c,d (x is a string variable with a list of things in it)
Y= b ( Y is a string variable that may or may not have a vaue that appears in X)
I tried this:
Case when Y in (X) then 1 else 0 end as aa
But it doesnt work since it looks for exact matches between X and Y
also tried this:
where contains(X,#Y)
but i cant create Y globally since it is a variable that changes in each row of the table.( x also changes)
A solution in SAS would also be useful.
Thanks
Maybe like will help
select
*
from
t
where
X like ('%'+Y+'%')
or
select
case when (X like ('%'+Y+'%')) then 1 else 0 end
from
t
SQLFiddle example
In SAS I would use the INDEX function, either in a data step or proc sql. This returns the position within the string in which it finds the character(s), or zero if there is no match. Therefore a test if the value returned is greater than zero will result in a binary 1:0 output. You need to use the compress function with the variable containing the search characters as SAS pads the value with blanks.
Data step solution :
aa=index(x,compress(y))>0;
Proc Sql solution :
index(x,compress(y))>0 as aa