Spark- scan data frame base on value - dataframe

I'm trying to find a column (I do know the name of the column) base on a value. For example in this dataframe below, I'd like to know which row that has a column contains yellow for Category = A . The thing is I don't know the column name (colour) in advance so I couldn't do select * where Category = 'A' and colour = 'yellow' How can I scan the columns and achieve this? Many thanks for your help.
+--------+-----------+-------------+
|Category|colour |. name. |
+--------+-----------+-------------+
|A. | blue.| Elmo|
|A | yellow | Alex|
|B | desc | Erin|
+--------+-----------+-------------+

You can loop that check through the list of column names. You also can wrap this loop in a function for the readable purpose. Please note that this check per column would happen in sequence.
from pyspark.sql import functions as F
cols = df.columns
for c in cols:
cnt = df.where((F.col('Category') == 'A') & (F.col(c) == 'yellow')).count()
if cnt > 0:
print(c)

Related

how to reduce rows to 1 row by concatenate in Azure Log Analytics

string row1
string row2
Is it possible to reduce rows to 1 row?
Rows should be joined with a comma.
As a result I expect
string row1, string row2
One of the workaround could able to solve the above issue,
To concatenate we can use this for e.g | extend New_Column = strcat(tagname,",", tagvalue) with comma between two string.
For example we have tested in our environment with tag name and tag value
resourcecontainers
| where type =~ 'microsoft.resources/subscriptions'
| extend tagname = tostring(bag_keys(tags)[0])
| extend tagvalue = tostring(tags[tagname])
| extend New_Column = strcat(tagname,",", tagvalue) // for concate two rows into one with comma between two string
Here is the sample output for reference:
For more information please refer this SO THREAD
UPDATE: To cancat the rows we tried with the example of code as stated in the given SO THREAD which is suggested by #Yoni L.
| summarize result = strcat_array(make_list(word), ",")
Sample output for reference:
thanks for the tips. in your links I found:
| summarize result = strcat_array(make_list(name_s), ",")

MS SQL Server: Logic to find out groups of all dependent records from table having source and destinations items

I am using MS SQL Server DB. I have a specific need to find out groups of interdependent items. Visualize below a scenario where we have two items in each row, one is source item and other is destination item. Any item can be source of any item, and same for destination as well. We have two column in the table, 'Source' and 'Destination'. Let's consider 10 values in the table as below:
Source | Destination
A | B
B | C
C | D
E | A
D | E
X | Y
Y | Z
Z | X
P | Q
R | S
My requirement is to get distinct groups of items with Source and destination. Meaning, my query should return below result with 4 rows (grouped items in comma separated form):
RowNum| Result
1| A,B,C,D,E
2| X,Y,Z
3| P,Q
4| R,S
Here, the level of hierarchy can be upto n number. In my example, I kept the first group of 5 items (A to B, B to C, C to D, D to E and E to A - Means 5 different items are involved in this group). But the data may have more items as well, in single group. Also, cyclical records are possible (X to , Y to Z and Z to X)
I can achieve this using nested WHILE Loops. But, as we have thousands of records, the nested WHILE Loop scripts takes too much time to execute. I am iterating one loop on each record of the table and then there is an inner loop which will take outer loop's record and will compare it with all other records.
Can anybody suggest a better way or algorithm to achieve this?
Any help on this would be appreciated.

Call function in pyspark with values from dataframe as strings

I have to call a function func_test(spark,a,b) which accepts two string values and create a df out of it. spark is a SparkSession variable
These two string values are two columns of another dataframe and would be different for different rows of that dataframe.
I am unable to achieve this.
Things tried so far:
1.
ctry_df = func_test(spark, df.select("CTRY").first()["CTRY"],df.select("CITY").first()["CITY"])
Gives CTRY and CITY of only the first record of the df.
2.
ctry_df = func_test(spark, df['CTRY'],df['CITY'])
Gives Column<b'CTRY'> and Column<b'CITY'> as values.
Example:
df is:
+----------+----------+-----------+
| CTRY | CITY | XYZ |
+----------+----------+-----------+
| US | LA | HELLO|
| UK | LN | WORLD|
| SN | SN | SPARK|
+----------+----------+-----------+
So, I want first call to fetch func_test(spark,US,LA); second call to go func_test(spark,UK,LN); third call to be func_test(spark,SN,SN) and so on.
Pyspark - 3.7
Spark - 2.2
Edit 1:
Issue in detail:
func_test(spark,string1,string2) is a function which accepts two string values. Inside this function is a set of various dataframe operations done. For example:- First spark sql in the func_test is a normal select and these two variables string1 and string2 are used in the where clause. The result of this spark sql which generates a df is a temp table of next spark sql and so on. Finally, it creates a df which this function func_test(spark,string1,string2) returns.
Now, In the main class, I have to call this func_test and the two parameters string1 and string2 will be fetched from records of dataframe. So that, first func_test call generates query as select * from dummy where CTRY='US' and CITY='LA'. And the subsequent operations happen which results in df. Second call to func_test becomes select * from dummy where CTRY='UK' and CITY='LN'. Third call becomes select * from dummy where CTRY='SN' and CITY='SN' and so on.
instead of first() use collect() and iterate through the loop
collect_vals = df.select('CTRY','CITY').distinct().collect()
for row_col in collect_vals:
func_test(spark, row_col['CTRY'],row_col['CITY'])
hope this helps !!

How to filter after group by and aggregate in Spark dataframe?

I have a spark dataframe df with schema as such:
[id:string, label:string, tags:string]
id | label | tag
---|-------|-----
1 | h | null
1 | w | x
1 | v | null
1 | v | x
2 | h | x
3 | h | x
3 | w | x
3 | v | null
3 | v | null
4 | h | null
4 | w | x
5 | w | x
(h,w,v are labels. x can be any non-empty values)
For each id, there is at most one label "h" or "w", but there might be multiple "v". I would like to select all the ids that satisfies following conditions:
Each id has:
1. one label "h" and its tag = null,
2. one label "w" and its tag != null,
3. at least one label "v" for each id.
I am thinking that I need to create three columns checking each above conditions. And then I need to do a group by "id".
val hCheck = (label: String, tag: String) => {if (label=="h" && tag==null) 1 else 0}
val udfHCheck = udf(hCheck)
val wCheck = (label: String, tag: String) => {if (label=="w" && tag!=null) 1 else 0}
val udfWCheck = udf(wCheck)
val vCheck = (label: String) => {if (label==null) 1 else 0}
val udfVCheck = udf(vCheck)
dfx = df.withColumn("hCheck", udfHCheck(col("label"), col("tag")))
.withColumn("wCheck", udfWCheck(col("label"), col("tag")))
.withColumn("vCheck", udfVCheck(col("label")))
.select("id","hCheck","wCheck","vCheck")
.groupBy("id")
Somehow I need to group three columns {"hCheck","wCheck","vCheck"} into vector of list [x,0,0],[0,x,0],[0,0,x]. And check if these vector contain all three {[1,0,0],[0,1,0],[0,0,1]}
I have not been able to solve this problem yet. And there might be a better approach than this one. Hope someone can give me suggestions. Thanks
To convert the three checks to vectors you can do:
Specifically you can do:
val df1 = df.withColumn("hCheck", udfHCheck(col("label"), col("tag")))
.withColumn("wCheck", udfWCheck(col("label"), col("tag")))
.withColumn("vCheck", udfVCheck(col("label")))
.select($"id",array($"hCheck",$"wCheck",$"vCheck").as("vec"))
Next the groupby returns a grouped object on which you need to perform aggregations. Specifically to get all the vectors you should do something like:
.groupBy("id").agg(collect_list($"vec"))
Also you do not need udfs for the various checks. You can do it with column semantics. For example udfHCheck can be written as:
with($"label" == lit("h") && tag.isnull 1).otherwise(0)
BTW, you said you wanted a label 'v' for each but in vcheck you just check if the label is null.
Update: Alternative solution
Upon looking on this question again, I would do something like this:
val grouped = df.groupBy("id", "label").agg(count("$label").as("cnt"), first($"tag").as("tag"))
val filtered1 = grouped.filter($"label" === "v" || $"cnt" === 1)
val filtered2 = filtered.filter($"label" === "v" || ($"label" === "h" && $"tag".isNull) || ($"label" === "w" && $"tag".isNotNull))
val ids = filtered2.groupBy("id").count.filter($"count" === 3)
The idea is that first we groupby BOTH id and label so we have information on the combination. The information we collect is how many values (cnt) and the first element (doesn't matter which).
Now we do two filtering steps:
1. we need exactly one h and one w and any number of v so the first filter gets us these cases.
2. we make sure all the rules are met for each of the cases.
Now we have only combinations of id and label which match the rules so in order for the id to be legal we need to have exactly three instances of label. This leads to the second groupby which simply counts the number of labels which matched the rules. We need exactly three to be legal (i.e. matched all the rules).

How to apply a custom filtering function on a Spark DataFrame

I have a DataFrame of the form:
A_DF = |id_A: Int|concatCSV: String|
and another one:
B_DF = |id_B: Int|triplet: List[String]|
Examples of concatCSV could look like:
"StringD, StringB, StringF, StringE, StringZ"
"StringA, StringB, StringX, StringY, StringZ"
...
while a triplet is something like:
("StringA", "StringF", "StringZ")
("StringB", "StringU", "StringR")
...
I want to produce the cartesian set of A_DF and B_DF, e.g.;
| id_A: Int | concatCSV: String | id_B: Int | triplet: List[String] |
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 21 | ("StringA", "StringF", "StringZ")|
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 45 | ("StringB", "StringU", "StringR")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 21 | ("StringA", "StringF", "StringZ")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 45 | ("StringB", "StringU", "StringR")|
| ... | | | |
Then keep just the records that have at least two substrings (e.g StringA, StringB) from A_DF("concatCSV") that appear in B_DF("triplet"), i.e. use filter to exclude those that don't satisfy this condition.
First question is: can I do this without converting the DFs into RDDs?
Second question is: can I ideally do the whole thing in the join step--as a where condition?
I have tried experimenting with something like:
val cartesianRDD = A_DF
.join(B_DF,"right")
.where($"triplet".exists($"concatCSV".contains(_)))
but where cannot be resolved. I tried it with filter instead of where but still no luck. Also, for some strange reason, type annotation for cartesianRDD is SchemaRDD and not DataFrame. How did I end up with that? Finally, what I am trying above (the short code I wrote) is incomplete as it would keep records with just one substring from concatCSV found in triplet.
So, third question is: Should I just change to RDDs and solve it with a custom filtering function?
Finally, last question: Can I use a custom filtering function with DataFrames?
Thanks for the help.
The function CROSS JOIN is implemented in Hive, so you could first do the cross-join using Hive SQL:
A_DF.registerTempTable("a")
B_DF.registerTempTable("b")
// sqlContext should be really a HiveContext
val result = sqlContext.sql("SELECT * FROM a CROSS JOIN b")
Then you can filter down to your expected output using two udf's. One that converts your string to an array of words, and a second one that gives us the length of the intersection of the resulting array column and the existing column "triplet":
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val splitArr = udf { (s: String) => s.split(",").map(_.trim) }
val commonLen = udf { (a: WrappedArray[String],
b: WrappedArray[String]) => a.intersect(b).length }
val temp = (result.withColumn("concatArr",
splitArr(col("concatCSV"))).select(col("*"),
commonLen(col("triplet"), col("concatArr")).alias("comm"))
.filter(col("comm") >= 2)
.drop("comm")
.drop("concatArr"))
temp.show
+----+--------------------+----+--------------------+
|id_A| concatCSV|id_B| triplet|
+----+--------------------+----+--------------------+
| 14|StringD, StringB,...| 21|[StringA, StringF...|
| 18|StringA, StringB,...| 21|[StringA, StringF...|
+----+--------------------+----+--------------------+