how to pass in array into udf spark - sql

I have a problem that 1) I don't really know how to call a registered UDF. I found some answer saying use callUDF so this is how I call the function in my code. 2) I don't really know how to pass in arrays as parameters.
Here is my code:
val df = Seq(("1","2","3","4","5","6")).toDF("A","B","C","D","E","F")
val newdf = Seq(("1","2","3","4","5","6")).toDF("A","B","C","D","E","F")
val cols = df.columns
val temp = Array(df.select($"A"),df.select($"B"),df.select($"C"),df.select($"D"),df.select($"E"),df.select($"F"))
val temp2 = Array(newdf.select($"A"),newdf.select($"B"),newdf.select($"C"),newdf.select($"D"),newdf.select($"E"),newdf.select($"F"))
sparkSession.udf.register ( "myfunc" , ((A:Array[String],B:Array[String]) => {for(i <- 0 to 5)yield( if (A(i)==B(i)) "U" else "N")} ) )
val a = df.withColumn("A",callUDF("myfunc",(temp,temp2)))
Thanks in advance!

You are trying to use columns from two different dataframes which is illegal in a UDF. Spark's UDF can only work on a per row basis. You can't combine rows from different dataframes. To do so you need to perform a join between the two.
In your case you have just one row but in a realistic case you would have multiple rows, you need to make sure you have some unique key to join by such as a unique id.
If you don't and both dataframes have the same number of rows and the same number of partitions you can easily create an id for both dataframes like this:
df.withColumn("id",monotonicallyIncreasingId)
You should probably also rename the columns to have different names.
Look at the different options for join (see http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset) to see what best matches your need.
As for registering and calling a udf you can do:
def myFunc(s1: Seq[Int], s2: Seq[Int]) = {
for(i <- 0 to 5) yield {
if (s1(i)==s2(i)) "U" else "N"
}
}
val u = udf(myFunc)
val a = df.withColumn("A", myFunc(temp,temp2))
note that temp and temp2 should each be a column representing an array in the same dataframe, i.e. you should define them after the join on the relevant columns.

Related

How to rename the columns starting with 'abcd' to starting with 'wxyz' in Spark Scala?

How can I rename the columns starting with abcd to starting with wxyz.
List of columns: abcd_name, abcd_id, abcd_loc, empId, empCode
I need to change the names of columns in a dataframe that starts with abcd
Required column list: wxyz_name, wxyz_id, wxyz_loc, empId, empCode
I tried getting all the columns' lists using the below code, but not sure how to implement it.
val df_cols_abcd = df.columns.filter(_.startsWith("abcd")).map(df(_))
You can do that with foldLeft:
val oldPrefix = "abcd"
val newPrefix = "wxyz"
val newDf = df.columns
.filter(_.startsWith(oldPrefix))
.foldLeft(df)((acc, oldName) =>
acc.withColumnRenamed(oldName, newPrefix + oldName.substring(oldPrefix.length))
)
Your first idea to filter columns with startWith is correct. The only think you miss the the part where you rename all the columns.
I recommend to do some research about foldLeft if you're not familiar with. The idea is the following:
I start with an initial dataframe (df in the first brackets).
I will apply a function to it with each of the columns I need to rename (the function is the one in the second brackets). This function takes as argument an accumulator (acc) that is an intermediate dataframe (because it will rename the columns one at a time), and another argument which is the current element of the list (here the list contains the name of the columns that need to be modified).

Spark scala how to remove the columns that are not in common between 2 dataframes

I have 2 dataframes, the first one has 53 columns and the second one has 132 column.
I want to compare the 2 dataframes and remove all the columns that are not in common between the 2 dataframes and then display each dataframe containing only those columns that are common.
What I did so far is to get a list of all the column that dont't match, but I don't know how to drop them.
val diffColumns = df2.columns.toSet.diff(df1.columns.toSet).union(df1.columns.toSet.diff(df2.columns.toSet))
This is getting me a scala.collection.immutable.Set[String].
Now I'd like to use this to drop these columns from each dataframe. Something like that, but this is not working...
val newDF1 = df1.drop(diffColumns)
The .drop function accepts a list of columns, not the Set object, so you need to convert it to Seq and "expand it" using, the : _* syntax, like, this:
df.drop(diffColumns.columns.toSet.toSeq: _*)
Also, instead of generating diff, it could be just easier to do intersect to find common columns, and use .select on each dataframe to get the same columns:
val df = spark.range(10).withColumn("b", rand())
val df2 = spark.range(10).withColumn("c", rand())
val commonCols = df.columns.toSet.intersect(df2.columns.toSet).toSeq.map(col)
df.select(commonCols: _*)
df2.select(commonCols: _*)

Spark Scala Compare Row and Row of 2 Data frames and get differences

I have a Dataframe 1, Df1, Dataframe 2 , Df2 - Same Schema
I have Row 1 from Df1 - Dfw1, Row 1 from Df2 - Dfw2
I need to compare both to get differences b/n Dfw1 and Dfw2 and get the differences out as collection (Map or something)
A simple solution would be to transform the Row objects to Map and then compare the values of the 2 Maps.
Something like in Scala:
val m1 = Dfw1.getValuesMap[AnyVal](Dfw1.schema.fieldNames)
val m2 = Dfw2.getValuesMap[AnyVal](Dfw2.schema.fieldNames)
val differences = for {
field <- m1.keySet
if (!m1.get(field).equals(m2.get(field)))
} yield (field, m1(field), m2(field))
Returns Seq of tuples (field, value of Dfw1, value of Dfw1) if they are different.
You may also use pattern matching on Row object to compare:
Dfw1 match {
case(id: String, desc: String, ....) => // assuming you have the schema
// compare each value with Dfw2 and return differences
}

Alternatives to iloc for searching dataframes

I have a simple piece of code that iterates through a list of id's, and if an id is in a particular data frame column(in this case, the column is called uniqueid), it uses iloc to get the value from another column on the matching row and then adds it to as a value in a dictionary with the id as the key:
union_cols = ['uniqueid', 'FLD_ZONE', 'FLD_ZONE_1', 'ST_FIPS', 'CO_FIPS', 'CID']
union_df = gpd.GeoDataFrame.from_features(records(union_gdb, union_cols))
pop_df = pd.read_csv(pop_csv, low_memory=False) # Example dataframe
uniqueid_inin = ['', 'FL1234', 'F54323', ....] # Just an example
isin_dict = dict()
for id in uniqueid_inin:
if (id is not '') & (id in pop_df.uniqueid.values):
v = pop_df.loc[pop_df['uniqueid'] == id, 'Pop_By_Area'].iloc[0]
inin_dict.setdefault(id, v)
This works, but it is very slow. Is there a quicker way to do this?
To resolve this issue (and make the process more efficient) I had to think about the process in a different way that took advantage of Pandas and didn't rely on a generic Python solution. I first had to get a list of only the uniqueids from my union_df that were absolutely in pop_df. If they were not, applying the .isin() method would throw an indexing error.
#Get list of uniqueids in pop_df
pop_uniqueids = pop_df['uniqueid'].unique()
#Get only the union_df rows where the uniqueid matches pop_uniqueid
union_df = union_df.loc[(union_df['uniqueid'].isin(pop_uniqueids))]
union_df = union_df.reset_index()
union_df = union_df.drop(columns='index')
When the uniqueid_inin list is created from union_df (by just getting the uniqueid's from rows where my zone_status column is equal to 'in-in'), it will only contain a subset of items that are definitely in pop_df and empty values are no longer an issue. Then, I simply create a subset dataframe using the list and zip the desired column values together in a dictionary:
inin_subset =pop_df[ pop_df['uniqueid'].isin(uniqueid_inin)]
inin_pop_dict = dict(zip(inin_subset.uniqueid, inin_subset.Pop_By_Area))
I hope this technique helps.

Is there a faster way through list comprehension to iterate through two dataframes?

I have two dataframes, one contains screen names/display names and another contains individuals, and I am trying to create a third dataframe that contains all the data from each dataframe in a new row for each time a last name appears in the screen name/display name. Functionally this will create a list of possible matching names. My current code, which works perfectly but very slowly, looks like this:
# Original Social Media Screen Names
# cols = 'userid','screen_name','real_name'
usernames = pd.read_csv('social_media_accounts.csv')
# List Of Individuals To Match To Accounts
# cols = 'first_name','last_name'
individuals = pd.read_csv('individuals_list.csv')
userid, screen_name, real_name, last_name, first_name = [],[],[],[],[]
for index1, row1 in individuals.iterrows():
for index2, row2 in usernames.iterrows():
if (row2['Screen_Name'].lower().find(row1['Last_Name'].lower()) != -1) | (row2['Real_Name'].lower().find(row1['Last_Name'].lower()) != -1):
userid.append(row2['UserID'])
screen_name.append(row2['Screen_Name'])
real_name.append(row2['Real_Name'])
last_name.append(row1['Last_Name'])
first_name.append(row1['First_Name'])
cols = ['UserID', 'Screen_Name', 'Real_Name', 'Last_Name', 'First_Name']
index = range(0, len(userid))
match_list = pd.DataFrame(index=index, columns=cols)
match_list = match_list.fillna('')
match_list['UserID'] = userid
match_list['Screen_Name'] = screen_name
match_list['Real_Name'] = real_name
match_list['Last_Name'] = last_name
match_list['First_Name'] = first_name
Because I need the whole row from each column, the list comprehension methods I have tried do not seem to work.
The thing you want is to iterate through a dataframe faster. Doing that with a list comprehension is, taking data out of a pandas dataframe, handling it using operations in python, then putting it back in a pandas dataframe. The fastest way (currently, with small data) would be to handle it using pandas iteration methods.
The next thing you want to do is work with 2 dataframes. There is a tool in pandas called join.
result = pd.merge(usernames, individuals, on=['Screen_Name', 'Last_Name'])
After the merge you can do your filtering.
Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html