Pyspark VectorAssembler on Ngram/Tokenizer Transofmed Dataframe - dataframe

If I have a dataframe with fields ['did','doc'] such as
data = sc.parallelize(['This is a test',
'This is also a test',
'These sentence are tests',
'This tests these sentences'])\
.zipWithIndex()\
.map(lambda x: (x[1],x[0]))\
.toDF(['did','doc'])
data.show()
+---+--------------------+--------------------+
|did| doc| words|
+---+--------------------+--------------------+
| 0| This is a test| [this, is, a, test]|
| 1| This is also a test|[this, is, also, ...|
| 2|These sentence ar...|[these, sentence,...|
| 3|This tests these ...|[this, tests, the...|
+---+--------------------+--------------------+
and I do some transformations on that document like tokenizing and finding 2-grams:
data = Tokenizer(inputCol = 'doc',outputCol = 'words').transform(data)
data = NGram(n=2,inputCol = 'words',outputCol='grams').transform(data)
data.show()
+---+--------------------+--------------------+--------------------+
|did| doc| words| grams|
+---+--------------------+--------------------+--------------------+
| 0| This is a test| [this, is, a, test]|[this is, is a, a...|
| 1| This is also a test|[this, is, also, ...|[this is, is also...|
| 2|These sentence ar...|[these, sentence,...|[these sentence, ...|
| 3|This tests these ...|[this, tests, the...|[this tests, test...|
+---+--------------------+--------------------+--------------------+
then at the end I want to combine the two-grams and words into a single column of features with a VectorAssembler:
data = VectorAssembler(inputCol=['words','grams'],
outputCol='features').transform(data)
then I get the following error:
Py4JJavaError: An error occurred while calling o504.transform.
: java.lang.IllegalArgumentException: Data type ArrayType(StringType,true) is not supported.
because the VectorAssembler doesn't like to work with lists of strings. To get around that I can drop the dataframe to an rdd, map the rdd to appropriate rows, and rezip it back up into a dataframe, a la
data = data.rdd.map(lambda x: Row(did = x['did'],
features = x['words']+x['grams'])) .toDF(['did','features'])
Which is not a problem for this tiny dataset, but which is prohibitively expensive for a large dataset.
Is there any way to achieve this more efficiently than the above?

You can use a udf to create the features column like this
import pyspark.sql.functions as f
import pyspark.sql.types as t
udf_add = f.udf(lambda x,y: x+y, t.ArrayType(t.StringType()))
data.withColumn('features', udf_add('words','grams')).select('features').show()
[Row(features=['this', 'is', 'a', 'test', 'this is', 'is a', 'a test']),
Row(features=['this', 'is', 'also', 'a', 'test', 'this is', 'is also', 'also a', 'a test']),
Row(features=['these', 'sentence', 'are', 'tests', 'these sentence', 'sentence are', 'are tests']),
Row(features=['this', 'tests', 'these', 'sentences', 'this tests', 'tests these', 'these sentences'])]

Related

Perform groupBy with array of strings - Scala

I have an array of strings defined in a variable which contains name of the column. I would like to perform group by and get count.
I am trying below code but throws error.
val keys = Array("Col1", "Col2")
val grouppedByDf = myDf.groupBy(keys.mkString(",").count
Can you please guide me what I am doing wrong here ?
import spark.implicits._
val df = Seq(("βήτα", "άλφα", 20), ("άλφα", "βήτα", 10), ("άλφα", "βήτα", 20), ("βήτα", "άλφα", 10)).toDF("α", "β", "ω")
val keys = Array("α", "β")
df
.groupBy(keys.map(col(_)): _*)
.count()
.show()
+----+----+-----+
| α| β|count|
+----+----+-----+
|βήτα|άλφα| 2|
|άλφα|βήτα| 2|
+----+----+-----+

How to count hypothenuses with pandas udf, pyspark

I want to write a panda udf which will take two arguments cathetus1, and cathetus2 from other dataframe and return hypot.
# this data is list where cathetuses are.
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
#and this is creating dataframe where only cathetuses are showing.
this is function i have written so far.
def pandaUdf(cat1, cat2):
leg1 = []
leg2 = []
for i in data:
x = 0
leg1.append(i[x])
leg2.append(i[x+1])
hypoData.append(np.hypot(leg1,leg2))
return np.hypot(leg1,leg2)
#example_series = pd.Series(data)
and im trying to create a new column in df, which values will be hypos.
df.withColumn(col('Hypo'), pandaUdf(example_df.cathetus1,example_df.cathetus2)).show()
but this gives me an error --> col should be Column.
I dont understand how I can fix this error or why its even there.
You can apply np.hypot on the 2 cathetus directly without extracting individual values.
from pyspark.sql import functions as F
from pyspark.sql.types import *
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
"""
+---------+---------+
|cathetus1|cathetus2|
+---------+---------+
| 3.0| 4.0|
| 6.0| 8.0|
| 3.3| 5.6|
+---------+---------+
"""
def hypot(cat1: pd.Series, cat2: pd.Series) -> pd.Series:
return np.hypot(cat1,cat2)
hypot_pandas_df = F.pandas_udf(hypot, returnType=FloatType())
df.withColumn("Hypo", hypot_pandas_df("cathetus1", "cathetus2")).show()
"""
+---------+---------+----+
|cathetus1|cathetus2|Hypo|
+---------+---------+----+
| 3.0| 4.0| 5.0|
| 6.0| 8.0|10.0|
| 3.3| 5.6| 6.5|
+---------+---------+----+
"""

Recommendation - Creating a new dataframe with conditions

I've been studying Spark for a while but today I got stuck, I'm working in a Recommendation model using Audioscrobbler Dataset.
I have my model based in ALS and the following definition for make the recommendations:
def makeRecommendations(model: ALSModel, userID: Int,howMany: Int): DataFrame = {
val toRecommend = model.itemFactors.select($"id".as("artist")).withColumn("user", lit(userID))
model.transform(toRecommend).
select("artist", "prediction", "user").
orderBy($"prediction".desc).
limit(howMany)
}
It's generating the expected output, but now I would like to create a new list of DataFrames using Predictions DF and User Data DF.
DataFrame Example
New list of DF consisting of the Predicted value from "Predictions DF" and "Listened" that will be 0 if the user didn't listened the artist or 1 if the user listened, something like this:
Expected DF
I tried the following solution:
val recommendationsSeq = someUsers.map { userID =>
//Gets the artists from user in testData
val artistsOfUser = testData.where($"user".===(userID)).select("artist").rdd.map(r => r(0)).collect.toList
// Recommendations for each user
val recoms = makeRecommendations(model, userID, numRecom)
//Insert a column listened with 1 if the artist in the test set for the user and 0 otherwise
val recomOutput = recoms.withColumn("listened", when($"artist".isin(artistsOfUser: _*), 1.0).otherwise(0.0)).drop("artist")
(recomOutput)
}.toSeq
But its very time consuming when the recommendation has more than 30 users. I believe there's a better way to do it,
Could someone give some idea?
Thanks,
You can try joining dataframes then goupby and count:
scala> val df1 = Seq((1205,0.9873411,1000019)).toDF("artist","prediction","user")
scala> df1.show()
+------+----------+-------+
|artist|prediction| user|
+------+----------+-------+
| 1205| 0.9873411|1000019|
+------+----------+-------+
scala> val df2 = Seq((1000019,1205,40)).toDF("user","artist","playcount")
scala> df2.show()
+-------+------+---------+
| user|artist|playcount|
+-------+------+---------+
|1000019| 1205| 40|
+-------+------+---------+
scala> df1.join(df2,Seq("artist","user")).groupBy('prediction).count().show()
+----------+-----+
|prediction|count|
+----------+-----+
| 0.9873411| 1|
+----------+-----+

Update pyspark dataframe from a column having the target column values

I have a dataframe which has a column('target_column' in this case) and I need to update these target columns with 'val' column values.
I have tried using udfs and .withcolumn but they all expect fixed column value. In my case it can be variable. Also using rdd map transformations didn't work as rdds are immutable.
def test():
data = [("jose_1", 'mase', "firstname", "jane"), ("li_1", "ken", 'lastname', 'keno'), ("liz_1", 'durn', 'firstname', 'liz')]
source_df = spark.createDataFrame(data, ["firstname", "lastname", "target_column", "val"])
source_df.show()
if __name__ == "__main__":
spark = SparkSession.builder.appName('Name Group').getOrCreate()
test()
spark.stop()
Input:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jose_1| mase| firstname|jane|
| li_1| ken| lastname|keno|
| liz_1| durn| firstname| liz|
+---------+--------+-------------+----+
Expected output:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jane| mase| firstname|jane|
| li_1| keno| lastname|keno|
| liz| durn| firstname| liz|
+---------+--------+-------------+----+
For e.g. in first row in input the target_column is 'firstname' and val is 'jane'. So I need to update the firstname with 'jane' in that row.
Thanks
You can do a loop with all you columns:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(
col,
F.when(
F.col("target_column")==F.lit(col),
F.col("val")
).otherwise(F.col(col))
)

Ungrouping the grouped table in Scala

I have the following data:
53,Male,11th,<=50K
53,Male,11th,<=50K
53,Male,11th,<=50K
20,Female,Masters,>50K
20,Female,Masters,>50K
33,Male,Bachelors,<=50K
Next I need to group the above data by using select and group. so it will be something like:
53,Male,11th,<=50K,3
20,Female,Masters,>50K,2
33,Male,Bachelors,<=50K,1
where the last number shows the number of similar records. Now I need to filter the number of equivalent records > 2, and store it in a separate file
I have grouped data by sql queries in Scala command. To ungroup data, I though we may create a table and add the grouped data by (insert command), and line by line. It works, but this is very very slow, and took around one hour for few records. Is there any thoughts using Scala with many thanks.
The command is shown below:
import spark.sqlContext.implicits._
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
case class Rating(age: Double,edu: String, sex: String, salary: String)
val Result = sc.textFile("hdfs://NameNode01:9000/input/adult.csv").map(_.split(",")).map(p => Rating(p(0).trim.toDouble,p(1),p(2),p(3))).toDF()
Result.registerTempTable("Start")
val sal1=spark.sqlContext.sql("SELECT age,edu,sex,salary,count(*) as cnt from Start group by age,edu,sex,salary")
sal1.registerTempTable("adult")
val sal2=spark.sqlContext.sql("SELECT age,edu,sex,salary,cnt from adult WHERE cnt>3")
sal2.registerTempTable("adult2")
var ag=sal2.map(age => ""+age(0)).collect()
var ed=sal2.map(edu => ""+edu(1)).collect()
var se=sal2.map(sex => ""+sex(2)).collect()
var sa=sal2.map(salary => ""+salary(3)).collect()
var cn=sal2.map(cnt => ""+cnt(4)).collect()
//convert age to double
val ages= ag.map(_.toDouble)
//convert the cnt to integer
val counts= cn.map(_.toInt)
//length of the array
var cnt_length=counts.size
//create a table and add the sal2 records in it
val adlt2=spark.sqlContext.sql("CREATE TABLE adult3 (age double, edu string, sex string, salary string)")
//loop and enter the number of cn
var sql_querys="query"
var i=0
var j=0
var loop_cnt=0
for(i <-0 to cnt_length-1){
loop_cnt=counts(i)
for(j <-0 to loop_cnt-1){
sql_querys="INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')"
val adlt3=spark.sqlContext.sql("INSERT into adult3 values ("+ages(i)+",'"+ed(i)+"','"+se(i)+"','"+sa(i)+"')")
}
}
The main part is the loop at the end of the code.
Here is a shorter solution which uses only rdds:
val result = sc
.textFile("hdfs://NameNode01:9000/input/adult.csv")
.map({ (line: String) =>
val p = line.split(",")
(Rating(p(0).trim.toDouble,p(1),p(2),p(3)), 1)
})
.reduceByKey(_ + _)
.filter(_._2 > 2)
.flatMap(rating => Array.fill(rating._2)(rating._1))
It works as follows:
textfile loads the rdd from the file
map transforms the lines to pairs of the form (rating, 1)
reduceByKey groups the pairs by the rating and sums the 1s (i.e. counts the occurrences of each rating)
filter discards the ratings which appear less than 3 times
flatmap repeats each rating so many times as its count and then flattens all the results to a single rdd
Here are some reasons why the initial approach is not performant:
collect is used on the dataframe to read its contents on a local machine. This means that you directly loose all parallelization and clustering benefits of spark.
The for loop executes single insertions to the dataframe. The available transformations of spark objects (e.g. map, filter, reduce, single sql queries) are highly optimised to perform those actions in a distributed fashion. By using a for loop to perform single row actions, you loose this advantage and additionally you risk getting extreme overhead by dataframes which are copied during each iteration in the loop.
(minor) Converting an RDD to a dataframe adds some additional computational cost. Thus unless you plan to perform several operations which would benefit from performance features of dataframes or datasets, I would suggest keeping it simple by using just rdds.
You might want to consider ungrouping your dataframe using explode in accordance with the groupBy count:
import org.apache.spark.sql.functions._
case class Rating(age: Double, edu: String, sex: String, salary: String)
val Result = sc.textFile("/Users/leo/projects/spark/files/testfile.csv").
map(_.split(",")).
map(p => Rating(p(0).trim.toDouble, p(1).trim, p(2).trim, p(3).trim)).
toDF
val saDF1 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt")
val saDF2 = Result.groupBy("age", "edu", "sex", "salary").agg(count("*") as "cnt").where($"cnt" > 2)
// Create a UDF to fill array of 1's to be later exploded
val fillArr = (n: Int) => Array.fill(n)(1)
val fillArrUDF = udf(fillArr)
val expandedDF1 = saDF1.withColumn("arr", fillArrUDF($"cnt"))
expandedDF1.show
+----+------+---------+------+---+---------+
| age| edu| sex|salary|cnt| arr|
+----+------+---------+------+---+---------+
|33.0| Male|Bachelors| <=50K| 1| [1]|
|20.0|Female| Masters| >50K| 2| [1, 1]|
|53.0| Male| 11th| <=50K| 3|[1, 1, 1]|
+----+------+---------+------+---+---------+
// Ungroup dataframe using explode
val ungroupedDF1 = expandedDF1.withColumn("a", explode($"arr")).
select("age", "edu", "sex", "salary")
ungroupedDF1.show
+----+------+---------+------+
| age| edu| sex|salary|
+----+------+---------+------+
|33.0| Male|Bachelors| <=50K|
|20.0|Female| Masters| >50K|
|20.0|Female| Masters| >50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
|53.0| Male| 11th| <=50K|
+----+------+---------+------+
According to what I understood from your question that you want to filter out similar records which are greater than 2 and write to a file. If thats so following can be your solution.
You must already have original dataframe as
+----+------+---------+------+
|age |edu |sex |salary|
+----+------+---------+------+
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|53.0|Male |11th |<=50K |
|20.0|Female|Masters |>50K |
|20.0|Female|Masters |>50K |
|33.0|Male |Bachelors|<=50K |
+----+------+---------+------+
You don't need to write complex sql queries to find the count, you can just use inbuilt functions as
val columnNames = Result.columns
val finalTemp = Result.groupBy(columnNames.map(col): _*).agg(count("salary").as("similar records"))
This should give output as
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|53.0|Male |11th |<=50K |3 |
+----+------+---------+------+---------------+
Now to filter, you can just use filter function as
val finalTable = finalTemp.filter($"similar records" < 3)
final output is
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
+----+------+---------+------+---------------+
you can save it to a file
finalTable.write.format("com.databricks.spark.csv").save("output path")
If you want the original data with filtered out then you can simply use join as
Result.join(finalTable, Seq(columnNames: _*)).show(false)
output is
+----+------+---------+------+---------------+
|age |edu |sex |salary|similar records|
+----+------+---------+------+---------------+
|33.0|Male |Bachelors|<=50K |1 |
|20.0|Female|Masters |>50K |2 |
|20.0|Female|Masters |>50K |2 |
+----+------+---------+------+---------------+
You can save it to a file as above
Note: you will need following import for above functions to work
import org.apache.spark.sql.functions._