How to get the mode of a column from a dataframe using spark scala? - dataframe

I'm trying to get the mode of a column in a dataframe using spark scala but my code is not working.
For example
val type_mode = dfairports.groupBy("type").count().orderBy("count").first()
print("Mode", type_mode.get(0))

You're almost there! You're probably getting the least common value now, since the orderBy function by default orders by ascending values. So taking the first element will take the lowest value.
Try:
val type_mode = dfairports.groupBy("type").count().orderBy(desc("count")).first()
print("Mode", type_mode.get(0))
Hope this helps :)

Related

How do I access dataframe column value within udf via scala

I am attempting to add a column to a dataframe, using a value from a specific column—-let’s assume it’s an id—-to look up its actual value from another df.
So I set up a lookup def
def lookup(id:String): String {
return lookupdf.select(“value”)
.where(s”id = ‘$id’”).as[String].first
}
The lookup def works if I test it on its own by passing an id string, it returns the corresponding value.
But I’m having a hard time finding a way to use it within the “withColumn” function.
dataDf
.withColumn(“lookupVal”, lit(lookup(col(“someId”))))
It properly complains that I’m passing in a column, instead of the expected string, the question is how do I give it the actual value from that column?
You cannot access another dataframe from withColumn . Think of withColumn can only access data at a single record level of the dataDf
Please use a join like
val resultDf = lookupDf.select(“value”,"id")
.join(dataDf, lookupDf("id") == dataDf("id"), "right")

How can I select all fields of same type in a pandas dataframe

I was looking for the best way to collect all fields of the same type in pandas, similar to the style that works with spark data frames as below;
continuousCols = [c[0] for c in pysparkDf.dtypes if c[1] in ['int', 'double']]
I eventually figured it out with continuousCols = df.select_dtypes([float,int]).columns . If you have used any other method that works, feel free to add it as well.

I cannot understand why "in" doesn't work correctly

sp01 is dataframe which contains S&P 500 index. And I have a dataframe,interest, which contains daily interest rate. The two data started from same date, but their size were not same. It's error.
I want to get exact same date, so tried to check every date using "in" function. But "in" function doesn't work well. This is code :
print(sp01.Date[0], type(sp01.Date[0]) )
->1976-06-01, str
print(interest.DATE[0], type(interest.DATE[0]) )
->1976-06-01, str
print(sp01.Date[0] in interest.DATE)
->False
I can never understand why the result becomes False.
Of course, first date of sp01 and interest is totally same,
I checked it too by typing code. So, True should be come out, but False came out. I got mad!!!please Help me.
I solved it! the problem is that "in" function does not work for pandas series data. Those two data are pandas series, so I have to change one of them to list

What happens when we do a repartition on a dataframe that is already repartitioned?

I was analysing a developed code. I found something like this.
val newDF = df.repartition(1).withColumn("name", lit("xyz")).orderBy(col("count").asc)
Later at a different module, this newDF was reused as below
newDF.repartition(1).write.format("csv").save(path/of/file)
Now my doubt is, since the same dataframe is repartitioned in 2 places - that too with an orderby in place for the first Dataframe - Will the data not get shuffled after the second repartition which makes orderBy void ?

how to remove stopwords from a text in pre processing of spark

I have a requirement to pre process the data in the spark before running the algorithms
One of the pre processing logic was to remove the stopwords from the text. I tried with spark StopWordsRemover. StopWordsRemover requires input and output should be Array[String]. After running the program the final column output is shown as collection of strings, i would require a plain string.
My code as follows.
val tokenizer: RegexTokenizer = new RegexTokenizer().setInputCol("raw").setOutputCol("token")
val stopWordsRemover = new StopWordsRemover().setInputCol("token").setOutputCol("final")
stopWordsRemover.setStopWords(stopWordsRemover.getStopWords ++ customizedStopWords)
val tokenized: DataFrame = tokenizer.transform(hiveDF)
val transformDF = stopWordsRemover.transform(tokenized)
Actual Output
["rt messy support need help with bill"]
Required Output:
rt messy support need help with bill
My output should be like a string but not as array of string. Is there any way to do this. I require the output of the column in the dataframe as string.
Also I would need suggestion on the below options to remove stopwords from the text in the spark program.
StopWordsRemover from SparkMlib
Standford CoreNLP Library.
Which of the option gives better performance when parsing huge files.
Any help appreciated.
Thanks in advance.
You may use this to get string instead of array-of-strings - df.collect()[0] - if you are sure only first item is in your interest.
However, that should not be any issue here as long as you traverse the array and get each items there.
Ultimately HiveDF will give you RDD[String] - and it becomes Array[String] when you convert from RDD.