Getting DataFrame's Column value results in 'Column' object is not callable - dataframe

For stream read from FileStore I'm trying to check if first column of first row value is equal to some string. Unfortunately while I access this column in any way e.g. launching .toList() on it, it throws
if df["Name"].iloc[0].item() == "Bob":
TypeError: 'Column' object is not callable
I'm calling the customProcessing function from:
df.writeStream\
.format("delta")\
.foreachBatch(customProcessing)\
[...]
And inside this function I'm trying to get the value, but none of the ways of getting the data works. The same error is being thrown.
def customProcessing(df, epochId):
if df["Name"].iloc[0].item() == "Bob":
[...]
Is there a possibility for reading single cols? Or it is writeStream specific and I'm unable to use conditions on that input?

There is no iloc for spark dataframes - this is not pandas; also there is no concept of index.
If you want to get the first item you could try
df.select('Name').limit(1).collect()[0][0] == "Bob"

Related

How do I access dataframe column value within udf via scala

I am attempting to add a column to a dataframe, using a value from a specific column—-let’s assume it’s an id—-to look up its actual value from another df.
So I set up a lookup def
def lookup(id:String): String {
return lookupdf.select(“value”)
.where(s”id = ‘$id’”).as[String].first
}
The lookup def works if I test it on its own by passing an id string, it returns the corresponding value.
But I’m having a hard time finding a way to use it within the “withColumn” function.
dataDf
.withColumn(“lookupVal”, lit(lookup(col(“someId”))))
It properly complains that I’m passing in a column, instead of the expected string, the question is how do I give it the actual value from that column?
You cannot access another dataframe from withColumn . Think of withColumn can only access data at a single record level of the dataDf
Please use a join like
val resultDf = lookupDf.select(“value”,"id")
.join(dataDf, lookupDf("id") == dataDf("id"), "right")

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])

Filtering DataFrame with static date value

I am trying to filter DataFrame to get all dates greater than '2012-09-15'
I tried the solution from another post which suggested me to use
data.filter(data("date").lt(lit("2015-03-14")))
but i am getting an error
TypeError: 'DataFrame' object is not callable
What is the solution for this
You need square brackets around "date", i.e.
data.filter(data["date"] < lit("2015-03-14"))
Calling data("date") is treating data as a function (rather than a dataframe)

Issue adding new columns to dataframe using pyspark

Say I run this
DF1.withColumn("Is_elite",
array_intersect(DF1.year,DF1.elite_years))
.show()
I get the result I want which is a new column called Is_elite with the correct values and all
Then in the next command I run
DF1.show
It just shows me what DF1 would have looked like had I not run the first command, my column is missing.
Since you have added .show() method in the line, it is not returning a new data frame. Make the following changes and try it out
elite_df = DF1.withColumn("Is_elite",array_intersect(DF1.year,DF1.elite_years))
elite_df.show()
In case you get confused about the object in python, try to print the type of object.
#the following must return a dataframe object.
print(type(elite_df))
Dataframes are immutable and every transformation create a new dataframe reference and hence if you try to print the old datagram, you will not get the revised result.

How do I return more than one value from DataFrame.groupby.rolling.apply?

I am looking to apply a function on a rolling window. I organize the data using the groupby method. But the function must return two values; when I do that I get the error:
cannot convert the series to <type 'float'>
When I run df.groupby(['A'])['B'].rolling(2).apply(function) with a function that returns a single value it works. But then I modify the code to use a function that returns two values:
df=pd.DataFrame({'A':[1,1,1,1,2,2,2,2],'B':[0.2,0.5,1.0,1.2,1.7,1.9,2.1,2.4]})
def basic_function(y):
sum1=0
sum2=0
for x in y:
sum1=sum1+x*x
sum2=sum2+x*x*x
return sum1/len(y),sum2/len(y)
df.groupby(['A'])['B'].rolling(2).apply(basic_function)
I was expecting the code to add two new columns to the dataframe. Instead I get an error message.