Issue adding new columns to dataframe using pyspark - apache-spark-sql

Say I run this
DF1.withColumn("Is_elite",
array_intersect(DF1.year,DF1.elite_years))
.show()
I get the result I want which is a new column called Is_elite with the correct values and all
Then in the next command I run
DF1.show
It just shows me what DF1 would have looked like had I not run the first command, my column is missing.

Since you have added .show() method in the line, it is not returning a new data frame. Make the following changes and try it out
elite_df = DF1.withColumn("Is_elite",array_intersect(DF1.year,DF1.elite_years))
elite_df.show()
In case you get confused about the object in python, try to print the type of object.
#the following must return a dataframe object.
print(type(elite_df))
Dataframes are immutable and every transformation create a new dataframe reference and hence if you try to print the old datagram, you will not get the revised result.

Related

Getting DataFrame's Column value results in 'Column' object is not callable

For stream read from FileStore I'm trying to check if first column of first row value is equal to some string. Unfortunately while I access this column in any way e.g. launching .toList() on it, it throws
if df["Name"].iloc[0].item() == "Bob":
TypeError: 'Column' object is not callable
I'm calling the customProcessing function from:
df.writeStream\
.format("delta")\
.foreachBatch(customProcessing)\
[...]
And inside this function I'm trying to get the value, but none of the ways of getting the data works. The same error is being thrown.
def customProcessing(df, epochId):
if df["Name"].iloc[0].item() == "Bob":
[...]
Is there a possibility for reading single cols? Or it is writeStream specific and I'm unable to use conditions on that input?
There is no iloc for spark dataframes - this is not pandas; also there is no concept of index.
If you want to get the first item you could try
df.select('Name').limit(1).collect()[0][0] == "Bob"

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])

How to efficiently append a dataframe column with a vector?

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.
You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).

how to print by default the first n row for a pandas dataframe?

when print a pandas dataframe, how to print the first n row by default?
I find myself frequently doing df.head(10) to view the column names and first a couple of rows.
I prefer when I type 'df', it prints the first n row by default, instead of printing the whole df, which in this case I cannot see the column names.
If I understand you correctly, you may set
pd.options.display.max_rows = 10
and whenever you do just df in your notebook, only 10 rows would be displayed.
You can always set back to the default value doing
pd.reset_option('display.max_rows')
Check pd.describe_option('display') for more information
Curry DataFrame.head using functools.partial.
from functools import partial
head10 = partial(pd.DataFrame.head, n=10)
Now you can either call the function passing your DataFrame as an argument,
head10(df)
Or, pass the function to df.pipe (which internally passes df as an argument to your function),
df.pipe(head10)
To get the first 10 rows by default.
The other option is to create a new class that extends DataFrame and add your own function (e.g., headXX) which internally calls df.head(n=10) and returns the result.
See the subclassing DataFrame section in the docs.

What's the cleanest way for assigning a new pandas dataframe column to a single value?

Working with a dataframe df I wanted to create a new column A and assign it to a single value (a string in my case)
df['A'] = value
gave a warning and suggested to use loc
however the solution below still gave the same warning:
df.loc[:,'A']=value
Doing some research I found the solution below which does not generate a warning:
df=df.assign(A =value)
Is it the general accepted way of creating a new column and assigning it to a value? Are there other possibilities using loc?
pandas version '0.20.1'
EDIT: this is the warning message obtained for the 2 first methods
"A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
As explained by #EdChum and #ScottBoston
Since df was derived using a mask on some original dataframe
df = df_original[boolean_mask]
to avoid the warning with the two first methods, use instead df=df_original[boolean_mask].copy()
df.assign does not need this because it automatically creates a copy of the original dataframe