How to assign Pandas.Series.str.extractall() result back to original dataset? (TypeError: incompatible index of inserted column with frame index) - pandas

Dataset brief overview
dete_resignations['cease_date'].head()
gives
dete_resignations['cease_date'].value_counts()
gives
of the code above
What I tried
I was trying to extract only the year value (e.g. 05/2012 -> 2012) from 'dete_resignations['cease_date']' using 'Pandas.Series.str.extractall()' and assign the result back to the original dataframe. However, since not all the rows contain that specific string values(e.g. 05/2012), an error occurred.
Here are the code I wrote.
pattern = r"(?P<month>[0-1][0-9])/?(?P<year>[0-2][0-9]{3})"
years = dete_resignations['cease_date'].str.extractall(pattern)
dete_resignations['cease_date_'] = years['year']
'TypeError: incompatible index of inserted column with frame index'
I thought the 'years' share the same index with 'dete_resignations['cease']'. Therefore, even though two dataset's index is not identical, I expected python automatically matches and assigns the values to the right rows. But it didn't
Can anyone help solve this issue?
Much appreciated if someone can enlighten me!

If you only want the years, then don't catch the month in pattern, and you can use extract instead of extractall:
# the $ indicates end of string
# \d is equivalent to [0-9]
# pattern extracts the last digit groups
pattern = '(?P<year>\d+)$'
years = dete_resignations['cease_date'].str.extract(pattern)
dete_resignations['cease_date_'] = years['year']

Related

Nested when in Pyspark

I need to apply lots of when conditions which take input from a list by indexes. I wanted to ask if there's a way to write the code efficiently which produces the same results without affecting the runtime efficiency.
Below is the code I am using
df=df.withColumn('date_match_label', F.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[0]} matches with {date_cols[3]}")
.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.otherwise('No Match'))
Here date_cols contains six column names. I need to check the first three columns with the last three columns and return the comment if there's a match.
The problem with current approach is as the size of the list increases, I'll have to add more and more lines which makes my code prone to errors and looks ugly. I was wondering if there's a way to that where I only need to specify the list indices which need to be compared against the other list elements.
Considering you want to compare the 1st half of the list (with column names) to second half of the list, you can dynamically build the code expression so that there wouldn't be any need to write more error prone code each time the list expands.
You can build the code dynamically with the help of indices in the following way:
from itertools import product
from pyspark.sql.functions import when,col
n=len(cols)
req=list(range(0,n))
res = list(product(req[:n//2], req[n//2:]))
start = '''df.withColumn('date_match_label','''
whens =[]
for i,j in res:
whens.append(f'''when(col(cols[{i}])==col(cols[{j}]), f"cols[{i}] matches with cols[{j}]")''')
final_exp = start + '.'.join(whens) + '''.otherwise('No Match'))'''
This will generate the final expession as shown below considering there are 4 columns (comparing 1st half with 2nd half):
The above is a string expression. So, to execute it you can use eval function and get results as shown below:
df = eval(final_exp)
df.show(truncate=False)

Writing csv file with CSV.jl throws StackOverflowError message

CSV.write() returns: ERROR: LoadError: StackOverflowError: further down claiming --- the last 2 lines are repeated 39990 more times ---. Which is nonsense since the data frame to be written is acknowledged to be of 129x6 dimension with no empty fields. I fear the function is counting rows beyond the dimension of the data frame and shouldn't be doing so. How do I make this function work?
using CSV,DataFrames
file=DataFrame(
a=1:50,
b=1:50,
c=Vector{Any}(missing,50),
d=Vector{Any}(missing,50))
CSV.write("a.csv",file)
file=CSV.read("a.csv",DataFrame)
c,d=[],[]
for i in 1:nrow(file)
push!(c,file.a[i]*file.b[i])
push!(d,file.a[i]*file.b[i]/file.a[i])
end
file[!,:c]=c
file[!,:d]=d
# I found no straight forward way to fill empty columns based on the values of filled columns in a for loop that wouldn't generate error messages
CSV.write("a.csv",DataFrame) # Error in question here
The StackOverflowError is indeed due to the line
CSV.write("a.csv",DataFrame)
which tries to write the data type itself to a file instead of the actual dataframe variable. (The reason that leads to this particular error is an implementation detail that's described here.)
# I found no straight forward way to fill empty columns based on the values of filled columns in a for loop that wouldn't generate error messages
You don't need a for loop for this, just
file.c = file.a .* file.b
file.d = file.a .* file.b ./ file.a
will do the job. (Though I don't understand the point of the file.d calculation - maybe this is just a sample dummy calculation you chose just for illustration.)

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])

I cannot understand why "in" doesn't work correctly

sp01 is dataframe which contains S&P 500 index. And I have a dataframe,interest, which contains daily interest rate. The two data started from same date, but their size were not same. It's error.
I want to get exact same date, so tried to check every date using "in" function. But "in" function doesn't work well. This is code :
print(sp01.Date[0], type(sp01.Date[0]) )
->1976-06-01, str
print(interest.DATE[0], type(interest.DATE[0]) )
->1976-06-01, str
print(sp01.Date[0] in interest.DATE)
->False
I can never understand why the result becomes False.
Of course, first date of sp01 and interest is totally same,
I checked it too by typing code. So, True should be come out, but False came out. I got mad!!!please Help me.
I solved it! the problem is that "in" function does not work for pandas series data. Those two data are pandas series, so I have to change one of them to list

How to efficiently append a dataframe column with a vector?

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.
You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).