Julia merging dataframes - dataframe

i am rather new to julia so apologies if this is a rather trivial question. However I have stumbled upon a problem i cannot resolve so appreciate any help/suggestions.
I have a function which returns a set of values in a dataframe when I run it. Something along the lines of:
(values) = function(data)
However I run this function in
[for example see here][1]
I run this function in a loop so that each time it runs, I get a new set of values.
for x = 1:5
(values) = function(data[data[:sub].==[x], :])
end
After the function runs, I then want to put the values that come back into a "master" data frame which has exactly the same column headings and compiles the values that come back on each loop iteration.
This seems infuriatingly tricky to do. I tried using append! as described here:
https://discourse.julialang.org/t/adding-a-new-row-to-a-dataframe/1331/3
But this does not work. For example if I run the following commands
(values_1) = function(data[data[:sub].==[1], :])
(values_2) = function(data[data[:sub].==[2], :])
append(values_1, values_2)
This fails with the following error message:
MethodError: no method matching append!(::Array{Float64,2}, ::Array{Float64,2})
Closest candidates are:
append!(::PyCall.PyObject, ::Any) at /Users/neil/.julia/v0.5/PyCall/src/PyCall.jl:836
append!(::Array{T,1}, ::Any) at collections.jl:21
append!{T}(::PyCall.PyVector{T}, ::Any) at /Users/neil/.julia/v0.5/PyCall/src/conversions.jl:278
in append!(::DataFrames.DataFrame, ::DataFrames.DataFrame) at /Users/neil/.julia/v0.5/DataFrames/src/dataframe/dataframe.jl:791
in include_string(::String, ::String) at ./loading.jl:441
in include_string(::String, ::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
Note that the function does correctly return a dataframe each time I run it, it is just the concatanation of values that is causing problems.
appreciate any pointers.
n

You're looking for vcat, e.g. vcat(values_1, values_2). There might be a better way to do what you're doing, though, but it's hard to give specific advice without an 'MWE', i.e. an example we can paste into a terminal, run and rewrite.

Related

Can anyone tell me what's wrong with my code (I am a newbie in programming, pls do cooperate )

I am trying to write a code which calculates the HCF of two numbers but I am either getting a error or an empty list as my answer
I was expecting the HCF, My idea was to get the factors of the 2 given numbers and then find the common amongst them then take the max out of that
For future reference, do not attach screenshots. Instead, copy your code and put it into a code block because stack overflow supports code blocks. To start a code block, write three tildes like ``` and to end it write three more tildes to close. If you add a language name like python, or javascript after the first three tildes, syntax highlighting will be enabled. I would also create a more descriptive title that more accurately describes the problem at hand. It would look like so:
Title: How to print from 1-99 in python?
for i in range(1,100):
print(i)
To answer your question, it seems that your HCF list is empty, and the python max function expects the argument to the function to not to be empty (the 'arg' is the HCF list). From inspection of your code, this is because the two if conditions that need to be satisfied before anything is added to HCF is never satisfied.
So it could be that hcf2[x] is not in hcf and hcf[x] is not in hcf[x] 2.
What I would do is extract the logic for the finding of the factors of each number to a function, then use built in python functions to find the common elements between the lists. Like so:
num1 = int(input("Num 1:")) # inputs
num2 = int(input("Num 2:")) # inputs
numberOneFactors = []
numberTwoFactors = []
commonFactors = []
# defining a function that finds the factors and returns it as a list
def findFactors(number):
temp = []
for i in range(1, number+1):
if number%i==0:
temp.append(i)
return temp
numberOneFactors = findFactors(num1) # populating factors 1 list
numberTwoFactors = findFactors(num2) # populating factors 2 list
# to find common factors we can use the inbuilt python set functions.
commonFactors = list(set(numberOneFactors).intersection(numberTwoFactors))
# the intersection method finds the common elements in a set.

Pandas dataframe [] issue

I am new to python. Could someone help me why the first code statement generates no coding error but the second one will generate a keyerror(I do not even know what is a keyerror).
p.s.: data is a DataFrame,'StrategyCumulativePct' and 'BuyHold' are two columns in the DataFrame respectively.
data[['StrategyCumulativePct', 'BuyHold']].plot()
data['StrategyCumulativePct', 'BuyHold'].plot()
On the other hand, may I ask why sometimes when I had only written 20 lines of codes but there could be errors pointing an arrow at lines 2000 / 3000.... which I have not created before? Thanks.

How to efficiently append a dataframe column with a vector?

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.
You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).

Defining a function in Pandas

I am new to Pandas and I am taking this course online. I know there is a way to define a function to make this code cleaner but I'm not sure how to go about it.
noshow = len((df[
(df['Gender'] == 'M') \
& (df['No_show'] == 'Yes') \
& (df['Persons_age'] == 'Child')
]))
noshow
There is multiple Genders and multiple No_show answers and Multiple Person's age and I don't want to have write out the code for each one of those.
I've gotten the code for a single function but not for the mutiple iterations.
def print_noshow_percentage(column_name, value, percentage_text):
total = (df[column_name] == value).sum()
noshow = len((df[(df[column_name] == value) & (df['No_show'] == 'Yes')]))
print(int((noshow / total) * 100), percentage_text)
I hope this makes sense. Thanks for any help!
Welcome to Stack Exchange. You are not too clear about your desired output, but I think what you are trying to do is to get a summary of every possible combination of age, gender, and no_show in your df. To accomplish this you can use the built in groupby method of pandas documentation here.
As mentioned by #ALollz, the following code will get you everything you need to know about your counts in terms of percentages.
counts = df.groupby(['Gender', 'Persons_age'])['No_show'].value_counts(normalize=True)
Now you need to decide what to do with it. You can either iterate through the dataframe printing each line, or you can find specific combinations or you can print out the whole thing.
In general, it is better to look for a built in method than to try to build a function outside of pandas. There are a lot of different ways to do things and checking the documentation is a good place to start.

Pandas and Fuzzy Match

Currently I have two data frames. I am trying to get a fuzzy match of client names using fuzzywuzzy's process.extractOne function. When I have run the following script on sample data I get good results and no error, but when I run the following on my current data frames I get both an Attribute and Type error. I am not able to provide the data for security reasons, but if anyone can figure out why I am getting errors based on the script provided I would be much obliged.
names2 = list(dftr3['Common Name'])
names3 = dict(zip(names2,names2))
def get_fuzz_match(row):
match = process.extractOne(row['CLIENT_NAME'],choices = n3.keys(),score_cutoff = 80)
if match:
return n3[match[0]]
return np.nan
dfmi4['Match Name'] = dfmi4.apply(get_fuzz_match, axis=1)
I know not having some examples makes this more difficult to troubleshoot, so I will answer any question and edit the post to help this process along. The specific errors are:
1.AttributeError: 'dict_keys' object has no attribute 'items'
2.TypeError: expected string or buffer
The AttributeError is straightforward and to be expected, I think. Fuzzywuzzy's process.extract function, which does most of the actual work in process.extractOne, uses a try:... except: clause to determine whether to process the choices parameter as dict-like or list-like. I think you are seeing the exception because the TypeError is raised during the except: clause.
The TypeError is trickier to pin down, but I suspect it occurs somewhere in the StringProcessor class, used in the processor module, again called by extract, which uses several string methods and doesn't catch exceptions. So it seems likely that your apply call is passing something that is not a string. Is it possible that you have any empty cells?