Count will not work for unique elements in my dataframe, only when repeated - pandas

I want to count the number of occurences in a dataframe, and I need to do it using the following function:
for x in homicides_prec.reset_index().DATE.drop_duplicates():
count= homicides_prec.loc[x]['VICTIM_AGE'].count()
print(count)
However, this only works for when the specific Date is repeated. It does not work when dates only appear once, and I don't understand why. I get this error:
TypeError: count() takes at least 1 argument (0 given)
That said, it really doesn't make sense to me, because I get that error for this specific value (which only appears once on the dataframe):
for x in homicides_prec.reset_index().DATE[49:50].drop_duplicates():
count= homicides_prec.loc[x]['VICTIM_AGE'].count()
print(count)
However, I don't get the error if I run this:
homicides_prec.loc[homicides_prec.reset_index().DATE[49:50].drop_duplicates()]['VICTIM_AGE'].count()
Why does that happen??? I can't use the second option because I need to use the for loop.
More info, in case it helps: The problem seems to be that, when I run this (without counting), the output is just a number:
for x in homicides_prec.reset_index().DATE[49:50].drop_duplicates(): count= homicides_prec.loc[x]['VICTIM_AGE']
print(count)
Output: 33
So, when I add the .count it will not accept that input. How can I fix this?

There are a few issues with the code you shared, but the shortest answer is that when x appears only once you are not doing a slice, rather you are accessing some value.
if x == '2019-01-01' and that value appears twice then
homicides_prec.loc[x]
will be a pd.DataFrame with two rows, and
homicides_prec.loc[x]['VICTIM_AGE']
will be a pd.Series object with two rows, and it will happily take a .count() method.
However, if x == '2019-01-02' and that date is unique, then
homicides_prec.loc[x]
will be a pd.Series representing row where the index is x
From that we see that
homicides_prec.loc[x]['VICTIM_AGE']
is a single value, so .count() does not make sense.

Related

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])

Functionality of iloc and simply [ ] in a series

I think in pandas a series S
S[0:2] is equivalent to s.iloc[0:2] , in both cases two rows will be there but recently I got into a trouble The first picture shows the expected output but I didn't know what went wrong in the In this picture S[0] is showing error i don't know why
I can try to explain this behavior a bit. In Pandas, you have selection by position or by label, and it's important to remember that every single row/column will always have a position "name" and a label name. In the case of columns, this distinction is often easy to see, because we usually give columns string names. The difference is also obvious when you use explicitly .iloc vs .loc slicing. Finally, s[X] is indexing, which s[X:Y] is slicing, and the behaviour of the two actions is different.
Example:
df = pd.DataFrame({'a':[1,2,3], 'b': [3,3,4]})
df.iloc[:,0]
df.loc[:,'a']
both will return
0 1
1 2
2 3
Name: a, dtype: int64
Now, what happened in your case is that you overwrote the index names when you declared s.index = [11,12,13,14]. You can see that by inspecting the index before and after this change. Before, if you run s.index, you see that it is a RangeIndex(start=0, stop=4, step=1). After you change the index, it becomes Int64Index([11, 12, 13, 14], dtype='int64').
Why does this matter? Because although you overrode the labels of the index, the position of each one of them remains the same as before. So, when you call
s[0:2]
you are slicing by position (this section in the documentation explains that it's equivalent to .iloc. However, when you run
s[0]
Pandas thinks you want to select by label, so it starts looking for the label 0, which doesn't exist anymore, because you overrode it. Think of the square-bracket selection in the context of selecting a dataframe column: you would say df["column"] (so you're asking for the column by label), so the same is in the context of a series.
In summary, what happens with Series indexing is the following:
In the case you use string index labels, and you index by an string, Pandas will look up the string label.
In the case you use string index labels, and you index by an integer, Pandas will fall back to indexing by position (that's why your example in the comment works).
In the case you use integer index labels, and you index by an integer, Pandas will always try to index by the "label" (that's why the first case doesn't work, as you have overriden the label 0).
Here are two articles explaining this bizarre behavior:
Indexing Best Practices in Pandas.series
Retrieving values in a Series by label or position

How to efficiently append a dataframe column with a vector?

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.
You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).

python error message min() arf is an empty sequence

I am trying to create a function that takes up a string of numbers and outputs the maximum and minimum values. Here is my code
def high_and_low(numbers):
numbers = map(int, numbers.split())
max_n = max(numbers)
print(max_n)
min_n = min(numbers)
return(max_n, min_n)
But I get the following error: ValueError: min() arg is an empty sequence. So I assume that it does not read the negative values, but I dont know why..
I assume you're using Python 3.x, where map() was changed to return a generator rather than an explicit list of results as in Python 2.x. Calling max() on this generator exhausted it, leaving no elements for min() to iterate over.
One solution would be to convert this generator to a list, perhaps numbers = list(numbers) as the second line of your function. As a list, you can iterate it as many times as you need to.

Why does the rebol interpreter return different results?

Consider:
>> print max 5 6 7 8
6
== 8
The documentation states that max only takes two arguments, so I understand the first line. But from the second line it looks like the interpreter is still able to find the max of an arbitrary number of args.
What's going on here? What is the difference between the two results returned? Is there a way to capture the second one?
I don't really know Rebol but what I do notice is that you're using print inside of th REPL. The first output is from print, which is outputting the result of max 5 6. The second output is from the REPL, which is outputting the value of your whole expression — which is maybe just the last item in the list? If you changed the order of your inputs, I bet you would see a different result.
max is an abbreviation for maximum. As #hobbs correctly guessed, it takes two arguments, and what you're seeing is just the evaluator's logic of turning the crank...and becoming equal to the last value in the expression. In this case you're not using that result, so the interpreter shows it to you with "==". But you could have assigned that whole expression to a variable (for instance).
What you were intending is something that gets the maximum value out of a series. In the DO dialect all functions have fixed arity, and the right way to design such a beast would be to make it take one argument...the series.
Such a thing does exist, though there isn't an abbreviation...
>> print maximum-of [5 6 7 8]
8