I want to create a function in python that normalizes the values of several variables with specific condition:
As an example the following df, mine have 24 in total (23 int and 1 obj)
Column A
Column B
Column C
2
4
A
3
3
B
0
0.4
A
5
7
B
3
2
A
6
0
B
Lets say that I want to create a new df with the values of Col A and Col B after dividing by factor X or Y depending of whether col C is A or B. ie if col C is A the factor is X and if col C is B the factor is Y
I have create different version of a function:
def normalized_new (columns):
for col in df.columns:
if df.loc[df['Column C'] =='A']:
col=df[col]/X
elif df.loc[df['Column C'] =='B']:
col=df[col]/Y
else: pass
return columns
normalized_new (df)
and the other I tried:
def new_norm (prog):
if df.loc[(df['Column C']=='A')]:
prog = 1/X
elif df.loc[(df['Column C']=='B')]:
prog = 1/Y
else: print('this function doesnt work well')
return (prog)
for col in df.columns:
df[col]=new_norm(df)
For both function I always have the same valueError:
The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Could you help me to understand what is going on here? is there any other way to create a df with the desire output?
Thank you so much in advance!
Try to use np.where + .div:
X = 10
Y = -10
df[["Column A", "Column B"]] = df[["Column A", "Column B"]].div(
np.where(df["Column C"].eq("A"), X, Y), axis=0
)
print(df)
Prints:
Column A Column B Column C
0 0.2 0.40 A
1 -0.3 -0.30 B
2 0.0 0.04 A
3 -0.5 -0.70 B
4 0.3 0.20 A
5 -0.6 -0.00 B
Would you consider using apply and call custom function to set new column based on whole row data. This makes it easier to read.
For example:
X=10
Y=5
def new_norm(row):
#put your if/elif logic here, for example:
if row['Column C'] == 'A':
return row['Column A']/X #don't forget to return value for new column
....
df['newcol'] = df.apply(new_norm, axis=1) #call function for each row and add column 'newcol'
Function will allow to solve edge case (for example empty Column C or when there is different value than A or B etc.
So, I'm learning more about Julia and I would like to do the following:
I have a 3 row by 2 columns matrix, which is fixed,
A = rand(2,3)
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.705942 0.553562 0.731246
0.205833 0.106978 0.131893
Then, I would like to have a anonymous function, which does the following:
a = ones(1,3);
a[2] = rand();
Finally, I would like to broadcast
broadcast(+, ones(1,3) => a[2]=rand(), A)
So I have the middle column of A, i.e., A[:,2], added by two different random numbers, and in the rest of the columns, we add ones.
EDIT:
If I add a, as it is:
julia> a = ones(1,3)
1×3 Matrix{Float64}:
1.0 1.0 1.0
julia> a[2] = rand()
0.664824196431979
julia> a
1×3 Matrix{Float64}:
1.0 0.664824 1.0
I would like that this a were dynamic, and a function.
So that:
broadcast(+, a, A)
Would give:
julia> broadcast(+, a, A)
2×3 Matrix{Float64}:
1.70594 0.553562 + rand() (correct) 1.73125
1.20583 0.106970 + rand() (different rand()) 1.13189
Instead of:
julia> broadcast(+, a, A)
2×3 Matrix{Float64}:
1.70594 1.21839 (0.553562 + -> 0.664824) 1.73125
1.20583 0.771802 (0.106978 + -> 0.664824) 1.13189
So, I thought of this pseudo-code:
broadcast(+, a=ones(1,3) => a[2]=rand(), A)
Formalizing:
broadcast(+, <anonymous-fucntion>, A)
Second EDIT:
Rules/Constrains:
Rule 1: the call must be data-transparent. That is, A must not change state, just like when we call f.(A).
Rule 2: not creating an auxiliary variable (a must not exist). The only vector that must exist, before and after, the call is A.
Rule 3: f.(A) must be anonymous; that is, you can't use define f as function f(A) ... end
With the caveat that I don't know how much you really learn by setting artificial rules like this, some tidier ways are:
julia> A = [ 0.705942 0.553562 0.731246
0.205833 0.106978 0.131893 ]; # as given
julia> r = 0.664824196431979; # the one random number
julia> (A' .+ (1, r, 1))' # no extra vector
2×3 adjoint(::Matrix{Float64}) with eltype Float64:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
julia> mapslices(row -> row .+ (1, r, 1), A; dims=2) # one line, but slow
2×3 Matrix{Float64}:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
julia> B = A .+ 1; #views B[:, 2] .+= (-1 + r); B # fast, no extra allocations
2×3 Matrix{Float64}:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
I can't tell from your question whether you want one random number or two different ones. If you want two, then you can do this:
julia> using Random
julia> Random.seed!(1); mapslices(row -> row .+ (1, rand(), 1), A; dims=2)
2×3 Matrix{Float64}:
1.70594 0.675436 1.73125
1.20583 0.771383 1.13189
julia> Random.seed!(1); B = A .+ 1; #views B[:, 2] .+= (-1 .+ rand.()); B
2×3 Matrix{Float64}:
1.70594 0.675436 1.73125
1.20583 0.771383 1.13189
Note that (-1 .+ rand.()) isn't making a new array on the right, it's fused by .+= into one loop over a column of B. Note also that B[:,2] .= stuff just writes into B, but B[:, 2] .+= stuff means B[:, 2] .= B[:, 2] .+ stuff and so, without #views, the slice B[:, 2] on the right would allocate a copy.
Firstly I'd like to say that the approach taken in the other answers is the most performant one. It seems like you want the entire matrix at the end, in that case for the best performance it is generally good to get data (like randomness) in big batches and to not "hide" data from the compiler (especially type information). A lot of interesting things can be achieved with higher level abstractions but since you say performance is important, let's establish a baseline:
function approach1(A)
a = ones(2,3)
#. a[:, 2] = rand()
broadcast(+, a, A)
end
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.199619 0.273481 0.99254
0.0927839 0.179071 0.188591 julia> #btime approach1($A)
65.420 ns (2 allocations: 256 bytes)
2×3 Matrix{Float64}:
1.19962 0.968391 1.99254
1.09278 1.14451 1.18859
With that out of the way let's try some other solutions.
If a single row with lazy elements doesn't count as an auxiliary variable this seems like a good starting point:
function approach2(A)
a = Matrix{Function}(undef, 1, 3)
fill!(a, ()->1.0)
a[2] = rand
broadcast((a,b)->a() + b, a, A)
end
We get a row a = [()->1.0 rand ()->1.0] and evaluate each function when the broadcast gets that element.
julia> #btime approach2($A)
1.264 μs (24 allocations: 960 bytes)
The performance is 20 times worse, why? We've hidden type information from the compiler, it can't tell that a() is a Float64 by just asserting this (changing the last row to broadcast((a,b)->a()::Float64 + b, a, A) increase the performance almost tenfold:
julia> #btime approach2($A)
164.108 ns (14 allocations: 432 bytes)
If this is acceptable we can make it cleaner: introduce a LazyNumber type that keeps track of the return type, and has promote rules/operators so we can get back to broadcast(+, ...). However, we are still 2-3 times slower, can we do better?
An approach that could allow us to squeeze out some more would be to represent the whole array lazily. Something like a Fill type, a LazySetItem that applies on top of a matrix. Once again actually creating the array will be cheaper unless you can avoid getting parts of the array
I agree that it is not very clear what you are trying to achieve, and even if what you want to learn is how to achieve something or how theoretically something works.
If all you want is just to add a random vector to a matrix column (and 1 elsewhere), it is as simple as... add a random vector to the desired matrix column and 1 elsewhere:
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.94194 0.691855 0.583107
0.198166 0.740017 0.914133
julia> A[:,[1,3]] .+= 1
2×2 view(::Matrix{Float64}, :, [1, 3]) with eltype Float64:
1.94194 1.58311
1.19817 1.91413
julia> A[:,2] += rand(size(A,1))
2-element Vector{Float64}:
1.0306116987831297
0.8757712661515558
julia> A
2×3 Matrix{Float64}:
1.94194 1.03061 1.58311
1.19817 0.875771 1.91413
Why not just have a be the same size as A, and then you don't even need broadcasting or any weird tricks:
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.564824 0.765611 0.841353
0.566965 0.322331 0.109889
julia> a = ones(2,3);
julia> a[:, 2] .= [rand() for _ in 1:size(a, 1)] #in every row, make the second column's value a different rand() result
julia> a
2×3 Matrix{Float64}:
1.0 0.519228 1.0
1.0 0.0804104 1.0
julia> A + a
2×3 Matrix{Float64}:
1.56482 1.28484 1.84135
1.56696 0.402741 1.10989
I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0
I want to return a logical vector showing the location of strings that are members of two string arrays A and B.
In Matlab, this would be
A = ["me","you","us"]
B = ["me","us"]
myLogicalVector = ismember(A,B)
myLogicalVector =
1×3 logical array
1 0 1
How do I achieve this in Julia?
I have tried
myLogicalVector = occursin.(A,B)
myLogicalVector = occursin(A,B)
It seems that occursin only works if the two input string arrays are of the same length or one string is a scalar - I am not sure if I am correct on this one.
You can write:
julia> in(B).(A)
3-element BitArray{1}:
1
0
1
more verbose versions of similar operation are (note that the type of array is different in all cases except the first):
julia> in.(A, Ref(B))
3-element BitArray{1}:
1
0
1
julia> [in(a, B) for a in A]
3-element Array{Bool,1}:
1
0
1
julia> map(a -> in(a, B), A)
3-element Array{Bool,1}:
1
0
1
julia> map(a -> a in B, A)
3-element Array{Bool,1}:
1
0
1
julia> [a in B for a in A]
3-element Array{Bool,1}:
1
0
1
If A and B were large and you needed performance then convert B to a Set like this:
in(Set(B)).(A)
(you pay one time cost of creation of the set, bu then the lookup will be faster)
I'd like to get the number of rows of a dataframe.
I can achieve that with size(myDataFrame)[1].
Is there a cleaner way ?
If you are using DataFrames specifically, then you can use nrow():
julia> df = DataFrame(Any[1:10, 1:10]);
julia> nrow(df)
10
Alternatively, you can specify the dimension argument for size:
julia> size(df, 1)
10
This also work for arrays as well so it's a bit more general:
julia> my_array = rand(4, 3)
4×3 Array{Float64,2}:
0.980798 0.873643 0.819478
0.341972 0.34974 0.160342
0.262292 0.387406 0.00741398
0.512669 0.81579 0.329353
julia> size(my_array, 1)
4