When i log transform pandas column i get NaNs should i replace these with 0? - pandas

I cannot find a similar question. But i have a df with some columns highly skewed. I then plan to log transform these columns then standardize. However when i log transform i then get NaNs, should i replace these with 0;s?
log_train[skew_cols]=np.log2(featuresdf[skew_cols]
error i get is:
RuntimeWarning: invalid value encountered in log2
This is separate from the ipykernel package so we can avoid doing imports until
not sure what i am doing wrong

You shouldn't replace with 0's, because np.log(1) is equal to 0. So then both 1, and 0 will be 0 in your log data.
Instead, just +1 your data prior to the log. Therefore log2(1) becomes 0, log2(2) (which was 1) is still 1, then log2(3) (which was 2) is now 1.58)
So the code would be:
log_train[skew_cols]=np.log2(featuresdf[skew_cols]+1)
The other option is to use other scaling methods that can handle 0, such as square root (np.sqrt)

Related

Writing csv file with CSV.jl throws StackOverflowError message

CSV.write() returns: ERROR: LoadError: StackOverflowError: further down claiming --- the last 2 lines are repeated 39990 more times ---. Which is nonsense since the data frame to be written is acknowledged to be of 129x6 dimension with no empty fields. I fear the function is counting rows beyond the dimension of the data frame and shouldn't be doing so. How do I make this function work?
using CSV,DataFrames
file=DataFrame(
a=1:50,
b=1:50,
c=Vector{Any}(missing,50),
d=Vector{Any}(missing,50))
CSV.write("a.csv",file)
file=CSV.read("a.csv",DataFrame)
c,d=[],[]
for i in 1:nrow(file)
push!(c,file.a[i]*file.b[i])
push!(d,file.a[i]*file.b[i]/file.a[i])
end
file[!,:c]=c
file[!,:d]=d
# I found no straight forward way to fill empty columns based on the values of filled columns in a for loop that wouldn't generate error messages
CSV.write("a.csv",DataFrame) # Error in question here
The StackOverflowError is indeed due to the line
CSV.write("a.csv",DataFrame)
which tries to write the data type itself to a file instead of the actual dataframe variable. (The reason that leads to this particular error is an implementation detail that's described here.)
# I found no straight forward way to fill empty columns based on the values of filled columns in a for loop that wouldn't generate error messages
You don't need a for loop for this, just
file.c = file.a .* file.b
file.d = file.a .* file.b ./ file.a
will do the job. (Though I don't understand the point of the file.d calculation - maybe this is just a sample dummy calculation you chose just for illustration.)

Functionality of iloc and simply [ ] in a series

I think in pandas a series S
S[0:2] is equivalent to s.iloc[0:2] , in both cases two rows will be there but recently I got into a trouble The first picture shows the expected output but I didn't know what went wrong in the In this picture S[0] is showing error i don't know why
I can try to explain this behavior a bit. In Pandas, you have selection by position or by label, and it's important to remember that every single row/column will always have a position "name" and a label name. In the case of columns, this distinction is often easy to see, because we usually give columns string names. The difference is also obvious when you use explicitly .iloc vs .loc slicing. Finally, s[X] is indexing, which s[X:Y] is slicing, and the behaviour of the two actions is different.
Example:
df = pd.DataFrame({'a':[1,2,3], 'b': [3,3,4]})
df.iloc[:,0]
df.loc[:,'a']
both will return
0 1
1 2
2 3
Name: a, dtype: int64
Now, what happened in your case is that you overwrote the index names when you declared s.index = [11,12,13,14]. You can see that by inspecting the index before and after this change. Before, if you run s.index, you see that it is a RangeIndex(start=0, stop=4, step=1). After you change the index, it becomes Int64Index([11, 12, 13, 14], dtype='int64').
Why does this matter? Because although you overrode the labels of the index, the position of each one of them remains the same as before. So, when you call
s[0:2]
you are slicing by position (this section in the documentation explains that it's equivalent to .iloc. However, when you run
s[0]
Pandas thinks you want to select by label, so it starts looking for the label 0, which doesn't exist anymore, because you overrode it. Think of the square-bracket selection in the context of selecting a dataframe column: you would say df["column"] (so you're asking for the column by label), so the same is in the context of a series.
In summary, what happens with Series indexing is the following:
In the case you use string index labels, and you index by an string, Pandas will look up the string label.
In the case you use string index labels, and you index by an integer, Pandas will fall back to indexing by position (that's why your example in the comment works).
In the case you use integer index labels, and you index by an integer, Pandas will always try to index by the "label" (that's why the first case doesn't work, as you have overriden the label 0).
Here are two articles explaining this bizarre behavior:
Indexing Best Practices in Pandas.series
Retrieving values in a Series by label or position

How to subset Julia DataFrame by condition, where column has missing values

This seems like something that should be almost dead simple, yet I cannot accomplish it.
I have a dataframe df in julia, where one column is of type Array{Union{Missing, Int64},1}.
The values in that column are: [missing, 1, 2].
I would simply like to subset the dataframe df to just see those rows that correspond to a condition, such as where the column is equal to 2.
What I have tried --> result:
df[df[:col].==2] --> MethodError: no method matching getindex
df[df[:col].==2, :] --> ArgumentError: invalid row index of type Bool
df[df[:col].==2, :col] --> BoundsError: attempt to access String (note that doing just df[!, :col] results in: 1339-element Array{Union{Missing, Int64},1}: [...eliding output...], with my favorite warning so far in julia: Warning: getindex(df::DataFrame, col_ind::ColumnIndex) is deprecated, use df[!, col_ind] instead. Having just used that would seem to exempt me from the warning, but whatever.)
This cannot be as hard as it seems.
Just as FYI, I can get what I want through using Query and making a multi-line sql query just to subset data, which seems...burdensome.
How to do row subsetting
There are two ways to solve your problem:
use isequal instead of ==, as == implements 3-valued logic., so just writing one of will work:
df[isequal.(df.col,2), :] # new data frame
filter(:col => isequal(2), df) # new data frame
filter!(:col => isequal(2), df) # update old data frame in place
if you want to use == use coalesce on top of it, e.g.:
df[coalesce.(df.col .== 2, false), :] # new data frame
There is nothing special about it related to DataFrames.jl. Indexing works the same way in Julia Base:
julia> x = [1, 2, missing]
3-element Array{Union{Missing, Int64},1}:
1
2
missing
julia> x[x .== 2]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
julia> x[isequal.(x, 2)]
1-element Array{Union{Missing, Int64},1}:
2
(in general you can expect that, where possible, DataFrames.jl will work consistently with Julia Base; except for some corner cases where it is not possible - the major differences come from the fact that DataFrame has heterogeneous column element types while Matrix in Julia Base has homogeneous element type)
How to do indexing
DataFrame is a two-dimensional object. It has rows and columns. In Julia, normally, df[...] notation is used to access object via locations in its dimensions. Therefore df[:col] is not a valid way to index into a DataFrame. You are trying to use one indexing dimension, while specifying both row and column indices is required. You are getting a warning, because you are using an invalid indexing approach (in the next release of DataFrames.jl this warning will be gone and you will just get an error).
Actually your example df[df[:col].==2] shows why we disallow single-dimensional indexing. In df[:col] you try to use a single dimensional index to subset columns, but in outer df[df[:col].==2] you want to subset rows using a single dimensional index.
The easiest way to get a column from a data frame is df.col or df."col" (the second way is usually used if you have characters like spaces in the column name). This way you can access column :col without copying it. An equivalent way to write this selection using indexing is df[!, :col]. If you would want to copy the column write df[:, :col].
A side note - more advanced indexing
Indeed in Julia Base, if a is an array (of whatever dimension) then a[i] is a valid index if i is an integer or CartesianIndex. Doing df[i], where i is an integer is not allowed for DataFrame as it was judged that it would be too confusing for users if we wanted to follow the convention from Julia Base (as it is related to storage mode of arrays which is not the same as for DataFrame). You are though allowed to write df[i] when i is CartesianIndex (as this is unambiguous). I guess this is not something you are looking for.
All the rules what is allowed for indexing a DataFrame are described in detail here. Also during JuliaCon 2020 there is going to be a workshop during which the design of indexing in DataFrames.jl will be discussed in detail (how it works, why it works this way, and how it is implemented internally).

how to treat numpy array division zero error

I have numpy array with random number
a=([[7.2], b=([[0],
[2.3], [1.3],
[1.1]]) [1.1]])
and I want to divide a with b, so I write code
c=np.divide(a,b)
but if b array includes zero, it makes division error.
I want to keep this simple code, then how do I ignore that zero division error?
Refering my python book, I think try ,except ,finally statement will work, but I don't know
to use them
with np.errstate(all='ignore'):
c=np.divide(a,b)
Exception handling may not be what you are looking for. Often, you simply want nans for 0/0, rather than ditching the whole computation. np.nan_to_num may also come in handy.

Division Always Yields 1 or 0 in SAP BusinessObjects Data Services

Going to be answering my own question here, but I wanted to post this to get some attention in case anyone else ever comes across this.
Within the BusinessObjects Data Services Designer, I have two decimal(36, 10) values, and I'd like to divide one by the other. In order to account for divide-by-zero situations, I have to first check if the denominator is zero, so I end up with an ifthenelse statement like the following:
ifthenelse(Query.Denominator = 0, 0, Query.Numerator / Query.Denominator)
When I execute my job, however, I end up always getting 0 or 1, rather than a decimal value.
And as I said, I already had an answer here, just wanted to provide this for the community.
An ifthenelse statement is interesting, in that it takes its data type from the second parameter. In the code example above, it assumes the correct data type is the data type of the parameter "0", which happens to be an integer.
When the ifthenelse receives a value from "else", it converts it to an integer. So a value of 0.356 becomes 0 and 0.576 becomes 1.
The trick is to cast the "0" to a decimal(36, 10):
ifthenelse(Query.Denominator = 0, cast(0, 'Decimal(36, 10)'), Query.Numerator / Query.Denominator)
Thanks for this, I thought I was going crazy!
Incidentally, for something slightly simpler, this also works:
ifthenelse(Query.Denominator = 0, 0.0, Query.Numerator / Query.Denominator)