Julia Dataframe generate unlinked variable duplicate - dataframe

I want to define a new column based on a which is afterwards not linked to the original.
using DataFrames
x = DataFrame(a=1:3)
x.b = x.a
x.b[1] += 1

There are several ways to do it, the major are:
x[:, :b] = x.a
or
x.b = x[:, :a]
You can also write:
x[!, :b] = x[:, :a]
(this can be useful if :b were a variable)
Finally you could also just write:
df.b = copy(df.a)
or
df.b = df.a[:]
All indexing rules for DataFrames.jl can be found at https://juliadata.github.io/DataFrames.jl/stable/lib/indexing/.
In short (simplifying a bit but these rules are enough to know in practice):
df.col is non-copying for getting and for setting a column
df[!, :col] is the same as df.col with the difference that you can then easily use a variable instead of a literal for indexing and it works with broadcasting while df.col does not work with broadcasting if :col were not present in a data frame
df[:, :col] copies for getting a column and is an in-place operation for setting a column, unless :col is not present in df in which case it freshly allocates it when setting

Related

Multiple column selection on a Julia DataFrame

Imagine I have the following DataFrame :
10 rows x 26 columns named A to Z
What I would like to do is to make a multiple subset of the columns by their name (not the index). For instance, assume that I want columns A to D and P to Z in a new DataFrame named df2.
I tried something like this but it doesn't seem to work :
df2=df[:,[:A,:D ; :P,:Z]]
syntax: unexpected semicolon in array expression
top-level scope at Slicing.jl:1
Any idea of the way to do it ?
Thanks for any help
df2 = select(df, Between(:A,:D), Between(:P,:Z))
or
df2 = df[:, All(Between(:A,:D), Between(:P,:Z))]
if you are sure your columns are only from :A to :Z you can also write:
df2 = select(df, Not(Between(:E, :O)))
or
df2 = df[:, Not(Between(:E, :O))]
Finally, you can easily find an index of the column using columnindex function, e.g.:
columnindex(df, :A)
and later use column numbers - if this is something what you would prefer.
In Julia you can also build Ranges with Chars and hence when your columns are named just by single letters yet another option is:
df[:, Symbol.(vcat('A':'D', 'P':'Z'))]

How to make a scatter plot based on the values of a column in the data set?

I am given a data set that looks something like this
and I am trying to graph all the points with a 1 on the first column separate from the points with a 0, but I want to put them in the same chart.
I know the final result should be something similar to this
But I can't find a way to filter the points in Julia. I'm using LinearAlgebra, CSV, Plots, DataFrames for my project, and so far I haven't found a way to make DataFrames storage types work nicely with Plots functions. I keep running into errors like Cannot convert Float64 to series data for plotting when I try plotting the points individually with a for loop as a filter as shown in the code below
filter = select(data, :1)
newData = select(data, 2:3)
#graph one initial point to create the plot
plot(newData[1,1], newData[1,2], seriestype = :scatter, title = "My Scatter Plot")
#add the additional points with the 1 in front
for i in 2:size(newData)
if filter[i] == 1
plot!(newData[i, 1], newData[i, 2], seriestype = :scatter, title = "My Scatter Plot")
end
end
Other approaches have given me other errors, but I haven't recorded those.
I'm using Julia 1.4.0 and the latest versions of all of the packages mentioned.
Quick Edit:
It might help to know that I am trying to replicate the Nonlinear dimensionality reduction section of this article https://sebastianraschka.com/Articles/2014_kernel_pca.html#principal-component-analysis
With Plots.jl you can do the following (I am passing a fully reproducible code):
julia> df = DataFrame(c=rand(Bool, 100), x = 2 .* rand(100) .- 1);
julia> df.y = ifelse.(df.c, 1, -1) .* df.x .^ 2;
julia> plot(df.x, df.y, color=ifelse.(df.c, "blue", "red"), seriestype=:scatter, legend=nothing)
However, in this case I would additionally use StatsPlots.jl as then you can just write:
julia> using StatsPlots
julia> #df df plot(:x, :y, group=:c, seriestype=:scatter, legend=nothing)
If you want to do it manually by groups it is easiest to use the groupby function:
julia> gdf = groupby(df, :c);
julia> summary(gdf) # check that we have 2 groups in data
"GroupedDataFrame with 2 groups based on key: c"
julia> plot(gdf[1].x, gdf[1].y, seriestype=:scatter, legend=nothing)
julia> plot!(gdf[2].x, gdf[2].y, seriestype=:scatter)
Note that gdf variable is bound to a GroupedDataFrame object from which you can get groups defined by the grouping column (:c) in this case.

Pandas dividing filtered column from df 1 by filtered column of df 2 warning and weird behavior

I have a data frame which is conditionally broken up into two separate dataframes as follows:
df = pd.read_csv(file, names)
df = df.loc[df['name1'] == common_val]
df1 = df.loc[df['name2'] == target1]
df2 = df.loc[df['name2'] == target2]
# each df has a 'name3' I want to perform a division on after this filtering
The original df is filtered by a value shared by the two dataframes, and then each of the two new dataframes are further filtered by another shared column.
What I want to work:
df1['name3'] = df1['name3']/df2['name3']
However, as many questions have pointed out, this causes a setting with copy warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I tried what was recommended in this question:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2.loc[:,'name3']
# also tried:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2['name3']
But in both cases I still get weird behavior and the set by copy warning.
I then tried what was recommended in this answer:
df.loc[df['name2']==target1, 'name3'] = df.loc[df['name2']==target1, 'name3']/df.loc[df['name2'] == target2, 'name3']
which still results in the same copy warning.
If possible I would like to avoid copying the data frame to get around this because of the size of these dataframes (and I'm already somewhat wastefully making two almost identical dfs from the original).
If copying is the best way to go with this problem I'm interested to hear why that works over all the options I explored above.
Edit: here is a simple data frame along the lines of what df would look like after the line df.loc[df['name1'] == common_val]
name1 other1 other2 name2 name3
a x y 1 2
a x y 1 4
a x y 2 5
a x y 2 3
So if target1=1 and target2=2,
I would like df1 to contain only rows where name1=1 and df2 to contain only rows where name2=2, then divide the resulting df1['name3'] by the resulting df2['name3'].
If there is a less convoluted way to do this (without splitting the original df) I'm open to that as well!

slice dataframe inplace and dynamically rename in a loop

I am aware that it may not be good practice but I am curious to know if it is possible to take two dfs (in this case, srm and srae), take a slice of each, and then rename this sliced dataframes as srm1 and srae1.
The logic is below.
for x in (srm, srae):
x1 = x[x['years_in_role']>5]
print(x.shape, x1.shape)
You can unpack 2 tuples to 2 variables:
srm1, srae1 = [x[x['years_in_role']>5] for x in (srm, srae)]
Your solution should be used for create list and then create new variables:
L = []
for x in (srm, srae):
x1 = x[x['years_in_role']>5]
L.append(x1)
srm1 = L[0]
srae1 = L[1]

How to insert a column in a julia DataFrame at specific position (without referring to existing column names)

I have a DataFrame in Julia with hundreds of columns, and I would like to insert a column after the first one.
For example in this DataFrame:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
I would like to insert a column area after colour, but without referring specifically to shape and border (that in my real case are hundreds of different columns).
df[:area] = [1,2]
In this example I can use (but referring specifically to shape and border):
df = df[[:colour, :area, :shape, :border]] # with specific reference to shape and border names
Update: This function has changed. See #DiegoJavierZea ’s comment.
Well, congratulate you found a workaround your self, but there is a built-in function that is semantically more clear and possibly a little bit faster:
using DataFrames
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
insert!(df, 3, [1,2], :area)
Where 3 is the expected index for the new column after the insertion, [1,2] is its content, and :area is the name. You can find a more detailed document by typing ?insert! in REPL after loading the DataFrames package.
It is worth noting that the ! is a part of the function name. It's a Julia convention to indicate that the function will mutate its argument.
rows = size(df)[1] # tuple gives you (rows,columns) of the DataFrame
insertcols!(df, # DataFrame to be changed
1, # insert as column 1
:Day => 1:rows, # populate as "Day" with 1,2,3,..
makeunique=true) # if the name of the column exist, make is Day_1
While making the question I also found a solution (as often happens).
I still post the question here for keep it in memory (for myself) and for the others..
It is enough to save the column names before "adding" the new column:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
dfnames = names(df)
df[:area] = [1,2]
df = df[vcat(dfnames[1:1],:area,dfnames[2:end])]