Changing Julia dataframe column headers to lowercase? - dataframe

I am looking for a solution to change column's headers to lowercase.
Let's say, I have this dataframe:
df = DataFrame(TIME = ["2021-10-21","2021-10-22","2021-10-23"],
MQ2= [-1.1, -2, 1],
MQ3=[-1, -1, 3.1],
MQ8= [-1, -4.2, 2],
)
>>>df
TIME MQ2 MQ3 MQ8
String Float64 Float64 Float64
1 2021-10-21 -1.1 -1.0 -1.0
2 2021-10-22 -2.0 -1.0 -4.2
3 2021-10-23 1.0 3.1 2.0
I want to change all of my column's headers, such as MQ2 to mq2.
May be something like df.columns.str.lower() in Python.
Therefore, I can achieve this dataframe:
time mq2 mq3 mq8
String Float64 Float64 Float64
1 2021-10-21 -1.1 -1.0 -1.0
2 2021-10-22 -2.0 -1.0 -4.2
3 2021-10-23 1.0 3.1 2.0

I would probably do the following:
julia> using DataFrames
julia> df = DataFrame(TIME = rand(5), MQ2 = rand(5), MQ3 = rand(5), MQ8 = rand(5));
julia> rename!(df, lowercase.(names(df)))
5×4 DataFrame
Row │ time mq2 mq3 mq8
│ Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────
1 │ 0.0796718 0.997022 0.0838867 0.63886
2 │ 0.923035 0.904928 0.993185 0.36081
3 │ 0.392671 0.0577061 0.518647 0.81432
4 │ 0.0377552 0.506528 0.190017 0.488105
5 │ 0.828534 0.731297 0.383561 0.604786
Here I'm using the DataFrames rename function in its mutating version (hence the bang in rename!), with a vector of new column names as the second argument. The new vector is created by getting the current names using names(df), and then broadcasting the lowercase function across each element in that vector.
Note that rename! also accepts pairs of old/new names if you only want to rename specific columns, e.g. rename!(df, "TIME" => "time")

Related

quote all during df to csv in julia

is there a way to double quote all fields when outputting a DataFrame to a csv in Julia? I am having trouble find an answer with Google.
In python I would add quoting=csv.QUOTE_ALL to df.to_csv(file)
I am having trouble finding something similar with CSV.write(file,df)
You can do the following:
julia> using CSV, DataFrames
julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf,ptr=1, mark=-1)
julia> df = DataFrame(rand(1:10, 3, 5), :auto)
3×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 6 10 5 4 4
2 │ 1 9 6 5 3
3 │ 5 4 5 8 4
julia> CSV.write(io, df; quotestrings=true, transform=(col,val)->string(val)) |> take! |> String |> println
"x1","x2","x3","x4","x5"
"6","10","5","4","4"
"1","9","6","5","3"
"5","4","5","8","4"
The trouble is that quotestrings only forces quoting strings (so that when you read back the file numbers are not quoted and correctly parsed) and therefore you need also transform argument to force every value to be written as string.

Correlation coefficient score between dataframe columns in Julia

I have dataframe like :
datetime sensor1 sensor2
String Int64 Int64
1 2021-09-28 13:36:04 626 570
2 2021-09-28 13:36:04 622 571
3 2021-09-28 13:36:05 620 574
4 2021-09-28 13:36:06 619 578
I would like to get correlation coefficient score between column sensor1 and sensor2 on the above dataframe.
For example, in Python, I can do it as :
cor = np.corrcoef(data.sensor1[0:] , data.sensor2[0:])[0,1]
How can I get the correlation coefficient in Julia?
Use cor from the Statistics standard library:
julia> using Statistics, DataFrames
julia> df = DataFrame(sensor1 = [626, 622, 620, 619], sensor2 = [570, 571, 574, 578])
4×2 DataFrame
Row │ sensor1 sensor2
│ Int64 Int64
─────┼──────────────────
1 │ 626 570
2 │ 622 571
3 │ 620 574
4 │ 619 578
julia> cor(Matrix(df))
2×2 Matrix{Float64}:
1.0 -0.861357
-0.861357 1.0
Here passing Matrix(df) means you'll get back a correlation matrix with the correlations between all columns.
More specifically for just two columns, which I guess is in line with your Python example:
julia> cor(df.sensor1, df.sensor2)
-0.861356769214109
EDIT: Actually I see you are doing [0, 1] indexing in Python, so you're probably getting back a 2x2 matrix there as well - arrays in Julia are 1-based so the equivalent would be cor(Matrix(df))[1, 2]. If you only want one number though there's no point computing all cross-correlations.

set_index() on Julia dataframe

I am looking for a function like .set_index() in python at Julia dataframe.
I've searched and find out NamedArray can give similar result with .set_index() in Python as below:
n = NamedArray(rand(2,4))
setnames!(n, ["one", "two"], 1)
n["one", 2:3]
n["two", :] = 11:14
n[Not("two"), :] = 4:7
Out[10]
2×4 Named Matrix{Float64}
A ╲ B │ 1 2 3 4
──────┼───────────────────────
one │ 4.0 5.0 6.0 7.0
two │ 11.0 12.0 13.0 14.0
However, NamedArray returns as matrix format, and I could not find function injulia dataframe. Is there any function like .set_index()?
Like this is what I expect :
>>> df
1 2 3 4
value Int64 Float64 Float64 Float64
one 84 64 42 77
two 24 90 8 33
There is no function similar to set_index in DataFrames.jl. The recommended thing is to add this data as a column of a data frame. Then you can e.g. groupby the data by this column to have a quick lookup.
If you provided more information about what you need the row index for I can comment how this can be done in DataFrames.jl?
One way is,
A = Dict("a" => 1, "b" => 2)
Then,
setindex!(A, 11, "c")
df = DataFrame(A)
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 11

How to extract column_name String and data Vector from a one-column DataFrame in Julia?

I was able to extract the column of a DataFrame that I want using a regular expression, but now I want to extract from that DataFrame column a String with the column name and a Vector with the data. How can I construct f and g below? Alternate approaches also welcome.
julia> df = DataFrame("x (in)" => 1:3, "y (°C)" => 4:6)
3×2 DataFrame
Row │ x (in) y (°C)
│ Int64 Int64
─────┼────────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> y = df[:, r"y "]
3×1 DataFrame
Row │ y (°C)
│ Int64
─────┼────────
1 │ 4
2 │ 5
3 │ 6
julia> y_units = f(y)
"°C"
julia> y_data = g(y)
3-element Vector{Int64}:
4
5
6
f(df) = only(names(df))
g(df) = only(eachcol(df)) # or df[!, 1] if you do not need to check that this is the only column
(only is used to check that the data frame actually has only one column)
An alternate approach to get the column name without creating an intermediate data frame is just writing:
julia> names(df, r"y ")
1-element Vector{String}:
"y (°C)"
to extract out the column name (you need to get the first element of this vector)

Julia: Apply function to every cell within a DataFrame (without loosing column names)

I am diving into Julia, hence my "novice"-question.
Coming from R and Python, I am used to apply simple functions (arithmetic or otherwise) to entire pandas.DataFrames and data.frames, respectively.
#both R and Python
df - 1 # returns all values -1, given all values are numeric
df == "someString" # returns a boolean df
a bit more complex
#python
df = df.applymap(lambda v: v - 1 if v > 1 else v)
#R
df[] <- lapply(df, function(x) ifelse(x>1,x-1,x))
The thing is, I don't know how to do this in Julia, I don't find analogue solutions easily on the web. And Stackoverflow helps a lot when using Google. So here it is. How do I do it in Julia?
Thanks for your help!
PS:
So far I have come up with the following solutions, where I loos my column names.
DataFrame(colwise(x -> x .-1, df))
# seems like to much code for only subtracting 1 and loosing col names
Please update your DataFrames.jl installation to version 1.4.2.
You can do all you want using broadcasting like this:
julia> df = DataFrame(rand(2,3), :auto)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.720264 0.759493 0.998702
2 │ 0.726994 0.560153 0.243982
julia> df .+ 1
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.72026 1.75949 1.9987
2 │ 1.72699 1.56015 1.24398
julia> df .< 0.5
2×3 DataFrame
Row │ x1 x2 x3
│ Bool Bool Bool
─────┼─────────────────────
1 │ false false false
2 │ false false true
julia> df2 = string.(df)
2×3 DataFrame
Row │ x1 x2 x3
│ String String String
─────┼────────────────────────────────────────────────────────────
1 │ 0.7202642575401104 0.7594928463144177 0.9987024771396766
2 │ 0.7269944483236035 0.5601527006649413 0.2439815742224939
julia> parse.(Float64, df2)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.720264 0.759493 0.998702
2 │ 0.726994 0.560153 0.243982
Is this what you wanted?