I have dataframe like :
datetime sensor1 sensor2
String Int64 Int64
1 2021-09-28 13:36:04 626 570
2 2021-09-28 13:36:04 622 571
3 2021-09-28 13:36:05 620 574
4 2021-09-28 13:36:06 619 578
I would like to get correlation coefficient score between column sensor1 and sensor2 on the above dataframe.
For example, in Python, I can do it as :
cor = np.corrcoef(data.sensor1[0:] , data.sensor2[0:])[0,1]
How can I get the correlation coefficient in Julia?
Use cor from the Statistics standard library:
julia> using Statistics, DataFrames
julia> df = DataFrame(sensor1 = [626, 622, 620, 619], sensor2 = [570, 571, 574, 578])
4×2 DataFrame
Row │ sensor1 sensor2
│ Int64 Int64
─────┼──────────────────
1 │ 626 570
2 │ 622 571
3 │ 620 574
4 │ 619 578
julia> cor(Matrix(df))
2×2 Matrix{Float64}:
1.0 -0.861357
-0.861357 1.0
Here passing Matrix(df) means you'll get back a correlation matrix with the correlations between all columns.
More specifically for just two columns, which I guess is in line with your Python example:
julia> cor(df.sensor1, df.sensor2)
-0.861356769214109
EDIT: Actually I see you are doing [0, 1] indexing in Python, so you're probably getting back a 2x2 matrix there as well - arrays in Julia are 1-based so the equivalent would be cor(Matrix(df))[1, 2]. If you only want one number though there's no point computing all cross-correlations.
Related
is there a way to double quote all fields when outputting a DataFrame to a csv in Julia? I am having trouble find an answer with Google.
In python I would add quoting=csv.QUOTE_ALL to df.to_csv(file)
I am having trouble finding something similar with CSV.write(file,df)
You can do the following:
julia> using CSV, DataFrames
julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf,ptr=1, mark=-1)
julia> df = DataFrame(rand(1:10, 3, 5), :auto)
3×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 6 10 5 4 4
2 │ 1 9 6 5 3
3 │ 5 4 5 8 4
julia> CSV.write(io, df; quotestrings=true, transform=(col,val)->string(val)) |> take! |> String |> println
"x1","x2","x3","x4","x5"
"6","10","5","4","4"
"1","9","6","5","3"
"5","4","5","8","4"
The trouble is that quotestrings only forces quoting strings (so that when you read back the file numbers are not quoted and correctly parsed) and therefore you need also transform argument to force every value to be written as string.
I am looking for a solution to change column's headers to lowercase.
Let's say, I have this dataframe:
df = DataFrame(TIME = ["2021-10-21","2021-10-22","2021-10-23"],
MQ2= [-1.1, -2, 1],
MQ3=[-1, -1, 3.1],
MQ8= [-1, -4.2, 2],
)
>>>df
TIME MQ2 MQ3 MQ8
String Float64 Float64 Float64
1 2021-10-21 -1.1 -1.0 -1.0
2 2021-10-22 -2.0 -1.0 -4.2
3 2021-10-23 1.0 3.1 2.0
I want to change all of my column's headers, such as MQ2 to mq2.
May be something like df.columns.str.lower() in Python.
Therefore, I can achieve this dataframe:
time mq2 mq3 mq8
String Float64 Float64 Float64
1 2021-10-21 -1.1 -1.0 -1.0
2 2021-10-22 -2.0 -1.0 -4.2
3 2021-10-23 1.0 3.1 2.0
I would probably do the following:
julia> using DataFrames
julia> df = DataFrame(TIME = rand(5), MQ2 = rand(5), MQ3 = rand(5), MQ8 = rand(5));
julia> rename!(df, lowercase.(names(df)))
5×4 DataFrame
Row │ time mq2 mq3 mq8
│ Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────
1 │ 0.0796718 0.997022 0.0838867 0.63886
2 │ 0.923035 0.904928 0.993185 0.36081
3 │ 0.392671 0.0577061 0.518647 0.81432
4 │ 0.0377552 0.506528 0.190017 0.488105
5 │ 0.828534 0.731297 0.383561 0.604786
Here I'm using the DataFrames rename function in its mutating version (hence the bang in rename!), with a vector of new column names as the second argument. The new vector is created by getting the current names using names(df), and then broadcasting the lowercase function across each element in that vector.
Note that rename! also accepts pairs of old/new names if you only want to rename specific columns, e.g. rename!(df, "TIME" => "time")
I am looking for a function like .set_index() in python at Julia dataframe.
I've searched and find out NamedArray can give similar result with .set_index() in Python as below:
n = NamedArray(rand(2,4))
setnames!(n, ["one", "two"], 1)
n["one", 2:3]
n["two", :] = 11:14
n[Not("two"), :] = 4:7
Out[10]
2×4 Named Matrix{Float64}
A ╲ B │ 1 2 3 4
──────┼───────────────────────
one │ 4.0 5.0 6.0 7.0
two │ 11.0 12.0 13.0 14.0
However, NamedArray returns as matrix format, and I could not find function injulia dataframe. Is there any function like .set_index()?
Like this is what I expect :
>>> df
1 2 3 4
value Int64 Float64 Float64 Float64
one 84 64 42 77
two 24 90 8 33
There is no function similar to set_index in DataFrames.jl. The recommended thing is to add this data as a column of a data frame. Then you can e.g. groupby the data by this column to have a quick lookup.
If you provided more information about what you need the row index for I can comment how this can be done in DataFrames.jl?
One way is,
A = Dict("a" => 1, "b" => 2)
Then,
setindex!(A, 11, "c")
df = DataFrame(A)
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 11
I'm pretty new to Julia and I consider myself as a beginner in programming in general. I coded a bit of MATLAB and Python.
I have a bunch of CSVs and I want to combine them to do data analysis. My data look like this:
using DataFrames
using Plots
using CSV
using Glob
using Pipe
file_list = glob("*.csv") #list of all csvs in dir
df = #pipe file_list[1] |> CSV.File(_,header = 2) |> DataFrame #Read file
# I could have use df = CSV.File(file_list[1], header = 2) |> DataFrame but
# I wanted to try piping multiple operation but it didn't work
[Results of the code snippet][1]
This results in: https://i.stack.imgur.com/nZTFy.png
The thing is
I want to combine the first 5 colums, as they define the time as yyyy-mm-dd-hh-mm-ss
Ideally, I would add a column with the name of the file so all would merge in a single dataframe.
As I said, I'm pretty new to Julia and programming in general. Any help is appreciated.
Thank you.
To pipe every item in a list, use .|>
julia> [1,2,3] .|> sqrt
3-element Array{Float64,1}:
1.0
1.4142135623730951
1.7320508075688772
you can add columns like that :
julia> using DataFrames, Dates
julia> df = DataFrame("yr"=>2000, "m"=>1:2, "d"=>[30,1], "h"=>12:13, "min"=>30:31, "sec"=>58:59)
2×6 DataFrame
Row │ yr m d h min sec
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────
1 │ 2000 1 30 12 30 58
2 │ 2000 2 1 13 31 59
julia> df[!,"datetime"] = DateTime.(df[!,"yr"], df[!,"m"], df[!,"d"], df[!,"h"], df[!,"min"], df[!,"sec"])
2-element Array{DateTime,1}:
2000-01-30T12:30:58
2000-02-01T13:31:59
julia> df[!,"file"] .= "file.csv"
2-element Array{String,1}:
"file.csv"
"file.csv"
julia> df
2×8 DataFrame
Row │ yr m d h min sec datetime file
│ Int64 Int64 Int64 Int64 Int64 Int64 DateTime String
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 2000 1 30 12 30 58 2000-01-30T12:30:58 file.csv
2 │ 2000 2 1 13 31 59 2000-02-01T13:31:59 file.csv
This seems rather obvious, but I can't seem to figure out how to convert an index of data frame to a column?
For example:
df=
gi ptt_loc
0 384444683 593
1 384444684 594
2 384444686 596
To,
df=
index1 gi ptt_loc
0 0 384444683 593
1 1 384444684 594
2 2 384444686 596
either:
df['index1'] = df.index
or, .reset_index:
df = df.reset_index(level=0)
so, if you have a multi-index frame with 3 levels of index, like:
>>> df
val
tick tag obs
2016-02-26 C 2 0.0139
2016-02-27 A 2 0.5577
2016-02-28 C 6 0.0303
and you want to convert the 1st (tick) and 3rd (obs) levels in the index into columns, you would do:
>>> df.reset_index(level=['tick', 'obs'])
tick obs val
tag
C 2016-02-26 2 0.0139
A 2016-02-27 2 0.5577
C 2016-02-28 6 0.0303
rename_axis + reset_index
You can first rename your index to a desired label, then elevate to a series:
df = df.rename_axis('index1').reset_index()
print(df)
index1 gi ptt_loc
0 0 384444683 593
1 1 384444684 594
2 2 384444686 596
This works also for MultiIndex dataframes:
print(df)
# val
# tick tag obs
# 2016-02-26 C 2 0.0139
# 2016-02-27 A 2 0.5577
# 2016-02-28 C 6 0.0303
df = df.rename_axis(['index1', 'index2', 'index3']).reset_index()
print(df)
index1 index2 index3 val
0 2016-02-26 C 2 0.0139
1 2016-02-27 A 2 0.5577
2 2016-02-28 C 6 0.0303
To provide a bit more clarity, let's look at a DataFrame with two levels in its index (a MultiIndex).
index = pd.MultiIndex.from_product([['TX', 'FL', 'CA'],
['North', 'South']],
names=['State', 'Direction'])
df = pd.DataFrame(index=index,
data=np.random.randint(0, 10, (6,4)),
columns=list('abcd'))
The reset_index method, called with the default parameters, converts all index levels to columns and uses a simple RangeIndex as new index.
df.reset_index()
Use the level parameter to control which index levels are converted into columns. If possible, use the level name, which is more explicit. If there are no level names, you can refer to each level by its integer location, which begin at 0 from the outside. You can use a scalar value here or a list of all the indexes you would like to reset.
df.reset_index(level='State') # same as df.reset_index(level=0)
In the rare event that you want to preserve the index and turn the index into a column, you can do the following:
# for a single level
df.assign(State=df.index.get_level_values('State'))
# for all levels
df.assign(**df.index.to_frame())
For MultiIndex you can extract its subindex using
df['si_name'] = R.index.get_level_values('si_name')
where si_name is the name of the subindex.
If you want to use the reset_index method and also preserve your existing index you should use:
df.reset_index().set_index('index', drop=False)
or to change it in place:
df.reset_index(inplace=True)
df.set_index('index', drop=False, inplace=True)
For example:
print(df)
gi ptt_loc
0 384444683 593
4 384444684 594
9 384444686 596
print(df.reset_index())
index gi ptt_loc
0 0 384444683 593
1 4 384444684 594
2 9 384444686 596
print(df.reset_index().set_index('index', drop=False))
index gi ptt_loc
index
0 0 384444683 593
4 4 384444684 594
9 9 384444686 596
And if you want to get rid of the index label you can do:
df2 = df.reset_index().set_index('index', drop=False)
df2.index.name = None
print(df2)
index gi ptt_loc
0 0 384444683 593
4 4 384444684 594
9 9 384444686 596
This should do the trick (if not multilevel indexing) -
df.reset_index().rename({'index':'index1'}, axis = 'columns')
And of course, you can always set inplace = True, if you do not want to assign this to a new variable in the function parameter of rename.
df1 = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
p = df1.index.values
df1.insert( 0, column="new",value = p)
df1
new gi ptt
0 0 232 342
1 1 66 56
2 2 34 662
3 3 43 123
In the newest version of pandas 1.5.0, you could use the function reset_index with the new argument names to specify a list of names you want to give the index columns. Here is a reproducible example with one index column:
import pandas as pd
df = pd.DataFrame({"gi":[232,66,34,43],"ptt":[342,56,662,123]})
gi ptt
0 232 342
1 66 56
2 34 662
3 43 123
df.reset_index(names=['new'])
Output:
new gi ptt
0 0 232 342
1 1 66 56
2 2 34 662
3 3 43 123
This can also easily be applied with MultiIndex. Just create a list of the names you want.
I usually do it this way:
df = df.assign(index1=df.index)