quote all during df to csv in julia - dataframe

is there a way to double quote all fields when outputting a DataFrame to a csv in Julia? I am having trouble find an answer with Google.
In python I would add quoting=csv.QUOTE_ALL to df.to_csv(file)
I am having trouble finding something similar with CSV.write(file,df)

You can do the following:
julia> using CSV, DataFrames
julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf,ptr=1, mark=-1)
julia> df = DataFrame(rand(1:10, 3, 5), :auto)
3×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 6 10 5 4 4
2 │ 1 9 6 5 3
3 │ 5 4 5 8 4
julia> CSV.write(io, df; quotestrings=true, transform=(col,val)->string(val)) |> take! |> String |> println
"x1","x2","x3","x4","x5"
"6","10","5","4","4"
"1","9","6","5","3"
"5","4","5","8","4"
The trouble is that quotestrings only forces quoting strings (so that when you read back the file numbers are not quoted and correctly parsed) and therefore you need also transform argument to force every value to be written as string.

Related

Extracting Data from .csv File in Julia

I'm quite new to Julia and i have a .csv File, which is stored inside a gzip, where i want to extract some informations from for educational purposes and to get to know the language better.
In Python there are many helpful functions from Panda to help with that, but i can't seem to get the Problem straight...
This is my Code (I KNOW, VERY WEAK!!!) :
{
import Pkg
#Pkg.add("CSV")
#Pkg.add("DataFrames")
#Pkg.add("CSVFiles")
#Pkg.add("CodecZlib")
#Pkg.add("GZip")
using CSVFiles
using Pkg
using CSV
using DataFrames
using CodecZlib
using GZip
df = CSV.read("Path//to//file//file.csv.gzip", DataFrame)
print(df)
}
I added a Screen to show how the Columns inside the .csv File are looking like.
I would like to extract the Dates and make some sort of a Top 10 most commented users, Top 10 days with the most threads etc.
I would like to point out that this is not an Exercise given to me, but a training i would like to do 4 myself.
I know the Panda Version to this is looking like this:
df['threadcreateddate'] = pd.to_datetine(df['thread_created_utc']).dt.date
or
df['commentcreateddate'] = pd.to_datetime(df['comment_created_utc']).dt.date
And to sort it:
pf_number_of_threads = df.groupby('threadcreateddate')["thread_id'].nunique()
If i were to plot it:
df_number_of_threads.plot(kind='line')
plt.show()
To print:
head = df.head()
print(df_number_of_threads.sort_values(ascending=False).head(10))
Can someone help? The df.select() function didn't work for me.
1. Packages
We obviously need DataFrames.jl. And since we're dealing with dates in the data, and doing a plot later, we'll include Dates and Plots as well.
As this example in CSV.jl's documentation shows, no additional packages are needed for gzipped data. CSV.jl can decompress automatically. So, you can remove the other using statements from your list.
julia> using CSV, DataFrames, Dates, Plots
2. Preparing the Data Frame
You can use CSV.read to load the data into the Data Frame, as in the question. Here, I'll use some sample (simplified) data for illustration, with just 4 columns:
julia> df
6×4 DataFrame
Row │ thread_id thread_created_utc comment_id comment_created_utc
│ Int64 String Int64 String
─────┼─────────────────────────────────────────────────────────────────
1 │ 1 2022-08-13T12:00:00 1 2022-08-13T12:00:00
2 │ 1 2022-08-13T12:00:00 2 2022-08-14T12:00:00
3 │ 1 2022-08-13T12:00:00 3 2022-08-15T12:00:00
4 │ 2 2022-08-16T12:00:00 4 2022-08-16T12:00:00
5 │ 2 2022-08-16T12:00:00 5 2022-08-17T12:00:00
6 │ 2 2022-08-16T12:00:00 6 2022-08-18T12:00:00
3. Converting from String to DateTime
To extract the thread dates from the string columns we have, we'll use the Dates standard libary.
Depending on the exact format your dates are in, you might have to add a datefmt argument for conversion to Dates data types (see the Constructors section of Dates in the Julia manual). Here in the sample data, the dates are in ISO standard format, so we don't need to specify the date format explicitly.
In Julia, we can get the date directly without intermediate conversion to a date-time type, but since it's a good idea to have the columns be in the proper type anyway, we'll first convert the existing columns from strings to DateTime:
julia> transform!(df, [:thread_created_utc, :comment_created_utc] .=> ByRow(DateTime), renamecols = false)
6×4 DataFrame
Row │ thread_id thread_created_utc comment_id comment_created_utc
│ Int64 DateTime Int64 DateTime
─────┼─────────────────────────────────────────────────────────────────
1 │ 1 2022-08-13T12:00:00 1 2022-08-13T12:00:00
2 │ 1 2022-08-13T12:00:00 2 2022-08-14T12:00:00
3 │ 1 2022-08-13T12:00:00 3 2022-08-15T12:00:00
4 │ 2 2022-08-16T12:00:00 4 2022-08-16T12:00:00
5 │ 2 2022-08-16T12:00:00 5 2022-08-17T12:00:00
6 │ 2 2022-08-16T12:00:00 6 2022-08-18T12:00:00
Though it looks similar, this data frame doesn't use Strings for the date-time columns, instead has proper DateTime type values.
(For an explanation of how this transform! works, see the DataFrames manual: Selecting and transforming columns.)
Edit: Based on the screenshot added to the question now, in your case you'd use transform!(df, [:thread_created_utc, :comment_created_utc] .=> ByRow(s -> DateTime(s, dateformat"yyyy-mm-dd HH:MM:SS.s")), renamecols = false).
4. Creating Date columns
Now, creating the date columns is as easy as:
julia> df.threadcreateddate = Date.(df.thread_created_utc);
julia> df.commentcreateddate = Date.(df.comment_created_utc);
julia> df
6×6 DataFrame
Row │ thread_id thread_created_utc comment_id comment_created_utc commentcreateddate threadcreatedate
│ Int64 DateTime Int64 DateTime Date Date
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 2022-08-13T12:00:00 1 2022-08-13T12:00:00 2022-08-13 2022-08-13
2 │ 1 2022-08-13T12:00:00 2 2022-08-14T12:00:00 2022-08-14 2022-08-13
3 │ 1 2022-08-13T12:00:00 3 2022-08-15T12:00:00 2022-08-15 2022-08-13
4 │ 2 2022-08-16T12:00:00 4 2022-08-16T12:00:00 2022-08-16 2022-08-16
5 │ 2 2022-08-16T12:00:00 5 2022-08-17T12:00:00 2022-08-17 2022-08-16
6 │ 2 2022-08-16T12:00:00 6 2022-08-18T12:00:00 2022-08-18 2022-08-16
These could also be written as a transform! call, and in fact the transform! call in the previous code segment could have instead been replaced with df.thread_created_utc = DateTime.(df.thread_created_utc) and df.comment_created_utc = DateTime.(df.comment_created_utc). However, transform offers a very powerful and flexible syntax that can do a lot more, so it's useful to familiarize yourself with it if you're going to work on DataFrames.
5. Getting the number of threads per day
julia> gdf = combine(groupby(df, :threadcreateddate), :thread_id => length ∘ unique => :number_of_threads)
2×2 DataFrame
Row │ threadcreateddate number_of_threads
│ Date Int64
─────┼──────────────────────────────────────
1 │ 2022-08-13 1
2 │ 2022-08-16 1
Note that df.groupby('threadcreateddate') becomes groupby(df, :threadcreateddate), which is a common pattern in Python-to-Julia conversions. Julia doesn't use the . based object-oriented syntax, and instead the data frame is one of the arguments to the function.
length ∘ unique uses the function composition operator ∘, and the result is a function that applies unique and then length. Here we take the unique values of thread_id column in each group, apply length to them (so, the equivalent of nunique), and store the result in number_of_threads column in a new GroupedDataFrame called gdf.
6. Plotting
julia> plot(gdf.threadcreateddate, gdf.number_of_threads)
Since our grouped data frame conveniently contains both the date and the number of threads, we can plot the number_of_threads against the dates, making for a nice and informative visualization.
As Sundar R commented it is hard to give you a precise answer for your data as there might be some relevant details. But here is a general pattern you can follow:
julia> using DataFrames
julia> df = DataFrame(id = [1, 1, 2, 2, 2, 3])
6×1 DataFrame
Row │ id
│ Int64
─────┼───────
1 │ 1
2 │ 1
3 │ 2
4 │ 2
5 │ 2
6 │ 3
julia> first(sort(combine(groupby(df, :id), nrow), :nrow, rev=true), 10)
3×2 DataFrame
Row │ id nrow
│ Int64 Int64
─────┼──────────────
1 │ 2 3
2 │ 1 2
3 │ 3 1
What this code does:
groupby groups data by the column you want to aggregate
combine with nrow argument counts the number of rows in each group and stores it in :nrow column (this is the default, you could choose other column name)
sort sorts data frame by :nrow and rev=true makes the order descending
first picks 10 first rows from this data frame
If you want something more similar to dplyr in R with piping you can use #chain that is exported by DataFramesMeta.jl:
julia> using DataFramesMeta
julia> #chain df begin
groupby(:id)
combine(nrow)
sort(:nrow, rev=true)
first(10)
end
3×2 DataFrame
Row │ id nrow
│ Int64 Int64
─────┼──────────────
1 │ 2 3
2 │ 1 2
3 │ 3 1

set_index() on Julia dataframe

I am looking for a function like .set_index() in python at Julia dataframe.
I've searched and find out NamedArray can give similar result with .set_index() in Python as below:
n = NamedArray(rand(2,4))
setnames!(n, ["one", "two"], 1)
n["one", 2:3]
n["two", :] = 11:14
n[Not("two"), :] = 4:7
Out[10]
2×4 Named Matrix{Float64}
A ╲ B │ 1 2 3 4
──────┼───────────────────────
one │ 4.0 5.0 6.0 7.0
two │ 11.0 12.0 13.0 14.0
However, NamedArray returns as matrix format, and I could not find function injulia dataframe. Is there any function like .set_index()?
Like this is what I expect :
>>> df
1 2 3 4
value Int64 Float64 Float64 Float64
one 84 64 42 77
two 24 90 8 33
There is no function similar to set_index in DataFrames.jl. The recommended thing is to add this data as a column of a data frame. Then you can e.g. groupby the data by this column to have a quick lookup.
If you provided more information about what you need the row index for I can comment how this can be done in DataFrames.jl?
One way is,
A = Dict("a" => 1, "b" => 2)
Then,
setindex!(A, 11, "c")
df = DataFrame(A)
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 11

How to extract column_name String and data Vector from a one-column DataFrame in Julia?

I was able to extract the column of a DataFrame that I want using a regular expression, but now I want to extract from that DataFrame column a String with the column name and a Vector with the data. How can I construct f and g below? Alternate approaches also welcome.
julia> df = DataFrame("x (in)" => 1:3, "y (°C)" => 4:6)
3×2 DataFrame
Row │ x (in) y (°C)
│ Int64 Int64
─────┼────────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> y = df[:, r"y "]
3×1 DataFrame
Row │ y (°C)
│ Int64
─────┼────────
1 │ 4
2 │ 5
3 │ 6
julia> y_units = f(y)
"°C"
julia> y_data = g(y)
3-element Vector{Int64}:
4
5
6
f(df) = only(names(df))
g(df) = only(eachcol(df)) # or df[!, 1] if you do not need to check that this is the only column
(only is used to check that the data frame actually has only one column)
An alternate approach to get the column name without creating an intermediate data frame is just writing:
julia> names(df, r"y ")
1-element Vector{String}:
"y (°C)"
to extract out the column name (you need to get the first element of this vector)

Using Julia, how can I read multiple CSV and combine columns

I'm pretty new to Julia and I consider myself as a beginner in programming in general. I coded a bit of MATLAB and Python.
I have a bunch of CSVs and I want to combine them to do data analysis. My data look like this:
using DataFrames
using Plots
using CSV
using Glob
using Pipe
file_list = glob("*.csv") #list of all csvs in dir
df = #pipe file_list[1] |> CSV.File(_,header = 2) |> DataFrame #Read file
# I could have use df = CSV.File(file_list[1], header = 2) |> DataFrame but
# I wanted to try piping multiple operation but it didn't work
[Results of the code snippet][1]
This results in: https://i.stack.imgur.com/nZTFy.png
The thing is
I want to combine the first 5 colums, as they define the time as yyyy-mm-dd-hh-mm-ss
Ideally, I would add a column with the name of the file so all would merge in a single dataframe.
As I said, I'm pretty new to Julia and programming in general. Any help is appreciated.
Thank you.
To pipe every item in a list, use .|>
julia> [1,2,3] .|> sqrt
3-element Array{Float64,1}:
1.0
1.4142135623730951
1.7320508075688772
you can add columns like that :
julia> using DataFrames, Dates
julia> df = DataFrame("yr"=>2000, "m"=>1:2, "d"=>[30,1], "h"=>12:13, "min"=>30:31, "sec"=>58:59)
2×6 DataFrame
Row │ yr m d h min sec
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────
1 │ 2000 1 30 12 30 58
2 │ 2000 2 1 13 31 59
julia> df[!,"datetime"] = DateTime.(df[!,"yr"], df[!,"m"], df[!,"d"], df[!,"h"], df[!,"min"], df[!,"sec"])
2-element Array{DateTime,1}:
2000-01-30T12:30:58
2000-02-01T13:31:59
julia> df[!,"file"] .= "file.csv"
2-element Array{String,1}:
"file.csv"
"file.csv"
julia> df
2×8 DataFrame
Row │ yr m d h min sec datetime file
│ Int64 Int64 Int64 Int64 Int64 Int64 DateTime String
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 2000 1 30 12 30 58 2000-01-30T12:30:58 file.csv
2 │ 2000 2 1 13 31 59 2000-02-01T13:31:59 file.csv

Julia: Apply function to every cell within a DataFrame (without loosing column names)

I am diving into Julia, hence my "novice"-question.
Coming from R and Python, I am used to apply simple functions (arithmetic or otherwise) to entire pandas.DataFrames and data.frames, respectively.
#both R and Python
df - 1 # returns all values -1, given all values are numeric
df == "someString" # returns a boolean df
a bit more complex
#python
df = df.applymap(lambda v: v - 1 if v > 1 else v)
#R
df[] <- lapply(df, function(x) ifelse(x>1,x-1,x))
The thing is, I don't know how to do this in Julia, I don't find analogue solutions easily on the web. And Stackoverflow helps a lot when using Google. So here it is. How do I do it in Julia?
Thanks for your help!
PS:
So far I have come up with the following solutions, where I loos my column names.
DataFrame(colwise(x -> x .-1, df))
# seems like to much code for only subtracting 1 and loosing col names
Please update your DataFrames.jl installation to version 1.4.2.
You can do all you want using broadcasting like this:
julia> df = DataFrame(rand(2,3), :auto)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.720264 0.759493 0.998702
2 │ 0.726994 0.560153 0.243982
julia> df .+ 1
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.72026 1.75949 1.9987
2 │ 1.72699 1.56015 1.24398
julia> df .< 0.5
2×3 DataFrame
Row │ x1 x2 x3
│ Bool Bool Bool
─────┼─────────────────────
1 │ false false false
2 │ false false true
julia> df2 = string.(df)
2×3 DataFrame
Row │ x1 x2 x3
│ String String String
─────┼────────────────────────────────────────────────────────────
1 │ 0.7202642575401104 0.7594928463144177 0.9987024771396766
2 │ 0.7269944483236035 0.5601527006649413 0.2439815742224939
julia> parse.(Float64, df2)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼──────────────────────────────
1 │ 0.720264 0.759493 0.998702
2 │ 0.726994 0.560153 0.243982
Is this what you wanted?