Julia pandas - how to append dataframes together - pandas

Working with Julia 1.0
I have a large numbers of data frames which I read into Julia using pandas (read_csv) and I am looking for a way to append them all together into a single big data frame. For some reason the "append" function does not do the trick. A simplified example below:
using Pandas
df = Pandas.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
df2 = Pandas.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])
df[:append](df2) #fails
df.append(df2) #fails
df[:concat](df2) #fails
vcat(df,df2)
The last step works but produces a 2 element Array with each element being a DataFrame
Any ideas on how to stack the two dataframes one under the other?

This seems to work
julia> df = Pandas.DataFrame([[1, 2], [3, 4]], columns=[:A, :B])
A B
0 1 2
1 3 4
julia> df2 = Pandas.DataFrame([[5, 6], [7, 8]], columns=[:A, :B])
A B
0 5 6
1 7 8
julia> df.pyo[:append](df2, ignore_index = true )
PyObject A B
0 1 2
1 3 4
2 5 6
3 7 8
Notes:
I don't know if this is a Pandas thing or a julia 1.0 PyCall thing, but the object seems to need the .pyo field explicitly before calling a method. If you try df[:append] it will try to interpret this as if you're trying to index the :append: column. Try doing df[:col3] = 3 to see what I mean
There is a julia native DataFrames package. No need to use Pandas unless you have some weird "I have ready made code" issue. And even then you're probably just complicating things by using Pandas via a Python layer in Julia.
For reference, here's the equivalent in julia DataFrames:
julia> df = DataFrames.DataFrame( [1:2, 3:4], [:A, :B]);
julia> df2 = DataFrames.DataFrame( [5:6, 7:8], [:A, :B]);
julia> append!(df, df2)
4×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 4 │
│ 3 │ 5 │ 7 │
│ 4 │ 6 │ 8 │

Since you said you have a lot of dataframes, you can add them to a list. Then pd.concat the list, and take the header of the first file (assuming they all have the same header) as the header of the new dataframe. This will skip the first line in all your dataframes, so you dont have a bunch of header rows in there.
dfs = [df, df2]
df3 = pd.DataFrame(pd.concat(dfs), columns=df.columns)

Related

Add thousands separator to column in dataframe in julia

I have a dataframe with two columns a and b and at the moment both are looking like column a, but I want to add separators so that column b looks like below. I have tried using the package format.jl. But I haven't gotten the result I'm afte. Maybe worth mentioning is that both columns is Int64 and the column names a and b is of type symbol.
a | b
150000 | 1500,00
27 | 27,00
16614 | 166,14
Is there some other way to solve this than using format.jl? Or is format.jl the way to go?
Assuming you want the commas in their typical positions rather than how you wrote them, this is one way:
julia> using DataFrames, Format
julia> f(x) = format(x, commas=true)
f (generic function with 1 method)
julia> df = DataFrame(a = [1000000, 200000, 30000])
3×1 DataFrame
Row │ a
│ Int64
─────┼─────────
1 │ 1000000
2 │ 200000
3 │ 30000
julia> transform(df, :a => ByRow(f) => :a_string)
3×2 DataFrame
Row │ a a_string
│ Int64 String
─────┼────────────────────
1 │ 1000000 1,000,000
2 │ 200000 200,000
3 │ 30000 30,000
If you instead want the row replaced, use transform(df, :a => ByRow(f), renamecols=false).
If you just want the output vector rather than changing the DataFrame, you can use format.(df.a, commas=true)
You could write your own function f to achieve the same behavior, but you might as well use the one someone already wrote inside the Format.jl package.
However, once you transform you data to Strings as above, you won't be able to filter/sort/analyze the numerical data in the DataFrame. I would suggest that you apply the formatting in the printing step (rather than modifying the DataFrame itself to contain strings) by using the PrettyTables package. This can format the entire DataFrame at once.
julia> using DataFrames, PrettyTables
julia> df = DataFrame(a = [1000000, 200000, 30000], b = [500, 6000, 70000])
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼────────────────
1 │ 1000000 500
2 │ 200000 6000
3 │ 30000 70000
julia> pretty_table(df, formatters = ft_printf("%'d"))
┌───────────┬────────┐
│ a │ b │
│ Int64 │ Int64 │
├───────────┼────────┤
│ 1,000,000 │ 500 │
│ 200,000 │ 6,000 │
│ 30,000 │ 70,000 │
└───────────┴────────┘
(Edited to reflect the updated specs in the question)
julia> df = DataFrame(a = [150000, 27, 16614]);
julia> function insertdecimalcomma(n)
if n < 100
return string(n) * ",00"
else
return replace(string(n), r"(..)$" => s",\1")
end
end
insertdecimalcomma (generic function with 1 method)
julia> df.b = insertdecimalcomma.(df.a)
julia> df
3×2 DataFrame
Row │ a b
│ Int64 String
─────┼─────────────────
1 │ 150000 1500,00
2 │ 27 27,00
3 │ 16614 166,14
Note that the b column will necessarily be a String after this change, as integer types cannot store formatting information in them.
If you have a lot of data and find that you need better performance, you may also want to use the InlineStrings package:
julia> #same as before upto the function definition
julia> using InlineStrings
julia> df.b = inlinestrings(insertdecimalcomma.(df.a))
3-element Vector{String7}:
"1500,00"
"27,00"
"166,14"
This stores the b column's data as fixed-size strings (String7 type here), which are generally treated like normal Strings, but can be significantly better for performance.

Filter DataFrame by rows which have no "missing" value

I have a DataFrame that may contain missing values and I want to filter out all the rows that contain at least one missing value, so from this
DataFrame(a = [1, 2, 3, 4], b = [5, missing, 7, 8], c = [9, 10, missing, 12])
4×3 DataFrame
Row │ a b c
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 5 9
2 │ 2 missing 10
3 │ 3 7 missing
4 │ 4 8 12
I want something like
Row │ a b c
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 5 9
4 │ 4 8 12
Ideally, there would be a filter function where I can pass each row into a lambda and then do a combo of collect and findfirst and whatnot, but I can't figure out how to pass lambdas to subset or #subset (from DataFramesMeta), because I don't only have three columns, I have over 200.
Following what #Antonello said you can do it with dropmissing. You have three options:
dropmissing: create a new data frame with dropped rows with missing values;
dropmissing with view=true create a view of the source data frame with dropped rows with missing values;
dropmissing! to drop dropped rows with missing values in-place.
By default all columns are considered, but you can change it and pass a column selector specifying which columns you want to include in the check.
Finally by default after dropping rows with missing values the columns will change their eltype not to allow missing values, but you can change this behavior by passing disallowmissing=false in which case they would still allow them.
Here is how you could perform filtering using subset and ismissing instead:
julia> subset(df, All() .=> ByRow(!ismissing))
2×3 DataFrame
Row │ a b c
│ Int64 Int64? Int64?
─────┼───────────────────────
1 │ 1 5 9
2 │ 4 8 12
(I am using standard select from DataFrames.jl)
or if you have a very wide data frame (like thousands of columns):
subset(df, AsTable(All()) => ByRow((x -> all(!ismissing, x))∘collect))
(this is a special syntax optimized for fast row-wise aggregation of wide tables)
OK, this seems to work but I'm leaving this open for more suggestions.
DataFrame(collect(filter(r -> nothing .== findfirst(collect(ismissing.(collect(r)))), eachrow(data[:, before_qs]))))

set_index() on Julia dataframe

I am looking for a function like .set_index() in python at Julia dataframe.
I've searched and find out NamedArray can give similar result with .set_index() in Python as below:
n = NamedArray(rand(2,4))
setnames!(n, ["one", "two"], 1)
n["one", 2:3]
n["two", :] = 11:14
n[Not("two"), :] = 4:7
Out[10]
2×4 Named Matrix{Float64}
A ╲ B │ 1 2 3 4
──────┼───────────────────────
one │ 4.0 5.0 6.0 7.0
two │ 11.0 12.0 13.0 14.0
However, NamedArray returns as matrix format, and I could not find function injulia dataframe. Is there any function like .set_index()?
Like this is what I expect :
>>> df
1 2 3 4
value Int64 Float64 Float64 Float64
one 84 64 42 77
two 24 90 8 33
There is no function similar to set_index in DataFrames.jl. The recommended thing is to add this data as a column of a data frame. Then you can e.g. groupby the data by this column to have a quick lookup.
If you provided more information about what you need the row index for I can comment how this can be done in DataFrames.jl?
One way is,
A = Dict("a" => 1, "b" => 2)
Then,
setindex!(A, 11, "c")
df = DataFrame(A)
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 11

How to extract column_name String and data Vector from a one-column DataFrame in Julia?

I was able to extract the column of a DataFrame that I want using a regular expression, but now I want to extract from that DataFrame column a String with the column name and a Vector with the data. How can I construct f and g below? Alternate approaches also welcome.
julia> df = DataFrame("x (in)" => 1:3, "y (°C)" => 4:6)
3×2 DataFrame
Row │ x (in) y (°C)
│ Int64 Int64
─────┼────────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> y = df[:, r"y "]
3×1 DataFrame
Row │ y (°C)
│ Int64
─────┼────────
1 │ 4
2 │ 5
3 │ 6
julia> y_units = f(y)
"°C"
julia> y_data = g(y)
3-element Vector{Int64}:
4
5
6
f(df) = only(names(df))
g(df) = only(eachcol(df)) # or df[!, 1] if you do not need to check that this is the only column
(only is used to check that the data frame actually has only one column)
An alternate approach to get the column name without creating an intermediate data frame is just writing:
julia> names(df, r"y ")
1-element Vector{String}:
"y (°C)"
to extract out the column name (you need to get the first element of this vector)

julia create an empty dataframe and append rows to it

I am trying out the Julia DataFrames module. I am interested in it so I can use it to plot simple simulations in Gadfly. I want to be able to iteratively add rows to the dataframe and I want to initialize it as empty.
The tutorials/documentation on how to do this is sparse (most documentation describes how to analyse imported data).
To append to a nonempty dataframe is straightforward:
df = DataFrame(A = [1, 2], B = [4, 5])
push!(df, [3 6])
This returns.
3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 1 | 4 |
| 2 | 2 | 5 |
| 3 | 3 | 6 |
But for an empty init I get errors.
df = DataFrame(A = [], B = [])
push!(df, [3, 6])
Error message:
ArgumentError("Error adding 3 to column :A. Possible type mis-match.")
while loading In[220], in expression starting on line 2
What is the best way to initialize an empty Julia DataFrame such that you can iteratively add items to it later in a for loop?
A zero length array defined using only [] will lack sufficient type information.
julia> typeof([])
Array{None,1}
So to avoid that problem is to simply indicate the type.
julia> typeof(Int64[])
Array{Int64,1}
And you can apply that to your DataFrame problem
julia> df = DataFrame(A = Int64[], B = Int64[])
0x2 DataFrame
julia> push!(df, [3 6])
julia> df
1x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1 | 3 | 6 |
using Pkg, CSV, DataFrames
iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"))
new_iris = similar(iris, nrow(iris))
head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1 │ missing │ missing │ missing │ missing │ missing │
# │ 2 │ missing │ missing │ missing │ missing │ missing │
for (i, row) in enumerate(eachrow(iris))
new_iris[i, :] = row[:]
end
head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
# │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
The answer from #waTeim already answers the initial question. But what if I want to dynamically create an empty DataFrame and append rows to it. E.g. what if I don't want hard-coded column names?
In this case, df = DataFrame(A = Int64[], B = Int64[]) is not sufficient.
The NamedTuple A = Int64[], B = Int64[] needs to be create dynamically.
Let's assume we have a vector of column names col_names and a vector of column types colum_types from which to create an emptyDataFrame.
col_names = [:A, :B] # needs to be a vector Symbols
col_types = [Int64, Float64]
# Create a NamedTuple (A=Int64[], ....) by doing
named_tuple = (; zip(col_names, type[] for type in col_types )...)
df = DataFrame(named_tuple) # 0×2 DataFrame
Alternatively, the NameTuple could be created with
# or by doing
named_tuple = NamedTuple{Tuple(col_names)}(type[] for type in col_types )
I think at least in the latest version of Julia you can achieve this by creating a pair object without specifying type
df = DataFrame("A" => [], "B" => [])
push!(df, [5,'f'])
1×2 DataFrame
Row │ A B
│ Any Any
─────┼──────────
1 │ 5 f
as seen in this post by #Bogumił Kamiński where multiple columns are needed, something like this can be done:
entries = ["A", "B", "C", "D"]
df = DataFrame([ name =>[] for name in entries])
julia> push!(df,[4,5,'r','p'])
1×4 DataFrame
Row │ A B C D
│ Any Any Any Any
─────┼────────────────────
1 │ 4 5 r p
Or as pointed out by #Antonello below if you know that type you can do.
df = DataFrame([name => Int[] for name in entries])
which is also in #Bogumil Kaminski's original post.