Julia readtable from a stream instead of from a file - dataframe

Is there a way to read a table from a network url or from a runpipe external command? It seems that DataFrame.readtable only supports reading from file.
For example in R we could do:
df = read.table(url("http://example.com/data.txt"))
x = read.table(pipe("zcat data.txt | sed /^#/d | cut -f '11-13'"), colClasses=c("integer","integer","integer"), fill=TRUE, row.names=NULL)

using DataFrames, Requests
julia> resp = get("https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv?accessType=DOWNLOAD")
Response(200 OK, 17 headers, 27350 bytes in body)
julia> tbl = readtable(IOBuffer(resp.data));
julia> names(tbl)
46-element Array{Symbol,1}:
:JURISDICTION_NAME
:COUNT_PARTICIPANTS
:COUNT_FEMALE
:PERCENT_FEMALE
:COUNT_MALE
:PERCENT_MALE
:COUNT_GENDER_UNKNOWN
:PERCENT_GENDER_UNKNOWN
:COUNT_GENDER_TOTAL
:PERCENT_GENDER_TOTAL
:COUNT_PACIFIC_ISLANDER
:PERCENT_PACIFIC_ISLANDER
:COUNT_HISPANIC_LATINO
:PERCENT_HISPANIC_LATINO
:COUNT_AMERICAN_INDIAN
:PERCENT_AMERICAN_INDIAN
:COUNT_ASIAN_NON_HISPANIC
⋮
:PERCENT_PERMANENT_RESIDENT_ALIEN
:COUNT_US_CITIZEN
:PERCENT_US_CITIZEN
:COUNT_OTHER_CITIZEN_STATUS
:PERCENT_OTHER_CITIZEN_STATUS
:COUNT_CITIZEN_STATUS_UNKNOWN
:PERCENT_CITIZEN_STATUS_UNKNOWN
:COUNT_CITIZEN_STATUS_TOTAL
:PERCENT_CITIZEN_STATUS_TOTAL
:COUNT_RECEIVES_PUBLIC_ASSISTANCE
:PERCENT_RECEIVES_PUBLIC_ASSISTANCE
:COUNT_NRECEIVES_PUBLIC_ASSISTANCE
:PERCENT_NRECEIVES_PUBLIC_ASSISTANCE
:COUNT_PUBLIC_ASSISTANCE_UNKNOWN
:PERCENT_PUBLIC_ASSISTANCE_UNKNOWN
:COUNT_PUBLIC_ASSISTANCE_TOTAL
:PERCENT_PUBLIC_ASSISTANCE_TOTAL
julia> eltypes(tbl)
46-element Array{Type,1}:
Int64
Int64
Int64
Float64
Int64
Float64
Int64
Int64
Int64
Int64
Int64
Float64
Int64
Float64
Int64
Float64
Int64
⋮
Float64
Int64
Float64
Int64
Float64
Int64
Int64
Int64
Int64
Int64
Float64
Int64
Float64
Int64
Int64
Int64
Int64

With the deprecation of Requests in favor of HTTP here is an example on how to use HTTP.request and the body of the resulting call to request.
julia> using CSV, HTTP
julia> res = HTTP.request("GET", "http://users.csc.calpoly.edu/~dekhtyar/365-Winter2015/data/CARS/cars-data.csv")
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Wed, 16 May 2018 12:46:39 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Mon, 05 Jan 2015 23:29:09 GMT
ETag: "330f-50bf00ea05b40"
Accept-Ranges: bytes
Content-Length: 13071
Content-Type: text/csv
Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year
1,18,8,307,130,3504,12,1970
2,15,8,350,165,3693,11.5,1970
3,18,8,318,150,3436,11,1970
⋮
13071-byte body
"""
julia> res_buffer = IOBuffer(res.body)
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=13071, maxsize=Inf, ptr=1, mark=-1)
julia> using DataFrames, DataStreams
julia> df = CSV.read(res_buffer)
406×8 DataFrames.DataFrame
│ Row │ Id │ MPG │ Cylinders │ Edispl │ Horsepower │ Weight │ Accelerate │ Year │
├─────┼─────┼─────┼───────────┼────────┼────────────┼────────┼────────────┼──────┤
│ 1 │ 1 │ 18 │ 8 │ 307.0 │ 130 │ 3504 │ 12.0 │ 1970 │
│ 2 │ 2 │ 15 │ 8 │ 350.0 │ 165 │ 3693 │ 11.5 │ 1970 │
│ 3 │ 3 │ 18 │ 8 │ 318.0 │ 150 │ 3436 │ 11.0 │ 1970 │
⋮
│ 405 │ 405 │ 28 │ 4 │ 120.0 │ 79 │ 2625 │ 18.6 │ 1982 │
│ 406 │ 406 │ 31 │ 4 │ 119.0 │ 82 │ 2720 │ 19.4 │ 1982 │

Related

transform function on all columns of dataframe

I have a dataframe df and I am trying to apply a function to each of the cells. According to the documentation I should use the transform function.
The function should be applied to each column so I use [:] as a selector for all columns
transform(
df, [:] .=> ByRow(x -> (if (x > 1) x else zero(Float64) end)) .=> [:]
)
but it yields an exception
ArgumentError: Unrecognized column selector: Colon() => (DataFrames.ByRow{Main.workspace293.var"#1#2"}(Main.workspace293.var"#1#2"()) => Colon())
although when I am using a single column, it works fine
transform(
df, [:K0] .=> ByRow(x -> (if (x > 1) x else zero(Float64) end)) .=> [:K0]
)
The simplest way to do it is to use broadcasting:
julia> df = DataFrame(2*rand(4,3), [:x1, :x2, :x3])
4×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼───────────┼──────────┼──────────┤
│ 1 │ 0.945879 │ 1.59742 │ 0.882428 │
│ 2 │ 0.0963367 │ 0.400404 │ 0.599865 │
│ 3 │ 1.23356 │ 0.807691 │ 0.547917 │
│ 4 │ 0.756098 │ 0.595673 │ 0.29678 │
julia> #. ifelse(df > 1, df, 0.0)
4×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 1.59742 │ 0.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 1.23356 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │
you can also transform for it if you prefer:
julia> transform(df, names(df) .=> ByRow(x -> ifelse(x>1, x, 0.0)) .=> names(df))
4×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 1.59742 │ 0.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 1.23356 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │
Also looking at the linked pandas solution DataFrames.jl seems faster in this case:
julia> df = DataFrame(2*rand(2,3), [:x1, :x2, :x3])
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼────────────────────────────
1 │ 1.48781 1.20332 1.08071
2 │ 1.55462 1.66393 0.363993
julia> using BenchmarkTools
julia> #btime #. ifelse($df > 1, $df, 0.0)
6.252 μs (58 allocations: 3.89 KiB)
2×3 DataFrame
Row │ x1 x2 x3
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.48781 1.20332 1.08071
2 │ 1.55462 1.66393 0.0
(in pandas for 2x3 data frame it was ranging from 163 µs to 2.26 ms)

Query.jl - create a new column and use it immediately

I have a DataFrame and I want to compute a bunch of group-level summary statistics. Some of those statistics are derived from other statistics I want to compute first.
df = DataFrame(a=[1,1,2,3], b=[4,5,6,8])
df2 = df |>
#groupby(_.a) |>
#map({a = key(_),
bm = mean(_.b),
cs = sum(_.b),
d = _.bm + _.cs}) |>
DataFrame
ERROR: type NamedTuple has no field bm
The closest I can get is this, which works, but gets very repetitive as the number of initial statistics I want to carry forward into the computation of derived statistics grows:
df2 = df |>
#groupby(_.a) |>
#map({a=key(_), bm=mean(_.b), cs=sum(_.b)}) |>
#map({a=_.a, bm=_.bm, cs=_.cs, d=_.bm + _.cs}) |>
DataFrame
3×4 DataFrame
│ Row │ a │ bm │ cs │ d │
│ │ Int64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼───────┼─────────┤
│ 1 │ 1 │ 4.5 │ 9 │ 13.5 │
│ 2 │ 2 │ 6.0 │ 6 │ 12.0 │
│ 3 │ 3 │ 8.0 │ 8 │ 16.0 │
Another option is to create a new DataFrame of first-order results, run a new #map on that to compute the second-order results, and then join the two afterward. Is there any way in Query, DataFramesMeta, or even bare DataFrames to do it in one relatively concise step?
Just for reference, the "create multiple DataFrames" approach:
df = DataFrame(a=[1,1,2,3], b=[4,5,6,8])
df2 = df |>
#groupby(_.a) |>
#map({a=key(_), bm=mean(_.b), cs=sum(_.b)}) |>
DataFrame
df3 = df2 |>
#map({a=_.a, d=_.bm + _.cs}) |>
DataFrame
df4 = innerjoin(df2, df3, on = :a)
3×4 DataFrame
│ Row │ a │ bm │ cs │ d │
│ │ Int64 │ Float64 │ Int64 │ Float64 │
├─────┼───────┼─────────┼───────┼─────────┤
│ 1 │ 1 │ 4.5 │ 9 │ 13.5 │
│ 2 │ 2 │ 6.0 │ 6 │ 12.0 │
│ 3 │ 3 │ 8.0 │ 8 │ 16.0 │

Duplicated columns in Julia Dataframes

In Python Pandas and R one can get rid of duplicated columns easily - just load the data, assign the column names, and select those that are not duplicated.
What is the best practices to deal with such data with Julia Dataframes? Assigning duplicated column names is not allowed here. I understand that only way would be to massage incoming data more, and get rid of such data before constructing a Dataframe?
The thing is that it is almost always easier to deal with duplicated columns in the dataframe that is already constructed, rather than in incoming data.
UPD: I meant the duplicated column names. I build dataframe from raw data, where columns names (and thus data) could be repeated.
UPD2: Python example added.
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.hstack([np.zeros((4,1)), np.ones((4,2))]), columns=["a", "b", "b"])
>>> df
a b b
0 0.0 1.0 1.0
1 0.0 1.0 1.0
2 0.0 1.0 1.0
3 0.0 1.0 1.0
>>> df.loc[:, ~df.columns.duplicated()]
a b
0 0.0 1.0
1 0.0 1.0
2 0.0 1.0
3 0.0 1.0
I build my Julia Dataframe from a Float32 matrix and then assign column names from a vector. That is where I need to get rid of columns that have duplicated names (already present in dataframe). That is the nature of underlying data, sometimes it has dups, sometimes not, I have no control on its creation.
Is this something you are looking for (I was not 100% sure from your description - if this is not what you want then please update the question with an example):
julia> df = DataFrame([zeros(4,3) ones(4,5)])
4×8 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ x7 │ x8 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │
│ 3 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │ 1.0 │
julia> DataFrame(unique(last, pairs(eachcol(df))))
4×2 DataFrame
│ Row │ x1 │ x4 │
│ │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ 0.0 │ 1.0 │
│ 2 │ 0.0 │ 1.0 │
│ 3 │ 0.0 │ 1.0 │
│ 4 │ 0.0 │ 1.0 │
EDIT
To deduplicate column names use makeunique keyword argument:
julia> DataFrame(rand(3,4), [:x, :x, :x, :x], makeunique=true)
3×4 DataFrame
│ Row │ x │ x_1 │ x_2 │ x_3 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼───────────┼──────────┼──────────┼───────────┤
│ 1 │ 0.410494 │ 0.775563 │ 0.819916 │ 0.0520466 │
│ 2 │ 0.0503997 │ 0.427499 │ 0.262234 │ 0.965793 │
│ 3 │ 0.838595 │ 0.996305 │ 0.833607 │ 0.953539 │
EDIT 2
So you seem to have access to column names when creating a data frame. In this case I would do:
julia> mat = [ones(3,1) zeros(3,2)]
3×3 Array{Float64,2}:
1.0 0.0 0.0
1.0 0.0 0.0
1.0 0.0 0.0
julia> cols = ["a", "b", "b"]
3-element Array{String,1}:
"a"
"b"
"b"
julia> df = DataFrame(mat, cols, makeunique=true)
3×3 DataFrame
│ Row │ a │ b │ b_1 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1 │ 1.0 │ 0.0 │ 0.0 │
│ 2 │ 1.0 │ 0.0 │ 0.0 │
│ 3 │ 1.0 │ 0.0 │ 0.0 │
julia> select!(df, unique(cols))
3×2 DataFrame
│ Row │ a │ b │
│ │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1 │ 1.0 │ 0.0 │
│ 2 │ 1.0 │ 0.0 │
│ 3 │ 1.0 │ 0.0 │

Different datatypes of dataframe columns are not support for Impute(handling missing value method) in Julia

I did one small experiment and I got to know that it is just because the different data types of columns include in CSV. please see the following code
julia> using DataFrames
julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5],:c => [1,3,5,missing,6])
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Int64? │
├─────┼──────────┼──────────┼─────────┤
│ 1 │ 1.0 │ 1.1 │ 1 │
│ 2 │ 2.0 │ 2.2 │ 3 │
│ 3 │ missing │ 3.0 │ 5 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 5.0 │ 5.0 │ 6 │
julia> df
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Int64? │
├─────┼──────────┼──────────┼─────────┤
│ 1 │ 1.0 │ 1.1 │ 1 │
│ 2 │ 2.0 │ 2.2 │ 3 │
│ 3 │ missing │ 3.0 │ 5 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 5.0 │ 5.0 │ 6 │
julia> using Impute
julia> Impute.interp(df)
ERROR: InexactError: Int64(5.5)
Stacktrace:
[1] Int64 at ./float.jl:710 [inlined]
[2] convert at ./number.jl:7 [inlined]
[3] convert at ./missing.jl:69 [inlined]
[4] setindex! at ./array.jl:826 [inlined]
[5] (::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}})(::Impute.Context) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:67
[6] (::Impute.Context)(::Impute.var"#58#59"{Int64,Array{Union{Missing, Int64},1}}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/context.jl:227
[7] _impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors/interp.jl:49
[8] impute!(::Array{Union{Missing, Int64},1}, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:84
[9] impute!(::DataFrame, ::Impute.Interpolate) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:172
[10] #impute#17 at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
[11] impute at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:76 [inlined]
[12] _impute(::DataFrame, ::Type{Impute.Interpolate}) at /home/synerzip/.julia/packages/Impute/GmIMg/src/imputors.jl:58
[13] #interp#105 at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84 [inlined]
[14] interp(::DataFrame) at /home/synerzip/.julia/packages/Impute/GmIMg/src/Impute.jl:84
[15] top-level scope at REPL[15]:1
and this error does not occur when I run the following code
julia> df = DataFrame(:a => [1.0, 2, missing, missing, 5.0], :b => [1.1, 2.2, 3, missing, 5])
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ missing │ 3.0 │
│ 4 │ missing │ missing │
│ 5 │ 5.0 │ 5.0 │
julia> Impute.interp(df)
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ 3.0 │ 3.0 │
│ 4 │ 4.0 │ 4.0 │
│ 5 │ 5.0 │ 5.0 │
now I know the reason but confused about how to solve it. I can not use eltype while reading CSV because in my dataset contains 171 columns and it typically has either Int or Float. stuck for how to convert all columns in Float64.
I assume you want:
something simple, that does not have to be maximally efficient
all your columns are numeric (possibly having missing values)
Then just write:
julia> df
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Int64? │
├─────┼──────────┼──────────┼─────────┤
│ 1 │ 1.5 │ 1.65 │ 1 │
│ 2 │ 3.0 │ 3.3 │ 3 │
│ 3 │ missing │ 4.5 │ 5 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 7.5 │ 7.5 │ 6 │
julia> float.(df)
5×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64? │ Float64? │ Float64? │
├─────┼──────────┼──────────┼──────────┤
│ 1 │ 1.5 │ 1.65 │ 1.0 │
│ 2 │ 3.0 │ 3.3 │ 3.0 │
│ 3 │ missing │ 4.5 │ 5.0 │
│ 4 │ missing │ missing │ missing │
│ 5 │ 7.5 │ 7.5 │ 6.0 │
It is possible to be more efficient (i.e. convert only the columns that are integer in the source data frame, but it requires more code - please comment if you need such a solution)
EDIT
Also note that CSV.jl has typemap keyword argument that should allow to handle this issue when reading the data in.

Julia: how to compute a particular operation on certain columns of a Dataframe

I have the following Dataframe
using DataFrames, Statistics
df = DataFrame(name=["John", "Sally", "Kirk"],
age=[23., 42., 59.],
children=[3,5,2], height = [180, 150, 170])
print(df)
3×4 DataFrame
│ Row │ name │ age │ children │ height │
│ │ String │ Float64 │ Int64 │ Int64 │
├─────┼────────┼─────────┼──────────┼────────┤
│ 1 │ John │ 23.0 │ 3 │ 180 │
│ 2 │ Sally │ 42.0 │ 5 │ 150 │
│ 3 │ Kirk │ 59.0 │ 2 │ 170 │
I can compute the mean of a column as follow:
println(mean(df[:4]))
166.66666666666666
Now I want to get the mean of all the numeric column and tried this code:
x = [2,3,4]
for i in x
print(mean(df[:x[i]]))
end
But got the following error message:
MethodError: no method matching getindex(::Symbol, ::Int64)
Stacktrace:
[1] top-level scope at ./In[64]:3
How can I solve the problem?
You are trying to access the DataFrame's column using an integer index specifying the column's position. You should just use the integer value without any : before i, which would create the symbol :i but you do not a have column named i.
x = [2,3,4]
for i in x
println(mean(df[i])) # no need for `x[i]`
end
You can also index a DataFrame using a Symbol denoting the column's name.
x = [:age, :children, :height];
for c in x
println(mean(df[c]))
end
You get the following error in your attempt because you are trying to access the ith index of the symbol :x, which is an undefined operation.
MethodError: no method matching getindex(::Symbol, ::Int64)
Note that :4 is just 4.
julia> :4
4
julia> typeof(:4)
Int64
Here is a one-liner that actually selects all Number columns:
julia> mean.(eachcol(df[findall(x-> x<:Number, eltypes(df))]))
3-element Array{Float64,1}:
41.333333333333336
3.3333333333333335
166.66666666666666
For many scenarios describe is actually more convenient:
julia> describe(df)
4×8 DataFrame
│ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │
│ │ Symbol │ Union… │ Any │ Union… │ Any │ Union… │ Nothing │ DataType │
├─────┼──────────┼─────────┼──────┼────────┼───────┼─────────┼──────────┼──────────┤
│ 1 │ name │ │ John │ │ Sally │ 3 │ │ String │
│ 2 │ age │ 41.3333 │ 23.0 │ 42.0 │ 59.0 │ │ │ Float64 │
│ 3 │ children │ 3.33333 │ 2 │ 3.0 │ 5 │ │ │ Int64 │
│ 4 │ height │ 166.667 │ 150 │ 170.0 │ 180 │ │ │ Int64 │
In the question println(mean(df[4])) works as well (instead of println(mean(df[:4]))).
Hence we can write
x = [2,3,4]
for i in x
println(mean(df[i]))
end
which works