Indexing with nested vectors in APL - indexing

I have a vector of vectors that contain some indices, and a character vector which I want to use them on.
A←(1 2 3)(3 2 1)
B←'ABC'
I have tried:
B[A]
RANK ERROR
B[A]
∧
A⌷B
LENGTH ERROR
A⌷B
∧
and
A⌷B
LENGTH ERROR
A⌷¨B
∧
I would like
┌→────────────┐
│ ┌→──┐ ┌→──┐ │
│ │ABC│ │CBA│ │
│ └───┘ └───┘ │
└∊────────────┘
to be returned, but if i need to find another way, let me know.

The index function ⌷ is a bit odd. To select multiple major cells from an array, you need to enclose the array of indices:
(⊂3 2 1)⌷'ABC'
CBA
In order to use each of two vectors of indices, the array you're selecting from needs to be distributed among the two. You can use APL's scalar extension for this, but then the array you're selecting from needs to be packaged as a scalar:
(⊂1 2 3)(⊂3 2 1)⌷¨⊂'ABC'
┌→────────────┐
│ ┌→──┐ ┌→──┐ │
│ │ABC│ │CBA│ │
│ └───┘ └───┘ │
└∊────────────┘
So to use your variables:
A←(1 2 3)(3 2 1)
B←'ABC'
(⊂¨A)⌷¨⊂B
┌→────────────┐
│ ┌→──┐ ┌→──┐ │
│ │ABC│ │CBA│ │
│ └───┘ └───┘ │
└∊────────────┘

Note that, if you are generating permutations which all have the same length, you may be better off avoiding nested arrays. Nested arrays force the system to follow pointers, while simple arrays allow sequential access to densely packed data. This only really matters when you have a LOT of data, of course:
⎕←SIMPLE←↑A ⍝ A 2×3 matrix of indices
1 2 3
3 2 1
(⊂SIMPLE)⌷B
ABC
CBA
B[SIMPLE] ⍝ IMHO bracket indexing is nicer for this
ABC
CBA
↓B[SIMPLE] ⍝ Split if you must
┌───┬───┐
│ABC│CBA│
└───┴───┘

In NARS2000, easy:
A←(1 3 2)(3 2 1)
B←'ABC'
⎕fmt {B[⍵]}¨¨A
┌2────────────┐
│┌3───┐ ┌3───┐│
││ ACB│ │ CBA││
│└────┘ └────┘2
└∊────────────┘
C←(1 3 2 3 2 1)(3 2 1)
⎕fmt {B[⍵]}¨¨C
┌2───────────────┐
│┌6──────┐ ┌3───┐│
││ ACBCBA│ │ CBA││
│└───────┘ └────┘2
└∊───────────────┘

Related

Extracting Data from .csv File in Julia

I'm quite new to Julia and i have a .csv File, which is stored inside a gzip, where i want to extract some informations from for educational purposes and to get to know the language better.
In Python there are many helpful functions from Panda to help with that, but i can't seem to get the Problem straight...
This is my Code (I KNOW, VERY WEAK!!!) :
{
import Pkg
#Pkg.add("CSV")
#Pkg.add("DataFrames")
#Pkg.add("CSVFiles")
#Pkg.add("CodecZlib")
#Pkg.add("GZip")
using CSVFiles
using Pkg
using CSV
using DataFrames
using CodecZlib
using GZip
df = CSV.read("Path//to//file//file.csv.gzip", DataFrame)
print(df)
}
I added a Screen to show how the Columns inside the .csv File are looking like.
I would like to extract the Dates and make some sort of a Top 10 most commented users, Top 10 days with the most threads etc.
I would like to point out that this is not an Exercise given to me, but a training i would like to do 4 myself.
I know the Panda Version to this is looking like this:
df['threadcreateddate'] = pd.to_datetine(df['thread_created_utc']).dt.date
or
df['commentcreateddate'] = pd.to_datetime(df['comment_created_utc']).dt.date
And to sort it:
pf_number_of_threads = df.groupby('threadcreateddate')["thread_id'].nunique()
If i were to plot it:
df_number_of_threads.plot(kind='line')
plt.show()
To print:
head = df.head()
print(df_number_of_threads.sort_values(ascending=False).head(10))
Can someone help? The df.select() function didn't work for me.
1. Packages
We obviously need DataFrames.jl. And since we're dealing with dates in the data, and doing a plot later, we'll include Dates and Plots as well.
As this example in CSV.jl's documentation shows, no additional packages are needed for gzipped data. CSV.jl can decompress automatically. So, you can remove the other using statements from your list.
julia> using CSV, DataFrames, Dates, Plots
2. Preparing the Data Frame
You can use CSV.read to load the data into the Data Frame, as in the question. Here, I'll use some sample (simplified) data for illustration, with just 4 columns:
julia> df
6×4 DataFrame
Row │ thread_id thread_created_utc comment_id comment_created_utc
│ Int64 String Int64 String
─────┼─────────────────────────────────────────────────────────────────
1 │ 1 2022-08-13T12:00:00 1 2022-08-13T12:00:00
2 │ 1 2022-08-13T12:00:00 2 2022-08-14T12:00:00
3 │ 1 2022-08-13T12:00:00 3 2022-08-15T12:00:00
4 │ 2 2022-08-16T12:00:00 4 2022-08-16T12:00:00
5 │ 2 2022-08-16T12:00:00 5 2022-08-17T12:00:00
6 │ 2 2022-08-16T12:00:00 6 2022-08-18T12:00:00
3. Converting from String to DateTime
To extract the thread dates from the string columns we have, we'll use the Dates standard libary.
Depending on the exact format your dates are in, you might have to add a datefmt argument for conversion to Dates data types (see the Constructors section of Dates in the Julia manual). Here in the sample data, the dates are in ISO standard format, so we don't need to specify the date format explicitly.
In Julia, we can get the date directly without intermediate conversion to a date-time type, but since it's a good idea to have the columns be in the proper type anyway, we'll first convert the existing columns from strings to DateTime:
julia> transform!(df, [:thread_created_utc, :comment_created_utc] .=> ByRow(DateTime), renamecols = false)
6×4 DataFrame
Row │ thread_id thread_created_utc comment_id comment_created_utc
│ Int64 DateTime Int64 DateTime
─────┼─────────────────────────────────────────────────────────────────
1 │ 1 2022-08-13T12:00:00 1 2022-08-13T12:00:00
2 │ 1 2022-08-13T12:00:00 2 2022-08-14T12:00:00
3 │ 1 2022-08-13T12:00:00 3 2022-08-15T12:00:00
4 │ 2 2022-08-16T12:00:00 4 2022-08-16T12:00:00
5 │ 2 2022-08-16T12:00:00 5 2022-08-17T12:00:00
6 │ 2 2022-08-16T12:00:00 6 2022-08-18T12:00:00
Though it looks similar, this data frame doesn't use Strings for the date-time columns, instead has proper DateTime type values.
(For an explanation of how this transform! works, see the DataFrames manual: Selecting and transforming columns.)
Edit: Based on the screenshot added to the question now, in your case you'd use transform!(df, [:thread_created_utc, :comment_created_utc] .=> ByRow(s -> DateTime(s, dateformat"yyyy-mm-dd HH:MM:SS.s")), renamecols = false).
4. Creating Date columns
Now, creating the date columns is as easy as:
julia> df.threadcreateddate = Date.(df.thread_created_utc);
julia> df.commentcreateddate = Date.(df.comment_created_utc);
julia> df
6×6 DataFrame
Row │ thread_id thread_created_utc comment_id comment_created_utc commentcreateddate threadcreatedate
│ Int64 DateTime Int64 DateTime Date Date
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 2022-08-13T12:00:00 1 2022-08-13T12:00:00 2022-08-13 2022-08-13
2 │ 1 2022-08-13T12:00:00 2 2022-08-14T12:00:00 2022-08-14 2022-08-13
3 │ 1 2022-08-13T12:00:00 3 2022-08-15T12:00:00 2022-08-15 2022-08-13
4 │ 2 2022-08-16T12:00:00 4 2022-08-16T12:00:00 2022-08-16 2022-08-16
5 │ 2 2022-08-16T12:00:00 5 2022-08-17T12:00:00 2022-08-17 2022-08-16
6 │ 2 2022-08-16T12:00:00 6 2022-08-18T12:00:00 2022-08-18 2022-08-16
These could also be written as a transform! call, and in fact the transform! call in the previous code segment could have instead been replaced with df.thread_created_utc = DateTime.(df.thread_created_utc) and df.comment_created_utc = DateTime.(df.comment_created_utc). However, transform offers a very powerful and flexible syntax that can do a lot more, so it's useful to familiarize yourself with it if you're going to work on DataFrames.
5. Getting the number of threads per day
julia> gdf = combine(groupby(df, :threadcreateddate), :thread_id => length ∘ unique => :number_of_threads)
2×2 DataFrame
Row │ threadcreateddate number_of_threads
│ Date Int64
─────┼──────────────────────────────────────
1 │ 2022-08-13 1
2 │ 2022-08-16 1
Note that df.groupby('threadcreateddate') becomes groupby(df, :threadcreateddate), which is a common pattern in Python-to-Julia conversions. Julia doesn't use the . based object-oriented syntax, and instead the data frame is one of the arguments to the function.
length ∘ unique uses the function composition operator ∘, and the result is a function that applies unique and then length. Here we take the unique values of thread_id column in each group, apply length to them (so, the equivalent of nunique), and store the result in number_of_threads column in a new GroupedDataFrame called gdf.
6. Plotting
julia> plot(gdf.threadcreateddate, gdf.number_of_threads)
Since our grouped data frame conveniently contains both the date and the number of threads, we can plot the number_of_threads against the dates, making for a nice and informative visualization.
As Sundar R commented it is hard to give you a precise answer for your data as there might be some relevant details. But here is a general pattern you can follow:
julia> using DataFrames
julia> df = DataFrame(id = [1, 1, 2, 2, 2, 3])
6×1 DataFrame
Row │ id
│ Int64
─────┼───────
1 │ 1
2 │ 1
3 │ 2
4 │ 2
5 │ 2
6 │ 3
julia> first(sort(combine(groupby(df, :id), nrow), :nrow, rev=true), 10)
3×2 DataFrame
Row │ id nrow
│ Int64 Int64
─────┼──────────────
1 │ 2 3
2 │ 1 2
3 │ 3 1
What this code does:
groupby groups data by the column you want to aggregate
combine with nrow argument counts the number of rows in each group and stores it in :nrow column (this is the default, you could choose other column name)
sort sorts data frame by :nrow and rev=true makes the order descending
first picks 10 first rows from this data frame
If you want something more similar to dplyr in R with piping you can use #chain that is exported by DataFramesMeta.jl:
julia> using DataFramesMeta
julia> #chain df begin
groupby(:id)
combine(nrow)
sort(:nrow, rev=true)
first(10)
end
3×2 DataFrame
Row │ id nrow
│ Int64 Int64
─────┼──────────────
1 │ 2 3
2 │ 1 2
3 │ 3 1

Add thousands separator to column in dataframe in julia

I have a dataframe with two columns a and b and at the moment both are looking like column a, but I want to add separators so that column b looks like below. I have tried using the package format.jl. But I haven't gotten the result I'm afte. Maybe worth mentioning is that both columns is Int64 and the column names a and b is of type symbol.
a | b
150000 | 1500,00
27 | 27,00
16614 | 166,14
Is there some other way to solve this than using format.jl? Or is format.jl the way to go?
Assuming you want the commas in their typical positions rather than how you wrote them, this is one way:
julia> using DataFrames, Format
julia> f(x) = format(x, commas=true)
f (generic function with 1 method)
julia> df = DataFrame(a = [1000000, 200000, 30000])
3×1 DataFrame
Row │ a
│ Int64
─────┼─────────
1 │ 1000000
2 │ 200000
3 │ 30000
julia> transform(df, :a => ByRow(f) => :a_string)
3×2 DataFrame
Row │ a a_string
│ Int64 String
─────┼────────────────────
1 │ 1000000 1,000,000
2 │ 200000 200,000
3 │ 30000 30,000
If you instead want the row replaced, use transform(df, :a => ByRow(f), renamecols=false).
If you just want the output vector rather than changing the DataFrame, you can use format.(df.a, commas=true)
You could write your own function f to achieve the same behavior, but you might as well use the one someone already wrote inside the Format.jl package.
However, once you transform you data to Strings as above, you won't be able to filter/sort/analyze the numerical data in the DataFrame. I would suggest that you apply the formatting in the printing step (rather than modifying the DataFrame itself to contain strings) by using the PrettyTables package. This can format the entire DataFrame at once.
julia> using DataFrames, PrettyTables
julia> df = DataFrame(a = [1000000, 200000, 30000], b = [500, 6000, 70000])
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼────────────────
1 │ 1000000 500
2 │ 200000 6000
3 │ 30000 70000
julia> pretty_table(df, formatters = ft_printf("%'d"))
┌───────────┬────────┐
│ a │ b │
│ Int64 │ Int64 │
├───────────┼────────┤
│ 1,000,000 │ 500 │
│ 200,000 │ 6,000 │
│ 30,000 │ 70,000 │
└───────────┴────────┘
(Edited to reflect the updated specs in the question)
julia> df = DataFrame(a = [150000, 27, 16614]);
julia> function insertdecimalcomma(n)
if n < 100
return string(n) * ",00"
else
return replace(string(n), r"(..)$" => s",\1")
end
end
insertdecimalcomma (generic function with 1 method)
julia> df.b = insertdecimalcomma.(df.a)
julia> df
3×2 DataFrame
Row │ a b
│ Int64 String
─────┼─────────────────
1 │ 150000 1500,00
2 │ 27 27,00
3 │ 16614 166,14
Note that the b column will necessarily be a String after this change, as integer types cannot store formatting information in them.
If you have a lot of data and find that you need better performance, you may also want to use the InlineStrings package:
julia> #same as before upto the function definition
julia> using InlineStrings
julia> df.b = inlinestrings(insertdecimalcomma.(df.a))
3-element Vector{String7}:
"1500,00"
"27,00"
"166,14"
This stores the b column's data as fixed-size strings (String7 type here), which are generally treated like normal Strings, but can be significantly better for performance.

Filter DataFrame by rows which have no "missing" value

I have a DataFrame that may contain missing values and I want to filter out all the rows that contain at least one missing value, so from this
DataFrame(a = [1, 2, 3, 4], b = [5, missing, 7, 8], c = [9, 10, missing, 12])
4×3 DataFrame
Row │ a b c
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 5 9
2 │ 2 missing 10
3 │ 3 7 missing
4 │ 4 8 12
I want something like
Row │ a b c
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 5 9
4 │ 4 8 12
Ideally, there would be a filter function where I can pass each row into a lambda and then do a combo of collect and findfirst and whatnot, but I can't figure out how to pass lambdas to subset or #subset (from DataFramesMeta), because I don't only have three columns, I have over 200.
Following what #Antonello said you can do it with dropmissing. You have three options:
dropmissing: create a new data frame with dropped rows with missing values;
dropmissing with view=true create a view of the source data frame with dropped rows with missing values;
dropmissing! to drop dropped rows with missing values in-place.
By default all columns are considered, but you can change it and pass a column selector specifying which columns you want to include in the check.
Finally by default after dropping rows with missing values the columns will change their eltype not to allow missing values, but you can change this behavior by passing disallowmissing=false in which case they would still allow them.
Here is how you could perform filtering using subset and ismissing instead:
julia> subset(df, All() .=> ByRow(!ismissing))
2×3 DataFrame
Row │ a b c
│ Int64 Int64? Int64?
─────┼───────────────────────
1 │ 1 5 9
2 │ 4 8 12
(I am using standard select from DataFrames.jl)
or if you have a very wide data frame (like thousands of columns):
subset(df, AsTable(All()) => ByRow((x -> all(!ismissing, x))∘collect))
(this is a special syntax optimized for fast row-wise aggregation of wide tables)
OK, this seems to work but I'm leaving this open for more suggestions.
DataFrame(collect(filter(r -> nothing .== findfirst(collect(ismissing.(collect(r)))), eachrow(data[:, before_qs]))))

Filter/select rows by comparing to previous rows when using DataFrames.jl?

I have a DataFrame that's 659 x 2 in its size, and is sorted according to its Low column. Its first 20 rows can be seen below:
julia> size(dfl)
(659, 2)
julia> first(dfl, 20)
20×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-08-25 0.783125
6 │ 2010-05-25 0.808333
7 │ 2010-06-08 0.820938
8 │ 2010-07-20 0.82375
9 │ 2010-05-21 0.824792
10 │ 2010-08-16 0.842188
11 │ 2010-08-12 0.849688
12 │ 2010-02-25 0.871979
13 │ 2010-02-23 0.879896
14 │ 2010-07-30 0.890729
15 │ 2010-06-01 0.916667
16 │ 2010-08-06 0.949271
17 │ 2010-09-10 0.949792
18 │ 2010-03-04 0.969375
19 │ 2010-05-17 0.9875
20 │ 2010-03-09 1.0349
What I'd like to do is to filter out all rows in this dataframe such that only rows with monotonically increasing dates remain. So if applied to the first 20 rows above, I'd like the output to be the following:
julia> my_filter_or_subset(f, first(dfl, 20))
5×2 DataFrame
Row │ Date Low
│ Date… Float64
─────┼──────────────────────
1 │ 2010-05-06 0.708333
2 │ 2010-07-01 0.717292
3 │ 2010-08-27 0.764583
4 │ 2010-08-31 0.776146
5 │ 2010-09-10 0.949792
Is there some high-level way to achieve this using Julia and DataFrames.jl?
I should also note that, I originally prototyped the solution in Python using Pandas, and b/c it was just a PoC I didn't bother try to figure out how to achieve this using Pandas either (assuming it's even possible). And instead, I just used a Python for loop to iterate over each row of the dataframe, then only appended the rows whose dates are greater than the last date of the growing list.
I'm now trying to write this better in Julia, and looked into filter and subset methods in DataFrames.jl. Intuitively filter doesn't seem like it'd work, since the user supplied filter function can only access contents from each passed row; subset might be feasible since it has access to the entire column of data. But it's not obvious to me how to do this cleanly and efficiently, assuming it's even possible. If not, then guess I'll just have to stick with using a for loop here too.
You need to use for loop for this task in the end (you have to loop all values)
In Julia loops are fast so using your own for loop does not hinder performance.
If you are looking for something that is relatively short to type (but it will be slower than a custom for loop as it will perform the operation in several passes) you can use e.g.:
dfl[pushfirst!(diff(accumulate(max, dfl.Date)) .> 0, true), :]

Is it possible to set a chosen column as index in a julia dataframe?

dataframes in pandas are indexed in one or more numerical and/or string columns. Particularly, after a groupby operation, the output is a dataframe where the new index is given by the groups.
Similarly, julia dataframes always have a column named Row which I think is equivalent to the index in pandas. However, after groupby operations, julia dataframes don't use the groups as the new index. Here is a working example:
using RDatasets;
using DataFrames;
using StatsBase;
df = dataset("Ecdat","Cigarette");
gdf = groupby(df, "Year");
combine(gdf, "Income" => mean)
Output:
11×2 DataFrame
│ Row │ Year │ Income_mean │
│ │ Int32 │ Float64 │
├─────┼───────┼─────────────┤
│ 1 │ 1985 │ 7.20845e7 │
│ 2 │ 1986 │ 7.61923e7 │
│ 3 │ 1987 │ 8.13253e7 │
│ 4 │ 1988 │ 8.77016e7 │
│ 5 │ 1989 │ 9.44374e7 │
│ 6 │ 1990 │ 1.00666e8 │
│ 7 │ 1991 │ 1.04361e8 │
│ 8 │ 1992 │ 1.10775e8 │
│ 9 │ 1993 │ 1.1534e8 │
│ 10 │ 1994 │ 1.21145e8 │
│ 11 │ 1995 │ 1.27673e8 │
Even if the creation of the new index isn't done automatically, I wonder if there is a way to manually set a chosen column as index. I discover the method setindex! reading the documentation. However, I wasn't able to use this method. I tried:
#create new df
income = combine(gdf, "Income" => mean)
#set index
setindex!(income, "Year")
which gives the error:
ERROR: LoadError: MethodError: no method matching setindex!(::DataFrame, ::String)
I think that I have misused the command. What am I doing wrong here? Is it possible to manually set an index in a julia dataframe using one or more chosen columns?
DataFrames.jl does not currently allow specifying an index for a data frame. The Row column is just there for printing---it's not actually part of the data frame.
However, DataFrames.jl provides all the usual table operations, such as joins, transformations, filters, aggregations, and pivots. Support for these operations does not require having a table index. A table index is a structure used by databases (and by Pandas) to speed up certain table operations, at the cost of additional memory usage and the cost of creating the index.
The setindex! function you discovered is actually a method from Base Julia that is used to customize the indexing behavior for custom types. For example, x[1] = 42 is equivalent to setindex!(x, 42, 1). Overloading this method allows you to customize the indexing behavior for types that you create.
The docstrings for Base.setindex! can be found here and here.
If you really need a table with an index, you could try IndexedTables.jl.