How to insert a column in a julia DataFrame at specific position (without referring to existing column names) - dataframe

I have a DataFrame in Julia with hundreds of columns, and I would like to insert a column after the first one.
For example in this DataFrame:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
I would like to insert a column area after colour, but without referring specifically to shape and border (that in my real case are hundreds of different columns).
df[:area] = [1,2]
In this example I can use (but referring specifically to shape and border):
df = df[[:colour, :area, :shape, :border]] # with specific reference to shape and border names

Update: This function has changed. See #DiegoJavierZea ’s comment.
Well, congratulate you found a workaround your self, but there is a built-in function that is semantically more clear and possibly a little bit faster:
using DataFrames
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
insert!(df, 3, [1,2], :area)
Where 3 is the expected index for the new column after the insertion, [1,2] is its content, and :area is the name. You can find a more detailed document by typing ?insert! in REPL after loading the DataFrames package.
It is worth noting that the ! is a part of the function name. It's a Julia convention to indicate that the function will mutate its argument.

rows = size(df)[1] # tuple gives you (rows,columns) of the DataFrame
insertcols!(df, # DataFrame to be changed
1, # insert as column 1
:Day => 1:rows, # populate as "Day" with 1,2,3,..
makeunique=true) # if the name of the column exist, make is Day_1

While making the question I also found a solution (as often happens).
I still post the question here for keep it in memory (for myself) and for the others..
It is enough to save the column names before "adding" the new column:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
dfnames = names(df)
df[:area] = [1,2]
df = df[vcat(dfnames[1:1],:area,dfnames[2:end])]

Related

How can I define and add a lagend to this ggplot 2 script?

I came up with the following script to bin my data on X values, and plot the means of those bins in overlapping bar graphs. It works fine, but I can't seem to get a legend to generate, probably due to poor understanding of aesthetic mapping.
Here is the script, note that "MOI" and "T_cell_contacts" are two data columns in each DF.
ggplot(mapping=aes(MOI, T_cell_contacts)) + stat_summary_bin(data = Cleaned24hr4, fun = "mean", geom="bar", bins= 100, fill = "#FF6666", alpha = 0.3) + stat_summary_bin(data = cleaned24hr8, fun = "mean", geom="bar", bins= 100, fill = "#3733FF", alpha = 0.3) + ylab("mean")
I also added the graph that it plots.
Full disclosure: I was in the middle of writing this when #schumacher posted their response :). Decided to finish anyway.
There are two ways to approach this. One way (more complicated) is to keep the dataframes separate and ask ggplot2 to create a legend via mapping, and the second (simpler) way is to combine into one dataset similar to what #schumacher posted and map the fill color to the extra id column created.
I'll show you both, but first, here's a sample dataset:
library(ggplot2)
set.seed(8675309)
df1 <- data.frame(my_x=rep(1:100, 3), my_y=rnorm(300, 40, 4))
df2 <- data.frame(my_x=rep(11:110, 3), my_y=rnorm(300, 110, 10))
# and the plot code similar to OP's question
ggplot(mapping=aes(x = my_x, y = my_y)) +
stat_summary_bin(data=df1, fun="mean", geom="bar", bins=40, fill="blue", alpha=0.3) +
stat_summary_bin(data=df2, fun="mean", geom="bar", bins=40, fill="red", alpha=0.3)
Method 1 : Combine Dataframes
This is the preferred method for a variety of reasons I can't list completely here. There are a lot of options you can use for combining datasets. One is using union() or rbind() after adding some sort of ID column to your data, but you can do all in one shot using bind_rows() from dplyr:
df <- dplyr::bind_rows(list(dataset1 = df1, dataset2 = df2), .id="id")
The result will bind the rows together and by specifying the .id argument, it will create a new column in the dataset called "id" that uses the names for each of the datasets in the list as the value. In this case, the value in thd df$id column is either "dataset1" if it originated from df1 or "dataset2" if it originated from df2.
Then you use aes(fill=...) to map the fill color to the column "id" in the combined dataset.
p <- ggplot(df, aes(x=my_x, y=my_y)) +
stat_summary_bin(aes(fill=id), fun="mean", geom="bar", bins=40, alpha=0.3)
p
This creates a plot with the default colors for fill, so if you want to supply your own, just use scale_fill_manual(values=...) to specify the particular colors. Using a named vector for values= ensures that each color is applied the way you want it to be, but you can just supply an unnamed vector of color names.
p + scale_fill_manual(values = c("dataset1" = "blue", "dataset2" = "red"))
Method 2 : Use mapping to add the legend
While Method 1 is preferred, there is another way that does not force you to combine your dataframes. This is also useful to illustrate a bit about how ggplot2 decides to create and draw legends. The legend is created automaticaly via the mapping= argument, specifically via aes(). If you put any aesthetic inside of aes() that would normally impart a different appearance and not location (with some exceptions like x, y, and label), then this initiates the creation of a legend. You can map either a column in your dataset (like above), or you can just supply a single value and that will be applied to the entire dataset used for the geom. In this case, see what happens when you change the fill= argument for each geom call to be within aes() and assign it to a character value:
p1 <- ggplot(mapping = aes(x=my_x, y=my_y)) +
stat_summary_bin(aes(fill="first"), data=df1, fun="mean", geom="bar", bins=40, alpha=0.3) +
stat_summary_bin(aes(fill="second"), data=df2, fun="mean", geom="bar", bins=40, alpha=0.3) +
scale_fill_manual(values = c("first" = "blue", "second" = "red"))
p1
It works! When you provide a character value for the fill= aesthetic inside aes(), it's basically labeling every observation in that data to have the value "first" or "second" and using that to make the legend. Cool, right?
You notice a problem though, which is that the alpha value for the legend is not correct. This is because you get overplotting. It's just one of the reasons why you shouldn't really do it this way, but... sort of works. It is only noticeable if you ahve an alpha value. You can get that to look normal, but you need to use guide_legend() to override the aesthetics. Since the code effectively causes the legend to be drawn completely for each geom... you have to cut the alpha value in half for it to display correctly.
p1 + guides(fill=guide_legend(override.aes = list(alpha=0.15)))
Oh, and the real reason why not to use Method 2 is.... just think about doing that again for 5 datasets... how about 10?... how about 20?.....
I think the difficulty has to do with building a single legend out of two different geoms. My approach was to combine your data into a single data frame. The records from each to be set apart by a new category column, I'll call "cat" for short.
With the popular dplyr package:
Cleaned24hr4 <- mutate(Cleaned24hr4, cat = "hr4")
Cleaned24hr8 <- mutate(Cleaned24hr8, cat = "hr8")
Then put them together:
Cleaned <- union(Cleaned24hr4,Cleaned24hr8)
Define your colors:
colorcode <- c("hr4" = "#FF6666", "hr8" = "#3733FF")
Here's my ggplot statement:
ggplot(Cleaned, mapping=aes(MOI, T_cell_contacts)) +
stat_summary_bin(fun = "mean", geom="bar", bins= 100, aes(fill = cat), alpha = 0.3) +
scale_fill_manual(values = colorcode) +
ylab("mean")
Output using some dummy data.

Adding a row to a FITS table with astropy

I have a problem which ought to be trivial but seems to have been massively over-complicated by the column-based nature of FITS BinTableHDU.
The script I'm writing should be trivial: iterate through a FITS file and write a subset of rows to an identically formatted FITS file, reducing the row count from c700k/3.6GB to about 350 rows. I have processed the input file and have each row that I want to save in a python array of FITS records:
outarray = []
self.indata=Table.read(self.infile, hdu=1)
for r in self._indata:
RecPassesFilter = FilterProc(r, self)
#
# Add to output array only if passes all filters...
#
if RecPassesFilter:
outarray.append(r)
Now, I've created an empty BintableHDU with exactly the same columns and formats and I want to add the filtered data:
[...much omitted code later...}
mycols = []
for inputcol in self._coldefs:
mycols.append(fits.Column(name=inputcol.name, format=inputcol.format))
# Next line should produce an empty BinTableHDU in the identical format to the output data
SaveData = fits.BinTableHDU.from_columns(mycols)
for s in self._outdata:
SaveData.data.append(s)
Now that last line not only fails, but every variant of it (SaveData.append() or .add_row() or whatever) also fails with a "no such method" error. There seems to be a singular lack of documentation on how to do the trivial task of adding a record. Clearly I am missing something, but two days later I'm still drawing a blank.
Can anyone point me in the right direction here?
OK, I managed to resolve this with some brute force and nested iterations essentially to create column data arrays on the fly. It's not much in terms of code and I don't care that it's inefficient as I won't need to run it too often. Example code here:
with fits.open(self._infile) as HDUSet:
tableHDU=HDUSet[1]
self._coldefs = tableHDU.columns
FITScols = []
for inputcol in self._coldefs:
NewColData = []
for r in self._outdata:
NewColData.append(r[inputcol.name])
FITScols.append(fits.Column(name=inputcol.name, format=inputcol.format, array=NewColData))
SaveData = fits.BinTableHDU.from_columns(FITScols)
SaveData.writeto(fname)
This solves my problem for a 350 row subset. I haven't yet dared try it for the 250K row subset that I need for the next part of my project!
I just recalled that BinTableHDU.from_columns takes an nrows argument. If you pass that along with the columns of an existing table HDU, it will copy the column structure but initialize subsequent rows with empty data:
>>> hdul = fits.open('astropy/io/fits/tests/data/table.fits')
>>> table = hdul[1]
>>> table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '>f4')]))
>>> new_table = fits.BinTableHDU.from_columns(table.columns, nrows=5)
>>> new_table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> new_table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2),
('', 0. ), ('', 0. )],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
As you can see, this still copies the data from the original columns. I think the idea behind this originally was for adding new rows to an existing table. However, you can also initialize a completely empty new table by passing fill=True:
>>> new_table_zeroed = fits.BinTableHDU.from_columns(table.columns, nrows=5, fill=True)
>>> new_table_zeroed.data
FITS_rec([('', 0.), ('', 0.), ('', 0.), ('', 0.), ('', 0.)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))

Writing values to an empty dataframe column in Julia

There is a dataframe with some values. I need to add an empty column to it, so that I can then fill in its individual cells with values when information on them becomes available. The type of values is known in advance, for example, let it be Float64, but if during initialization I set the contents of the column as "missing", then the type of the column is also displayed as "Missing" and no numeric values can be written there later. Here is an example illustrating the problem:
df = DataFrame(a = 1:3, b = 1.0:3.0)
insertcols!(df, 3, :c => missing)
df.b[1] = 5.0 # it works
df.c[1] = 7.0 # give an error
What is the right thing to do in this situation? Is needed to change the way of empty column initialization or the way of recording in its cells?
One possible way you could try:
insertcols!(df, :c => Vector{Union{Float64,Missing}}(missing,nrow(df)))
or:
df[!,:c] = Vector{Union{Float64,Missing}}(missing,nrow(df))
or shorter as mentioned by #Milan Bouchet-Valat:
df[!,:c] = missings(Float64, nrow(df))

How to make a scatter plot based on the values of a column in the data set?

I am given a data set that looks something like this
and I am trying to graph all the points with a 1 on the first column separate from the points with a 0, but I want to put them in the same chart.
I know the final result should be something similar to this
But I can't find a way to filter the points in Julia. I'm using LinearAlgebra, CSV, Plots, DataFrames for my project, and so far I haven't found a way to make DataFrames storage types work nicely with Plots functions. I keep running into errors like Cannot convert Float64 to series data for plotting when I try plotting the points individually with a for loop as a filter as shown in the code below
filter = select(data, :1)
newData = select(data, 2:3)
#graph one initial point to create the plot
plot(newData[1,1], newData[1,2], seriestype = :scatter, title = "My Scatter Plot")
#add the additional points with the 1 in front
for i in 2:size(newData)
if filter[i] == 1
plot!(newData[i, 1], newData[i, 2], seriestype = :scatter, title = "My Scatter Plot")
end
end
Other approaches have given me other errors, but I haven't recorded those.
I'm using Julia 1.4.0 and the latest versions of all of the packages mentioned.
Quick Edit:
It might help to know that I am trying to replicate the Nonlinear dimensionality reduction section of this article https://sebastianraschka.com/Articles/2014_kernel_pca.html#principal-component-analysis
With Plots.jl you can do the following (I am passing a fully reproducible code):
julia> df = DataFrame(c=rand(Bool, 100), x = 2 .* rand(100) .- 1);
julia> df.y = ifelse.(df.c, 1, -1) .* df.x .^ 2;
julia> plot(df.x, df.y, color=ifelse.(df.c, "blue", "red"), seriestype=:scatter, legend=nothing)
However, in this case I would additionally use StatsPlots.jl as then you can just write:
julia> using StatsPlots
julia> #df df plot(:x, :y, group=:c, seriestype=:scatter, legend=nothing)
If you want to do it manually by groups it is easiest to use the groupby function:
julia> gdf = groupby(df, :c);
julia> summary(gdf) # check that we have 2 groups in data
"GroupedDataFrame with 2 groups based on key: c"
julia> plot(gdf[1].x, gdf[1].y, seriestype=:scatter, legend=nothing)
julia> plot!(gdf[2].x, gdf[2].y, seriestype=:scatter)
Note that gdf variable is bound to a GroupedDataFrame object from which you can get groups defined by the grouping column (:c) in this case.

Infer Series Labels and Data from pandas dataframe column for plotting

Consider a simple 2x2 dataset with with Series labels prepended as the first column ("Repo")
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLib 140.0 47.0
Here are the DataFrame columns:
p(df.columns)
([u'Repo', u'AllTests', u'Restricted']
So we have the first column is the string/label and the second and third columns are data values. We want one series per row corresponding to the Galactian and the Forecast-MLlib repos.
It would seem this would be a common task and there would be a straightforward way to simply plot the DataFrame . However the following related question does not provide any simple way: it essentially throws away the DataFrame structural knowledge and plots manually:
Set matplotlib plot axis to be the dataframe column name
So is there a more natural way to plot these Series - that does not involve deconstructing the already-useful DataFrame but instead infers the first column as labels and the remaining as series data points?
Update Here is a self contained snippet
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['Galactian','Forecast-MLlib'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','AllTests','Restricted']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.show()
And here is output
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLlib 140.0 47.0
And with piRSquared help it looks like this
So the data is showing now .. but the Series and Labels are swapped. Will look further to try to line them up properly.
Another update
By flipping the columns/labels the series are coming out as desired.
The change was to :
labels = npa(['AllTests','Restricted'])
..
colnames = ['Repo','Galactian','Forecast-MLlib']
So the updated code is
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['AllTests','Restricted'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','Galactian','Forecast-MLlib']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.title("Restricting Long-Running Tests\nin Galactus and Forecast-ML")
plt.show()
p('df columns', df.columns)
ps(df)
Pandas assumes your label information is in the index and columns. Set the index first:
df.set_index('Repo').astype(float).plot()
Or
df.set_index('Repo').T.astype(float).plot()