Select rows with missing value in a Julia dataframe - dataframe

I'm just started exploring Julia and am struggeling with subsetting dataframes. I would like to select rows where LABEL has the value "B" and VALUE is missing. Selecting rows with "B" works fine, but trying to add a filter for missing fails. Any suggestions how to solve this. Tips for good documentation on subsetting/filtering dataframes in Julia are welcome. In the Julia documentation I haven't found a solution.
using DataFrames
df = DataFrame(ID = 1:5, LABEL = ["A", "A", "B", "B", "B"], VALUE = ["A1", "A2", "B1", "B2", missing])
df[df[:LABEL] .== "B", :] # works fine
df[df[:LABEL] .== "B" && df[:VALUE] .== missing, :] # fails

Use:
filter([:LABEL, :VALUE] => (l, v) -> l == "B" && ismissing(v), df)
(a very similar example is given in the documentation of the filter function).
If you want to use getindex then write:
df[(df.LABEL .== "B") .& ismissing.(df.VALUE), :]
The fact that you need to use .& instead of && when working with arrays is not DataFrames.jl specific - this is a common pattern in Julia in general when indexing arrays with booleans.

Related

Cannot append items to end of list and concatenate data frames

I am looping through data frame columns to obtain specific pieces of data - so far I have been successful except for the last three pieces of data I need. When I attempt to append these pieces of data to the list, they are appended to the beginning of the list and not at the end (I need them to be at the end).
Therefore, when I convert this list into a data frame and attempt to concatenate it to another data frame I have prepared, the values are all in the wrong places.
This is my code:
descs = ["a", "b", "c", "d", "e", "f", "g"]
data =[]
stats = []
for desc in descs:
data.append({
"Description":desc
})
for column in df:
if df[column].name == "column1":
counts = df[column].value_counts()
stats.extend([
counts.sum(),
counts[True],
counts[False]
])
elif df[column].name == "date_column":
stats.append(
df[column].min().date()
)
#Everything is fine up until this `elif` block
#I THINK THIS IS WHERE THE PROBLEM IS I DONT KNOW HOW TO FIX IT
elif df[column].name == "column2":
stats.extend([
df[column].max() ,
round(df[column].loc[df["column1"] == True].agg({column:"mean"}),2),
round(df[column].loc[df["column1"] == False].agg({column:"mean"}),2)
])
Up until the second elifblock, when I run this code and concatenate data and stats as data frames pd.concat([pd.DataFrame(data), pd.DataFrame({"Statistic":stats}), axis = 1) I get the following output - which is the output I want:
Description
Statistic
"a"
38495
"b"
3459
"c"
234
"d"
1984-06-2
"e"
NaN
"f"
NaN
"g"
NaN
When I run the above code chunk including the second elif block, the output is messed up
Description
Statistic
"a"
[78, [454],[45]]
"b"
38495
"c"
3459
"d"
234
"e"
1984-06-2
"f"
NaN
"g"
NaN
Those values in the first index of the data frame [78, 454, 45] should be in the place (and in that order) where NaNs appear in the first table
What am I doing wrong?
You're really close to making this work the way you want!
A couple things to make your life simpler:
df[column].name isn't needed because you can just use column
Looping through columns and having multiple if statements on their names to calculate summary statistics works, but you'll make your life easier if you look into .groupby() with .agg()
And that brings me to your issue: .agg() returns a pandas Series, and you just want a single number. Try
round(df[column].loc[df["column1"] == False].mean(),2)
instead. :)
Update: Now it looks like you are hitting the second elif with the first column, so re-order your DataFrame columns to be in the order you want them in:
cols = ["column1", "date_column", "column2"]
for column in cols:
if column == "column1":

Get first column in dataframe that exists from list of column names

Given a list of column names, only some or none exist in a dataframe, what's the least verbose way of getting the first existing column or None?
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["a", "b", "c"])
cols = ["d", "e", "c"]
This is fairly short but fails with StopIteration for no matching columns:
col = next(filter(lambda c: c in df, cols))
df[col]
0 3
1 6
Name: c, dtype: int64
Is there a better way?
You can do it with:
col = next(filter(lambda c: c in df, cols), None)
One idea:
col = next(iter(df.columns.intersection(cols, sort=False)), None)
#Learnings is a mess answered it beautifully and you should use that solution but here is another one line solution with walrus operator.
col = intersect[0] if (intersect:= [c for c in cols if c in df.columns]) else None

How to make a scatter plot based on the values of a column in the data set?

I am given a data set that looks something like this
and I am trying to graph all the points with a 1 on the first column separate from the points with a 0, but I want to put them in the same chart.
I know the final result should be something similar to this
But I can't find a way to filter the points in Julia. I'm using LinearAlgebra, CSV, Plots, DataFrames for my project, and so far I haven't found a way to make DataFrames storage types work nicely with Plots functions. I keep running into errors like Cannot convert Float64 to series data for plotting when I try plotting the points individually with a for loop as a filter as shown in the code below
filter = select(data, :1)
newData = select(data, 2:3)
#graph one initial point to create the plot
plot(newData[1,1], newData[1,2], seriestype = :scatter, title = "My Scatter Plot")
#add the additional points with the 1 in front
for i in 2:size(newData)
if filter[i] == 1
plot!(newData[i, 1], newData[i, 2], seriestype = :scatter, title = "My Scatter Plot")
end
end
Other approaches have given me other errors, but I haven't recorded those.
I'm using Julia 1.4.0 and the latest versions of all of the packages mentioned.
Quick Edit:
It might help to know that I am trying to replicate the Nonlinear dimensionality reduction section of this article https://sebastianraschka.com/Articles/2014_kernel_pca.html#principal-component-analysis
With Plots.jl you can do the following (I am passing a fully reproducible code):
julia> df = DataFrame(c=rand(Bool, 100), x = 2 .* rand(100) .- 1);
julia> df.y = ifelse.(df.c, 1, -1) .* df.x .^ 2;
julia> plot(df.x, df.y, color=ifelse.(df.c, "blue", "red"), seriestype=:scatter, legend=nothing)
However, in this case I would additionally use StatsPlots.jl as then you can just write:
julia> using StatsPlots
julia> #df df plot(:x, :y, group=:c, seriestype=:scatter, legend=nothing)
If you want to do it manually by groups it is easiest to use the groupby function:
julia> gdf = groupby(df, :c);
julia> summary(gdf) # check that we have 2 groups in data
"GroupedDataFrame with 2 groups based on key: c"
julia> plot(gdf[1].x, gdf[1].y, seriestype=:scatter, legend=nothing)
julia> plot!(gdf[2].x, gdf[2].y, seriestype=:scatter)
Note that gdf variable is bound to a GroupedDataFrame object from which you can get groups defined by the grouping column (:c) in this case.

Discrepencies in Pandas groupby aggregates vs dataframe, particularly on axis=1

import pandas as pd
import numpy as np
def main():
df = pd.DataFrame([["a", "b", "c", "k"],["d", "e", "f", "l"],['g','h','i', "J"]], columns=["ay", "be", "ce", "jay"])
print(df)
gb1 = df.groupby({"ay": "x", "be": "x"}, axis=1)
gb2 = df.groupby({"ay": "x", "be": "x", "ce": "y", "jay": "y"}, axis=1)
print("apply sum by axis 0")
#print(df.apply(sum))
print("fails")
print("apply sum by axis 1")
# print(df.apply(sum, axis=1))
print("fails")
print("agg sum by axis 0")
print(df.agg(sum))
print("agg sum by axis 1")
print(df.agg(sum, axis=1))
print("gb1 apply sum axis 1")
print(gb1.apply(sum))
print("gb1 agg sum axis 1")
print(gb1.agg(sum))
print("gb2 apply sum axis 1")
# print(gb2.apply(sum))
print("fails")
print("gb2 agg sum axis 1")
print(gb2.agg(sum))
print(gb1.agg(lambda x: ";".join([x[0], x[1]]))
if __name__ == "__main__":
main()
I don't understand the failures occurring and I don't understand why apply on groups fails with 2 groups but not with one.
I've solved my overall goal (I was trying to concatenate some strings of columns together) but I am concerned that I am somewhat bewildered by these failures.
The driving goal for reference was to be able to do
gb1.agg(lambda x: ";".join(x))
and I also don't understand why that doesn't work
especially since
gb1.agg(lambda x: ";".join([x[0], x[1]]) does
There's a lot to unpack in there.
print("apply sum by axis 0")
#print(df.apply(sum))
print("fails")
print("apply sum by axis 1")
# print(df.apply(sum, axis=1))
print("fails")
...the above are failing because you're apply-ing the Python sum function, which requires numerical types. You could use either of the following to fix that (which I think under the hood relies on the ability of numpy to handle the object dtypes that pandas converts them to):
df.apply(np.sum)
df.sum()
Next, these two items say axis=1 in the print statement, but aren't really:
print("gb1 apply sum axis 1")
print(gb1.apply(sum))
print("gb2 apply sum axis 1")
# print(gb2.apply(sum))
print("fails")
...if you add axis=1 they'll work and give sensible results.
Note that you have a missing closing parenthesis in:
gb1.agg(lambda x: ";".join([x[0], x[1]])
...both in the sample code and in the later comment about it.
It seems like you're saying that the final bit of code is what accomplishes your goal. The previous attempt:
gb1.agg(lambda x: ";".join(x))
...is joining the items in the index of the one group that is present instead of the individual series. Examine:
print(gb1.groups)
Finally, given your dataframe if what you wanted to do was concatenate columns with ";" between them, you could also do:
cols = ['ay','be']
df.apply(lambda x: ";".join((x[c] for c in cols)), axis=1)
or for a small number of items,
df['concat'] = df['ay'] + ";" + df['be']
...rather than using groupby.

Renaming columns in dataframe according to reference key

Let's assume I have a simple data frame:
df <- data.frame("one"= c(1:5), "two" = c(6:10), "three" =c(7:11))
I would like like to rename my column names, so that they match the reference
Let my reference be following:
df2 <- data.frame("Name" = c("A", "B", "C"), "Oldname" = c("one", "two", "three"))
How could I replace my column names from df with those from df2, if they match whats there (So that column names in df are: A, B C)?
In my original data df2 is way bigger and I have multiple data sets such a df, so for a solution to work, the code should be as generic as possible. Thanks in advance!
We can use the match function here to map the new names onto the old ones:
names(df) <- df2$Name[match(names(df), df2$Oldname)]
names(df)
[1] "A" "B" "C"
Demo