is there a way to subset an AnnData object after reading it in? - pandas

I read in the excel file like so:
data = sc.read_excel('/Users/user/Desktop/CSVB.xlsx',sheet= 'Sheet1', dtype= object)
There are 3 columns in this data set that I need to work with as .obs but it looks like everything is in the .X data matrix.
Anyone successfully subset after reading in the file or is there something I need to do beforehand?

Okay, so assuming sc stands for scanpy package, the read_excel just takes the first row as .var and the first column as .obs of the AnnData object.
The data returned by read_excel can be tweaked a bit to get what you want.
Let's say the index of the three columns you want in the .obs are stored in idx variable.
idx = [1,2,4]
Now, .obs is just a Pandas DataFrame, and data.X is just a Numpy matrix (see here). Thus, the job is simple.
# assign some names to the new columns
new_col_names = ['C1', 'C2', 'C3']
# add the columns to data.obs
data.obs[new_col_names] = data.X[:,idx]
If you may wish to remove the idx columns from data.X, I suggest making a new AnnData object for this.

Related

Encoding feature array column from training set and applying to test set later

I have input columns that contain arrays of features. Feature is listed if present, absent if not. Order not guaranteed. eg:
features = pd.DataFrame({"cat_features":[['cuddly','short haired'],['short haired','bitey'],['short haired','orange','fat']]})
This works:
feature_table = pd.get_dummies(features['cat_features'].explode()).add_prefix("cat_features_").groupby(level=0).sum()
Problem:
It's not trivial to ensure the same output columns on my test set
when features are missing.
My real dataset has multiple such array
columns, but I can't explode them all at once because ValueError: columns must have matching element counts requiring looping over each array column.
One option, make a dtype and save it for later ("skinny" added as an example of something not in our input set):
from pandas.api.types import CategoricalDtype
cat_feature_type = CategoricalDtype([x.replace("cat_features_","") for x in feature_table.columns.to_list()]+ ["skinny"])
pd.get_dummies(features["cat_features"].explode().astype(cat_feature_type)).add_prefix("cat_features_").groupby(level=0).sum()
Is there a smarter way of doing this?

Loop over list of DataFrame names and save to csv in PySpark

I need to save a bunch of PySpark DataFrames as csv tables. The tables should also have the same names as the DataFrames.
The code should be something like that:
for table in ['ag01','a5bg','h68chl', 'df42', 'gh63', 'ur55', 'ui99']:
ppath='hdfs://hadoopcentralprodlab01/..../data/'+table+'.csv'
table.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").save(ppath)
The problem is here that in the command "table.repartition(1)..." I need the actual names of data frames without ''. So in this form the code doesn't work. But If write "for table in [ag01,a5bg,...]", so without quotes in the list, I then cannot define the path because I cannot concantenate the name of data frame and a string. How can I resolve this dilemma?
Thanks in advance!
Having a bunch of variable names not considered good coding practice. You should have used a list or a dictionary in the first place. But if you're stuck in this already, you can use eval to get the dataframe stored in that variable.
for table in ['ag01', 'a5bg', 'h68chl', 'df42', 'gh63', 'ur55', 'ui99']:
ppath = 'hdfs://hadoopcentralprodlab01/..../data/'+table+'.csv'
df = eval(table)
df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").save(ppath)

Pandas str slice in combination with Pandas str index

I have a Dataframe containing a single column with a list of file names. I want to find all rows in the Dataframe that their value has a prefix from a set of know prefixes.
I know I can run a simple for loop, but I want to run in a Dataframe to check speeds and run benchmarks - it's also a nice exercise.
What I had in mind is combining str.slice with str.index but I can't get it to work. This is what I have in mind:
import pandas as pd
file_prefixes = {...}
file_df = pd.Dataframe(list_of_file_names)
file_df.loc[file_df.file.str.slice(start=0, stop=upload_df.file.str.index('/')-1).isin(file_prefixes), :] # this doesn't work as the index returns a dataframe
My hope is that said code will return all rows that the value there starts with a file prefix from the list above.
In summary, I would like help with 2 things:
Combining slice and index
Thoughts about better ways to achieve this
Thanks
I will use startswith
file_df.loc[file_df.file.str.startswith(tuple(file_prefixes)), :]

Reading .txt into Julia DataFrame as Date Type

Is there way to read date ("2000-01") variables from text files into a Julia DataFrame directly, as a date? There's no documentation on this from what I have seen.
df = readtable("path/dates.txt", eltypes = [Date, Date])
This doesn't work, even though it seems like it should. My usual process is to read the dates in as strings and then loop over each row to create a new date variable. This has become a bottleneck in some of my processes now, do to the size of the DataFrames.
My usual flow is to do something like this:
full_df[:real_date] = Date(full_df[:temp_dte_string], "m/d/y")
Thank you
I don't think there's currently any way to do the loading in a single step like your first suggested code. However you can speed up the second method somewhat by making a DateFormat object and calling Date with that instead of with a string.
(This is mentioned briefly here.)
dfmt = Dates.DateFormat(“m/d/y”)
full_df[:real_date] = Date(full_df[:temp_dte_string], dfmt)
(For some reason I thought Date was not vectorized and have been doing this inside a for loop in all my code. Whoops.)
By delete a variable do you mean delete a column or a row? If you mean the former, then there's a few other ways to do this including things like
function vectorin(a, b) #IMHO this should be in base
bset = Set(b)
[i in bset for i in a]
end
df = DataFrame(A1="", A2="", A3="", D="", E="", F="") #Some long list of columns
badCols = [:D, :F] #Some long list of columns you want to remove
df = df[names(df)[!vectorin(names(df), badCols)]]
Sometimes I read in csv files with a lot of extra columns, then just do something like
df = readtable("data.csv")
df = df[[:Only, :the, :cols, :I, :want]]

How to Mutate a DataFrame?

I am trying to remove some columns from my data frame and would prefer not to return the modified data frame and reassign it to the old. Instead, I would like the function to just modify the data frame. This is what I tried but it does not seem to be doing what I except. I was under the impression arguments as passed as reference and not by value?
function remove_cols! (df::DataFrame, cols)
df = df[setdiff(names(df), cols)];
end
df = DataFrame(x = [1:10], y = [11:20]);
remove_cols!(df, [:y]); # this does not modify the original data frame
Of course the below works but I would prefer if remove_cols! just changed the df in place
df = remove_cols!(df, [:y]);
How can I change the df in place inside my function?
Thanks!
As I understand Julia it uses what is called pass by sharing, meaning that the reference is passed by value. So when you pass the DataFrame to the function a new reference to the DataFrame is created which is local to the function. When you reassign the local df variable with its own reference to the DataFrame it has no effect on the separate global variable and its separate reference to the DataFrame.
There is a function in DataFrames.jl for deleting columns in DataFrames.
To answer the question of how to mutate a dataframe in your own function in general, the key is to use functions and operations that mutate the dataframe within the function. For example, see the function below which builds upon the standard dataframe append! function with some added benefits like it can append from any number of dataframes, the order of columns does not matter and missing columns will be added to the dataframes:
function append_with_missing!(df1::DataFrame, dfs::AbstractDataFrame...)
columns = Dict{Symbol, Type}(zip(names(df1), colwise(eltype, df1)))
for df in dfs
columns_temp = Dict(zip(names(df), colwise(eltype, df)))
merge!(columns, columns_temp)
end
for (n, t) in columns, df in [df1; [i for i in dfs]]
n in names(df) || (df[n] = Vector{Union{Missing,t}}(missing, size(df, 1)))
end
for df in dfs
append!(df1, df[names(df1)])
end
end
Here, the first dataframe passed itself is mutated with rows added from the other dataframes.
(The functionality for adding missing columns is based upon the answer given by #Bogumił Kamiński here: Breaking change on vcat when columns are missing)