more efficient way to construct DataFrame through loop function execution - pandas

Now I have written a parser to extract the information from raw html source code, which could return them as a tuple, and I have to loop this function and use the return to construct a DataFrame (each loop's return as a row). Here's what I have done:
import pandas as pd
import leveldb
for key, value in db.RangeIter():
html = db.Get(key)
result = parser(html)
df = df.append(pd.Series(result, index = index), ignore_index = True)
Note that parser and index are already defined, and db is a leveldb object which store all links and corresponding html source code. My problem is what's the more efficient way to construct that DataFrame? THANKS!

I would create a dataframe before the loop starts, then append successive dataframes to that. Note that if result is a tuple, it needs to be converted to a list before being converted into a dataframe. And I assume your index is already a list. So:
df = pd.DataFrame()
for key, value in db.RangeIter():
html = db.Get(key)
result = parser(html)
df = df.append(pd.DataFrame(list(result), index = index).transpose())
df.reset_index(inplace = True)
This is not to say your parser could not more efficiently return data for the creation of a dataframe, but I'm working within the confines of a single returned tuple.
Also, depending on the number of elements in the tuple, it may be more efficient to create simple python lists within the loop then create dataframes from those lists when complete, but you don't state the tuple size.

Related

How to find all variable objects in memory that are DataFrames and put them in a dictionary with the same name

I have the following situation:
I get multiple DataFrames stored in memory during the script is running. F.e. loading them from .csv, fetching Twitter stuff, etc....
Now I want to store all of them into one new object (Dictionary) with the same Key as they are names as a variable.
That means I want to search over all variables and if the variable is a DataFrame I want to add it to a Dictionary with Key=Name of the Variable and Value=the DataFrame itself
My first approach is trying to put all in a list, but in the result the elements are strings and not DataFrames anymore
import sys
dfs = []
for var in dir():
if isinstance(eval(var), pd.core.frame.DataFrame) and var[0]!='_':
dfs.append(var)
My second apporach is:
v = %who_ls DataFrame
k=[]
for var in dir():
if isinstance(locals()[var], pd.core.frame.DataFrame) and var[0]!='_':
k.append(var)
dfs = dict(zip(k, v))
But from v = %who_ls DataFrameI receive only a list of strings and not a list of DataFrames

Pandas Using Series.str.slice Most Efficiently with row varying parameters

My derived column is a substring of another column but the new string must be extracted at varying positions. In the code below I have done this using a lambda. However, this is slow. Is it possible to achieve the correct result using str.slice or is there another fast method?
import pandas as pd
df = pd.DataFrame ( {'st_col1':['aa-b', 'aaa-b']} )
df['index_dash'] = df['st_col1'].str.find ('-')
# gives wrong answer at index 1
df['res_wrong'] = df['st_col1'].str.slice (3)
# what I want to do :
df['res_cant_do'] = df['st_col1'].str.slice ( df['index_dash'] )
# slow solution
# naively invoking the built in python string slicing ... aStr[ start: ]
# ... accessing two columns from every row in turn
df['slow_sol'] = df.apply (lambda x: x['st_col1'] [ 1+ x['index_dash']:], axis=1 )
So can this be sped up ideally using str.slice or via another method?
From what I understand you want to get the last value after the "-" in st_col1 and pass that to a single column for that just use split
df['slow_sol'] = df['st_col1'].str.split('-').str[-1]
No need to identify the index, and them slicing it again on the given index dash. This will surely be more efficient then what you are doing, and cut a lot of steps

is there a way to subset an AnnData object after reading it in?

I read in the excel file like so:
data = sc.read_excel('/Users/user/Desktop/CSVB.xlsx',sheet= 'Sheet1', dtype= object)
There are 3 columns in this data set that I need to work with as .obs but it looks like everything is in the .X data matrix.
Anyone successfully subset after reading in the file or is there something I need to do beforehand?
Okay, so assuming sc stands for scanpy package, the read_excel just takes the first row as .var and the first column as .obs of the AnnData object.
The data returned by read_excel can be tweaked a bit to get what you want.
Let's say the index of the three columns you want in the .obs are stored in idx variable.
idx = [1,2,4]
Now, .obs is just a Pandas DataFrame, and data.X is just a Numpy matrix (see here). Thus, the job is simple.
# assign some names to the new columns
new_col_names = ['C1', 'C2', 'C3']
# add the columns to data.obs
data.obs[new_col_names] = data.X[:,idx]
If you may wish to remove the idx columns from data.X, I suggest making a new AnnData object for this.

Pandas str slice in combination with Pandas str index

I have a Dataframe containing a single column with a list of file names. I want to find all rows in the Dataframe that their value has a prefix from a set of know prefixes.
I know I can run a simple for loop, but I want to run in a Dataframe to check speeds and run benchmarks - it's also a nice exercise.
What I had in mind is combining str.slice with str.index but I can't get it to work. This is what I have in mind:
import pandas as pd
file_prefixes = {...}
file_df = pd.Dataframe(list_of_file_names)
file_df.loc[file_df.file.str.slice(start=0, stop=upload_df.file.str.index('/')-1).isin(file_prefixes), :] # this doesn't work as the index returns a dataframe
My hope is that said code will return all rows that the value there starts with a file prefix from the list above.
In summary, I would like help with 2 things:
Combining slice and index
Thoughts about better ways to achieve this
Thanks
I will use startswith
file_df.loc[file_df.file.str.startswith(tuple(file_prefixes)), :]

How to Mutate a DataFrame?

I am trying to remove some columns from my data frame and would prefer not to return the modified data frame and reassign it to the old. Instead, I would like the function to just modify the data frame. This is what I tried but it does not seem to be doing what I except. I was under the impression arguments as passed as reference and not by value?
function remove_cols! (df::DataFrame, cols)
df = df[setdiff(names(df), cols)];
end
df = DataFrame(x = [1:10], y = [11:20]);
remove_cols!(df, [:y]); # this does not modify the original data frame
Of course the below works but I would prefer if remove_cols! just changed the df in place
df = remove_cols!(df, [:y]);
How can I change the df in place inside my function?
Thanks!
As I understand Julia it uses what is called pass by sharing, meaning that the reference is passed by value. So when you pass the DataFrame to the function a new reference to the DataFrame is created which is local to the function. When you reassign the local df variable with its own reference to the DataFrame it has no effect on the separate global variable and its separate reference to the DataFrame.
There is a function in DataFrames.jl for deleting columns in DataFrames.
To answer the question of how to mutate a dataframe in your own function in general, the key is to use functions and operations that mutate the dataframe within the function. For example, see the function below which builds upon the standard dataframe append! function with some added benefits like it can append from any number of dataframes, the order of columns does not matter and missing columns will be added to the dataframes:
function append_with_missing!(df1::DataFrame, dfs::AbstractDataFrame...)
columns = Dict{Symbol, Type}(zip(names(df1), colwise(eltype, df1)))
for df in dfs
columns_temp = Dict(zip(names(df), colwise(eltype, df)))
merge!(columns, columns_temp)
end
for (n, t) in columns, df in [df1; [i for i in dfs]]
n in names(df) || (df[n] = Vector{Union{Missing,t}}(missing, size(df, 1)))
end
for df in dfs
append!(df1, df[names(df1)])
end
end
Here, the first dataframe passed itself is mutated with rows added from the other dataframes.
(The functionality for adding missing columns is based upon the answer given by #Bogumił Kamiński here: Breaking change on vcat when columns are missing)