Select cells in a pandas DataFrame by a Series of its column labels - pandas

Say we have a DataFrame and a Series of its column labels, both (almost) sharing a common index:
df = pd.DataFrame(...)
s = df.idxmax(axis=1).shift(1)
How can I obtain cells given a series of columns, getting value from every row using a corresponding column label from the joined series? I'd imagine it would be:
values = df[s] # either
values = df.loc[s] # or
In my example I'd like to have values that are under biggest-in-their-row values (I'm doing a poor man's ML :) )
However I cannot find any interface selecting cells by series of columns. Any ideas folks?
Meanwhile I use this monstrous snippet:
def get_by_idxs(df: pd.DataFrame, idxs: pd.Series) -> pd.Series:
ts_v_pairs = [
(ts, row[row['idx']])
for ts, row in df.join(idxs.rename('idx'), how='inner').iterrows()
if isinstance(row['idx'], str)
]
return pd.Series([v for ts, v in ts_v_pairs], index=[ts for ts, v in ts_v_pairs])

I think you need dataframe lookup
v = s.dropna()
v[:] = df.to_numpy()[range(len(v)), df.columns.get_indexer_for(v)]

Related

cancatenate multiple dfs with same dimensions and apply functions to cell values of all dfs and store result in the cell

df1 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
df2 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
Lets say after concatenate df1 and df2(real case I have many dfs with 700*200 size) in a way that I get something like below table(I dont need to see this table, just for explanation)
col a
col b
row a
[1.4]
[7,8]
row b
[9,2]
[2,0]
Then i want to pass each cell values to below compute function and add the result it from to the cell
def compute(row, column, cell_values):
baseline_df = [2, 4, 6, 7, 8]
result = baseline_df
for values in cell_values:
if (column-row) != dict[values]: # dict contain specific values
result = baseline_df
else:
result = result.apply(func, value=values)
return result.loc[column-row]
def func(df, value):
# operation
result_df = df*value
return result_df
What I want is get df1 and df2 , concatenate and apply above function and get the results. In a really fast way.
In the actual use case , df quite big and if it run for all cells it would take significant amount of time, i need a faster way to perform this.
Note:
This is my idea of doing this. I hope you understand what my requirements are. Please let me know if that is not clear.
currently, i am using something like below, just get the max value of the cell and do the calculation(func)later
This will just give the max value of all cells combined,
dfs = pd.concat(grid).max(level=0)
Final result should be something like this after calculation(same 2d array with new cell data)
col a
col b
row a
0.1
0.7
row b
0.9
0,6
Different approaches are also welcome

Create pandas dataframe from series with ordered dict as rows

I am trying to extract lmfit parameter results as dataframes. I pass 1 column x, 1 column data through a fit_func and parameters pars and the output of the minimize function in lmfit outputs as OrderedDict.
out = minimize(fit_func, pars, method = 'leastsq', args=(x, data))
res = out.params.valuesdict()
res
Output:
OrderedDict([('a1', 12.850309404600393),
('c1', 1346.833513206811),
('s1', 44.22337472274829),
('f1', 1.1275639898142586),
('a2', 77.15732669480884),
('c2', 1580.5712512351947),
('s2', 16.239969775527275),
('f2', 0.8684363668111492)])
The output I want in DataFrames I achieved like this with pd.DataFrame(res,index=[0]) :
I have 3 data columns that I want to quickly fit:
x = d.iloc[:,0]
fit_odict = pd.DataFrame(d.iloc[:,1:4].\
apply(lambda y: minimize(fit_func, pars, method = 'leastsq', args=(x, y))\
.params.valuesdict()),index=[1])
But I get a series of Ordered Dicts as rows in the Dataframe:
How do I get the output I want with the three parameter results as rows ? Is there a better way to apply the function?
UPDATE:
Appended #M Newville in my solution. Might be helpful for those who want to quickly extract lmfit parameter results from multiple data columns d1.iloc[:,1:] :
def fff(cols):
out = minimize(fit_func, pars, method = 'leastsq', args=(x, cols))
return {key: par.value for key, par in out.params.items()}
results = d1.iloc[:,1:].apply(fff,result_type='expand').transpose()
Output:
For a single fit, this would probably be what you are looking for:
out = minimize(fit_func, pars, method = 'leastsq', args=(x, data))
fit_odict = pd.DataFrame({key: [par.value] for key, par in out.params.items()})
I think you probably are looking for something like this:
results = {key: [] for key in pars}
for data in datasets:
out = minimize(fit_func, pars, method='leastsq', args=(x, data))
for par_name, val_list in results.items():
val_list.append(out.params[par_name].value)
results = pd.DataFrame(results)
You could probably stuff that all into a single long line, but I wouldn't recommend it -- someone may want to read that code ;).
This is quick work around that you can do. The code is not efficient but you can optimize it. Note that index start 1 but you are welcome to re-index using pandas library
import pandas as pd
# Your output is a list of tuple
OrderedDict = [('a1', 12.850309404600393),('c1', 1346.833513206811),('s1',
44.22337472274829),('f1', 1.1275639898142586),('a2', 77.15732669480884),('c2',
1580.5712512351947),('s2', 16.239969775527275),('f2', 0.8684363668111492)]
# Create a dataframe from the list of tuple and tanspose
df = pd.DataFrame(OrderedDict).T
# Get the first row for the dataframe columns namea
columns = df.loc[0].values.tolist()
df.columns = columns
output = df.drop(df.index[0])
output
a1 c1 s1 f1 a2 c2 s2 f2
1 12.8503 1346.83 44.2234 1.12756 77.1573 1580.57 16.24 0.868436

pandas: appending a row to a dataframe with values derived using a user defined formula applied on selected columns

I have a dataframe as
df = pd.DataFrame(np.random.randn(5,4),columns=list('ABCD'))
I can use the following to achieve the traditional calculation like mean(), sum()etc.
df.loc['calc'] = df[['A','D']].iloc[2:4].mean(axis=0)
Now I have two questions
How can I apply a formula (like exp(mean()) or 2.5*mean()/sqrt(max()) to column 'A' and 'D' for rows 2 to 4
How can I append row to the existing df where two values would be mean() of the A and D and two values would be of specific formula result of C and B.
Q1:
You can use .apply() and lambda functions.
df.iloc[2:4,[0,3]].apply(lambda x: np.exp(np.mean(x)))
df.iloc[2:4,[0,3]].apply(lambda x: 2.5*np.mean(x)/np.sqrt(max(x)))
Q2:
You can use dictionaries and combine them and add it as a row.
First one is mean, the second one is some custom function.
ad = dict(df[['A', 'D']].mean())
bc = dict(df[['B', 'C']].apply(lambda x: x.sum()*45))
Combine them:
ad.update(bc)
df = df.append(ad, ignore_index=True)

How to insert a column in a julia DataFrame at specific position (without referring to existing column names)

I have a DataFrame in Julia with hundreds of columns, and I would like to insert a column after the first one.
For example in this DataFrame:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
I would like to insert a column area after colour, but without referring specifically to shape and border (that in my real case are hundreds of different columns).
df[:area] = [1,2]
In this example I can use (but referring specifically to shape and border):
df = df[[:colour, :area, :shape, :border]] # with specific reference to shape and border names
Update: This function has changed. See #DiegoJavierZea ’s comment.
Well, congratulate you found a workaround your self, but there is a built-in function that is semantically more clear and possibly a little bit faster:
using DataFrames
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
insert!(df, 3, [1,2], :area)
Where 3 is the expected index for the new column after the insertion, [1,2] is its content, and :area is the name. You can find a more detailed document by typing ?insert! in REPL after loading the DataFrames package.
It is worth noting that the ! is a part of the function name. It's a Julia convention to indicate that the function will mutate its argument.
rows = size(df)[1] # tuple gives you (rows,columns) of the DataFrame
insertcols!(df, # DataFrame to be changed
1, # insert as column 1
:Day => 1:rows, # populate as "Day" with 1,2,3,..
makeunique=true) # if the name of the column exist, make is Day_1
While making the question I also found a solution (as often happens).
I still post the question here for keep it in memory (for myself) and for the others..
It is enough to save the column names before "adding" the new column:
df = DataFrame(
colour = ["green","blue"],
shape = ["circle", "triangle"],
border = ["dotted", "line"]
)
dfnames = names(df)
df[:area] = [1,2]
df = df[vcat(dfnames[1:1],:area,dfnames[2:end])]

Infer Series Labels and Data from pandas dataframe column for plotting

Consider a simple 2x2 dataset with with Series labels prepended as the first column ("Repo")
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLib 140.0 47.0
Here are the DataFrame columns:
p(df.columns)
([u'Repo', u'AllTests', u'Restricted']
So we have the first column is the string/label and the second and third columns are data values. We want one series per row corresponding to the Galactian and the Forecast-MLlib repos.
It would seem this would be a common task and there would be a straightforward way to simply plot the DataFrame . However the following related question does not provide any simple way: it essentially throws away the DataFrame structural knowledge and plots manually:
Set matplotlib plot axis to be the dataframe column name
So is there a more natural way to plot these Series - that does not involve deconstructing the already-useful DataFrame but instead infers the first column as labels and the remaining as series data points?
Update Here is a self contained snippet
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['Galactian','Forecast-MLlib'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','AllTests','Restricted']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.show()
And here is output
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLlib 140.0 47.0
And with piRSquared help it looks like this
So the data is showing now .. but the Series and Labels are swapped. Will look further to try to line them up properly.
Another update
By flipping the columns/labels the series are coming out as desired.
The change was to :
labels = npa(['AllTests','Restricted'])
..
colnames = ['Repo','Galactian','Forecast-MLlib']
So the updated code is
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['AllTests','Restricted'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','Galactian','Forecast-MLlib']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.title("Restricting Long-Running Tests\nin Galactus and Forecast-ML")
plt.show()
p('df columns', df.columns)
ps(df)
Pandas assumes your label information is in the index and columns. Set the index first:
df.set_index('Repo').astype(float).plot()
Or
df.set_index('Repo').T.astype(float).plot()