Create pandas dataframe from series with ordered dict as rows - pandas

I am trying to extract lmfit parameter results as dataframes. I pass 1 column x, 1 column data through a fit_func and parameters pars and the output of the minimize function in lmfit outputs as OrderedDict.
out = minimize(fit_func, pars, method = 'leastsq', args=(x, data))
res = out.params.valuesdict()
res
Output:
OrderedDict([('a1', 12.850309404600393),
('c1', 1346.833513206811),
('s1', 44.22337472274829),
('f1', 1.1275639898142586),
('a2', 77.15732669480884),
('c2', 1580.5712512351947),
('s2', 16.239969775527275),
('f2', 0.8684363668111492)])
The output I want in DataFrames I achieved like this with pd.DataFrame(res,index=[0]) :
I have 3 data columns that I want to quickly fit:
x = d.iloc[:,0]
fit_odict = pd.DataFrame(d.iloc[:,1:4].\
apply(lambda y: minimize(fit_func, pars, method = 'leastsq', args=(x, y))\
.params.valuesdict()),index=[1])
But I get a series of Ordered Dicts as rows in the Dataframe:
How do I get the output I want with the three parameter results as rows ? Is there a better way to apply the function?
UPDATE:
Appended #M Newville in my solution. Might be helpful for those who want to quickly extract lmfit parameter results from multiple data columns d1.iloc[:,1:] :
def fff(cols):
out = minimize(fit_func, pars, method = 'leastsq', args=(x, cols))
return {key: par.value for key, par in out.params.items()}
results = d1.iloc[:,1:].apply(fff,result_type='expand').transpose()
Output:

For a single fit, this would probably be what you are looking for:
out = minimize(fit_func, pars, method = 'leastsq', args=(x, data))
fit_odict = pd.DataFrame({key: [par.value] for key, par in out.params.items()})
I think you probably are looking for something like this:
results = {key: [] for key in pars}
for data in datasets:
out = minimize(fit_func, pars, method='leastsq', args=(x, data))
for par_name, val_list in results.items():
val_list.append(out.params[par_name].value)
results = pd.DataFrame(results)
You could probably stuff that all into a single long line, but I wouldn't recommend it -- someone may want to read that code ;).

This is quick work around that you can do. The code is not efficient but you can optimize it. Note that index start 1 but you are welcome to re-index using pandas library
import pandas as pd
# Your output is a list of tuple
OrderedDict = [('a1', 12.850309404600393),('c1', 1346.833513206811),('s1',
44.22337472274829),('f1', 1.1275639898142586),('a2', 77.15732669480884),('c2',
1580.5712512351947),('s2', 16.239969775527275),('f2', 0.8684363668111492)]
# Create a dataframe from the list of tuple and tanspose
df = pd.DataFrame(OrderedDict).T
# Get the first row for the dataframe columns namea
columns = df.loc[0].values.tolist()
df.columns = columns
output = df.drop(df.index[0])
output
a1 c1 s1 f1 a2 c2 s2 f2
1 12.8503 1346.83 44.2234 1.12756 77.1573 1580.57 16.24 0.868436

Related

Contatenate rows in Pandas

I have 12 months sales data for each month. I want to analyze the dataset as a whole.
I have tried using the concat function but It produces not a number (NaN) in my dataframe fields.
In R, cbind function solves this. How do i approach this differently in Python?
I tried using df.concat function to bind the rows cos all the column names are the same for the datasets.
What other options can i explore?
sales_1 = pd.read_csv('Sales_January_2019.csv')
sales_2 = pd.read_csv('Sales_February_2019.csv')
sales_3 = pd.read_csv('Sales_March_2019.csv')
sales_4 = pd.read_csv('Sales_April_2019.csv')
sales_5 = pd.read_csv('Sales_May_2019.csv')
sales_6 = pd.read_csv('Sales_June_2019.csv')
sales_7 = pd.read_csv('Sales_July_2019.csv')
sales_8 = pd.read_csv('Sales_August_2019.csv')
sales_9 = pd.read_csv('Sales_September_2019.csv')
sales_10 = pd.read_csv('Sales_October_2019.csv')
sales_11 = pd.read_csv('Sales_November_2019.csv')
sales_12 = pd.read_csv('Sales_December_2019.csv')
I expect all data frame to be merged into one since the column names are the same for all
perhaps
# using concat with the list of the DF that you already read-in to combine into a single DF
pd.concat([sales_1 ,sales_2 ,sales_3 ,sales_4 ,sales_5 ,sales_6 ,sales_7 ,sales_8 ,sales_9 ,sales_10 ,sales_11 ,sales_12 ])

Working on multiple data frames with data for NBA players during the season, how can I modify all the dataframes at the same time?

I have a list of 16 dataframes that contain stats for each player in the NBA during the respective season. My end goal is to run unsupervised learning algorithms on the data frames. For example, I want to see if I can determine a player's position by their stats or if I can determine their total points during the season based on their stats.
What I would like to do is modify the list(df_list), unless there's a better solution, of these dataframes instead modifying each dataframe to:
Change the datatype of the MP(minutes played column from str to int.
Modify the dataframe where there are only players with 1000 or more MP and there are no duplicate players(Rk)
(for instance in a season, a player(Rk) can play for three teams in a season and have 200MP, 300MP, and 400MP mins with each team. He'll have a column for each team and a column called TOT which will render his MP as 900(200+300+400) for a total of four rows in the dataframe. I only need the TOT row
Use simple algebra with various and individual columns columns, for example: being able to total the MP column and the PTS column and then diving the sum of the PTS column by the MP column.
Or dividing the total of the PTS column by the len of the PTS column.
What I've done so far is this:
Import my libraries and create 16 dataframes using pd.read_html(url).
The first dataframes created using two lines of code:
url = "https://www.basketball-reference.com/leagues/NBA_1997_totals.html"
ninetysix = pd.read_html(url)[0]
HOWEVER, the next four data frames had to be created using a few additional line of code(I received an error code that said "html5lib not found, please install it" so I downloaded both html5lib and requests). I say that to say...this distinction in creating the DF may have to considered in a solution.
The code I used:
import requests
import uuid
url = 'https://www.basketball-reference.com/leagues/NBA_1998_totals.html'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
ninetyseven = pd.read_html(html)[0]
These four data frames look like this:
I tried this but it didn't do anything:
df_list = [
eightyfour, eightyfive, eightysix, eightyseven,
eightyeight, eightynine, ninety, ninetyone,
ninetytwo, ninetyfour, ninetyfive,
ninetysix, ninetyseven, ninetyeight, owe_one, owe_two
]
for df in df_list:
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
owe_two
============================UPDATE===================================
This code will solves a portion of problem # 2
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
dd = pd.read_html(url)[0]
dd = dd[dd['Rk'].ne('Rk')]
dd['MP'] = dd['MP'].astype(int)
players_1000_rk_list = list(dd[dd['MP'] >= 1000]['Rk'])
players_dd = dd[dd['Rk'].isin(players_1000_rk_list)]
But it doesn't remove the duplicates.
==================== UPDATE 10/11/22 ================================
Let's say I take rows with values "TOT" in the "Tm" and create a new DF with them, and these rows from the original data frame...
could I then compare the new DF with the original data frame and remove the names from the original data IF they match the names from the new data frame?
the problem is that the df you are working on in the loop is not the same df that is in the df_list. you could solve this by saving the new df back to the list, overwriting the old df
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
df_list[i] = df
the2 lines are probably wrong as well
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
perhaps you want this
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
#df = list(df[df['MP'] >= 1000]['Rk'])
#df = df[df['Rk'].isin(df)]
# just the rows where MP > 1000
df_list[i] = df[df['MP'] >= 1000]

Select cells in a pandas DataFrame by a Series of its column labels

Say we have a DataFrame and a Series of its column labels, both (almost) sharing a common index:
df = pd.DataFrame(...)
s = df.idxmax(axis=1).shift(1)
How can I obtain cells given a series of columns, getting value from every row using a corresponding column label from the joined series? I'd imagine it would be:
values = df[s] # either
values = df.loc[s] # or
In my example I'd like to have values that are under biggest-in-their-row values (I'm doing a poor man's ML :) )
However I cannot find any interface selecting cells by series of columns. Any ideas folks?
Meanwhile I use this monstrous snippet:
def get_by_idxs(df: pd.DataFrame, idxs: pd.Series) -> pd.Series:
ts_v_pairs = [
(ts, row[row['idx']])
for ts, row in df.join(idxs.rename('idx'), how='inner').iterrows()
if isinstance(row['idx'], str)
]
return pd.Series([v for ts, v in ts_v_pairs], index=[ts for ts, v in ts_v_pairs])
I think you need dataframe lookup
v = s.dropna()
v[:] = df.to_numpy()[range(len(v)), df.columns.get_indexer_for(v)]

To join complicated pandas tables

I'm trying to join a dataframe of results from a statsmodels GLM to a dataframe designed to hold both univariate data and model results as models are iterated through. i'm having trouble figuring out how to grammatically join the two data sets.
I've consulted the pandas documentation found below to no luck:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
This is difficult because of the out put of the model compared to the final table which holds values of each unique level of each unique variable.
See an example of what the data looks like with the code below:
import pandas as pd
df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
,'channel_model','channel_model','channel_model']
, 'level': [0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
,'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
,36388470.44,31805316.37]}
final_table = pd.DataFrame(df)
df2 = {'variable': ['intercept','C(channel_model)[T.EA]','C(channel_model)[T.IA]', 'CLded_model']
, 'coefficient': [-2.36E-14,-0.091195797,-0.244225888, 0.00174356]}
model_results = pd.DataFrame(df2)
After this is run you can see that for categorical variables, the value is incased in a few layers compared to the final_table. Numerical values such as CLded_model needs to be joined with the one coefficient it's associated with.
There is a lot to this and i'm not sure where to start.
Update: The following code produces the desired result:
d3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
,'CLded_model','channel_model','channel_model','channel_model']
, 'level': [None, 0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
,'value': [None, 60955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
,36388470.44,31805316.37]
, 'coefficient': [ -2.36E-14, 0.00174356, 0.00174356, 0.00174356, 0.00174356, 0.00174356 ,0.00174356
, 0.00174356,None, -0.091195797,-0.244225888, ]}
desired_result = pd.DataFrame(d3)
First you have to clean df2:
df2['variable'] = df2['variable'].str.replace("C\(","")\
.str.replace("\)\[T.", "-")\
.str.strip("\]")
df2
variable coefficient
0 intercept -2.360000e-14
1 channel_model-EA -9.119580e-02
2 channel_model-IA -2.442259e-01
3 CLded_model 1.743560e-03
Because you want to merge some of df1 on the level column and others not, we need to change df1 slightly to match df2:
df1.loc[df1['variable'] == 'channel_model', 'variable'] = "channel_model-"+df1.loc[df1['variable'] == 'channel_model', 'level']
df1
#snippet of what changed
variable level value
6 CLded_model 1000 1.458799e+07
7 channel_model-DIR DIR 1.049374e+07
8 channel_model-EA EA 3.638847e+07
9 channel_model-IA IA 3.180532e+07
Then we merge them:
df4 = df1.merge(df2, how = 'outer', left_on =['variable'], right_on = ['variable'])
And we get your result (except for the minor change in the variable name)

Extracting value and creating new column out of it

I would like to extract certain section of a URL, residing in a column of a Pandas Dataframe and make that a new column. This
ref = df['REFERRERURL']
ref.str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE)
returns me a Series with tuples in it. How can I take out only one part of that tuple before the Series is created, so I can simply turn that into a column? Sample data for referrerurl is
http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....
In this example I am interested in creating a column that only has 'someproduct_step2' in it.
Thanks,
In [25]: df = DataFrame([['http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....']],columns=['A'])
In [26]: df['A'].str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE).apply(lambda x: Series(x[0][0],index=['first']))
Out[26]:
first
0 someproduct_step2
in 0.11.1 here is a neat way of doing this as well
In [34]: df.replace({ 'A' : "http:.+\d\d\/(.*?)(;|\\?).*$"}, { 'A' : r'\1'} ,regex=True)
Out[34]:
A
0 someproduct_step2
This also worked
def extract(x):
res = re.findall("\\d\\d\\/(.*?)(;|\\?)",x)
if res: return res[0][0]
session['RU_2'] = session['REFERRERURL'].apply(extract)