pandas.DataFrame input DataFrame but get NaN? - pandas

df is original DataFrame, csv file.
a = df.head(3) # get part of df.
This is table a.
b = a.loc[1:3,'22':'41'] #select part of a.
c = pd.DataFrame(data=b,index=['a','b'],columns=['v','g']) # give index and columns
final
b show 2x2. I get four value.
c show 2x2 NaN. I get four NaN.
why c don't contain any number?

Try using .values, you are running into 'intrinsic data alignment'
c = pd.DataFrame(data=b.values,index=['a','b'],columns=['v','g']) # give index and columns
Pandas likes to align indexes, by converting your 'b' dataframe into a np.array, you can then use the pandas dataframe constructor to build a new dataframe with those 2x2 values assigning new indexing.

Your DataFrame b already contains row and column indices, so when you try to create DataFrame c and you pass index and columns keyword arguments, you are implicitly indexing out of the original DataFrame b.
If all you want to do is re-index b, why not do it directly?
b = b.copy()
b.index = ['a', 'b']
b.columns = ['v', 'g']

Related

Converting list of nested dicts to Dataframe

I am trying to convert a list of dicts with the following format to a single Dataframe where each row contains the a specific type of betting odds offered by one sports book (meaning ‘h2h’ odds and ‘spread’ odds are in separate rows):
temp = [{"id":"e4cb60c1cd96813bbf67450007cb2a10",
"sport_key":"americanfootball",
"sport_title":"NFL",
"commence_time":"2022-11-15T01:15:31Z",
"home_team":"Philadelphia Eagles",
"away_team":"Washington Commanders",
"bookmakers":
[{"key":"fanduel","title":"FanDuel",
"last_update":"2022-11-15T04:00:35Z",
"markets":[{"key":"h2h","outcomes":[{"name":"Philadelphia
Eagles","price":630},{"name":"Washington Commanders","price":-1200}]}]},
{"key":"draftkings","title":"DraftKings",
"last_update":"2022-11-15T04:00:30Z",
"markets":[{"key":"h2h","outcomes":[{"name":"Philadelphia Eagles","price":600},
{"name":"Washington Commanders","price":-950}]}]},
There are many more bookmaker entries of the same format. I have tried:
df = pd.DataFrame(temp)
# normalize the column of dicts
normalized = pd.json_normalize(df['bookmakers'])
# join the normalized column to df
df = df.join(normalized,).drop(columns=['bookmakers'])
# join the normalized column to df
df = df.join(normalized, lsuffix = 'key')
However, this results in a Dataframe with repeated columns and columns that contain dictionaries.
Thanks for any help in advance!

Add/Sum values of a column vector to a matrix with pandas

I have 2 dataframes (without headers or index). One is of size 100x20 (Dataframe A) and the other of size 100x1 (Dataframe B). I would like to add the values of Dataframe B to the first 5 columns in Dataframe A.
I tried to do this with
C = A.iloc[:,:5].add(B,axis=0)
Now C is of size 100X5 but I get A[:,0]+B for the first column alone and the other 4 columns in C is NaN. What am I doing wrong?
This is because of index alignement. DataFrames necessarily have indexes. Here B has index 0, so when aligned with A during addition, only the column 0 of A is used.
Use an array to bypass it:
C = A.iloc[:,:5].add(B.to_numpy(), axis=0)
Or slice B as Series:
A.iloc[:,:5].add(B[0], axis=0)

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

pandas: appending a row to a dataframe with values derived using a user defined formula applied on selected columns

I have a dataframe as
df = pd.DataFrame(np.random.randn(5,4),columns=list('ABCD'))
I can use the following to achieve the traditional calculation like mean(), sum()etc.
df.loc['calc'] = df[['A','D']].iloc[2:4].mean(axis=0)
Now I have two questions
How can I apply a formula (like exp(mean()) or 2.5*mean()/sqrt(max()) to column 'A' and 'D' for rows 2 to 4
How can I append row to the existing df where two values would be mean() of the A and D and two values would be of specific formula result of C and B.
Q1:
You can use .apply() and lambda functions.
df.iloc[2:4,[0,3]].apply(lambda x: np.exp(np.mean(x)))
df.iloc[2:4,[0,3]].apply(lambda x: 2.5*np.mean(x)/np.sqrt(max(x)))
Q2:
You can use dictionaries and combine them and add it as a row.
First one is mean, the second one is some custom function.
ad = dict(df[['A', 'D']].mean())
bc = dict(df[['B', 'C']].apply(lambda x: x.sum()*45))
Combine them:
ad.update(bc)
df = df.append(ad, ignore_index=True)

pandas merge multiple dataframes

For example: I have multiple dataframes. Each data frame has columns: variable_code, variable_description, year.
df1:
variable_code, variable_description
N1, Number of returns
N2, Number of Exemptions
df2:
variable_code, variable_description
N1, Number of returns
NUMDEP, # of dependent
I want to merge these two dataframes to get all variable_codes in both df1 and df2.
variable_code, variable_description
N1 Number of returns
N2 Number of Exemption
NUMDEP # of dependent
There is documentation for merge right here
Since your columns you want to merge on are both called "variable_code" then you can use on='variable_code'
so the whole thing would be:
df1.merge(df2, on='variable_code')
You can specify How='outer' if you want blanks where you have data in only one of those tables. Use how='inner' if you want only data that is in both tables (no blanks).
To attain your requirement, try this:
import pandas as pd
#Create the first dataframe, through a dictionary - several other possibilities exist.
data1 = {'variable_code': ['N1','N2'], 'variable_description': ['Number of returns','Number of Exemptions']}
df1 = pd.DataFrame(data=data1)
#Create second dataframe
data2 = {'variable_code': ['N1','NUMDEP'], 'variable_description': ['Number of returns','# of dependent']}
df2 = pd.DataFrame(data=data2)
#place the dataframes on a list.
dfs = [df1,df2] #additional dfs can be added here.
#You can loop over the list,merging the dfs. But here reduce and a lambda is used.
resultant_df = reduce(lambda left,right: pd.merge(left,right,on=['variable_code','variable_description'],how='outer'), dfs)
This gives:
>>> resultant_df
variable_code variable_description
0 N1 Number of returns
1 N2 Number of Exemptions
2 NUMDEP # of dependent
There are several options available for how, each catering for various needs. outer, used here allows for inclusion of even the rows with empty data. See the docs for detailed explanation on the other options.
First, concatenate df1, df2, by using
final_df = pd.concat([df1,df2]).
Then we can convert columns variable_code, variable_name into dictionary. variable_code as keys, variable_name as values by using
d = dict(zip(final_df['variable_code'], final_df['variable_name'])).
Then convert d into dataframe:
d_df = pd.DataFrame(list(d.items()), columns=['variable_code', 'variable_name']).