Converting list of nested dicts to Dataframe - pandas

I am trying to convert a list of dicts with the following format to a single Dataframe where each row contains the a specific type of betting odds offered by one sports book (meaning ‘h2h’ odds and ‘spread’ odds are in separate rows):
temp = [{"id":"e4cb60c1cd96813bbf67450007cb2a10",
"sport_key":"americanfootball",
"sport_title":"NFL",
"commence_time":"2022-11-15T01:15:31Z",
"home_team":"Philadelphia Eagles",
"away_team":"Washington Commanders",
"bookmakers":
[{"key":"fanduel","title":"FanDuel",
"last_update":"2022-11-15T04:00:35Z",
"markets":[{"key":"h2h","outcomes":[{"name":"Philadelphia
Eagles","price":630},{"name":"Washington Commanders","price":-1200}]}]},
{"key":"draftkings","title":"DraftKings",
"last_update":"2022-11-15T04:00:30Z",
"markets":[{"key":"h2h","outcomes":[{"name":"Philadelphia Eagles","price":600},
{"name":"Washington Commanders","price":-950}]}]},
There are many more bookmaker entries of the same format. I have tried:
df = pd.DataFrame(temp)
# normalize the column of dicts
normalized = pd.json_normalize(df['bookmakers'])
# join the normalized column to df
df = df.join(normalized,).drop(columns=['bookmakers'])
# join the normalized column to df
df = df.join(normalized, lsuffix = 'key')
However, this results in a Dataframe with repeated columns and columns that contain dictionaries.
Thanks for any help in advance!

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

How to build a loop for converting entires of categorical columns to numerical values in Pandas?

I have a Pandas data frame with several columns, with some columns comprising categorical entries. I am 'manually' converting these entries to numerical values. For example,
df['gender'] = pd.Series(df['gender'].factorize()[0])
df['race'] = pd.Series(df['race'].factorize()[0])
df['city'] = pd.Series(df['city'].factorize()[0])
df['state'] = pd.Series(df['state'].factorize()[0])
If the number of columns is huge, this method is obviously inefficient. Is there a way to do this by constructing a loop over all columns (only those columns with categorical entries)?
Use DataFrame.apply by columns in variable cols:
cols = df.select_dtypes(['category']).columns
df[cols] = df[cols].apply(lambda x: x.factorize()[0])
EDIT:
Your solution should be simplify:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize()[0]
I tried the following, which seems to work fine:
for column in df.select_dtypes(['category']):
df[column] = pd.Series(df[column].factorize()[0])
where 'category' could be 'bool', 'object', etc.

Pandas - How best to combing dataframes based on specific column values

I have my main data frame (df) with the six columns defined in 'column_main'.
The needed data comes from two much larger df's. Let's call them df1 and df2.
Plus df1 & df2 do not have the same columns labels. But they both include the required df columns.
The df just has the few pieces that are needed from each for the two bigger ones. And by bigger, I mean many times the columns.
Since it is all going into a DB I want to get rid of all the unwanted columns.
How do I combine/merge/join/mask the needed data from the large data frames into the main (smaller) data frame? or maybe drop the columns not covered by 'columns_main'.
df = pd.DataFrame(columns = columns_main)
The other two df's are coming from excel workbooks with a lot of unwanted trash.
wb = load_workbook(filename = filename )
ws = wb[_sheets[0]]
df1 = pd.DataFrame(ws.values)
ws = wb[_sheets[1]]
df2 = pd.DataFrame(ws.values)
How can I do without some sort of crazy looping?
Thank you.
You can select another DataFrames by subset:
df1[df['column_main']]
df2[df['column_main']]
If possible some columns not match use Index.intersection:
cols = df['column_main']
df1[df1.columns.intersection(cols)]
df2[df2.columns.intersection(cols)]

How to concat 3 dataframes with each into sequential columns

I'm trying to understand how to concat three individual dataframes (i.e df1, df2, df3) into a new dataframe say df4 whereby each individual dataframe has its own column left to right order.
I've tried using concat with axis = 1 to do this, but it appears not possible to automate this with a single action.
Table1_updated = pd.DataFrame(columns=['3P','2PG-3Io','3Io'])
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io])
Note that with the exception of get_table1_2P_max_3Io, which has two columns, all other dataframes have one column
For example,
get_table1_3P =
get_table1_2P_max_3Io =
get_table1_3Io =
Ultimately, i would like to see the following:
I believe you need first concat and tthen change order by list of columns names:
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io], axis=1)
Table1_updated = Table1_updated[['3P','2PG-3Io','3Io']]

pandas.DataFrame input DataFrame but get NaN?

df is original DataFrame, csv file.
a = df.head(3) # get part of df.
This is table a.
b = a.loc[1:3,'22':'41'] #select part of a.
c = pd.DataFrame(data=b,index=['a','b'],columns=['v','g']) # give index and columns
final
b show 2x2. I get four value.
c show 2x2 NaN. I get four NaN.
why c don't contain any number?
Try using .values, you are running into 'intrinsic data alignment'
c = pd.DataFrame(data=b.values,index=['a','b'],columns=['v','g']) # give index and columns
Pandas likes to align indexes, by converting your 'b' dataframe into a np.array, you can then use the pandas dataframe constructor to build a new dataframe with those 2x2 values assigning new indexing.
Your DataFrame b already contains row and column indices, so when you try to create DataFrame c and you pass index and columns keyword arguments, you are implicitly indexing out of the original DataFrame b.
If all you want to do is re-index b, why not do it directly?
b = b.copy()
b.index = ['a', 'b']
b.columns = ['v', 'g']