How to convert column that has text element to column of numerics before modelling - pandas

I am working on a dataset that has
How can I change Elements and Area with numbers before modelling

If need factorize for strings columns use:
cols = df.select_dtypes(object).columns
df[cols] = df[cols].apply(lambda x: pd.factorize(x)[0])

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

Converting list of nested dicts to Dataframe

I am trying to convert a list of dicts with the following format to a single Dataframe where each row contains the a specific type of betting odds offered by one sports book (meaning ‘h2h’ odds and ‘spread’ odds are in separate rows):
temp = [{"id":"e4cb60c1cd96813bbf67450007cb2a10",
"sport_key":"americanfootball",
"sport_title":"NFL",
"commence_time":"2022-11-15T01:15:31Z",
"home_team":"Philadelphia Eagles",
"away_team":"Washington Commanders",
"bookmakers":
[{"key":"fanduel","title":"FanDuel",
"last_update":"2022-11-15T04:00:35Z",
"markets":[{"key":"h2h","outcomes":[{"name":"Philadelphia
Eagles","price":630},{"name":"Washington Commanders","price":-1200}]}]},
{"key":"draftkings","title":"DraftKings",
"last_update":"2022-11-15T04:00:30Z",
"markets":[{"key":"h2h","outcomes":[{"name":"Philadelphia Eagles","price":600},
{"name":"Washington Commanders","price":-950}]}]},
There are many more bookmaker entries of the same format. I have tried:
df = pd.DataFrame(temp)
# normalize the column of dicts
normalized = pd.json_normalize(df['bookmakers'])
# join the normalized column to df
df = df.join(normalized,).drop(columns=['bookmakers'])
# join the normalized column to df
df = df.join(normalized, lsuffix = 'key')
However, this results in a Dataframe with repeated columns and columns that contain dictionaries.
Thanks for any help in advance!

How to build a loop for converting entires of categorical columns to numerical values in Pandas?

I have a Pandas data frame with several columns, with some columns comprising categorical entries. I am 'manually' converting these entries to numerical values. For example,
df['gender'] = pd.Series(df['gender'].factorize()[0])
df['race'] = pd.Series(df['race'].factorize()[0])
df['city'] = pd.Series(df['city'].factorize()[0])
df['state'] = pd.Series(df['state'].factorize()[0])
If the number of columns is huge, this method is obviously inefficient. Is there a way to do this by constructing a loop over all columns (only those columns with categorical entries)?
Use DataFrame.apply by columns in variable cols:
cols = df.select_dtypes(['category']).columns
df[cols] = df[cols].apply(lambda x: x.factorize()[0])
EDIT:
Your solution should be simplify:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize()[0]
I tried the following, which seems to work fine:
for column in df.select_dtypes(['category']):
df[column] = pd.Series(df[column].factorize()[0])
where 'category' could be 'bool', 'object', etc.

Pandas split list inside a column into separate columns

I have a dataset with 71 columns and 113 rows. Each column is a array of values. I want to split these arrays into separate columns. Then rename the columns with the prefix
!wget https://raw.githubusercontent.com/pranavn91/sample/master/audioonly.csv
audio = pd.read_csv("audioonly.csv")
zcr = pd.DataFrame(audio['zcr'].str.split().values.tolist())
zcr.columns = ['zcr_' + str(col) for col in zcr.columns]
I can do it for each column individually and combine as single dataframe.
Please propose a faster method.
you can use concat and a list comprehension:
audio_exploded = pd.concat([pd.DataFrame(audio[col].str.split().values.tolist())\
.add_prefix(f'{col}_')
for col in audio.columns],
axis=1)

How to concatenate numerous column names in pandas?

I would like to concatenate all the columns with comma-delimitted in pandas.
But as you can seem it is very laborious tasks since I manually typed all the column indices.
de = data[3]+","+data[4]+","+data[5]+....+","+data[1511]
do you have any idea to avoid above procedure in pandas in python3?
First convert all columns to strings by DataFrame.astype and then possible add join per rows:
df = data.astype(str).apply(','.join, axis=1)
Or after convert to strings add ,, then sum and last remove last , by Series.str.rstrip:
df = data.astype(str).add(',').sum(axis=1).str.rstrip(',')