Convert string (object datatype) to categorical data - pandas

I had the issue in the past and hope you can help. I am trying to convert columns of 'object' datatype into categorical values.
So:
print(df_train['aus_heiz_befeuerung'].unique())
['Gas' 'Unbekannt' 'Alternativ' 'Öl' 'Elektro' 'Kohle']
These values from the above columns should be converted to. eg. 1, 2, 4, 5, 3.
Unfortunately I can not figure out how.
I have tried different astype versions and the following code block:
# string label to categorical values
from sklearn.preprocessing import LabelEncoder
for i in range(df_train.shape[1]):
if df_train.iloc[:,i].dtypes == object:
lbl = LabelEncoder()
lbl.fit(list(df_train.iloc[:,i].values) + list(df_test.iloc[:,i].values))
df_train.iloc[:,i] = lbl.transform(list(df_train.iloc[:,i].values))
df_test.iloc[:,i] = lbl.transform(list(df_test.iloc[:,i].values))
print(df_train['aus_heiz_befeuerung'].unique())
It leads to :
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
So happy for all ideas.

You can use the pandas.Categorical() function to convert the values in a column to categorical values. For example, to convert the values in the aus_heiz_befeuerung column to categorical values, you can use the following code:
df_train['aus_heiz_befeuerung'] =
pd.Categorical(df_train['aus_heiz_befeuerung'])
This will assign a numerical value to each unique category in the column, so that the values in the column become integers instead of strings. You can specify the order in which the categories should be assigned numerical values by passing a list of category names to the categories parameter of the pandas.Categorical() function. For example, to assign the categories in the order specified in your question ('Gas', 'Unbekannt', 'Alternativ', 'Öl', 'Elektro', 'Kohle'), you can use the following code:
df_train['aus_heiz_befeuerung'] = pd.Categorical(df_train['aus_heiz_befeuerung'], categories=['Gas', 'Unbekannt', 'Alternativ', 'Öl', 'Elektro', 'Kohle'])
After you have converted the values in the column to categorical values, you can use the cat.codes property to access the integer values that have been assigned to each category. For example:
df_train['aus_heiz_befeuerung'].cat.codes
This will return a pandas.Series object containing the integer values that have been assigned to each category in the aus_heiz_befeuerung column.

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

Selecting two sets of columns from a dataFrame with all rows

I have a dataFrame with 28 columns (features) and 600 rows (instances). I want to select all rows, but only columns from 0-12 and 16-27. Meaning that I don't want to select columns 12-15.
I wrote the following code, but it doesn't work and throws a syntax error at : in 0:12 and 16:. Can someone help me understand why?
X = df.iloc[:,[0:12,16:]]
I know there are other ways for selecting these rows, but I am curious to learn why this one does not work, and how I should write it to work (if there is a way).
For now, I have written it is as:
X = df.iloc[:,0:12]
X = X + df.iloc[:,16:]
Which seems to return an incorrect result, because I have already treated the NaN values of df, but when I use this code, X includes lots of NaNs!
Thanks for your feedback in advance.
You can use np.r_ to concatenate the slices:
x = df.iloc[:, np.r_[0:12,16:]]
iloc has these allowed inputs (from the docs):
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
What you're passing to iloc in X = df.iloc[:,[0:12,16:]] is not a list of integers or a slice of ints, but a list of slice objects. You need to convert those slices to a list of integers, and the best way to do that is using the numpy.r_ function.
X = df.iloc[:, np.r_[0:13, 16:28]]

Check multiple columns for multiple values and return a dataframe

I have a list of strings and my dataframe has several columns that i need to search (each of type object).
I need to return all rows where any of the selected columns have any of the string items within them, or is part of the string.
How do i check if 4 columns in my dataframe has any one of the items in the list of strings? The string inside the column may have part of the string provided in the list object, but probably wont have it all.
Ive tried 'list' both as a tuple and as a python list:
list = ("25110", "25910", "25990", "30110", "33110", "43999")
new_df = df.loc[(df['column1'].isin(list))
| (df['column2'].isin(list))
| (df['column3'].isin(list))
| (df['column4'].isin(list))]
When i run new_df.shape, i get (0, 12).
Im new to pandas, got a mountain of analysis to do for an intense uni project, and cant get this to work. Do i need to convert each column to be a string datatype first? (ive actually already tried THAT as well, but each datatype is still stubbornly an 'object').
IIUC:
try:
lst = ["25110", "25910", "25990", "30110", "33110", "43999"]
cols=['column1','column2','column3','column4']
Finally:
m=df[cols].astype(str).agg(lambda x:x.str.contains('|'.join(lst)),1).any(1)
#you can also use apply() in place of agg()
df[m]
#OR
df.loc[m]

Converting multiple types of values to integer values

I have a Pandas DF in which I want to convert column values to integer values. The information should be stored in meters but can be stored in kilometers as well, resulting in the following possible values:
23145 (correct)
23145.0 (.0 should be removed)
101.1 (should be multiplied *1000)
47,587 (should be multiplied *1000)
'No value known'
I tried different options with converting data types, but I always seem to break the existing integers and cannot check for them correctly because the type 'object' is the dtype. Sometimes faulty values or strings block conversion as well.
Any ideas how to check if the value currently is an integer and do nothing, remove .0 from applicable values and multiply where applicable.
I also have some other columns with integers (e.g. number 22321323) where randomly a .0 is assigned (e.g. number 22321323.0). How can I correctly convert these values to not include the .0?
If you use a .apply on the column, you should be able to very easily convert these values while casing on their type. For example:
import pandas as pd
def convert(x):
if isinstance(x, int):
return x
elif isinstance(x, float):
return int(x)
else:
# Defaults to 0 when not convertable
return 0
print(x)
df = pd.DataFrame({'col': [23145, 23145.0, 'No value known']})
df['col'] = df['col'].apply(convert)

Group by multiple columns creating new column in pandas dataframe

I have a pandas dateframe of two columns ['company'] which is a string and ['publication_datetime'] which is a datetime.
I want to group by company and the publication_date , adding a new column with the maximum publication_datetime for each record.
so far i have tried:
issuers = news[['company','publication_datetime']]
issuers['publication_date'] = issuers['publication_datetime'].dt.date
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max()
my group by does not appear to work.
i get the following error
ValueError: Wrong number of items passed 3, placement implies 1
You need the transform() method to cast the result in the original dimension of the dataframe.
issuers['max'] = issuers.groupby(['company', 'publication_date'])['publication_datetime'].transform('max')
The result of your groupby() before was returning a multi-indexed group object, which is why it's complaining about 3 values (first group, second group, and then values). But even if you just returned the values, it's combining like groups together, so you'll have fewer values than needed.
The transform() method returns the group results for each row of the dataframe in a way that makes it easy to create a new column. The returned values are an indexed Series with the indices being the original ones from the issuers dataframe.
Hope this helps! Documentation for transform here
The thing is by doing what you are doing you are trying to set a DataFrame to a column value.
Doing the following will get extract only the values without the two indexe columns:
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max().tolist()
Hope this help !