Panda DataFrame Combine Unique Values in Two Column for OrdinalEncoder Fit - pandas

I have Titanic dataset and columns in a dataframe I would like to use are 'Embarked' and 'Sex'.
df['Embarked'] and df['Sex'] have Unique value: Embarked['C','Q','S'] and Sex['male','female']
What I would like to do is create a list like below:
[['S','female'],['S','male'],['C','female'],['c','male'],['Q','female'],['Q','male']]
I need unique value combination in a list format so that I can pass to OrdinalEncoder to fit.
Scikit Learn OrdinalEncoder example:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
enc.categories_
enc.transform([['Female', 3], ['Male', 1],['Female',2],['Male',3]])
encoder transform only takes list

If what you'd like is to find the product from the unique values of two columns in a dataframe then turn them into a list, then this will do that!
import pandas as pd
from itertools import product
data = pd.DataFrame([['Q', 'male'], ['Q', 'male'], ['S', 'female'],
['S', 'female'], ['S', 'male'], ['C', 'female'],
['C', 'female'], ['C', 'male'], ['C', 'male']],
columns=['Embarked', 'Sex'])
print([list(x) for x in product(data['Embarked'].unique(), data['Sex'].unique())])
itertools.product gives you the Cartesian Product of a sequence of iterables. Our iterables here are lists created by calling Series.unique() on each of the DataFrame's columns to get their unique values.
Finally, the list comprehension turns itertools.product's typical return of a list of tuples into a list of lists.

a way of doing it is:
list_1 = ['C','Q','S']
list_2 = ['male','female']
X = [[x, y] for x in list_1 for y in list_2]

Related

how to add dataframe columns based on value of loop

I have a dataframe in python, called df. It contains two variables, Name and Age. I want to do a loop in python to generate 10 new column dataframes, called Age_1, Age_2, Age_3....Age_10 which contain the values of Age.
So far I have tried:
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
for i in range(1,11):
df[Age_'i'] = df['Age']
Just use this for loop:
for x in range(0,11):
df['Age_'+str(x)]=df['Age']
OR
for x in range(0,11):
df['Age_{}'.format(x)]=df['Age']
OR
for x in range(0,11):
df['Age_%s'%(x)]=df['Age']
Now if you print df you will get your desired output:
you can use .assign and ** unpacking.
df.assign(**{f'Age_{i}' : df['Age'] for i in range(11)})

How to concatenate values from multiple rows using Pandas?

In the screenshot, 'Ctrl' column contains a key value. I have two duplicate rows for OTC-07 which I need to consolidate. I would like to concat the rest of column values for OTC-07. i.e, OTC-07 should have Type A,B and Assertion a,b,c,d after consolidation.. Can anyone help me on this? :o
First define a dataframe of given structure:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ctrl': ['OTC-05', 'OTC-06', 'OTC-07', 'OTC-07', 'OTC-08'],
'Type': ['A', 'A', 'A', 'B', np.NaN],
'Assertion': ['a,b,c', 'c,b', 'a,c', 'b,c,d', 'a,b,c']
})
df
Output:
Then replace NaN values with empty strings:
df = df.replace(np.NaN, '', regex=True)
Then group by column 'Ctrl' and aggregate columns 'Type' and 'Assertion'. Please not that assertion aggregation is a bit tricky as you need not a simple concatenation, but sorted list of unique letters:
df.groupby(['Ctrl']).agg({
'Type': ','.join,
'Assertion': lambda x: ','.join(list(sorted(set(','.join(x).split(',')))))
})
Output:

Does sklearn use pandas index as a feature?

I'm passing a pandas DataFrame containing various features to sklearn and I do not want the estimator to use the dataframe index as one of the features. Does sklearn use the index as one of the features?
df_features = pd.DataFrame(columns=["feat1", "feat2", "target"])
# Populate the dataframe (not shown here)
y = df_features["target"]
X = df_features.drop(columns=["target"])
estimator = RandomForestClassifier()
estimator.fit(X, y)
No, sklearn doesn't use the index as one of your feature. It essentially happens here, when you call the fit method the check_array function will be applied. And now if you dig deep into the check_array function, you can find that you are converting your input into array using np.array function which essentially strips the indices from your dataframe as shown below:
import pandas as pd
import numpy as np
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df
Name Age
0 tom 10
1 nick 15
2 juli 14
np.array(df)
array([['tom', 10],
['nick', 15],
['juli', 14]], dtype=object)

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.
A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.
This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.
sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.