Making a dataframe with columns as subsets of another dataframe's columns - pandas

Suppose I have a dataframe df, and it has columns with names 'a', 'b', 'c', 'd', 'e'. Then I make all combinations of length three (order doesn't matter) from this list to generate the following list of lists:
Combinations_of_3 = [ [['a','b','c'], ['a','b','d'],...., ['c','d','e']]]
Now I wish to create a for loop to populate a second data frame, and to do this I want to loop over Combinations_of_3 and use the current entry to select the corresponding columns of df.
For example, if I wanted to select only the 'a', 'b' and 'e' columns of df, I would normally write df[['a','b','e']]; but now I would like to do this in a for loop using Combinations_of_3. I'm writing this code using pandas / python. Thank you.

Just do as you described, using a variable:
Combinations_of_3 = [['a','b','c'], ['a','b','d'], ['c','d','e']]
for cols in Combinations_of_3:
#do something
print(df[cols])
NB. To create Combinations_of_3 you could use:
from itertools import combinations
Combinations_of_3 = list(combinations(df.columns, r=3))
#or using a generator
Combinations_of_3 = combinations(df.columns, r=3)

Related

pandas df creation of list within a function to be used outside the function

I want to create a list of first three values in a column in a df, but this df is created within a function and will be called several times with different input variables. Every time I call this function, I want the new first three to be added on to the list of old first three. Then I would like to be able to use this list outside this function, as in input list while calling a different function.
So within the function, with the first call, the df that is created is like below:
col1 col2
A 1
B 2
C 3
D 4
And the list should look like this:
['A', 'B', 'C']
then with the next iteration with changed input variable, the table will look like this
col1 col2
E 5
F 6
G 7
H 8
I 9
then the list should look like this:
['A', 'B', 'C', 'D', 'E', 'F']
then I should be able to use this list outside this function (as an input for a different function). Could someone please help me with this? Thanks in advance for your help
You can collect as list your column
my_list = df[col1].tolist()
then get the three first element
selected_items = my_list[:2]
then concat with your previous list
previous_list = previous_list + selected_items
obviouly previous list should be initialized before.
you can do these process at each new iterration of your process.

split content of a column pandas

I have the following Pandas Dataframe
Which can also be generated using this list of dictionaries:
list_of_dictionaries = [
{'Project': 'A', 'Hours': 2, 'people_ids': [16986725, 17612732]},
{'Project': 'B', 'Hours': 2, 'people_ids': [17254707, 17567393, 17571668, 17613773]},
{'Project': 'C', 'Hours': 3, 'people_ids': [17097009, 17530240, 17530242, 17543865, 17584457, 17595079]},
{'Project': 'D', 'Hours': 2, 'people_ids': [17097009, 17584457, 17702185]}]
I have implemented kind of what I need, but adding columns vertically:
df['people_id1']=[x[0] for x in df['people_ids'].tolist()]
df['people_id2']=[x[1] for x in df['people_ids'].tolist()]
And then I get a different column of every single people_id, just until the second element, because when I add the extraction 3rd element on a third column, it crashes because , there is no 3rd element to extract from the first row.
Even though, what I am trying to do is to extract every people_id from people_ids column, and then each one of those will have their associated value from the Project and Hours columns, so I get a dataset like this one:
Any idea on how could I get this output?
I think what you are looking for is explode on 'people_ids' column.
df = df.explode('people_ids', ignore_index=True)

Get names of dummy variables created by get_dummies

I have a dataframe with a very large number of columns of different types. I want to encode the categorical variables in my dataframe using get_dummies(). The question is: is there a way to get the column headers of the encoded categorical columns created by get_dummies()?
The hard way to do this would be to extract a list of all categorical variables in the dataframe, then append the different text labels associated to each categorical variable to the corresponding column headers. I wonder if there is an easier way to achieve the same end.
I think the way that should work with all the different uses of get_dummies would be:
#example data
import pandas as pd
df = pd.DataFrame({'P': ['p', 'q', 'p'], 'Q': ['q', 'p', 'r'],
'R': [2, 3, 4]})
dummies = pd.get_dummies(df)
#get column names that were not in the original dataframe
new_cols = dummies.columns[~dummies.columns.isin(df.columns)]
new_cols gives:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')
I think the first column is the only column preserved when using get_dummies, so you could also just take the column names after the first column:
dummies.columns[1:]
which on this test data gives the same result:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')

pandas: appending a row to a dataframe with values derived using a user defined formula applied on selected columns

I have a dataframe as
df = pd.DataFrame(np.random.randn(5,4),columns=list('ABCD'))
I can use the following to achieve the traditional calculation like mean(), sum()etc.
df.loc['calc'] = df[['A','D']].iloc[2:4].mean(axis=0)
Now I have two questions
How can I apply a formula (like exp(mean()) or 2.5*mean()/sqrt(max()) to column 'A' and 'D' for rows 2 to 4
How can I append row to the existing df where two values would be mean() of the A and D and two values would be of specific formula result of C and B.
Q1:
You can use .apply() and lambda functions.
df.iloc[2:4,[0,3]].apply(lambda x: np.exp(np.mean(x)))
df.iloc[2:4,[0,3]].apply(lambda x: 2.5*np.mean(x)/np.sqrt(max(x)))
Q2:
You can use dictionaries and combine them and add it as a row.
First one is mean, the second one is some custom function.
ad = dict(df[['A', 'D']].mean())
bc = dict(df[['B', 'C']].apply(lambda x: x.sum()*45))
Combine them:
ad.update(bc)
df = df.append(ad, ignore_index=True)

Pandas dynamic column creation

I am attempting to dynamically create a new column based on the values of another column.
Say I have the following dataframe
A|B
11|1
22|0
33|1
44|1
55|0
I want to create a new column.
If the value of column B is 1, insert 'Y' else insert 'N'.
The resulting dataframe should looks like so:
A|B|C
11|1|Y
22|0|N
33|1|Y
44|1|Y
55|0|N
I could do this by iterating through the column values,
for i in dataframe['B'].values:
if i==1:
add Y to Series
else:
add N to Series
dataframe['C'] = Series
However I am afraid this will severely reduce performance especially since my dataset contains 500,000+ rows.
Any help will be greatly appreciated.
Thank you.
Avoid chained indexing by using loc. There are some subtleties with returning a view versus a copy in pandas that are related to numpy
df['C'] = 'N'
df.loc[df.B == 1, 'C'] = 'Y'
Try this:
df['C'] = 'N'
df['C'][df['B']==1] = 'Y'
should be faster.