pandas df creation of list within a function to be used outside the function - pandas

I want to create a list of first three values in a column in a df, but this df is created within a function and will be called several times with different input variables. Every time I call this function, I want the new first three to be added on to the list of old first three. Then I would like to be able to use this list outside this function, as in input list while calling a different function.
So within the function, with the first call, the df that is created is like below:
col1 col2
A 1
B 2
C 3
D 4
And the list should look like this:
['A', 'B', 'C']
then with the next iteration with changed input variable, the table will look like this
col1 col2
E 5
F 6
G 7
H 8
I 9
then the list should look like this:
['A', 'B', 'C', 'D', 'E', 'F']
then I should be able to use this list outside this function (as an input for a different function). Could someone please help me with this? Thanks in advance for your help

You can collect as list your column
my_list = df[col1].tolist()
then get the three first element
selected_items = my_list[:2]
then concat with your previous list
previous_list = previous_list + selected_items
obviouly previous list should be initialized before.
you can do these process at each new iterration of your process.

Related

split content of a column pandas

I have the following Pandas Dataframe
Which can also be generated using this list of dictionaries:
list_of_dictionaries = [
{'Project': 'A', 'Hours': 2, 'people_ids': [16986725, 17612732]},
{'Project': 'B', 'Hours': 2, 'people_ids': [17254707, 17567393, 17571668, 17613773]},
{'Project': 'C', 'Hours': 3, 'people_ids': [17097009, 17530240, 17530242, 17543865, 17584457, 17595079]},
{'Project': 'D', 'Hours': 2, 'people_ids': [17097009, 17584457, 17702185]}]
I have implemented kind of what I need, but adding columns vertically:
df['people_id1']=[x[0] for x in df['people_ids'].tolist()]
df['people_id2']=[x[1] for x in df['people_ids'].tolist()]
And then I get a different column of every single people_id, just until the second element, because when I add the extraction 3rd element on a third column, it crashes because , there is no 3rd element to extract from the first row.
Even though, what I am trying to do is to extract every people_id from people_ids column, and then each one of those will have their associated value from the Project and Hours columns, so I get a dataset like this one:
Any idea on how could I get this output?
I think what you are looking for is explode on 'people_ids' column.
df = df.explode('people_ids', ignore_index=True)

Making a dataframe with columns as subsets of another dataframe's columns

Suppose I have a dataframe df, and it has columns with names 'a', 'b', 'c', 'd', 'e'. Then I make all combinations of length three (order doesn't matter) from this list to generate the following list of lists:
Combinations_of_3 = [ [['a','b','c'], ['a','b','d'],...., ['c','d','e']]]
Now I wish to create a for loop to populate a second data frame, and to do this I want to loop over Combinations_of_3 and use the current entry to select the corresponding columns of df.
For example, if I wanted to select only the 'a', 'b' and 'e' columns of df, I would normally write df[['a','b','e']]; but now I would like to do this in a for loop using Combinations_of_3. I'm writing this code using pandas / python. Thank you.
Just do as you described, using a variable:
Combinations_of_3 = [['a','b','c'], ['a','b','d'], ['c','d','e']]
for cols in Combinations_of_3:
#do something
print(df[cols])
NB. To create Combinations_of_3 you could use:
from itertools import combinations
Combinations_of_3 = list(combinations(df.columns, r=3))
#or using a generator
Combinations_of_3 = combinations(df.columns, r=3)

How to pass function parameters if using one or more parameters

Thank you in advance for your assistance.
#Create df.
import pandas as pd
d = {'dep_var' : pd.Series([10, 20, 30, 40], index =['a', 'b', 'c', 'd']),
'one' : pd.Series([9, 23, 37, 41], index =['a', 'b', 'c', 'd']),
'two' : pd.Series([1, 6, 5, 4], index =['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
dep_var one two
a 10 9 1
b 20 23 6
c 30 37 5
d 40 41 4
#Define function.
def df_two(dep_var, ind_var_1, ind_var_2):
global two
data = {
dep_var: df[dep_var],
ind_var_1: df[ind_var_1],
ind_var_2: df[ind_var_2]
}
two = pd.DataFrame(data)
return two
# Execute function.
df_two("dep_var", "one", "two")
dep_var one two
a 10 9 1
b 20 23 6
c 30 37 5
d 40 41 4
Works perfect. I'd like to, fairly new at this, be able to use a single function when using say three or four parameters, of course, using the above code I get error message with third parameter.
So rookie move I define another function with 3 parameters.
def df_three(dep_var, ind_var_1, ind_var_2, ind_var_3):
global three
data = {
dep_var: df[dep_var],
ind_var_1: df[ind_var_1],
ind_var_2: df[ind_var_2],
ind_var_3: df[ind_var_2]
}
three = pd.DataFrame(data)
return three
I've tried *args, *kargs, mapping and host of things with no luck. My sense is I'm close but need a way to tell the function that sometimes there might be one, two, or three parameters, and then map one, two or three parameters to created dataframe.
Use unpack *args:
def foo(dep_var, *args):
global df
data = {dep_var: df[dep_var]}
for a in args:
data[a] = df[a]
return pd.DataFrame(data)
And then you can call
foo('dep_var', 'one')
foo('dep_var', 'one', 'two')
To eliminate the need of global argument, I'd pass df to the function as well:
def foo(df, dep_var, *args):
data = {dep_var: df[dep_var]}
for a in args:
data[a] = df[a]
return pd.DataFrame(data)
More information on *args.
It sounds like you want to select only some columns from a data frame, in a certain order. You can just pass a list of the column names for that:
two[["dep_var", "one", "two"]]
If you want to, you can pack that into a function, using tuple unpacking to have a variable number of arguments.
def select(df, *columns):
return df[list(columns)]
This should directly work with your use cases:
select(two, "dep_var", "one", "two")
select(three, "dep_var", "one", "two", "three")
Note that I also passed the data frame variable, so you don't need to rely on a global variable.
The call to list is needed, because tuple unpacking produces, well, a tuple. And using a tuple as an index to the data frame produces different results than using a list.
You might want to append a .copy() to the return line, depending on how you use the return value of this.
A variable number of arguments also includes zero, so you might want to add a check for that.

Advanced condition lookup in pandas(numpy)

given:
a list of elements 'ls' and a big df 'df', all the elements of 'ls' is in the 'df'.
ls = ['a0','a1','a2','b0','b2','c0',...,'c_k']
df = [['a0','b0','c0'],
['a0','b0','c1'],
['a0','b0','c2'],
...
['a_i','b_j','c_k']]
goal:
I want to collect the rows set of the 'df' that contains the most elements of 'ls', such as ['a0','b0','c0'] is the best one. But at most a row just contain only 2 elements
tried:
I tried enumerating 3 or 2 elements in 'ls', but it was too expensive and probably return None since there exist only 2 elements in some row.
I tried to use a dictionary to count, but it didn't work either.
I've been puzzling over this problem all day, any help will be greatly appreciated.
I would go like this:
row_id = df.apply(lambda x: x.isin(ls).sum(), axis=1)
This will give you the row index with max entries in the list.
The desired row can be obtained so:
df.iloc[row_id, :]

Error in using Pandas groupby.apply to drop duplication

I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
The problem with your code is that it attempts to modify
the original group.
Other problem is that this function should return a single row
not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
Edit following the comment
In my opinion, it is not allowed to modify the original group,
as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice.
Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group.
It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you
create some new content, e.g. as the result of groupby instruction
and then save it under the same variable which so far held the source
DataFrame.