Pandas assign series to groupby results - pandas

Hi I'm a bit clueless about how to assign a series to groupby results.
I have dataframe A and B:
A = pd.DataFrame({'ID':[1,1,1,2,2,2],'TW':[0,1,0,0,1,0]})
B = pd.DataFrame({1:['A','B','C'], 2:['A','B','C']})
B's columns are representing A's ID information. I want to group A by ID and assign B's corresponding columns to replace the TW data. Here is what I want:
C = pd.DataFrame({'Date':[1,1,1,2,2,2],'TW':['A','B','C','A','B','C']})
Could someone please help on this one?

Couldn't you just melt B?
>>> pd.melt(B, var_name='Date', value_name='TW')
Date TW
0 1 A
1 1 B
2 1 C
3 2 A
4 2 B
5 2 C

#Alexander's answer is the obvious one. But this is another way to go about it.
B.rename_axis('Date', 1).stack() \
.reset_index('Date', name='TW') \
.reset_index(drop=True)

Related

Convert transactions with several products from columns to row [duplicate]

I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])

How can I apply an expanding window to the names of groupby results?

I would like to use pandas to group a dataframe by one column, and then run an expanding window calculation on the groups. Imagine the following dataframe:
G Val
A 0
A 1
A 2
B 3
B 4
C 5
C 6
C 7
What I am looking for is a way to group the data by column G (resulting in groups ['A', 'B', 'C']), and then applying a function first to the items in group A, then to items in groups A and B, and finally items in groups A to C.
For example, if the function is sum, then the result would be
A 3
B 10
C 28
For my problem the function that is applied needs to be able to access all original items in the dataframe, not only the aggregates from the groupby.
For example when applying mean, the expected result would be
A 1
B 2
C 3.5
A: mean([0,1,2]), B: mean([0,1,2,3,4]), C: mean([0,1,2,3,4,5,6,7]).
cummean not exist, so possible solution is aggregate counts and sum, use cumulative sum and for mean divide:
df = df.groupby('G')['Val'].agg(['size', 'sum']).cumsum()
s = df['sum'].div(df['size'])
print (s)
A 1.0
B 2.0
C 3.5
dtype: float64
If need general solution is possible extract expanding groups and then use function in dict comprehension like:
g = df['G'].drop_duplicates().apply(list).cumsum()
s = pd.Series({x[-1]: df.loc[df['G'].isin(x), 'Val'].mean() for x in g})
print (s)
A 1.0
B 2.0
C 3.5
dtype: float64

matching consecutive pairs in pd.Series

I have a DataFrame which looks like this :-
ID | act
1 A
1 B
1 C
1 D
2 A
2 B
3 A
3 C
I am trying to get the IDs where an activity act1 is followed by another act2, for example, A is followed by B. In that case, I want to get [1,2] as the ids. How do I go about this in a vectorized manner?
Edit :- Expected output : For the sample df defined above, the output should be a list/Series of all the IDs where A is followed immediately by B
IDs
1
2
Here is a simple, vectorised way to do it!
df.loc[(df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1)), 'ID']
Output:
0 1
4 2
Name: ID, dtype: int64
Another way of writing this, possibly clearer:
conditions = (df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1))
df.loc[conditions, 'ID']
Numpy makes it easy to filter for one or many boolean conditions. The resulting vector is used to filter your dataframe.
Here is one approach: groupby, and don't sort, since we need to track B immediately following A, based on the current dataframe structure.
Next aggregate using str.cat
check if A,B is present
get the index
pass as a list
(df
.groupby('ID',sort=False)
.Act
.agg(lambda x: x.str.cat(sep=','))
.str.contains('A,B')
.loc[lambda x: x==1]
.index.tolist()
)
[1, 2]
Another approach is using the shift function and filtering:
df['x'] = df.Act.shift()
df.loc[lambda x: (x['Act']=='B') & (x['x']=='A')].ID

pandas get columns without copy

I have a data frame with multiple columns, and I want to get some of them, and drop others, without copying a new dataframe
I suppose it should be
df = df['col_a','col_b']
but I'm not sure whether it copy a new one or not. Is there any better way to do this?
Your approach should work, apart from one minor issue:
df = df['col_a','col_b']
shoud be:
df = df[['col_a','col_b']]
Because you assign the subset df back to df, it's essentially equivalent to dropping the other columns.
If you would like to drop other columns in place, you can do:
df.drop(columns=df.columns.difference(['col_a','col_b']),inplace=True)
Let me know if this is what you want.
you have a dataframe df with multiple columns a, b, c, d and e. You want to select let us say a and b and store them back in df. To achieve this, you can do :
df=df[['a', 'b']]
Input dataframe df:
a b c d e
1 1 1 1 1
3 2 3 1 4
When you do :
df=df[['a', 'b']]
output will be :
a b
1 1
3 2

Slicing and Setting Values in Pandas, with a composite of position and labels

I want to set a value in a specific cell in a pandas dataFrame.
I know which position the row is in (I can even get the row by using df.iloc[i], for example), and I know the name of the column, but I can't work out how to select the cell so that I can set a value to it.
df.loc[i,'columnName']=val
won't work because I want the row in position i, not labelled with index i. Also
df.iloc[i, 'columnName'] = val
obviously doesn't like being given a column name. So, short of converting to a dict and back, how do I go about this? Help very much appreciated, as I can't find anything that helps me in the pandas documentation.
You can use ix to set a specific cell:
In [209]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[209]:
a b c
0 1.366340 1.643899 -0.264142
1 0.052825 0.363385 0.024520
2 0.526718 -0.230459 1.481025
3 1.068833 -0.558976 0.812986
4 0.208232 0.405090 0.704971
In [210]:
df.ix[1,'b'] = 0
df
Out[210]:
a b c
0 1.366340 1.643899 -0.264142
1 0.052825 0.000000 0.024520
2 0.526718 -0.230459 1.481025
3 1.068833 -0.558976 0.812986
4 0.208232 0.405090 0.704971
You can also call iloc on the col of interest:
In [211]:
df['b'].iloc[2] = 0
df
Out[211]:
a b c
0 1.366340 1.643899 -0.264142
1 0.052825 0.000000 0.024520
2 0.526718 0.000000 1.481025
3 1.068833 -0.558976 0.812986
4 0.208232 0.405090 0.704971
You can get the position of the column with get_loc:
df.iloc[i, df.columns.get_loc('columnName')] = val