At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)?
Orig df:
a
b
0
1
2
1
2
3
df replicated with pd.concat[df]*5 and with an additional Column c:
a
b
c
0
1
2
1
1
2
3
1
0
1
2
2
1
2
3
2
0
1
2
3
1
2
3
3
0
1
2
4
1
2
3
4
0
1
2
5
1
2
3
5
This is a multi-row dataframe where the count variable would have to be applied to multiple rows.
Thanks for your thoughts!
You could use np.arange and np.repeat:
N = 5
new_df = pd.concat([df] * N)
new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1
Output:
>>> new_df
a b c
0 1 2 1
1 2 3 1
0 1 2 2
1 2 3 2
0 1 2 3
1 2 3 3
0 1 2 4
1 2 3 4
0 1 2 5
1 2 3 5
Related
P
T1
T2
T3
0
1
2
3
1
1
2
0
2
3
1
2
3
1
0
2
In the above pandas dataframe df,
I want to add columns on the basis of the value of column 'P'.
if df['P'] == 0: 0
if df['P'] == 1: T1 (=1)
if df['P'] == 2: T1+T2 (=3+1=4)
if df['P'] == 3: T1+T2+T3 (=1+0+2=3)
In other words, I want to add from T1 to TN if df['P'] == N.
How can I implement this with Python code?
EDIT:
For sum values by P column create mask by broadcasting np.arange by length of filtered columns by DataFrame.filter, compare by P values and this mask pass to DataFrame.where, last use sum per rows:
np.random.seed(20)
c = [f'{x}{i + 1}' for x in ['T','U','V'] for i in range(3)]
df = pd.DataFrame(np.random.randint(4, size=(10,10)), columns=['P'] + c)
arrP = df['P'].to_numpy()[:, None]
for c in ['T','U','V']:
df1 = df.filter(regex=rf'^{c}')
df[f'{c}_SUM'] = df1.where(np.arange(len(df1.columns)) < arrP, 0).sum(axis=1)
print (df)
P T1 T2 T3 U1 U2 U3 V1 V2 V3 T_SUM U_SUM V_SUM
0 3 2 3 3 0 2 1 0 3 2 8 3 5
1 3 2 0 2 0 1 2 2 3 3 4 3 8
2 0 1 2 2 2 0 1 1 3 1 0 0 0
3 3 2 2 2 1 3 2 1 3 2 6 6 6
4 3 1 1 3 1 2 2 0 2 3 5 5 5
5 2 3 2 3 1 1 1 0 3 0 5 2 3
6 2 3 2 3 3 3 2 1 1 2 5 6 2
7 3 2 0 2 1 1 2 2 2 3 4 4 7
8 2 2 1 0 2 2 0 3 3 0 3 4 6
9 2 2 3 2 2 3 2 2 1 1 5 5 3
I would like to sum values distinct per group. Pardon the wordy post...
Context. Suppose I have a table of the form:
ID Foo Value
A 1 2
B 0 2
C 0 3
A 1 2
A 1 2
C 0 3
B 0 2
Each ID/Foo combo has a distinct value. I'd like to join this table onto another cte that has a cumulative field, e.g. suppose after joining using rows unbounded preceeding I have a new field called cumulative. Same data, just duplicated 3 times with value cumulative:
ID Foo Value Cumulative
A 1 2 1
B 0 2 1
C 0 3 1
A 1 2 1
A 1 2 1
C 0 3 1
B 0 2 1
A 1 2 2
B 0 2 2
C 0 3 2
A 1 2 2
A 1 2 2
C 0 3 2
B 0 2 2
A 1 2 3
B 0 2 3
C 0 3 3
A 1 2 3
A 1 2 3
C 0 3 3
B 0 2 3
I want to add a new field 'segment_value' that, for each foo gets the sum of distinct ID values. E.g. The distinct ID/Foo combinations are:
ID Foo Value
A 1 2
B 0 2
C 0 3
I would therefore like a new field, 'segment_value', That returns 2 for Foo=1 and 5 for Foo=0. Desired result:
ID Foo Value Cumulative segment_value
A 1 2 1 2
B 0 2 1 5
C 0 3 1 5
A 1 2 1 2
A 1 2 1 2
C 0 3 1 5
B 0 2 1 5
A 1 2 2 2
B 0 2 2 5
C 0 3 2 5
A 1 2 2 2
A 1 2 2 2
C 0 3 2 5
B 0 2 2 5
A 1 2 3 2
B 0 2 3 5
C 0 3 3 5
A 1 2 3 2
A 1 2 3 2
C 0 3 3 5
B 0 2 3 5
How can I achieve this?
I don't think you explained your problem very well and I might have misunderstood something, but can't you extract the segment_value using a query such as this one:
select
foo,
sum(val) as segment_value
from (
select distinct foo, val from table
) tab
group by foo
this would return the following result:
foo segment_value
1 2
0 5
then you could join this to the rest of you query and use it as per your needs.
I have a DataFrame:-
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 C 1
6 C 2
7 C 3
8 C 4
wan to create new variable named Flag according to last occurrence of B , A in col variable. reference df:-
col count Flag
0 B 1 0
1 B 2 1
2 A 1 0
3 A 2 0
4 A 3 1
5 C 1 0
6 C 2 0
7 C 3 0
8 C 4 1
TIA
Use Series.duplicated with numpy.where:
df['Flag'] = np.where(df['col'].duplicated(keep='last'), 0, 1)
Or Series.view with invert mask by ~:
df['Flag'] = (~df['col'].duplicated(keep='last')).view('i1')
print (df)
col count Flag
0 B 1 0
1 B 2 1
2 A 1 0
3 A 2 0
4 A 3 1
5 C 1 0
6 C 2 0
7 C 3 0
8 C 4 1
I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)
Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
I have 2 data frames like :
df_out:
a b c d
1 1 2 1
2 1 2 3
3 1 3 5
df_fin:
a e f g
1 0 2 1
2 5 2 3
3 1 3 5
5 2 4 6
7 3 2 5
I want to get result as :
a b c d a e f g
1 1 2 1 1 0 2 1
2 1 2 3 2 5 2 3
3 1 3 5 3 1 3 5
in the other word I have two diffrent data frames that are common in one column(a), I want two compare this two columns(df_fin.a and df_out.a) and select the rows from df_fin that have the same value in column a and create new dataframe that has selected rows from df_fin and added columns from df_out ?
I think you need merge with left join:
df = pd.merge(df_out, df_fin, on='a', how='left')
print (df)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5
EDIT:
df1 = df_fin[df_fin['a'].isin(df_out['a'])]
df2 = df_out.join(df1.set_index('a'), on='a')
print (df2)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5