pandas dataframe how to sum all value of bigger columns per row - pandas

I have a dataframe:
0. 1. 2. 3
2. 3. 5. 9
5. 1. 0. 3
And for columns 1,2,3 - I want the value for each row to be the sum of higher columns
So the new df will be:
0. 1. 2. 3
2. 17 14. 9
5. 4. 3. 3
What is the best way to do so?

Use inverse DataFrame.cumsum:
L = [1,2,3]
df[L[::-1]] = df[L[::-1]].cumsum(axis=1)
print (df)
0 1 2 3
0 2.0 17.0 14.0 9.0
1 5.0 4.0 3.0 3.0

Let us try cumsum along axis=1 after reverting the columns 1, 2, 3:
c = [1, 2, 3]
df.loc[:, c] = df.loc[:, c[::-1]].cumsum(axis=1)
0 1 2 3
0 2 17 14 9
1 5 4 3 3

Related

pandas function to find index of the first future instance where column is less than each row's value

I'm new to Stack Overflow, and I just have a question about solving a problem in pandas. I am looking to create a function that returns the index of the first future instance where a column is less than each row's value for that column.
For example, consider the dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Val': [1, 2, 3, 4, 0, 1, -1, -2, -3]}, index = np.arange(0,9))
df
Index
Val
0
1
1
2
2
3
3
4
4
0
5
1
6
-1
7
-2
8
-3
I am looking for the output:
Index
F(Val)
0
4
1
4
2
4
3
4
4
6
5
6
6
7
7
8
8
NaN
Or the series/array equivalent of F(Val).
I've been able to solve this quite easily using for loops, but obviously this is extremely slow on the large dataset I am working with an not a very elegant or optimal solution. My hope is that the solution is an efficient pandas function that employs vectorization.
Also, as a bonus question (if anyone can assist), how might the maximum value between each row's index and the F(Val) index be computed using vectorization? The output should look like:
Index
G(Val)
0
4
1
4
2
4
3
4
4
1
5
1
6
-1
7
-2
8
NaN
Thanks!
You can use:
grp = df['Val'].lt(df['Val'].shift()).shift(fill_value=0).cumsum()
df['F(Val)'] = df.groupby(grp).transform(lambda x: x.index[-1]).shift(-1)
print(df)
# Output
Val F(Val)
0 1 4.0
1 2 4.0
2 3 4.0
3 4 4.0
4 0 6.0
5 1 6.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
Using numpy broadcasting and the lower triangle:
a = df['Val'].to_numpy()
m = np.tril(a[:,None]<=a, k=-1)
df['F(Val)'] = np.where(m.any(0), m.argmax(0), np.nan)
Same logic with expanding:
df['F(Val)'] = (df.loc[::-1, 'Val'].expanding()
.apply(lambda x: s.idxmax() if len(s:=(x.iloc[-2::-1]<=x.iloc[-1]))
else np.nan)
)
Output (with a difference to the provided one):
Val F(Val)
0 1 5.0 # here the next is 5
1 2 4.0
2 3 4.0
3 4 4.0
4 2 5.0
5 -2 7.0
6 -1 7.0
7 -2 8.0
8 -3 NaN

What does this line do (uniques) in dataframes

can someone help me understand this line. Thanks in advance
uniques = data.apply(lambda x: [x.unique])
I believe that the syntax that you're looking for is data.apply(lambda x: [x.unique()]). If so, then it outputs a DataFrame with a single row. Each element is a list containing the unique elements in that column. An example:-
In [14]: data.head()
Out[14]:
ipip wage1992 prtyid ideol gender
1 3 0.614226 7.0 5.000000 1
2 4 0.674754 5.0 5.000000 1
3 4 0.267990 7.0 6.000000 1
4 1 0.261490 1.0 4.000000 1
5 3 0.109723 2.0 1.760314 1
In [15]: data.apply(lambda x: [x.unique()])
Out[15]:
ipip ... gender
0 [3, 4, 1, 5, 2] ... [1, 0]
[1 rows x 5 columns]
As an example, the gender column contains two unique values, 1 and 0.

Drop a column based on the existence of another column

I'm actually trying to figure out how to drop a column based on the existence of another column. Here is my problem :
I start with this DataFrame. Each "X" column is associated with a "Y" column using a number. (X_1,Y_1 / X_2,Y_2 ...)
Index X_1 X_2 Y_1 Y_2
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
I drop NaN values using pd.dropna(). The result I get is this DataFrame :
Index X_1 X_2 Y_1
1 4 0 A
2 7 0 A
3 6 0 B
4 2 0 B
5 8 0 A
The problem is that I want to delete the "X" column associated to the "Y" column that just got dropped. I would like to use a condition that basically says :
"If Y_2 is not in the DataFrame, drop the X_2 column"
I used a for loop combined to if, but it doesn't seem to work. Any ideas ?
Thanks and have a good day.
Setup
>>> df
CHA_COEXPM1_COR CHA_COEXPM2_COR CHA_COFMAT1_COR CHA_COFMAT2_COR
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Solution
Identify the columns having NaN values in any row
Group the identified columns using the numeric identifier and transform using any
Filter the columns using the boolean mask created in the previous step
m = df.isna().any()
m = m.groupby(m.index.str.extract(r'(\d+)_')[0]).transform('any')
Result
>>> df.loc[:, ~m]
CHA_COEXPM1_COR CHA_COFMAT1_COR
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
Slightly modified example to be closer to actual DataFrame:
df = pd.DataFrame({
'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'X_V1_C': {0: 4, 1: 7, 2: 6, 3: 2, 4: 8},
'X_V2_C': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Y_V1_C': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'A'},
'Y_V2_C': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
})
Index X_V1_C X_V2_C Y_V1_C Y_V2_C
0 1 4 0 A NaN
1 2 7 0 A NaN
2 3 6 0 B NaN
3 4 2 0 B NaN
4 5 8 0 A NaN
set_index on any columns to be "saved"
Extract the numbers from the columns and create a MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
0 1 2 1 2 # Numbers Extracted From Columns
X_V1_C X_V2_C Y_V1_C Y_V2_C
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Check where There are groups with all NaN columns with DataFrame.isna all on axis=0 (columns) then any relative to level=0 (the number that was extracted)
col_mask = ~df.isna().all(axis=0).any(level=0)
0
1 True # Keep 1 Group
2 False # Don't Keep 2 Group
dtype: bool
4.filter the DataFrame with the mask using loc then droplevel on the added number level
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
All Together
df = df.set_index('Index')
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
col_mask = ~df.isna().all(axis=0).any(level=0)
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
df:
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
drop nas
df.dropna(axis=1, inplace=True)
compute suffixes and columns with both suffixes
suffixes = [i[2:] for i in df.columns]
cols = [c for c in df.columns if suffixes.count(c[2:]) == 2]
filter columns
df[cols]
full code:
df = df.set_index('Index').dropna(axis=1)
suffixes = [i[2:] for i in df2.columns]
df[[c for c in df2.columns if suffixes.count(c[2:]) == 2]]

Re-index to insert missing rows in a multi-indexed dataframe

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0

Get data from multiindexed dataframe with given a list of index

I want to get information from from a column called 'index' in a certain pandas dataframe with multiple index. However, the index shall be listed. Please see the following example.
index ID
0 0 1.0 17226815
1 0 2.0 17226807
2 0 3.0 17226816
3 0 4.0 17226808
4 0 5.0 17231739
5 0 6.0 17231739
6 0 1.0 17226815
1 2.0 17226807
7 0 1.0 17226815
1 3.0 17226816
filtered_list = [3, 5, 7]
with the following line I can get the filtered data.
print(df.loc[df.index.isin(filtered_list, level=0)]['index'])
out:
3 0 4.0
5 0 6.0
7 0 1.0
1 3.0
what I want to get is a list consisting of the 'index' value. It will be then as additional information next to the filtered index. It is shown as follow:
0 3 4
1 5 6
2 7 (1, 3)
how can I get this list?
thank you in advance.
If I understand correctly,
df.loc[filtered_list,'index'].groupby(level=0).apply(tuple).reset_index()
Output:
0 index
0 3 (4.0,)
1 5 (6.0,)
2 7 (1.0, 3.0)
Going further:
df.loc[filtered_list,'index']\
.groupby(level=0)\
.apply(lambda x: tuple(x)[0] if len(x.index)==1 else tuple(x))\
.reset_index()
OUtput:
0 index
0 3 4
1 5 6
2 7 (1.0, 3.0)