Sum specific columns of pandas dataframe based on if column name ends with string and begins with value in another column - pandas

I have a pandas dataframe like this
In [1]: import pandas as pd
In [2}: df = pd.DataFrame([['X', 2, 3, 4, 5 ,6, 7], ['Y',8, 9, 10, 11, 12, 13], ['X', 14, 15, 16, 17, 18, 19]], \
columns=['name', 'X 1_V1', 'X 1_V2', 'Y 1_V1', 'Y 1_V2','X 2_V1', 'X 2_V2'])
In[3]: print(df)
Out[3]: name X 1_V1 X 1_V2 Y 1_V1 Y 1_V2 X 2_V1 X 2_V2
0 X 2 3 4 5 6 7
1 Y 8 9 10 11 12 13
2 X 14 15 16 17 18 19
I want to sum the columns that begin with the value in the 'name' column and end with 'V1'. So the 1st and 3rd row would sum the 2nd and 5th column, while the 2nd row would sum the 4th column.
In[3]: df['sum']
Out[3]:
0 8
1 10
2 32
Name: sum, dtype: int64
I have tried
df["sum_Area"] = df[[x for x in df.columns if (x.split(' ')[0] == df['name']) and (x.endswith('peak_area'))]].sum(axis = "columns")
But receive the fault : ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). The column names are strings
Results I would like in picture format

df['sum']=df.apply(lambda x:sum([x[c] for c in df.columns if c.split()[0]==x['name'] and c.endswith('V1')]),axis=1)

Related

Combine unequal length lists to dataframe pandas with values repeating

How to add a list to a dataframe column such that the values repeat for every row of the dataframe?
mylist = ['one error','delay error']
df['error'] = mylist
This gives error of unequal length as df has 2000 rows. I can still add it if I make mylist into a series, however that only appends to the first row and the output looks like this:
d = {'col1': [1, 2, 3, 4, 5],
'col2': [3, 4, 9, 11, 17],
'error':['one error',np.NaN,np.NaN,np.NaN,np.NaN]}
df = pd.DataFrame(data=d)
However I would want the solution to look like this:
d = {'col1': [1, 2, 3, 4, 5],
'col2': [3, 4, 9, 11, 17],
'error':[''one error','delay error'',''one error','delay error'',''one error','delay error'',''one error','delay error'',''one error','delay error'']}
df = pd.DataFrame(data=d)
I have tried ffill() but it didn't work.
You can assign to the result of df.to_numpy(). Note that you'll have to use [mylist] instead of mylist, even though it's already a list ;)
>>> mylist = ['one error']
>>> df['error'].to_numpy()[:] = [mylist]
>>> df
col1 col2 error
0 1 3 [one error]
1 2 4 [one error]
2 3 9 [one error]
3 4 11 [one error]
4 5 17 [one error]
>>> mylist = ['abc', 'def', 'ghi']
>>> df['error'].to_numpy()[:] = [mylist]
>>> df
col1 col2 error
0 1 3 [abc, def, ghi]
1 2 4 [abc, def, ghi]
2 3 9 [abc, def, ghi]
3 4 11 [abc, def, ghi]
4 5 17 [abc, def, ghi]
It's not a very clean way to do it, but you can first update your mylist to become the same length as the rows in dataframe, and only then you put it into your dataframe.
mylist = ['one error','delay error']
new_mylist = [mylist for i in range(len(df['col1']))]
df['error'] = new_mylist
Repeat the elements in mylist exactly N times where N is the ceil of quotient obtained after dividing length of dataframe with length of list, now assign this to new column but while assigning make sure that the length of repeated list don't exceed the length of column
df['error'] = (mylist * (len(df) // len(mylist) + 1))[:len(df)]
col1 col2 error
0 1 3 one error
1 2 4 delay error
2 3 9 one error
3 4 11 delay error
4 5 17 one error
df.assign(error=mylist.__str__())

Numpy vs Pandas axis

Why axis differs in Numpy vs Pandas?
Example:
If I want to get rid of column in Pandas I could do this:
df.drop("column", axis = 1, inplace = True)
Here, we are using axis = 1 to drop a column (vertically in a DF).
In Numpy, if I want to sum a matrix A vertically I would use:
A.sum(axis = 0)
Here I use axis = 0.
axis isn't used that often in pandas. A dataframe has 2 dimensions, which are often treated quite differently. In drop the axis definition is well documented, and actually corresponds to the numpy usage.
Make a simple array and data frame:
In [180]: x = np.arange(9).reshape(3,3)
In [181]: df = pd.DataFrame(x)
In [182]: df
Out[182]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
Delete a row from the array, or a column:
In [183]: np.delete(x, 1, 0)
Out[183]:
array([[0, 1, 2],
[6, 7, 8]])
In [184]: np.delete(x, 1, 1)
Out[184]:
array([[0, 2],
[3, 5],
[6, 8]])
Drop does the same thing for the same axis:
In [185]: df.drop(1, axis=0)
Out[185]:
0 1 2
0 0 1 2
2 6 7 8
In [186]: df.drop(1, axis=1)
Out[186]:
0 2
0 0 2
1 3 5
2 6 8
In sum, the definitions are the same as well:
In [188]: x.sum(axis=0)
Out[188]: array([ 9, 12, 15])
In [189]: df.sum(axis=0)
Out[189]:
0 9
1 12
2 15
dtype: int64
In [190]: x.sum(axis=1)
Out[190]: array([ 3, 12, 21])
In [191]: df.sum(axis=1)
Out[191]:
0 3
1 12
2 21
dtype: int64
The pandas sums are Series, which are the pandas equivalent of a 1d array.
Visualizing what axis does with reduction operations like sum is a bit tricky - especially with 2d arrays. Is the axis kept or removed? It can help to think about axis for 1d arrays (the only axis is removed), or 3d arrays, where one axis is removed leaving two.
When you get rid of a column, the name is picked from the axis 1, which is the horizontal axis. When you sum along the axis 0, you sum vertically.

Adding the lower levels of two Pandas MultiIndex columns

I have the following DataFrame:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['n1', 'n1', 'n2', 'n2'],
['p', 'm', 'p', 'm']])
values = [
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
]
df = pd.DataFrame(values, columns=columns)
n1 n2
p m p m
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Now I want to add another column (n3) to this DataFrame whose lower-level columns p and m should be the sums of the corresponding lower-level columns of n1 and n2:
n1 n2 n3
p m p m p m
0 1 2 3 4 4 6
1 5 6 7 8 12 14
2 9 10 11 12 20 22
Here's the code I came up with:
n3 = df[['n1', 'n2']].sum(axis=1, level=1)
level1 = df.columns.levels[1]
n3.columns = pd.MultiIndex.from_arrays([['n3'] * len(level1), level1])
df = pd.concat([df, n3], axis=1)
This does what I want, but feels very cumbersome compared to code that doesn't use MultiIndex columns:
df['n3'] = df[['n1', 'n2']].sum(axis=1)
My current code also only works for a column MultiIndex consisting of two levels, and I'd be interested in doing this for arbitrary levels.
What's a better way of doing this?
One way to do so with stack and unstack:
new_df = df.stack(level=1)
new_df['n3'] = new_df.sum(axis=1)
new_df.unstack(level=-1)
Output:
n1 n2 n3
m p m p m p
0 2 1 4 3 6 4
1 6 5 8 7 14 12
2 10 9 12 11 22 20
If you build the structure like:
df['n3','p']=1
df['n3','m']=1
then you can write:
df['n3'] = df[['n1', 'n2']].sum(axis=1, level=1)
Here's another way that I just discovered which does not reorder the columns:
# Sum column-wise on level 1
s = df.loc[:, ['n1', 'n2']].sum(axis=1, level=1)
# Prepend a column level
s = pd.concat([s], keys=['n3'], axis=1)
# Add column to DataFrame
df = pd.concat([df, s], axis=1)

how to convert a pandas column containing list into dataframe

I have a pandas dataframe.
One of its columns contains a list of 60 elements, constant across its rows.
How do I convert each of these lists into a row of a new dataframe?
Just to be clearer: say A is the original dataframe with n rows. One of its columns contains a list of 60 elements.
I need to create a new dataframe nx60.
My tentative:
def expand(x):
return(pd.DataFrame(np.array(x)).reshape(-1,len(x)))
df["col"].apply(lambda x: expand(x))]
it gives funny results....
The weird thing is that if i call the function "expand" on a single raw, it does exactly what I expect from it
expand(df["col"][0])
To ChootsMagoots: Thjis is the result when i try to apply your suggestion. It does not work.
Sample data
df = pd.DataFrame()
df['col'] = np.arange(4*5).reshape(4,5).tolist()
df
Output:
col
0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11, 12, 13, 14]
3 [15, 16, 17, 18, 19]
now exctract DataFrame from col
df.col.apply(pd.Series)
Output:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
Try this:
new_df = pd.DataFrame(df["col"].tolist())
This is a little frankensteinish, but you could also try:
import numpy as np
np.savetxt('outfile.csv', np.array(df['col'].tolist()), delimiter=',')
new_df = pd.read_csv('outfile.csv')
You can try this as well:
newCol = pd.Series(yourList)
df['colD'] = newCol.values
The above code:
1. Creates a pandas series.
2. Maps the series value to columns in original dataframe.

Check whether a column in a dataframe is an integer or not, and perform operation

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6