Using Pandas to concatenate values in a column and sum corresponding values in another column - sum

For example, if i have a dataset in the first table below....
Name
Code
Thickness
CH1
3
0.5
CH1
3
0.3
CH1
4
0.4
CH1
3
0.2
CH1
5
0.6
CH1
5
0.4
.... and i want to achieve the result in the next table by grouping by the "Code" column and summing the "Thickness" column
Name
Code
Thickness
Grp_Thinckness
CH1
3
0.5
0.8
CH1
3
0.3
0.8
CH1
4
0.4
0.4
CH1
3
0.2
0.2
CH1
5
0.6
1.0
CH1
5
0.4
1.0
How do I go about this?

It's a gap-and-island problem. Anytime the Code changes, it creates a new island. You can solve these problems with a cumsum:
s = (df["Code"] != df["Code"].shift()).cumsum()
df["Grp_Thickness"] = df.groupby(s)["Thickness"].transform("sum")

Related

how to select row index and variable name based on value in a data frame?

I have a large data frame made of float numbers between -1.0 and 1.0. I would like to create a new list containing the index rows, the variable names and the values for all the cells having a number higher than 0.59.
Here is an example:
A B C D ... FD
0 0.34 -0.23 0.6 0.7 ... 0.3
1 -0.5 0.99 0.8 0.2 ... 0.8
...
45 0.8 0.13 0.34 0.4 ... -0.9
output:
0 C 0.6
0 D 0.7
1 B 0.99
1 C 0.8
...
1 FD 0.8
etc..
Thanks!
I am sure there must be a better solution than mine, as mine has awful performance (iterating cell by cell). But here is my attempt:
# creating a sample df
df = pd.DataFrame(np.random.uniform(-1, 1, size=(10, 4)), columns=list('abcd'))
new_list = []
for tup in df.itertuples():
for i in range(1, len(tup)):
if tup[i] > 0.59:
new_list.append([tup.Index, df.columns[i-1], tup[i]])
new_df = pd.DataFrame(new_list, columns=['index', 'column', 'value'])
new_df = new_df.set_index('index')

Excel sumproudct function in pandas dataframes

Ok, as a python beginner I found multiplication matrix in pandas dataframes is very difficult to conduct.
I have two tables look like:
df1
Id lifetime 0 1 2 3 4 5 .... 30
0 1 4 0.1 0.2 0.1 0.4 0.5 0.4... 0.2
1 2 7 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
2 3 8 0.5 0.2 0.1 0.4 0.5 0.4... 0.6
.......
9 6 10 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
df2
Group lifetime 0 1 2 3 4 5 .... 30
0 2 4 0.9 0.8 0.9 0.8 0.8 0.8... 0.9
1 2 7 0.8 0.9 0.9 0.9 0.8 0.8... 0.9
2 3 8 0.9 0.7 0.8 0.8 0.9 0.9... 0.9
.......
9 5 10 0.8 0.9 0.7 0.7 0.9 0.7... 0.9
I want to perform excel's sumproduct function in my codes and the length of the columns that need to be summed are based on the lifetime in column 1 of both dfs, e,g.,
for row 0 in df1&df2, lifetime=4:
sumproduct(df1 row 0 from column 0 to column 3,
df2 row 0 from column 0 to column 3)
for row 1 in df1&df2, lifetime=7
sumproduct(df1 row 2 from column 0 to column 6,
df2 row 2 from column 0 to column 6)
.......
How can I do this?
You can use .iloc to access row and columns with integers.
So where lifetime==4 is row 0, and if you count the column numbers where Id is zero, then column labeled as 0 would be 2, and column labeled as 3 would be 5, to get that interval you would enter 2:6.
Once you get the correct data in both data frames with .iloc[0,2:6], you run np.dot
See below:
import numpy as np
np.dot(df1.iloc[0,2:6], df2.iloc[1,2:6])
Just to make sure you have the right data, try just running
df1.iloc[0,2:6]
Then try the np.dot product. You can read up on "pandas iloc" and "slicing" for more info.

Trying to group by but only specific rows based on their value

I am finding this issue quite complex:
I have the following df:
values_1 values_2 values_3 id name
0.1 0.2 0.3 1 AAAA_living_thing
0.1 0.2 0.3 1 AAA_mammals
0.1 0.2 0.3 1 AA_dog
0.2 0.4 0.6 2 AAAA_living_thing
0.2 0.4 0.6 2 AAA_something
0.2 0.4 0.6 2 AA_dog
The ouput should be:
values_1 values_2 values_3 id name
0.3 0.6 0.9 3 AAAA_living_thing
0.1 0.2 0.3 1 AAA_mammals
0.1 0.2 0.3 1 AA_dog
0.2 0.4 0.6 2 AAA_something
0.2 0.4 0.6 2 AA_dog
It would be like a group_by().sum() but only the AAAA_living_thing as the rows below are childs of AAAA_living_thing
Seperate the dataframe first by using query and getting the rows only with AAAA_living_thing and without. Then use groupby and finally concat them back together:
temp = df.query('name.str.startswith("AAAA")').groupby('name', as_index=False).sum()
temp2 = df.query('~name.str.startswith("AAAA")')
final = pd.concat([temp, temp2])
Output
id name values_1 values_2 values_3
0 3 AAAA_living_thing 0.3 0.6 0.9
1 1 AAA_mammals 0.1 0.2 0.3
2 1 AA_dog 0.1 0.2 0.3
4 2 AAA_something 0.2 0.4 0.6
5 2 AA_dog 0.2 0.4 0.6
Another way would be to make a unique identifier for rows which are not AAAA_living_thing with np.where and then groupby on name + unique identifier:
s = np.where(df['name'].str.startswith('AAAA'), 0, df.index)
final = df.groupby(['name', s], as_index=False).sum()
Output
name values_1 values_2 values_3 id
0 AAAA_living_thing 0.3 0.6 0.9 3
1 AAA_mammals 0.1 0.2 0.3 1
2 AAA_something 0.2 0.4 0.6 2
3 AA_dog 0.1 0.2 0.3 1
4 AA_dog 0.2 0.4 0.6 2

Pandas apply a function at fixed interval

Is there a straightforward existing method to apply a function at fixed interval with pandas (or numpy, scipy) ?
Example
A pd.DataFrame of length 11
0 0.2
1 0.3
2 0.4
3 0.4
4 0.4
5 0.4
6 0.4
7 0.4
8 0.4
9 0.4
10 0.6
For instance applying a min function with interval = 5 would result in
0 0.2 # Beginning of interval
1 0.2
2 0.2
3 0.2
4 0.2 # End of interval
5 0.4 # Beginning of interval
6 0.4
7 0.4
8 0.4
9 0.4 # End of interval
10 0.6 # Beginning of interval (takes the min function of the remaining values)
So far I can do it with
df = pd.read_clipboard(index_col = 0, header = None) # Copying the above data
df['intervals'] = (np.arange(len(df)) / 5).astype(int)
mapper = df.groupby('intervals').min()
result = df['intervals'].apply(lambda x: mapper.loc[x])
print result
But I wonder if there exists fixed interval filters already built in pandas/numpy/scipy.
One of the various possibilities would be to use groupby.transform after grouping them as per the necessary window interval.
When you perform min on the transform method of groupby, all sub-groups would get filled by the smallest value present in their respective group.
Assuming the single columned DF to be represented by s:
s.groupby(np.arange(len(s.index)) // 5).transform('min')
produces:
0 0.2
1 0.2
2 0.2
3 0.2
4 0.2
5 0.4
6 0.4
7 0.4
8 0.4
9 0.4
10 0.6
dtype: float64

Transpose table then set and rename index

I want to transpose a table and rename the index.
If I display the df with existing index Time I get
Time v1 v2
1 0.5 0.3
2 0.2 0.1
3 0.3 0.3
and after df.transpose() I'm at
Time 1 2 3
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
Interestingly if I do now df.Time I get
AttributeError: 'DataFrame' object has no attribute 'Time'
although it gets displayed in the output.
I can't find a way to easily rename the column Time to Variable and set that as the new index ..
I tried df.reset_index().set_index("index") but what I get is something that looks like this:
Time 1 2 3
index
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
You need only rename column names by rename_axis:
print (df.transpose().rename_axis('Variable', axis=1))
Variable 1 2 3
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
Or set new column names by assign name:
df1 = df.transpose()
df1.columns.name = 'Var'
print (df1)
Var 1 2 3
v1 0.5 0.2 0.3
v2 0.3 0.1 0.3
But I think you need set new column from index and then rename column index to var, also reset column names to None:
df1 = df.transpose().reset_index().rename(columns={'index':'var'})
df1.columns.name = None
print (df1)
var 1 2 3
0 v1 0.5 0.2 0.3
1 v2 0.3 0.1 0.3