How to extract the maximum and minimum times from successive column values? - pandas

I have dataframe below.
I want extract the maximum and minimum times from successive column values.
How can I do it?
import pandas as pd
import numpy as np
raw_data = {'Time':[281.54385,298.64380,321.29645,321.39640,419.58545,430.68540,
533.96025,580.37990,590.85605,634.06015,724.16010,750.26000,
777.87955,830.97945,850.07940],
'CF_A': [1,1,1,0,0,0,1,1,1,2,2,2,0,0,0],
'CF_B': [1,1,1,1,1,1,0,0,0,0,1,1,1,0,0],
'CF_C': [0,0,2,2,3,3,3,3,1,1,1,1,0,0,0],
}
data = pd.DataFrame(raw_data)
dataframe - Input (see picture)
Variables in each column appear in succession and I want to new dataframe
sum up the time corresponding to the start and end of the sequence.
Wanted result is below.
result (see picture)

I suggest use index with cases for avoid multiple columns names with same values:
#filter column with CF
cols = data.filter(like='CF').columns
#output list of Series
L = []
for col in cols:
#create groups by consecutive values
s = data[col].ne(data[col].shift()).cumsum().rename('g')
#grouping by each column with helper groups
g = data.groupby([s, data[col]])['Time']
#difference by first and last value
d = g.last() - g.first()
#append sum by second level of MultiIndex
L.append(d.sum(level=1))
#join all Series together, cases are index values
df = pd.concat(L, axis=1, keys=cols).fillna(0)
print (df)
CF_A CF_B CF_C
0 181.48885 119.19985 89.29980
1 96.64840 202.86100 159.40395
2 116.19985 0.00000 0.09995
3 0.00000 0.00000 160.79445
But if really need expected output:
#filter column with CF
df1 = data.filter(like='CF')
#flatten all values of cases, get sorted unique values
idx = np.sort(np.unique(df1.values.ravel()))
print (idx)
#output list of Dataframes
L = []
for col in df1.columns:
#create groups by consecutive values
s = data[col].ne(data[col].shift()).cumsum().rename('g')
#grouping by each column with helper groups
g = data.groupby([s, data[col]])['Time']
#difference by first and last value
d = g.last() - g.first()
#sum by second level of MultiIndex, add missing rows by reindex
df = d.sum(level=1).rename_axis('Case').reindex(idx, fill_value=0).reset_index()
#append df with renamed columns names
L.append(df.add_prefix(col + '_'))
#join all DataFrames together
df = pd.concat(L, axis=1)
print (df)
CF_A_Case CF_A_Time CF_B_Case CF_B_Time CF_C_Case CF_C_Time
0 0 181.48885 0 119.19985 0 89.29980
1 1 96.64840 1 202.86100 1 159.40395
2 2 116.19985 2 0.00000 2 0.09995
3 3 0.00000 3 0.00000 3 160.79445

Related

How do I offset a dataframe with values in another dataframe?

I have two dataframes. One is the basevales (df) and the other is an offset (df2).
How do I create a third dataframe that is the first dataframe offset by matching values (the ID) in the second dataframe?
This post doesn't seem to do the offset... Update only some values in a dataframe using another dataframe
import pandas as pd
# initialize list of lists
data = [['1092', 10.02], ['18723754', 15.76], ['28635', 147.87]]
df = pd.DataFrame(data, columns = ['ID', 'Price'])
offsets = [['1092', 100.00], ['28635', 1000.00], ['88273', 10.]]
df2 = pd.DataFrame(offsets, columns = ['ID', 'Offset'])
print (df)
print (df2)
>>> print (df)
ID Price
0 1092 10.02
1 18723754 15.76 # no offset to affect it
2 28635 147.87
>>> print (df2)
ID Offset
0 1092 100.00
1 28635 1000.00
2 88273 10.00 # < no match
This is want I want to produce: The price has been offset by matching
ID Price
0 1092 110.02
1 18723754 15.76
2 28635 1147.87
I've also looked at Pandas Merging 101
I don't want to add columns to the dataframe, and I don;t want to just replace column values with values from another dataframe.
What I want is to add (sum) column values from the other dataframe to this dataframe, where the IDs match.
The closest I come is df_add=df.reindex_like(df2) + df2 but the problem is that it sums all columns - even the ID column.
Try this :
df['Price'] = pd.merge(df, df2, on=["ID"], how="left")[['Price','Offset']].sum(axis=1)

pandas add one column to many others

I want to add the values of one column
import pandas as pd
df= pd.DataFrame(data={"a":[1,2],"b":[102,4], "c":[4,5]})
# what I intended to do
df[["a","b"]] = df[["a","b"]] + df[["c"]]
Expected result:
df["a"] = df["a"] + df["c"]
df["b"] = df["b"] + df["c"]
You can assume a list of columns is available (["a", "b"]). is there a non loop / non line by line way of doing this? must be...
Use DataFrame.add with axis=0 and select c column only one [] for Series:
df[["a","b"]] = df[["a","b"]].add(df["c"], axis=0)
print (df)
a b c
0 5 106 4
1 7 9 5

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

Pandas: Selecting rows by list

I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.
You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028