How to select and calculate with value from specific variable in dataframe with pandas - pandas

I am running below code and get this:
import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0].count()*100/1892
x
id 0.528541
date 0.528541
count 0.528541
idade 0.528541
site 0.528541
baseline 0.528541
fuv1 0.528541
fuv2 0.475687
fuv3 0.528541
fuv4 0.475687
dtype: float64
What I want is just to get this result 0.528541 and forgot all the above results.
What to do?
Thanks.

If want count number of 0 values in column fuv1 use sum for count Trues which are processes like 1s:
print ((pf['fuv1'] == 0).sum())
10
x = (pf['fuv1'] == 0).sum()*100/1892
print (x)
0.528541226216
Explanation why different outputs - count exclude NaNs:
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0]
print (x)
id date count idade site baseline fuv1 fuv2 fuv3 fuv4
0 0 4/1/2016 10 13 A 1 0.0 1.0 0.0 1.0
2 2 4/3/2016 9 5 C 1 0.0 NaN 0.0 1.0
3 3 4/4/2016 108 96 D 1 0.0 1.0 0.0 NaN
11 11 4/12/2016 6 13 C 1 0.0 1.0 1.0 0.0
13 13 4/14/2016 12 4 C 1 0.0 1.0 1.0 0.0
40 40 5/11/2016 14 7 C 1 0.0 1.0 1.0 1.0
41 41 5/12/2016 0 26 C 1 0.0 1.0 1.0 1.0
42 42 5/13/2016 10 15 C 1 0.0 1.0 1.0 1.0
60 60 5/31/2016 13 3 D 1 0.0 1.0 1.0 1.0
74 74 6/14/2016 15 7 B 1 0.0 1.0 1.0 1.0
print (x.count())
id 10
date 10
count 10
idade 10
site 10
baseline 10
fuv1 10
fuv2 9
fuv3 10
fuv4 9
dtype: int64

In [282]: pf.loc[pf['fuv1'] == 0, 'id'].count()*100/1892
Out[282]: 0.5285412262156448

import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x = (pf['fuv1'] == 0).sum()*100/1892
y=pf["idade"].mean()
l = "Performance"
k = "LTFU"
def test(l1,k1):
return pd.DataFrame({'a':[l1, k1], 'b':[x, y]})
df1 = test(l,k)
df1.columns = [''] * len(df1.columns)
df1.index = [''] * len(df1.index)
print(round(df1, 2))
Performance 0.53
LTFU 14.13

Related

Dynamic sum of one column based on NA values of another column in Pandas

I've got an ordered dataframe, df. It's grouped by 'ID' and ordered by 'order'
df = pd.DataFrame(
{'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A','A', 'A','A', 'B','B', 'B','B', 'B', 'B', 'B','B'],
'order': [1,3,4,6,7,9,11,12,13,14,15,16,19,25,8,10,15,17,20,25,29,31],
'col1': [1,2,np.nan, 1,2,3,4,5, np.nan, np.nan,6,7,8,9,np.nan,np.nan,np.nan,10,11,12,np.nan,13],
'col2': [1,5,6,np.nan,1,2,3,np.nan,2,3,np.nan,np.nan,3,1,5,np.nan,np.nan, np.nan,2,3, np.nan,np.nan],
}
)
In each ID group, I would need to sum col1 for those rows that have col2 as NA. The sum includes the value of col1 for which the next value of col2 exists:
I prefer a vecterised solution to make it fast, but it could be difficult.
i need to use this in a groupby (as col1_dynamic_sum should be grouped by ID)
What i have done so far, is define a function that helps count the number of previous consecutive NAs in the row:
def count_prev_consec_na(input_col):
"""
This function takes a dataframe Series (column) and outputs the number of consecutive misisng values in previous rows
"""
try:
a1 = input_col.isna() + 0 ## missing
a2 = ~input_col.isna() + 0 ## not missing
b1 = a1.shift().fillna(0) ## prev missing
d = a1.cumsum()
e = b1*a2
f = d*e
g = f.replace(0, np.nan)
h=g.ffill()
h = h.fillna(0)
i = h.shift()
result = h-i
result = result.fillna(0)
return (result)
except Exception as e:
print(e.message)
return None
I think one solution is to use this to get a dynamic number of rows that needs to be rolled back for sum:
df['roll_back_count'] = df.groupby(['ID'], as_index = False).col2.transform(count_prev_consec_na)
ID order col1 col2 roll_back_count
A 1 1.0 1.0 0.0
A 3 2.0 5.0 0.0
A 4 NaN 6.0 0.0
A 6 1.0 NaN 0.0
A 7 2.0 1.0 1.0 ## I want to sum col1 of order 6 and 7 and remove order 6 row
A 9 3.0 2.0 0.0
A 11 4.0 3.0 0.0
A 12 5.0 NaN 0.0
A 13 NaN 2.0 1.0 ## I want to sum col1 of order 12 and 13 and remove order 12 row
A 14 NaN 3.0 0.0
A 15 6.0 NaN 0.0
A 16 7.0 NaN 0.0
A 19 8.0 3.0 2.0 ## I want to sum col1 of order 15,16,19 and remove order 15 and 16 rows
A 25 9.0 1.0 0.0
B 8 NaN 5.0 0.0
B 10 NaN NaN 0.0
B 15 NaN NaN 0.0
B 17 10.0 NaN 0.0 ## I want to sum col1 of order 10,15,17,20 and remove order 10,15,17 rows
B 20 11.0 2.0 3.0
B 25 12.0 3.0 0.0
B 29 NaN NaN 0.0
B 31 13.0 NaN 0.0
this is my desired output:
desired_output:
ID order col1_dynamic_sum col2
A 1 1.0 1
A 3 2.0 5
A 4 NaN 6
A 7 3.0 1
A 9 3.0 2
A 11 4.0 3
A 13 5.0 2
B 14 NaN 3
B 19 21.0 3
B 25 9.0 1
B 8 NaN 5
B 20 21.0 2
B 25 12.0 3
note: the sums should ignore NAs
again, i prefer vecterised solution, but it might not be possible due to the rolling effect.
Gah, I think I found a solution that doesn't involve rolling at all!
I created a new grouping ID based on NA values of the col2, using the index of rows that don't have any values. I would then use this grouping ID to aggregate!
def create_na_group(rollback_col):
a = ~rollback_col.isna() + 0
b = a.replace(0, np.nan)
c = rollback_col.index
d = c*b
d = d.bfill()
return(d)
df['na_group'] = df.groupby(['ID'], as_index = False).col2.transform(create_na_group)
df = df.loc[~df.na_group.isna()]
desired_output = df.groupby(['ID','na_group'], as_index=False).agg(
order = ('order', 'last')
, col1_dyn_sum = ('col1', sum)
, col2 = ('col2', sum)
)
I just have to find a way to make sure NaN don't become 0, like in rows 2,7 and 10.
ID na_group order col1_dyn_sum col2
0 A 0.0 1 1.0 1.0
1 A 1.0 3 2.0 5.0
2 A 2.0 4 0.0 6.0
3 A 4.0 7 3.0 1.0
4 A 5.0 9 3.0 2.0
5 A 6.0 11 4.0 3.0
6 A 8.0 13 5.0 2.0
7 A 9.0 14 0.0 3.0
8 A 12.0 19 21.0 3.0
9 A 13.0 25 9.0 1.0
10 B 14.0 8 0.0 5.0
11 B 18.0 20 21.0 2.0
12 B 19.0 25 12.0 3.0
I'll just creat two separate sum columns with lamba x: x.sum(skipna = False) and lamba x: x.sum(skipna = True) and then if the skipna = True sum column is 0 and skipna = False sum column is NA then I'll leave the final sum as NA, otherwise, I use the skipna = True sum column as the final desired output.

group dataframe if the column has the same value in consecutive order

let's say I have a dataframe that looks like below:
I want to assign my assets to one group if I have treatment that are consecutive. If there are two consecutive assets without treatment after them, then we still can assign them to the same group. However, if there are more than two assets without treatment, then those assets (without treatment) will have empty group. The next assets that have treatment will be assigned to a new group
You can use a rolling check if whether there was at least one Y in the last N occurrences.
I am providing two options depending on whether or not it's important not to label the leading/trailing Ns:
# maximal number of days without treatment
# to remain in same group
N = 2
m = df['Treatment'].eq('Y')
group = m.rolling(N+1, min_periods=1).max().eq(0)
group = (group & ~group.shift(fill_value=False)).cumsum().add(1)
df['group'] = group
# don't label leading/trailing N
m1 = m.groupby(group).cummax()
m2 = m[::-1].groupby(group).cummax()
df['group2'] = group.where(m1&m2)
print(df)
To handle the last NaNs separately:
m3 = ~m[::-1].cummax()
df['group3'] = group.where(m1&m2|m3)
Output:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 1.0 1.0
4 N 1 1.0 1.0
5 Y 1 1.0 1.0
6 Y 1 1.0 1.0
7 Y 1 1.0 1.0
8 N 1 NaN NaN
9 N 1 NaN NaN
10 N 2 NaN NaN
11 Y 2 2.0 2.0
12 Y 2 2.0 2.0
13 Y 2 2.0 2.0
14 Y 2 2.0 2.0
15 N 2 NaN 2.0
Other example for N=1:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 NaN NaN
4 N 2 NaN NaN
5 Y 2 2.0 2.0
6 Y 2 2.0 2.0
7 Y 2 2.0 2.0
8 N 2 NaN NaN
9 N 3 NaN NaN
10 N 3 NaN NaN
11 Y 3 3.0 3.0
12 Y 3 3.0 3.0
13 Y 3 3.0 3.0
14 Y 3 3.0 3.0
15 N 3 NaN 3.0

How to fill nans with multiple if-else conditions?

I have a dataset:
value score
0 0.0 8
1 0.0 7
2 NaN 4
3 1.0 11
4 2.0 22
5 NaN 12
6 0.0 4
7 NaN 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 NaN 28
There are some NaNs in it. I want to fill those NaNs with these conditions:
If 'score' is less than 10, then fill nan with 0.0
If 'score' is between 10 and 20, then fill nan with 1.0
If 'score' is greater than 20, then fill nan with 2.0
How do I do this in pandas?
Here is an example dataframe:
value = [0,0,np.nan,1,2,np.nan,0,np.nan,0,2,1,1,0,2,np.nan]
score = [8,7,4,11,22,12,4,15,5,24,12,15,5,26,28]
pd.DataFrame({'value': value, 'score':score})
Do with cut then fillna
df.value.fillna(pd.cut(df.score,[-np.Inf,10,20,np.Inf],labels = [0,1,2]).astype(int),inplace=True)
df
Out[6]:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
You could use numpy.select with conditions on <10, 10≤score<20, etc. but a more efficient version could be to use a floor division to have values below 10 become 0, below 20 -> 1, etc.
df['value'] = df['value'].fillna(df['score'].floordiv(10))
with numpy.select:
df['value'] = df['value'].fillna(np.select([df['score'].lt(10),
df['score'].between(10, 20),
df['score'].ge(20)],
[0, 1, 2])
)
output:
value score
0 0.0 8
1 0.0 7
2 0.0 4
3 1.0 11
4 2.0 22
5 1.0 12
6 0.0 4
7 1.0 15
8 0.0 5
9 2.0 24
10 1.0 12
11 1.0 15
12 0.0 5
13 2.0 26
14 2.0 28
use np.select or pd.cut to map the intervals to values, then fillna:
mapping = np.select((df['score'] < 10, df['score'] > 20),
(0, 2), 1)
df['value'] = df['value'].fillna(mapping)

Converting a flat table of records to an aggregate dataframe in Pandas [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a flat table of records about objects. Object have a type (ObjType) and are hosted in containers (ContainerId). The records also have some other attributes about the objects. However, they are not of interest at present. So, basically, the data looks like this:
Id ObjName XT ObjType ContainerId
2 name1 x1 A 2
3 name2 x5 B 2
22 name5 x3 D 7
25 name6 x2 E 7
35 name7 x3 G 7
..
..
92 name23 x2 A 17
95 name24 x8 B 17
99 name25 x5 A 21
What I am trying to do is 're-pivot' this data to further analyze which containers are 'similar' by looking at the types of objects they host in aggregate.
So, I am looking to convert the above data to the form below:
ObjType A B C D E F G
ContainerId
2 2.0 1.0 1.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 1.0 2.0 1.0 1.0
9 1.0 1.0 0.0 1.0 0.0 0.0 0.0
11 0.0 0.0 0.0 2.0 3.0 1.0 1.0
14 1.0 1.0 0.0 1.0 0.0 0.0 0.0
17 1.0 1.0 0.0 0.0 0.0 0.0 0.0
21 1.0 0.0 0.0 0.0 0.0 0.0 0.0
This is how I have managed to do it currently (after a lot of stumbling and using various tips from questions such as this one). I am getting the right results but, being new to Pandas and Python, I feel that I must be taking a long route. (I have added a few comments to explain the pain points.)
import pandas as pd
rdf = pd.read_csv('.\\testdata.csv')
#The data in the below group-by is all that is needed but in a re-pivoted format...
rdfg = rdf.groupby('ContainerId').ObjType.value_counts()
#Remove 'ContainerId' and 'ObjType' from the index
#Had to do reset_index in two steps because otherwise there's a conflict with 'ObjType'.
#That is, just rdfg.reset_index() does not work!
rdx = rdfg.reset_index(level=['ContainerId'])
#Renaming the 'ObjType' column helps get around the conflict so the 2nd reset_index works.
rdx.rename(columns={'ObjType':'Count'}, inplace=True)
cdx = rdx.reset_index()
#After this a simple pivot seems to do it
cdf = cdx.pivot(index='ContainerId', columns='ObjType',values='Count')
#Replacing the NaNs because not all containers have all object types
cdf.fillna(0, inplace=True)
Ask: Can someone please share other possible approaches that could perform this transformation?
This is a use case for pd.crosstab. Docs.
e.g.
In [539]: pd.crosstab(df.ContainerId, df.ObjType)
Out[539]:
ObjType A B D E G
ContainerId
2 1 1 0 0 0
7 0 0 1 1 1
17 1 1 0 0 0
21 1 0 0 0 0

how to calculate how many times is changed in the column

how I can calculate on the most easy way, how much values changes I have in the specific DataFrame columns. For example I have follow DF:
a b
0 1
1 1
2 1
3 2
4 1
5 2
6 2
7 3
8 3
9 3
In this Data Frame the values in the column b have been changed 4 times (in the rows 4,5,6 and 8).
My very simple solution is:
a = 0
for i in range(df.shape[0] - 1):
if df['b'].iloc[i] != df['b'].iloc[i+1]:
a+=1
I think need boolean indexing with index:
idx = df.index[df['b'].diff().shift().fillna(0).ne(0)]
print (idx)
Int64Index([4, 5, 6, 8], dtype='int64')
For more general solution is possible indexing by arange:
a = np.arange(len(df))[df['b'].diff().shift().bfill().ne(0)].tolist()
print (a)
[4, 5, 6, 8]
Explanation:
First get difference by Series.diff:
print (df['b'].diff())
0 NaN
1 0.0
2 0.0
3 1.0
4 -1.0
5 1.0
6 0.0
7 1.0
8 0.0
9 0.0
Name: b, dtype: float64
Then shift by one value:
print (df['b'].diff().shift())
0 NaN
1 NaN
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
Replace first NaNs by fillna:
print (df['b'].diff().shift().fillna(0))
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
And compare for not equal to 0
print (df['b'].diff().shift().fillna(0).ne(0))
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
9 False
Name: b, dtype: bool
If the a is a column and not the index:
idx = df['a'].loc[df['b'].diff().shift().fillna(0) != 0]