Pandas iterate max value of a variable length slice in a series - pandas

Let's assume i have a Pandas DataFrame as follows:
import pandas as pd
idx = ['2003-01-02', '2003-01-03', '2003-01-06', '2003-01-07',
'2003-01-08', '2003-01-09', '2003-01-10', '2003-01-13',
'2003-01-14', '2003-01-15', '2003-01-16', '2003-01-17',
'2003-01-21', '2003-01-22', '2003-01-23', '2003-01-24',
'2003-01-27']
a = pd.DataFrame([1,2,0,0,1,2,3,0,0,0,1,2,3,4,5,0,1],
columns = ['original'], index = pd.to_datetime(idx))
I am trying to get the max for each slices of that DataFrame between two zeros.
In that example i would get:
a['result'] = [0,2,0,0,0,0,3,0,0,0,0,0,0,0,5,0,1]
that is:
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1

find zeros
cumsum to make groups
mask the zeros into their own group -1
find the max location in each group idxmax
get rid of the one for group -1, that was for zeros anyway
get a.original for found max locations, reindex and fill with zeros
m = a.original.eq(0)
g = a.original.groupby(m.cumsum().mask(m, -1))
i = g.idxmax().drop(-1)
a.assign(result=a.loc[i, 'original'].reindex(a.index, fill_value=0))
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1

Related

How to select row value from given columns based on comparison of other column values in Pandas data frame?

I have the following Pandas DataFrame:
true_y m1_labels m1_probs_0 m1_probs_1 m2_labels m2_probs_0 m2_probs_1
0 0 0.628205 0.371795 1 0.491648 0.508352
0 0 0.564113 0.435887 1 0.474973 0.525027
0 1 0.463897 0.536103 0 0.660307 0.339693
0 1 0.454559 0.545441 0 0.512349 0.487651
0 0 0.608345 0.391655 1 0.499531 0.500469
0 0 0.816127 0.183873 1 0.456669 0.543331
0 1 0.442693 0.557307 0 0.573354 0.426646
1 0 0.653497 0.346503 1 0.487212 0.512788
0 1 0.392380 0.607620 0 0.627419 0.372581
0 1 0.375816 0.624184 0 0.631532 0.368468
This is a collection of disagreeing ML model predictions with labels and label probabilities of two models (m1, m2) and the actual label (true_y).
I would like to have any of the hard label predictions (m1_labels or m2_labels) which have a higher probability to the respective predicted class of their respective models per row. So for row #1, I expect 0 (as the m1 model has a higher probability for its prediction 0 than the m2 model for its prediction 1). Basically, this is intended to be a manual voting ensemble of the two models.
How can I get this vector with a Pandas query?
You can use the apply function for this:
df.apply(lambda x: x["m1_labels"] if max(x["m1_probs_0"], x["m1_probs_1"]) > max(x["m2_probs_0"], x["m2_probs_1"]) else x["m2_labels"], axis=1)
This select the first model label if the probabilty of its predicted class is higher than the probability of the second model predicted class. Otherwise, it selects the label from the second model.
You can use:
# get max probability for m1
p1 = df.filter(like='m1_probs').max(axis=1)
# get max probability for m2
p2 = df.filter(like='m2_probs').max(axis=1)
# m1_label if it has a greater probability, else m2_label
df['best'] = df['m1_labels'].where(p1.gt(p2), df['m2_labels'])
output:
true_y m1_labels m1_probs_0 m1_probs_1 m2_labels m2_probs_0 m2_probs_1 best
0 0 0 0.628205 0.371795 1 0.491648 0.508352 0
1 0 0 0.564113 0.435887 1 0.474973 0.525027 0
2 0 1 0.463897 0.536103 0 0.660307 0.339693 0
3 0 1 0.454559 0.545441 0 0.512349 0.487651 1
4 0 0 0.608345 0.391655 1 0.499531 0.500469 0
5 0 0 0.816127 0.183873 1 0.456669 0.543331 0
6 0 1 0.442693 0.557307 0 0.573354 0.426646 0
7 1 0 0.653497 0.346503 1 0.487212 0.512788 0
8 0 1 0.392380 0.607620 0 0.627419 0.372581 0
9 0 1 0.375816 0.624184 0 0.631532 0.368468 0

Pandas dataframe aggregating data into counts per group

I'm new to pandas and was looking for some advice on how to reshape my pandas dataframe:
Currently, I have a dataframe like this.
panelist_id
type
refer_sm
refer_se
refer_non_n
1
HP
1
0
0
1
HP
1
0
0
1
HP
0
0
1
1
PB
0
1
0
2
PB
0
1
0
2
PB
1
0
0
2
HP
1
0
0
Ideally, I want to group by panelist_id, and aggregate the other columns by count:
panelist_id
type
type_count
refer_sm_count
refer_se_count
refer_non_n_count
1
HP
2
2
1
1
PB
1
0
1
0
2
HP
1
1
0
0
PB
2
1
1
0
0
I've tried using groupby to group by panelist, which works, however I'm a little stuck on the aggregation part. Any help would be much appreciated.
df.groupby(['panelist_id', 'type']).agg(type_count =('type', 'size'), refer_sm_count=('refer_sm', 'sum'), refer_se_count = ('refer_se', 'sum')) ?

How to split a column in a data frame containing only numbers into multiple columns in pandas

I have a .dat file containing the following data:
0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011
Need to count number of zeros and ones in each row
I have tried with Pandas.
Step-1: Read the data file
Step-2: Given a column name
Step-3: Tried to split the values into multiple columns. But could
not succeed
df1=pd.read_csv('data.dat',header=None) df1.head()
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
df1.columns=['kirti']
df1.head()
Kirti
_______________________
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
I need to split the data frame into multiple columns depending upon the 0s and 1s in each row.
the maximum number of columns will be equal to max no of zeros and ones in any of the rows in the data frame.
First create one column DataFrame by parameters names and dtype=str for convert column to strings:
import pandas as pd
temp="""0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename'
df = pd.read_csv(StringIO(temp), header=None, names=['kirti'], dtype=str)
print (df)
kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
And then create new DataFrame by convert values to lists:
df = pd.DataFrame([list(x) for x in df['kirti']])
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0
1 1 1 0 1 0 1 0 0 0 0 0 1 1 1 1 None None None None
2 1 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 None
3 0 1 1 1 1 1 1 0 1 0 1 0 0 None None None None None None
4 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 None None None
If your data is in a list of strings, then use the count method:
>> data = ["0001100000101010100", "110101000001111", "101100011001110111", "0111111010100", "1010111111100011"]
>> for i in data:
print(i.count("0"))
13
7
7
5
5
If your data is in a .dat file with whitespace sepparation as you discribed, then I would recommend loading your data as follows:
data = pd.read_csv("data.dat", lineterminator=" ",dtype="str", header=None, names=["Kirti"])
Kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
The lineterminator argument ensures that every entry is in a new row. The dtype argument ensures that it's read as string. Otherwise you will loose leading zeros.
If your data is in a DataFrame, you can use the count method (inspired from here):
>> data["Kirti"].str.count("0")
0 13
1 7
2 7
3 5
4 5
Name: Kirti, dtype: int64

Calculating ratio value within a line which contain binary numbers "0" & "1"

I have a data file which contain more than 2000 lines and 45001 columns.
The first column is actually a "string" which explains the data type.
Start from column #2, up to column #45001, the data is reprsented as
"1"
or
"0"
For example, the pattern of data in a line is
(0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0)
The total number of data is 25. Within this data line, there are 5 sub-groups which are made by only the number "1"s e.g. (11 111 1111 1 111 ). The "0"s in between the subgroups are assumed as "delimiter". The total of all "1"s is = 13.
I would like to calculate the ratio of
(total of all "1"s / total of number of sub-groups made only by "1"s)
That is
(13/5).
I tried with this code for calculating the total of all "1"s ;
awk -F '0' '{print NF}' < inputfile.in
This gives value 13.
But I donn't know how to go further from here to calcuate the ratio that I want.
I don't know how to find the number of sub-groups within each line beacuse the number of occurances of "1"s and "0"s are random.
Wish to get some kind help to sort this problem.
Appreciate any help in advance.
It is not clear to me from the description what the format of the input file is. Assume the input looks like:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
To count up the number of ones and the number of groups of ones and take their ratio:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; print s1/s2}' file
2.6
Update: Handling all zeros
Suppose one of the lines in the file has all zeros:
$ cat file
0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
For the second line, both sums are zero which would lead to a divide by zero error. We can avoid that by adding an if statement which will print the ratio if one exists or 0/0 is it doesn't:
if (s2>0)print s1/s2; else print s1"/"s2
The complete code is now:
$ awk '{f=0;s1=0;s2=0;for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}; if (s2>0)print s1/s2; else print s1"/"s2}' file
2.6
0/0
How it works
The code uses three variables. f is a flag which is true (1) if we are currently in a group of ones and is false (0) otherwise. s1 is the the number of ones on the line. s2 is the number of groups of ones on the line.
f=0;s1=0;s2=0
At the beginning of each line, we initialize the variables.
for (i=2;i<=NF;i++){s1+=$i;if ($i && !f)s2++;f=$i}
We loop over each field on the line starting with field 2. If the field contains a 1, we increment counter s1. If the field is 1 and is the start of a new group, we increment s2.
if (s2>0)print s1/s2; else print s1"/"s2}
If we encountered at least one one, we print the ratio s1/s2. Otherwise, we print 0/0.
Here is an awk that does what you need:
cat file
data 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0
data 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
data 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
data 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BMR_10#O24-BMR_6#O13-H13 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
data 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1
awk '{$1="";$0="0 "$0" 0";t=split($0,b,"1")-1;gsub(/ +/,"");n=split($0,a,"[^1]+")-2;print (n?t/n:0)}' t
2.6
0
25
11
5.5
3

Setting values in a matrix in bulk

The question is about bulk-changing values in a matrix based on data contained in a vector.
Suppose I have a matrix 5x4 matrix of zeroes.
octave> Z = zeros(5,4)
Z =
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
And a column vector of length equal to the number of rows in Z, that is, 5. The rows in the vector y correspond to rows in the matrix Z.
octave> y = [1; 3; 2; 1; 3]
y =
1
3
2
1
3
What I want is to set 1's in the matrix Z in the columns whose indices are contained as values in the corresponding row of the vector y. Namely, I'd like to have Z matrix like this:
Z = # y =
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
0 1 0 0 # <-- 2 nd column
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
Is there a concise way of doing it? I know I can implement it using a loop over y, but I have a feeling Octave could have a more laconic way. I am new to Octave.
Since Octave has automatic broadcasting (you'll need Octave 3.6.0 or later), the easies way I can think is to use this with a comparison. Here's how
octave> 1:5 == [1 3 2 1 3]'
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Broadcasting is explained on the Octave manual but Scipy also has a good explanation for it with nice pictures.
Found another solution that does not use broadcasting. It does not need a matrix of zeroes either.
octave> y = [1; 3; 2; 1; 3]
octave> eye(5)(y,:)
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Relevant reading here:
http://www.gnu.org/software/octave/doc/interpreter/Creating-Permutation-Matrices.html