Division of column values from 2 file for matching id and header - awk

I am having two large file.
test.txt
Id sub_id s_1 s_2 s_3 s_4 s_5 c_1 c_2 ct_1 ct_2
A a 1 4 3 0 0 1 2 1 1
A b 0 0 3 4 3 3 3 1 2
A c 4 4 4 1 1 0 9 7 8
B d 1 3 2 7 0 5 2 8 5
B e 8 7 4 0 8 4 2 11 30
test1.txt
Id s_1 s_2 s_3 s_4 s_5 c_1 c_2 ct_1 ct_2
A 5 8 10 5 4 4 14 9 11
B 9 10 6 7 8 9 4 19 35
expected output
Id sub_id s_1 s_2 s_3 s_4 s_5 c_1 c_2 ct_1 ct_2
A a 0.2 0.5 0.3 0 0 0.25 0.142857 0.111111 0.0909091
A b 0 0 0.3 0.8 0.75 0.75 0.214286 0.111111 0.181818
A c 0.8 0.5 0.4 0.2 0.25 0 0.642857 0.777778 0.727273
B d 0.111111 0.3 0.333333 1 0 0.555556 0.5 0.421053 0.142857
B e 0.888889 0.7 0.666667 0 1 0.444444 0.5 0.578947 0.857143
I am comparing 1st column of test1.txt file with test.txt file and if matched i am calculation value by diving columns from test.txt file by test1.txt file. For am smaller file and without considering column header I can do this by
awk -v OFS='\t' 'NR==FNR{A[$1]=$1;B[$1]=$2; C[$1]=$3; D[$1]=$4; E[$1]=$5; F[$1]=$6; G[$1]=$7; H[$1]=$8; I[$1]=$9; J[$1]=$10; next}FNR==1{print $0}(FNR>1 && A[$1]){print $1, $2, $3/B[$1], $4/C[$1], $5/D[$1], $6/E[$1], $7/F[$1], $8/G[$1], $9/H[$1], $10/I[$1], $11/J[$1]}' test1.txt test.txt
But for files with 1000s columns, whats the best way to do this? Also can the division be made by between columns with matching headers between the two files?
INPUT FILE EDITED to show representation of different column order
test11.txt
Id sub_id s_1 s_2 s_3 s_4
A a 1 4 3 0
A b 0 0 3 0
A c 4 4 4 0
B d 1 3 2 7
B e 8 7 4 0
test12.txt
Id s_1 s_2 s_4 s_3
A 5 8 0 10
B 9 10 7 6
EXPECTED OUTPUT
Id sub_id s_1 s_2 s_3 s_4
A a 0.2 0.5 0.3 0
A b 0 0 0.3 0
A c 0.8 0.5 0.4 0
B d 0.111111 0.3 0.333333 1
B e 0.888889 0.7 0.666667 0

You may use this awk:
awk 'NR == FNR {
for (i=2; i<=NF; ++i)
if (FNR==1)
h1[i] = $i
else
map[$1,h1[i]] = ($i != 0 ? $i : 1)
next
}
{
for (i=3; i<=NF; ++i)
if (FNR==1)
h2[i] = $i
else
$i /= map[$1,h2[i]]
} 1' test12.txt test11.txt | column -t
Id sub_id s_1 s_2 s_3 s_4
A a 0.2 0.5 0.3 0
A b 0 0 0.3 0
A c 0.8 0.5 0.4 0
B d 0.111111 0.3 0.333333 1
B e 0.888889 0.7 0.666667 0

Related

Put next months start as previous months end pandas

I have a dataframe in long format (panel data), Each person has a start month along with variables. it looks something like:
Data description
person_id
month_start
Var1
Var2
1
1
0.4
1.4
1
2
0.3
0.131
1
3
0.34
0.434
2
2
0.49
0.949
2
3
0.53
1.53
2
5
0.38
0.738
3
1
1.12
1.34
3
4
1.89
1.02
3
5
0.83
0.27
and I need it to look like:
person_id
month_start
month_end
Var1
Var2
1
1
2
0.4
1.4
1
2
3
0.3
0.131
1
3
4
0.34
0.434
2
2
3
0.49
0.949
2
3
5
0.53
1.53
2
5
6
0.38
0.738
3
1
4
1.12
1.34
3
4
5
1.89
1.02
3
5
6
0.83
0.27
Where month end is the beginning of the next entry for that person.
I was able to make this:
a = pd.DataFrame({'person_id':[1,1,1,2,2,2,3,3,3], 'var1': [0.4, 0.3, 0.34, 0.49, 0.53, 0.38, 1.12, 1.89, 0.83], 'var2': [1.4, 0.131, 0.434, 0.949, 1.53, 0.738, 1.34, 1.02, 0.27], 'month_start': [1,2,3,2,3,5,1,4,5]})
def add_end_date(df_in,object_id, start_col, end_col):
df = df_in.copy()
prev_person_id = -1
prev_index = -1
df[end_col] = [-1]*len(df)
for idx, row in df.iterrows():
p_id = row[object_id]
p_idx = idx
if prev_person_id == p_id:
df.loc[prev_index, end_col] = int(row[start_col])# put in start date as last entries end date
if row[end_col] == -1:
df.loc[idx, end_col] = int(row[start_col]+1)
prev_person_id = p_id
prev_index = p_idx
return df
add_end_date(a, 'person_id', 'month_start', 'month_end')
Is there a better/optimized way to accomplish this?
Try groupby.shift:
df['month_end'] = df.groupby('person_id').month_start.shift(-1)\
.fillna(df.month_start + 1).astype(int)
df
person_id month_start Var1 Var2 month_end
0 1 1 0.40 1.400 2
1 1 2 0.30 0.131 3
2 1 3 0.34 0.434 4
3 2 2 0.49 0.949 3
4 2 3 0.53 1.530 5
5 2 5 0.38 0.738 6
6 3 1 1.12 1.340 4
7 3 4 1.89 1.020 5
8 3 5 0.83 0.270 6

How to group by ID and get the count per each category

Here I'm again.
I got a df like this
id c1 c2 c3
0 0 11 12 0
1 0 15 15 1
2 0 4 24 2
3 0 5 13 2
4 0 3 15 1
5 0 5 7 0
6 0 3 18 2
7 0 17 9 3
8 0 0 17 1
9 0 12 0 0
10 1 17 9 3
11 1 1 21 2
12 1 0 3 1
13 1 4 20 3
14 1 8 22 0
15 1 16 23 2
16 1 0 3 1
17 1 4 20 3
18 1 19 17 1
19 1 12 0 0
For each ID, I want to count the values in c3 (see them as categories) and then divide the value by the length the id.
for example :
ID = 0 has 10 observations, 3 in c3.0, 3 in c3.1, 3 in c3.2, 1 in c3.3
ID = 1 has 10 observations, 2 in c3.0, 3 in c3.1, 2 in c3.2, 3 in c3.3
I want to obtain something like this :
ID c3.0 c3.1 c3.2 c3.3
0 0.3 0.3 0.3 0.1
1 0.2 0.3 0.2 0.3
The names of the columns are not relevant
Thanks for the help!
We can use groupby value_counts with normalize=True to count the occurences of 'c3' per 'id' normalized by total length of the group. Then unstack to get wide form:
out = df.groupby('id')['c3'].value_counts(normalize=True).unstack()
out:
c3 0 1 2 3
id
0 0.3 0.3 0.3 0.1
1 0.2 0.3 0.2 0.3
Some cleanup with add_prefix to update the column headers, and reset_index to make id a column:
out = (
df.groupby('id')['c3'].value_counts(normalize=True)
.unstack()
.rename_axis(columns=None)
.add_prefix('c3.')
.reset_index()
)
out:
id c3.0 c3.1 c3.2 c3.3
0 0 0.3 0.3 0.3 0.1
1 1 0.2 0.3 0.2 0.3
You can use crosstab :
result = pd.crosstab(df.id, df.c3, normalize='index')
Rename the columns:
result.columns = [f'{result.columns.name}.{label}' for label in result.columns]
result.rename_axis(None)
c3.0 c3.1 c3.2 c3.3
0 0.3 0.3 0.3 0.1
1 0.2 0.3 0.2 0.3

divide a column based on groupby or looping conditions in pandas

I have a data frame as shown below
B_ID No_Show Session slot_num Patient_count
1 0.2 S1 1 1
2 0.3 S1 2 1
3 0.8 S1 3 1
4 0.3 S1 3 2
5 0.6 S1 4 1
6 0.8 S1 5 1
7 0.9 S1 5 2
8 0.4 S1 5 3
9 0.6 S1 5 4
12 0.9 S2 1 1
13 0.5 S2 1 2
14 0.3 S2 2 1
15 0.7 S2 3 1
20 0.7 S2 4 1
16 0.6 S2 5 1
17 0.8 S2 5 2
19 0.3 S2 5 3
where
No_Show = Probability of no show
Assume that
threshold probability = 0.2
Duration for each slot = 30 (minutes)
From the above I would like calculate below data frame
Step1
sort the dataframe based on Session, slot_number and Patient_count
df = df.sort_values(['Session', 'slot_num', 'Patient_count'], ascending=False)
step 2 Calculate the cut off by using below conditions
if patient_count = 1
Divide No_show by threshold probability if patient_count = 1
Example for B_ID = 3, Patient_count = 1, cut_off = 0.8/0.2 = 4
else if patient_count = 2
multiply previously 1 No_Show with current No_show and divide with threshold)
Example for B_ID = 4, Patient_count = 2, cut_off = (0.3*0.8)/0.2 = 1.2
else if patient_count = 3
multiply previously 2 No_Show with current No_show and divide with threshold
Example for B_ID = 8, Patient_count = 3, cut_off = (0.4*0.9*0.8)/0.2 = 1.44
And so on
The Expected Output:
B_ID No_Show Session slot_num Patient_count Cut_off
1 0.2 S1 1 1 1
2 0.3 S1 2 1 1.5
3 0.8 S1 3 1 4
4 0.3 S1 3 2 1.2
5 0.6 S1 4 1 3
6 0.8 S1 5 1 4
7 0.9 S1 5 2 3.6
8 0.4 S1 5 3 1.44
9 0.6 S1 5 4 0.864
12 0.9 S2 1 1 4.5
13 0.5 S2 1 2 2.25
14 0.3 S2 2 1 1.5
15 0.7 S2 3 1 3.5
20 0.7 S2 4 1 3.5
16 0.6 S2 5 1 3
17 0.8 S2 5 2 2.4
19 0.3 S2 5 3 0.72
Use GroupBy.cumprod and divide by probability by Series.div:
probability = 0.2
df['new'] = df.groupby(['Session','slot_num'])['No_Show'].cumprod().div(probability)
print (df)
B_ID No_Show Session slot_num Patient_count new
0 1 0.2 S1 1 1 1.000
1 2 0.3 S1 2 1 1.500
2 3 0.8 S1 3 1 4.000
3 4 0.3 S1 3 2 1.200
4 5 0.6 S1 4 1 3.000
5 6 0.8 S1 5 1 4.000
6 7 0.9 S1 5 2 3.600
7 8 0.4 S1 5 3 1.440
8 9 0.6 S1 5 4 0.864
9 12 0.9 S2 1 1 4.500
10 13 0.5 S2 1 2 2.250
11 14 0.3 S2 2 1 1.500
12 15 0.7 S2 3 1 3.500
13 20 0.7 S2 4 1 3.500
14 16 0.6 S2 5 1 3.000
15 17 0.8 S2 5 2 2.400
16 19 0.3 S2 5 3 0.720

How to fix the domain violation error in GAMS

when i run this code it shows domain violation error for element, how do i remove the error.?
...
table data (i, coef)
a b c
1 0.0016 2 0
2 0.01 2.5 0
3 0.0625 1.0 0
4 0.00834 3.25 0
5 0.025 3 0
6 0.025 3 0;
table Losscoef(i,j)
1 2 3 4 5 6
1 0.000218 0.000103 0.000009 -0.00001 0.000002 0.000027
2 0.000103 0.000181 0.000004 -0.000015 0.000002 0.00003
3 0.000009 0.000004 0.000417 -0.000131 -0.000153 -0.000107
4 -0.00014 -0.000015 -0.000131 0.000221 0.000094 0.00005
5 0.000002 0.000002 -0.000153 0.000094 0.000243 0
6 0.000027 0.00003 -0.000107 0.00005 0 0.000358;
...
No error now, you must first de-clear your sets
set i /1*6/
coef /a,b,c/;
alias(i,j);
table data (i, coef)
a b c
1 0.0016 2 0
2 0.01 2.5 0
3 0.0625 1.0 0
4 0.00834 3.25 0
5 0.025 3 0
6 0.025 3 0;
table Losscoef(i,j)
1 2 3 4 5 6
1 0.000218 0.000103 0.000009 -0.00001 0.000002 0.000027
2 0.000103 0.000181 0.000004 -0.000015 0.000002 0.00003
3 0.000009 0.000004 0.000417 -0.000131 -0.000153 -0.000107
4 -0.00014 -0.000015 -0.000131 0.000221 0.000094 0.00005
5 0.000002 0.000002 -0.000153 0.000094 0.000243 0
6 0.000027 0.00003 -0.000107 0.00005 0 0.000358;

Delete row of a dataframe condition

here is my first dataframe df1
269 270 271 346
0 1 153.00 2.14 1
1 1 153.21 3.89 2
2 1 153.90 2.02 1
3 1 154.18 3.02 1
4 1 154.47 2.30 1
5 1 154.66 2.73 1
6 1 155.35 2.82 1
7 1 155.70 2.32 1
8 1 220.00 15.50 1
9 0 152.64 1.44 1
10 0 152.04 2.20 1
11 0 150.48 1.59 1
12 0 149.88 1.73 1
13 0 129.00 0.01 1
here is my second dataframe df2
269 270 271 346
0 0 149.88 2.0 1
I would like the row at the index 12 to be remove because they have the same number in columns ['269'] & ['270']
Hope below solutions would match to your requirement
Using anti_join from dplyr
library(dplyr)
anti_join(df1, df2, by = c("269", "270"))
Using %in% operator
df1[!(df1$269 %in% df2$269 & df1$270 %in% df2$270),]