How can I merge two files while printing a given value on resulting empty fields using AWK? - awk

I have two files:
01File:
1 2051
2 1244
7 917
X 850
22 444
21 233
Y 47
KI270728_1 6
KI270727_1 4
KI270734_1 3
KI270726_1 2
KI270713_1 2
GL000195_1 2
GL000194_1 2
KI270731_1 1
KI270721_1 1
KI270711_1 1
GL000219_1 1
GL000218_1 1
GL000213_1 1
GL000205_2 1
GL000009_2 1
and 02File:
1 248956422
2 242193529
7 159345973
X 156040895
Y 56887902
22 50818468
21 46709983
KI270728_1 1872759
KI270727_1 448248
KI270726_1 43739
GL000009_2 201709
KI270322_1 21476
GL000226_1 15008
KI270311_1 12399
KI270366_1 8320
KI270511_1 8127
KI270448_1 7992
I need to merge these two files based on Field 01 and print "0"s on resulting empty fields.
I was trying to accomplish this using the following command:
awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0 "\t" a[$1]}' 01File 02File
Which results in the following output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476
GL000226_1 15008
KI270311_1 12399
KI270366_1 8320
KI270511_1 8127
KI270448_1 7992
However, I am having trouble adapting the command so as to be able to print, in this case a value of zero "0" on the resulting empty fields, so as to generate the following output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476 0
GL000226_1 15008 0
KI270311_1 12399 0
KI270366_1 8320 0
KI270511_1 8127 0
KI270448_1 7992 0
I would be grateful if you can get me going in the right direction

Use a conditional expression in place of a[1]. Instead of the empty string, "0" will be printed if no line matched.
awk 'FNR==NR{a[$1]=$2;next} {print $0 "\t" ($1 in a? a[$1]: "0")}' 01File 02File
Also I simplified the first action, as there are only 2 fields.
Output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476 0
GL000226_1 15008 0
KI270311_1 12399 0
KI270366_1 8320 0
KI270511_1 8127 0
KI270448_1 7992 0

Related

Reindex kmeans clustered dataframe in an ascending order values

I have created a set of 4 clusters using kmeans, but I'd like to reorder the clusters in an ascending manner to have a predictable way of outputting an analysis every time the script is executed.
The resulting df with the clusters is something like:
customer_id recency frequency monetary_value recency_cluster \
0 44792907512250289 21 1 43.76 0
1 4277896431638207047 443 1 73.13 1
2 1509512561185834874 559 1 37.50 1
3 -8259919882769629944 437 1 34.38 1
4 8269311313560571571 133 2 324.78 0
5 6521698907264712834 311 1 6.32 3
6 9102795320443090762 340 1 174.99 3
7 6203217338400763719 39 1 77.50 0
8 7633758030510673403 625 1 95.26 2
9 -2417721548925747504 644 1 76.84 2
frequency_cluster monetary_value_cluster
0 1 0
1 1 0
2 1 0
3 1 0
4 0 1
5 1 0
6 1 1
7 1 0
8 1 0
9 1 0
The recency clusters are not sorted by the data, I'd like for example that the recency cluster 0 to be the one with the min value = 1.0 (recency cluster 1).
recency_cluster count mean std min 25% 50% 75% max
0 17609.0 700.900960 56.895995 609.0 651.0 697.0 749.0 807.0
1 16458.0 102.692672 62.952229 1.0 47.0 101.0 159.0 210.0
2 17166.0 515.971746 56.592490 418.0 466.0 517.0 567.0 608.0
3 18634.0 317.599227 58.852980 211.0 269.0 319.0 367.0 416.0
Using something like:
rfm_df.groupby('recency_cluster')['recency'].transform('min')
Will return a colum with the min value of each clusters
0 1
1 418
2 418
3 418
4 1
...
69862 609
69863 1
69864 211
69865 609
69866 211
I guess there's got to be a way to convert this categories [1,211,418,609] into [0, 1, 2, 3] in order to get the desired result but I can't come up with a solution.
Or maybe there's a better approach to the problem.
Edit: I did this and I think it's working:
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes

Pandas, Replace values of a column with a variable (negative) if it is less than that variable, else keep the values as is

say:
m = 170000 , v = -(m/100)
{'01-09-2021': 631, '02-09-2021': -442, '08-09-2021': 6, '09-09-2021': 1528, '13-09-2021': 2042, '14-09-2021': 1098, '15-09-2021': -2092, '16-09-2021': -6718, '20-09-2021': -595, '22-09-2021': 268, '23-09-2021': -2464, '28-09-2021': 611, '29-09-2021': -1700, '30-09-2021': 4392}
I want to replace values in column 'Final' with v if the value is less than v, else keep the original value. Tried numpy.where , df.loc etc but didn't work.
Your can use clip:
df['Final'] = df['Final'].clip(-1700)
print(df)
# Output:
Date Final
0 01-09-2021 631
1 02-09-2021 -442
2 08-09-2021 6
3 09-09-2021 1528
4 13-09-2021 2042
5 14-09-2021 1098
6 15-09-2021 -1700
7 16-09-2021 -1700
8 20-09-2021 -595
9 22-09-2021 268
10 23-09-2021 -1700
11 28-09-2021 611
12 29-09-2021 -1700
13 30-09-2021 4392
Or the classical np.where:
df['Final'] = np.where(df['Final'] < -1700, -1700, df['Final'])
Setup:
df = pd.DataFrame({'Date': d.keys(), 'Final': d.values()})
You can try:
df.loc[df['Final']<v, 'Final'] = v
Output:
Date Final
0 01-09-2021 631
1 02-09-2021 -442
2 08-09-2021 6
3 09-09-2021 1528
4 13-09-2021 2042
5 14-09-2021 1098
6 15-09-2021 -1700
7 16-09-2021 -1700
8 20-09-2021 -595
9 22-09-2021 268
10 23-09-2021 -1700
11 28-09-2021 611
12 29-09-2021 -1700
13 30-09-2021 4392

How to combine two groupby into one

I have two GroubBy:
The First one
ser2 = ser.groupby(pd.cut(ser, 10)).sum()
(-2620.137, 476638.7] 12393813
(476638.7, 951152.4] 9479666
(951152.4, 1425666.1] 14381033
(1425666.1, 1900179.8] 5113056
(1900179.8, 2374693.5] 4114429
(2374693.5, 2849207.2] 4929537
(2849207.2, 3323720.9] 0
(3323720.9, 3798234.6] 0
(3798234.6, 4272748.3] 3978230
(4272748.3, 4747262.0] 4747262
And the second:
ser1= pd.cut(ser, 10)
print(ser1.value_counts())
(-2620.137, 476638.7] 110
(476638.7, 951152.4] 15
(951152.4, 1425666.1] 12
(1425666.1, 1900179.8] 3
(2374693.5, 2849207.2] 2
(1900179.8, 2374693.5] 2
(4272748.3, 4747262.0] 1
(3798234.6, 4272748.3] 1
(3323720.9, 3798234.6] 0
(2849207.2, 3323720.9] 0
Question: Are there ways to combine these operations into one code to get both calculations in the same pivot table
Use GroupBy.agg, instead value_counts use GroupBy.size:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg(['sum','size'])
print (df)
sum size
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10
If need custom columns names:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg([('col1','sum'),('col2','size')])
print (df)
col1 col2
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10

SQL: Insert Rows and interpolate

I have a large SQL table and I want to add rows so all issue ages 40-75 are present and all the issue ages have a db_perk and accel_perk which is added via liner interpolation.
Here is a small portion of my data
class gender iss_age dur db_perk accel_perk ext_perk
111 F 40 1 0.1961 0.0025 0
111 F 45 1 0.2985 0.0033 0
111 F 50 1 0.472 0.0065 0
111 F 55 1 0.7075 0.01 0
111 F 60 1 1.0226 0.0238 0
111 F 65 1 1.5208 0.0551 0
111 F 70 1 2.3808 0.1296 0
111 F 75 1 4.0748 0.3242 0
I want my output to look something like this
class gender iss_age dur db_perk accel_perk ext_perk
111 F 40 1 0.1961 0.0025 0
111 F 41 1 0.21656 0.00266 0
111 F 42 1 0.23702 0.00282 0
111 F 43 1 0.25748 0.00298 0
111 F 44 1 0.27794 0.00314 0
111 F 45 1 0.2985 0.0033 0
I basically want to have all the columns, but iss_age, db_perk, and accel_perk be the same as the column above
Is there anyway to do this?

calculating sum and average only for selected data set only

I have a dataset as below:
col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8
0 17 215 55.7059 947 BMR_42 O22-BMR_1 O23-H23
1 1 1 1.0000 1 BMR_42 O23-BMR_1 O23-H23
2 31 3 1.0968 34 BMR_31 O22-BMR_1 O26-H26
3 11 2 1.0909 12 BMR_31 O13-BMR_1 O26-H26
4 20 5 1.8500 37 BMR_49 O22-BMR_1 O26-H26
5 24 4 1.7917 43 BMR_49 O23-BMR_1 O26-H26
6 41 2 1.0488 43 BMR_49 O12-BMR_1 O12-H12
7 28 2 1.0357 29 BMR_49 O22-BMR_1 O13-H13
8 1 1000 1000.0000 1000 BMR_49 O13-BMR_1 O13-H13
9 1 1 1.0000 1 BMR_22 O12-BMR_2 O22-H22
10 50 62 18.9400 947 BMR_59 O13-BMR_2 O22-H22
11 1 1 1.0000 1 BMR_59 O25-BMR_2 O23-H23
12 34 5 1.1471 39 BMR_59 O13-BMR_2 O23-H23
13 7 6 2.1429 15 BMR_59 O26-BMR_2 O24-H24
14 6 8 3.6667 22 BMR_59 O25-BMR_2 O24-H24
15 28 2 1.1071 31 BMR_10 O26-BMR_2 O26-H26
16 52 121 15.1346 787 BMR_10 O25-BMR_2 O26-H26
17 65 9 1.9231 125 BMR_10 O13-BMR_2 O26-H26
18 4 4 2.2500 9 BMR_59 O26-BMR_2 O26-H26
19 9 1 1.0000 9 BMR_22 O15-BMR_2 O13-H13
20 1 1 1.0000 1 BMR_10 O11-BMR_2 O16-H16
21 7 2 1.1429 8 BMR_53 O13-BMR_2 O16-H16
22 2 3 2.5000 5 BMR_33 O13-BMR_3 O22-H22
23 97 54 6.8247 662 BMR_61 O26-BMR_3 O22-H22
24 1 1 1.0000 1 BMR_29 O26-BMR_3 O23-H23
25 31 36 3.3226 103 BMR_29 O16-BMR_3 O23-H23
(The real file contains over 2000 lines).
I want to select data under certain criteria and find the sum and average of that. For example I want to select lines containing O22 in column $7 and $8 and calculate the sum and average of the values in column $4.
I tried a script as below:
awk '$7 ~ /O22/ && $8 ~ /O22/ {sum += $4} END {print sum, (sum/NR) }' hhsolute.lifetime2.dat
This code could select the line correctly but when I want to calculate the average (sum/NR), I don't get the correct value.
I wish to get some help on this. How I could get the sum and average values only for the data lines I wanted?
Appreciate any help in advance.
awk -v tgt="O22" '
$7 ~ tgt && $8 ~ tgt { sum+=$4; cnt++ }
END { print sum+0, (cnt ? sum/cnt : 0) }
' file
Try this:
awk '$7~/O22/ && $8~/O22/{++n;sum+=$4}END{if(n) print "Sum = " (sum), "Average= "(sum/n)}' File
If 7th and 8th field both contains pattern O22, add 4th field value to variable sum, increase n. Within END block, print the sum and average.