How can I merge two files while printing a given value on resulting empty fields using AWK?

How can I merge two files while printing a given value on resulting empty fields using AWK? - awk

I have two files:
01File:
1 2051
2 1244
7 917
X 850
22 444
21 233
Y 47
KI270728_1 6
KI270727_1 4
KI270734_1 3
KI270726_1 2
KI270713_1 2
GL000195_1 2
GL000194_1 2
KI270731_1 1
KI270721_1 1
KI270711_1 1
GL000219_1 1
GL000218_1 1
GL000213_1 1
GL000205_2 1
GL000009_2 1
and 02File:
1 248956422
2 242193529
7 159345973
X 156040895
Y 56887902
22 50818468
21 46709983
KI270728_1 1872759
KI270727_1 448248
KI270726_1 43739
GL000009_2 201709
KI270322_1 21476
GL000226_1 15008
KI270311_1 12399
KI270366_1 8320
KI270511_1 8127
KI270448_1 7992
I need to merge these two files based on Field 01 and print "0"s on resulting empty fields.
I was trying to accomplish this using the following command:
awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0 "\t" a[$1]}' 01File 02File
Which results in the following output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476
GL000226_1 15008
KI270311_1 12399
KI270366_1 8320
KI270511_1 8127
KI270448_1 7992
However, I am having trouble adapting the command so as to be able to print, in this case a value of zero "0" on the resulting empty fields, so as to generate the following output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476 0
GL000226_1 15008 0
KI270311_1 12399 0
KI270366_1 8320 0
KI270511_1 8127 0
KI270448_1 7992 0
I would be grateful if you can get me going in the right direction

Use a conditional expression in place of a[1]. Instead of the empty string, "0" will be printed if no line matched.
awk 'FNR==NR{a[$1]=$2;next} {print $0 "\t" ($1 in a? a[$1]: "0")}' 01File 02File
Also I simplified the first action, as there are only 2 fields.
Output:
1 248956422 2051
2 242193529 1244
7 159345973 917
X 156040895 850
Y 56887902 47
22 50818468 444
21 46709983 233
KI270728_1 1872759 6
KI270727_1 448248 4
KI270726_1 43739 2
GL000009_2 201709 1
KI270322_1 21476 0
GL000226_1 15008 0
KI270311_1 12399 0
KI270366_1 8320 0
KI270511_1 8127 0
KI270448_1 7992 0

Related

Reindex kmeans clustered dataframe in an ascending order values

I have created a set of 4 clusters using kmeans, but I'd like to reorder the clusters in an ascending manner to have a predictable way of outputting an analysis every time the script is executed.
The resulting df with the clusters is something like:
customer_id recency frequency monetary_value recency_cluster \
0 44792907512250289 21 1 43.76 0
1 4277896431638207047 443 1 73.13 1
2 1509512561185834874 559 1 37.50 1
3 -8259919882769629944 437 1 34.38 1
4 8269311313560571571 133 2 324.78 0
5 6521698907264712834 311 1 6.32 3
6 9102795320443090762 340 1 174.99 3
7 6203217338400763719 39 1 77.50 0
8 7633758030510673403 625 1 95.26 2
9 -2417721548925747504 644 1 76.84 2
frequency_cluster monetary_value_cluster
0 1 0
1 1 0
2 1 0
3 1 0
4 0 1
5 1 0
6 1 1
7 1 0
8 1 0
9 1 0
The recency clusters are not sorted by the data, I'd like for example that the recency cluster 0 to be the one with the min value = 1.0 (recency cluster 1).
recency_cluster count mean std min 25% 50% 75% max
0 17609.0 700.900960 56.895995 609.0 651.0 697.0 749.0 807.0
1 16458.0 102.692672 62.952229 1.0 47.0 101.0 159.0 210.0
2 17166.0 515.971746 56.592490 418.0 466.0 517.0 567.0 608.0
3 18634.0 317.599227 58.852980 211.0 269.0 319.0 367.0 416.0
Using something like:
rfm_df.groupby('recency_cluster')['recency'].transform('min')
Will return a colum with the min value of each clusters
0 1
1 418
2 418
3 418
4 1
...
69862 609
69863 1
69864 211
69865 609
69866 211
I guess there's got to be a way to convert this categories [1,211,418,609] into [0, 1, 2, 3] in order to get the desired result but I can't come up with a solution.
Or maybe there's a better approach to the problem.
Edit: I did this and I think it's working:
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes

rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes

Pandas, Replace values of a column with a variable (negative) if it is less than that variable, else keep the values as is

say:
m = 170000 , v = -(m/100)
{'01-09-2021': 631, '02-09-2021': -442, '08-09-2021': 6, '09-09-2021': 1528, '13-09-2021': 2042, '14-09-2021': 1098, '15-09-2021': -2092, '16-09-2021': -6718, '20-09-2021': -595, '22-09-2021': 268, '23-09-2021': -2464, '28-09-2021': 611, '29-09-2021': -1700, '30-09-2021': 4392}
I want to replace values in column 'Final' with v if the value is less than v, else keep the original value. Tried numpy.where , df.loc etc but didn't work.

Your can use clip:
df['Final'] = df['Final'].clip(-1700)
print(df)
# Output:
Date Final
0 01-09-2021 631
1 02-09-2021 -442
2 08-09-2021 6
3 09-09-2021 1528
4 13-09-2021 2042
5 14-09-2021 1098
6 15-09-2021 -1700
7 16-09-2021 -1700
8 20-09-2021 -595
9 22-09-2021 268
10 23-09-2021 -1700
11 28-09-2021 611
12 29-09-2021 -1700
13 30-09-2021 4392
Or the classical np.where:
df['Final'] = np.where(df['Final'] < -1700, -1700, df['Final'])
Setup:
df = pd.DataFrame({'Date': d.keys(), 'Final': d.values()})

You can try:
df.loc[df['Final']<v, 'Final'] = v
Output:
Date Final
0 01-09-2021 631
1 02-09-2021 -442
2 08-09-2021 6
3 09-09-2021 1528
4 13-09-2021 2042
5 14-09-2021 1098
6 15-09-2021 -1700
7 16-09-2021 -1700
8 20-09-2021 -595
9 22-09-2021 268
10 23-09-2021 -1700
11 28-09-2021 611
12 29-09-2021 -1700
13 30-09-2021 4392

How to combine two groupby into one

I have two GroubBy:
The First one
ser2 = ser.groupby(pd.cut(ser, 10)).sum()
(-2620.137, 476638.7] 12393813
(476638.7, 951152.4] 9479666
(951152.4, 1425666.1] 14381033
(1425666.1, 1900179.8] 5113056
(1900179.8, 2374693.5] 4114429
(2374693.5, 2849207.2] 4929537
(2849207.2, 3323720.9] 0
(3323720.9, 3798234.6] 0
(3798234.6, 4272748.3] 3978230
(4272748.3, 4747262.0] 4747262
And the second:
ser1= pd.cut(ser, 10)
print(ser1.value_counts())
(-2620.137, 476638.7] 110
(476638.7, 951152.4] 15
(951152.4, 1425666.1] 12
(1425666.1, 1900179.8] 3
(2374693.5, 2849207.2] 2
(1900179.8, 2374693.5] 2
(4272748.3, 4747262.0] 1
(3798234.6, 4272748.3] 1
(3323720.9, 3798234.6] 0
(2849207.2, 3323720.9] 0
Question: Are there ways to combine these operations into one code to get both calculations in the same pivot table

Use GroupBy.agg, instead value_counts use GroupBy.size:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg(['sum','size'])
print (df)
sum size
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10
If need custom columns names:
np.random.seed(2020)
ser = pd.Series(np.random.randint(40, size=100))
df = ser.groupby(pd.cut(ser, 10)).agg([('col1','sum'),('col2','size')])
print (df)
col1 col2
(-0.039, 3.9] 27 14
(3.9, 7.8] 49 9
(7.8, 11.7] 142 15
(11.7, 15.6] 151 11
(15.6, 19.5] 159 9
(19.5, 23.4] 187 9
(23.4, 27.3] 253 10
(27.3, 31.2] 176 6
(31.2, 35.1] 231 7
(35.1, 39.0] 375 10

SQL: Insert Rows and interpolate

I have a large SQL table and I want to add rows so all issue ages 40-75 are present and all the issue ages have a db_perk and accel_perk which is added via liner interpolation.
Here is a small portion of my data
class gender iss_age dur db_perk accel_perk ext_perk
111 F 40 1 0.1961 0.0025 0
111 F 45 1 0.2985 0.0033 0
111 F 50 1 0.472 0.0065 0
111 F 55 1 0.7075 0.01 0
111 F 60 1 1.0226 0.0238 0
111 F 65 1 1.5208 0.0551 0
111 F 70 1 2.3808 0.1296 0
111 F 75 1 4.0748 0.3242 0
I want my output to look something like this
class gender iss_age dur db_perk accel_perk ext_perk
111 F 40 1 0.1961 0.0025 0
111 F 41 1 0.21656 0.00266 0
111 F 42 1 0.23702 0.00282 0
111 F 43 1 0.25748 0.00298 0
111 F 44 1 0.27794 0.00314 0
111 F 45 1 0.2985 0.0033 0
I basically want to have all the columns, but iss_age, db_perk, and accel_perk be the same as the column above
Is there anyway to do this?

calculating sum and average only for selected data set only

I have a dataset as below:
col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8
0 17 215 55.7059 947 BMR_42 O22-BMR_1 O23-H23
1 1 1 1.0000 1 BMR_42 O23-BMR_1 O23-H23
2 31 3 1.0968 34 BMR_31 O22-BMR_1 O26-H26
3 11 2 1.0909 12 BMR_31 O13-BMR_1 O26-H26
4 20 5 1.8500 37 BMR_49 O22-BMR_1 O26-H26
5 24 4 1.7917 43 BMR_49 O23-BMR_1 O26-H26
6 41 2 1.0488 43 BMR_49 O12-BMR_1 O12-H12
7 28 2 1.0357 29 BMR_49 O22-BMR_1 O13-H13
8 1 1000 1000.0000 1000 BMR_49 O13-BMR_1 O13-H13
9 1 1 1.0000 1 BMR_22 O12-BMR_2 O22-H22
10 50 62 18.9400 947 BMR_59 O13-BMR_2 O22-H22
11 1 1 1.0000 1 BMR_59 O25-BMR_2 O23-H23
12 34 5 1.1471 39 BMR_59 O13-BMR_2 O23-H23
13 7 6 2.1429 15 BMR_59 O26-BMR_2 O24-H24
14 6 8 3.6667 22 BMR_59 O25-BMR_2 O24-H24
15 28 2 1.1071 31 BMR_10 O26-BMR_2 O26-H26
16 52 121 15.1346 787 BMR_10 O25-BMR_2 O26-H26
17 65 9 1.9231 125 BMR_10 O13-BMR_2 O26-H26
18 4 4 2.2500 9 BMR_59 O26-BMR_2 O26-H26
19 9 1 1.0000 9 BMR_22 O15-BMR_2 O13-H13
20 1 1 1.0000 1 BMR_10 O11-BMR_2 O16-H16
21 7 2 1.1429 8 BMR_53 O13-BMR_2 O16-H16
22 2 3 2.5000 5 BMR_33 O13-BMR_3 O22-H22
23 97 54 6.8247 662 BMR_61 O26-BMR_3 O22-H22
24 1 1 1.0000 1 BMR_29 O26-BMR_3 O23-H23
25 31 36 3.3226 103 BMR_29 O16-BMR_3 O23-H23
(The real file contains over 2000 lines).
I want to select data under certain criteria and find the sum and average of that. For example I want to select lines containing O22 in column $7 and $8 and calculate the sum and average of the values in column $4.
I tried a script as below:
awk '$7 ~ /O22/ && $8 ~ /O22/ {sum += $4} END {print sum, (sum/NR) }' hhsolute.lifetime2.dat
This code could select the line correctly but when I want to calculate the average (sum/NR), I don't get the correct value.
I wish to get some help on this. How I could get the sum and average values only for the data lines I wanted?
Appreciate any help in advance.

awk -v tgt="O22" '
$7 ~ tgt && $8 ~ tgt { sum+=$4; cnt++ }
END { print sum+0, (cnt ? sum/cnt : 0) }
' file

Try this:
awk '$7~/O22/ && $8~/O22/{++n;sum+=$4}END{if(n) print "Sum = " (sum), "Average= "(sum/n)}' File
If 7th and 8th field both contains pattern O22, add 4th field value to variable sum, increase n. Within END block, print the sum and average.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I merge two files while printing a given value on resulting empty fields using AWK? - awk

Related

Reindex kmeans clustered dataframe in an ascending order values

Pandas, Replace values of a column with a variable (negative) if it is less than that variable, else keep the values as is

How to combine two groupby into one

SQL: Insert Rows and interpolate

calculating sum and average only for selected data set only

Categories

Resources