How to merge two files together matching exactly by 2 columns? - awk

I have file 1 with 5778 lines with 15 columns.
Sample from output_armlympho.txt:
NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804
42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
I have another file with 25958247 lines and 16 columns
Sample from file1:
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16
1 chr1_10000044_A_T_b38 ENS 171773 29 30 0.02 0.33 0.144 0.14 chr1 10000044 A T chr1 10060102
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645
I want to merge these files together so that the second and third column from file 1 (CHROM POS) exactly matches the 15th and 16th columns from file 2 (column15 column16). However a problem is that in column15, the format is chr[number] e.g. chr1 and in file 1 is just 1. So I need a way to match 1 to chr1 or 7 to chr7 and via position. There may also be repeated lines in file 2. E.g. repeated values that are the same in column15 and column16. Both files aren't ordered in the same way.
Expected output: (outputs all the columns from file 1 and 2).
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16 NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785 42486 7 18054785 42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638 42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645 42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804
Current attempt:
awk 'NR==FNR {Tmp[$3] = $16 FS $4; next} ($16 in Tmp) {print $0 FS Tmp[$16]}' output_armlympho.txt file1 > test

Assumptions:
within the file output_armlympho.txt the combination of the 2nd and 3rd columns are unique
One awk idea:
awk '
FNR==1 { if (header) print $0,header; else header=$0; next }
FNR==NR { lines["chr" $2,$3]=$0; next }
($15,$16) in lines { print $0, lines[$15,$16] }
' output_armlympho.txt file1
This generates:
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16 NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785 42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638 42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645 42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804

Related

Data Imputation in Pandas Dataframe column

I have 2 tables which I am merging( Left Join) based on common column but other column does not have exact column values and hence some of the column values are blank. I want to fill the missing value with closest tenth . for example I have these two dataframes:
d = {'col1': [1.31, 2.22,3.33,4.44,5.55,6.66], 'col2': ['010100', '010101','101011','110000','114000','120000']}
df1=pd.DataFrame(data=d)
d2 = {'col2': ['010100', '010102','010144','114218','121212','166110'],'col4': ['a','b','c','d','e','f']}
df2=pd.DataFrame(data=d2)
# df1
col1 col2
0 1.31 010100
1 2.22 010101
2 3.33 101011
3 4.44 110000
4 5.55 114000
5 6.66 120000
# df2
col2 col4
0 010100 a
1 010102 b
2 010144 c
3 114218 d
4 121212 e
5 166110 f
After left merging on col2,
I get:
df1.merge(df2,how='left',on='col2')
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 NaN
2 3.33 101011 NaN
3 4.44 111100 NaN
4 5.55 114100 NaN
5 6.66 166100 NaN
Vs what I want, for for all values where NaN, my col2 value firstly converts to closest 10 and then matches in col2 of table 1 if there is a match, place col4 accordingly, if not then closest 100, then closest thousand, ten thousand..
Ideally my answer should be:
col1 col2 col4
0 1.31 010100 a
1 2.22 010101 a
2 3.33 101011 f
3 4.44 111100 d
4 5.55 114100 d
5 6.66 166100 f
Please help me in coding this

Division of column values from 2 file for matching id and header

I am having two large file.
test.txt
Id sub_id s_1 s_2 s_3 s_4 s_5 c_1 c_2 ct_1 ct_2
A a 1 4 3 0 0 1 2 1 1
A b 0 0 3 4 3 3 3 1 2
A c 4 4 4 1 1 0 9 7 8
B d 1 3 2 7 0 5 2 8 5
B e 8 7 4 0 8 4 2 11 30
test1.txt
Id s_1 s_2 s_3 s_4 s_5 c_1 c_2 ct_1 ct_2
A 5 8 10 5 4 4 14 9 11
B 9 10 6 7 8 9 4 19 35
expected output
Id sub_id s_1 s_2 s_3 s_4 s_5 c_1 c_2 ct_1 ct_2
A a 0.2 0.5 0.3 0 0 0.25 0.142857 0.111111 0.0909091
A b 0 0 0.3 0.8 0.75 0.75 0.214286 0.111111 0.181818
A c 0.8 0.5 0.4 0.2 0.25 0 0.642857 0.777778 0.727273
B d 0.111111 0.3 0.333333 1 0 0.555556 0.5 0.421053 0.142857
B e 0.888889 0.7 0.666667 0 1 0.444444 0.5 0.578947 0.857143
I am comparing 1st column of test1.txt file with test.txt file and if matched i am calculation value by diving columns from test.txt file by test1.txt file. For am smaller file and without considering column header I can do this by
awk -v OFS='\t' 'NR==FNR{A[$1]=$1;B[$1]=$2; C[$1]=$3; D[$1]=$4; E[$1]=$5; F[$1]=$6; G[$1]=$7; H[$1]=$8; I[$1]=$9; J[$1]=$10; next}FNR==1{print $0}(FNR>1 && A[$1]){print $1, $2, $3/B[$1], $4/C[$1], $5/D[$1], $6/E[$1], $7/F[$1], $8/G[$1], $9/H[$1], $10/I[$1], $11/J[$1]}' test1.txt test.txt
But for files with 1000s columns, whats the best way to do this? Also can the division be made by between columns with matching headers between the two files?
INPUT FILE EDITED to show representation of different column order
test11.txt
Id sub_id s_1 s_2 s_3 s_4
A a 1 4 3 0
A b 0 0 3 0
A c 4 4 4 0
B d 1 3 2 7
B e 8 7 4 0
test12.txt
Id s_1 s_2 s_4 s_3
A 5 8 0 10
B 9 10 7 6
EXPECTED OUTPUT
Id sub_id s_1 s_2 s_3 s_4
A a 0.2 0.5 0.3 0
A b 0 0 0.3 0
A c 0.8 0.5 0.4 0
B d 0.111111 0.3 0.333333 1
B e 0.888889 0.7 0.666667 0
You may use this awk:
awk 'NR == FNR {
for (i=2; i<=NF; ++i)
if (FNR==1)
h1[i] = $i
else
map[$1,h1[i]] = ($i != 0 ? $i : 1)
next
}
{
for (i=3; i<=NF; ++i)
if (FNR==1)
h2[i] = $i
else
$i /= map[$1,h2[i]]
} 1' test12.txt test11.txt | column -t
Id sub_id s_1 s_2 s_3 s_4
A a 0.2 0.5 0.3 0
A b 0 0 0.3 0
A c 0.8 0.5 0.4 0
B d 0.111111 0.3 0.333333 1
B e 0.888889 0.7 0.666667 0

How to transfer (sum up) the counts from a set of ranges to ranges that are englobing those ranges?

I am working with sequencing data, but I think the problem applies to different range-value datatypes.
I want to combine several experiments of read counts(values) from a set DNA regions that have a start and end position (ranges), into added up counts for other set of DNA regions, which generally englobe many of the primary regions. Like in the following example:
Giving the following table A with ranges and counts:
feature start end count1 count2 count3
gene1 1 10 100 30 22
gene2 15 40 20 10 6
gene3 50 70 40 11 7
gene4 100 150 23 15 9
and the following table B (with new ranges):
feature start end
range1 1 45
range2 55 160
I would like to get the following count table with the new ranges:
feature start end count1 count2 count3
range1 1 45 120 40 28
range2 55 160 63 26 16
Just to simplify, if there is at least some overlap (at least a fraction a feature in table A is contained in feature in table B), it should be added up. Any idea of a tool available doing that or a script in perl, python or R? I am counting the sequencing reads with bedtools multicov, but as far as I searched there is no other functionality doing what I want. Any idea?
Thanks.
We can do this by:
Creating an artificial key column
Perform an outer join (mxn)
Filter on the start OR end value being between our ranges
pandas.DataFrame.groupby on feature and sum the count columns
Finally concat the output to df2, to get desired output
df1['key'] = 'A'
df2['key'] = 'A'
df3 = pd.merge(df1,df2, on='key', how='outer')
df4 = df3[(df3.start_x.between(df3.start_y, df3.end_y)) | (df3.end_x.between(df3.start_y, df3.end_y))]
df5 = df4.groupby('feature_y').agg({'count1':'sum',
'count2':'sum',
'count3':'sum'}).reset_index()
df_final = pd.concat([df2.drop(['key'], axis=1), df5.drop(['feature_y'], axis=1)], axis=1)
output
print(df_final)
feature start end count1 count2 count3
0 range1 1 45 120 40 28
1 range2 55 160 63 26 16
You can use apply() and pd.concat() with a custom function where a corresponds to your first dataframe and b corresponds to your second dataframe:
def find_englobed(x):
englobed = a[(a['start'].between(x['start'], x['end'])) | (a['end'].between(x['start'], x['end']))]
return englobed[['count1','count2','count3']].sum()
pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)
Yields:
feature start end count1 count2 count3
0 range1 1 45 120 40 28
1 range2 55 160 63 26 16
If it can help somebody, based on #rahlf23 answer, I modified it to make it more general, considering that on one side, the counting columns can be more, and that besides the range, it is also important to be on the right chromosome.
So if table "a" is:
feature Chromosome start end count1 count2 count3
gene1 Chr1 1 10 100 30 22
gene2 Chr1 15 40 20 10 6
gene3 Chr1 50 70 40 11 7
gene4 Chr1 100 150 23 15 9
gene5 Chr2 5 30 24 17 2
gene5 Chr2 40 80 4 28 16
and table "b" is:
feature Chromosome start end
range1 Chr1 1 45
range2 Chr1 55 160
range3 Chr2 10 90
range4 Chr2 100 200
with the following python script:
import pandas as pd
def find_englobed(x):
englobed = a[(a['Chromosome'] == x['Chromosome']) & (a['start'].between(x['start'], x['end']) | (a['end'].between(x['start'], x['end'])))]
return englobed[list(a.columns[4:])].sum()
pd.concat([b, b.apply(find_englobed, axis=1)], axis=1)
Now with a['Chromosome'] == x['Chromosome'] & I ask for them to be in the same Chromosome, and with list(a.columns[4:]) I get all the columns from the 5th until the end, being independent on the number of count columns.
I obtain the following result:
feature Chromosome start end count1 count2 count3
range1 Chr1 1 45 120.0 40.0 28.0
range2 Chr1 55 160 63.0 26.0 16.0
range3 Chr2 10 90 28.0 45.0 18.0
range4 Chr2 100 200 0.0 0.0 0.0
I am not sure why the obtained counts are with floating points.. any comment?
If you are doing genomics in pandas you might want to look into pyranges:
import pyranges as pr
c = """feature Chromosome Start End count1 count2 count3
gene1 Chr1 1 10 100 30 22
gene2 Chr1 15 40 20 10 6
gene3 Chr1 50 70 40 11 7
gene4 Chr1 100 150 23 15 9
gene5 Chr2 5 30 24 17 2
gene5 Chr2 40 80 4 28 16
"""
c2 = """feature Chromosome Start End
range1 Chr1 1 45
range2 Chr1 55 160
range3 Chr2 10 90
range4 Chr2 100 200 """
gr, gr2 = pr.from_string(c), pr.from_string(c2)
j = gr2.join(gr).drop(like="_b")
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# | feature | Chromosome | Start | End | count1 | count2 | count3 |
# | (object) | (category) | (int32) | (int32) | (int64) | (int64) | (int64) |
# |------------+--------------+-----------+-----------+-----------+-----------+-----------|
# | range1 | Chr1 | 1 | 45 | 100 | 30 | 22 |
# | range1 | Chr1 | 1 | 45 | 20 | 10 | 6 |
# | range2 | Chr1 | 55 | 160 | 40 | 11 | 7 |
# | range2 | Chr1 | 55 | 160 | 23 | 15 | 9 |
# | range3 | Chr2 | 10 | 90 | 24 | 17 | 2 |
# | range3 | Chr2 | 10 | 90 | 4 | 28 | 16 |
# +------------+--------------+-----------+-----------+-----------+-----------+-----------+
# Unstranded PyRanges object has 6 rows and 7 columns from 2 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
df = j.df
fs = {"Chromosome": "first", "Start":
"first", "End": "first", "count1": "sum", "count2": "sum", "count3": "sum"}
result = df.groupby("feature".split()).agg(fs)
# Chromosome Start End count1 count2 count3
# feature
# range1 Chr1 1 45 120 40 28
# range2 Chr1 55 160 63 26 16
# range3 Chr2 10 90 28 45 18

Repetitive grabbing with awk

I am trying to compare two files and combine different columns of each. The example files are:
1.txt
chr8 12 24 . . + chr1 11 12 XX4 -
chr3 22 33 . . + chr4 60 61 XXX9 -
2.txt
chr1 11 1 X1 X2 11 12 2.443 0.843 +1 SXSD 1.3020000
chr1 11 2 X3 X4 11 12 0.888 0.833 -1 XXSD -28.887787
chr1 11 3 X5 X6 11 12 0.888 0.843 +1 XXSD 2.4909883
chr1 12 4 X7 X8 11 12 0.888 0.813 -1 CMKY 0.0009223
chr1 12 5 X9 X10 11 12 0.888 0.010 -1 XASD 0.0009223
chr1 12 6 X11 X12 11 12 0.888 0.813 -1 XUPS 0.10176998
I want to compare the 1st,6th and 7th columns of 2.txt, with 7th,8th and 9th columns of 1.txt, and if there is a match, I want to print the whole line of 1.txt with 3th and 12th columns of 2.txt.
The expected output is :
chr8 12 24 . . + chr1 11 12 XX4 - 1 1.3020000
chr8 12 24 . . + chr1 11 12 XX4 - 2 -28.887787
chr8 12 24 . . + chr1 11 12 XX4 - 3 2.4909883
chr8 12 24 . . + chr1 11 12 XX4 - 4 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 5 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998
My trial is with awk:
awk 'NR==FNR{ a[$1,$6,$7]=$3"\t"$12; next } { s=SUBSEP; k=$7 s $8 s $9 }k in a{ print $0,a[k] }' 2.txt 1.txt
It outputs only the last match and I cannot make it print all matches:
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998
How can I repetitively search and print all matches?
You're making it much harder than it has to be by reading the 2nd file first.
$ cat tst.awk
NR==FNR { a[$7,$8,$9] = $0; next }
($1,$6,$7) in a { print a[$1,$6,$7], $3, $12 }
$ awk -f tst.awk 1.txt 2.txt
chr8 12 24 . . + chr1 11 12 XX4 - 1 1.3020000
chr8 12 24 . . + chr1 11 12 XX4 - 2 -28.887787
chr8 12 24 . . + chr1 11 12 XX4 - 3 2.4909883
chr8 12 24 . . + chr1 11 12 XX4 - 4 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 5 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998
Extended AWK solution:
awk 'NR==FNR{ s=SUBSEP; k=$1 s $6 s $7; a[k]=(k in a? a[k]"#":"")$3"\t"$12; next }
{ s=SUBSEP; k=$7 s $8 s $9 }
k in a{ len=split(a[k], b, "#"); for (i=1;i<=len;i++) print $0,b[i] }' 2.txt 1.txt
s=SUBSEP; k=$1 s $6 s $7 - constructing key k value comprised of the 1st, 6th and 7th fields of hte file 2.txt
a[k]=(k in a? a[k]"#":"")$3"\t"$12 - concatenate the $3"\t"$12 sequences with custom separator # within the same group (group presented by k)
s=SUBSEP; k=$7 s $8 s $9 - constructing key k value comprised of the 7th, 8th and 9th fields of the file 1.txt
len=split(a[k], b, "#"); - split previously accumulated sequences into array b by separator #
The output:
chr8 12 24 . . + chr1 11 12 XX4 - 1 1.3020000
chr8 12 24 . . + chr1 11 12 XX4 - 2 -28.887787
chr8 12 24 . . + chr1 11 12 XX4 - 3 2.4909883
chr8 12 24 . . + chr1 11 12 XX4 - 4 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 5 0.0009223
chr8 12 24 . . + chr1 11 12 XX4 - 6 0.10176998

How to extract specific lines from a text file using awk?

I have a text file that looks like this.
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
I need to extract all the lines that start with B,H and two lines after H . How can I do this using awk?
The expected output would be
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
Any suggestions please.
Ignoring the blank line after B in your output (your problem specifications give no indication as to why that blank line is in the output, so I'm assuming it should not be there):
awk '/^H/{t=3} /^B/ || t-- >0' input.file
will print all lines that start with B and each line that starts with H along with the next two lines.
awk '/^[BH]/ || /^[[:blank:]]*[[:digit:]]/' inputfile
bash-3.00$ cat t
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
bash-3.00$ awk '{if(( $1 == "B") || ($1 == "H") || ($0 ~ /^ / )) print;}' t
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
OR in short
awk '{if($0 ~ /^[BH ]/ ) print;}' t
OR even shorter
awk '/^[BH ]/' t
If H and B aren't the only headers that are sent before tabular data and you intend to omit those blocks of data (you don't specify the requirements fully) you have to use a flip-flop to remember if you're currently in a block you want to keep or not:
awk '/^[^ 0-9]/ {inblock=0}; /^[BH]/ {inblock=1}; { if (inblock) print }' d.txt
cat filename.txt | awk '/^[B(H(^ .*$){2})].*$/' > output.txt
EDIT: Updated for OP's edit