Modify tab delimited txt file - awk

I want to modify tab delimited txt file using linux commands sed/awk/or any other method
This is an example of tab delimited txt file which I want to modify for R boxplot input:
----start of input format---------
chr8 38277027 38277127 Ex8_inner
25425 8 100 0.0800000
chr8 38277027 38277127 Ex8_inner
25426 4 100 0.0400000
chr9 38277027 38277127 Ex9_inner
25427 9 100 0.0900000
chr9 38277027 38277127 Ex9_inner
25428 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
30935 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
31584 1 100 0.0100000
all 687 1 1000 0.0010000
all 694 1 1000 0.0010000
all 695 1 1000 0.0010000
all 697 1 1000 0.0010000
all 699 6 1000 0.0060000
all 700 2 1000 0.0020000
all 723 7 1000 0.0070000
all 740 8 1000 0.0080000
all 742 1 1000 0.0010000
all 761 5 1000 0.0050000
all 814 2 1000 0.0020000
all 821 48 1000 0.0480000
------end of input file format------
I want it to be modified so that 4th column of odd rows becomes 1st column and 2nd column of the even rows (1st column is blank) becomes 2nd column. Rows starting with "all" gets deleted.
This is how output file should look:
-----start of the output file----
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
-----end of the output file----

EDIT: As OP has changed Input_file sample a bit so adding code too it.
awk --re-interval 'match($0,/Exon[0-9]{1,}/){val=substr($0,RSTART,RLENGTH);getline;sub(/^ +/,"",$1);print val,$1}' Input_file
NOTE: My awk is old version to I added --re-interval to it you need not to add it in case you have recent version of it too.
With single awk following may help you on same too.
awk '/Ex[0-9]+_inner/{val=$NF;getline;sub(/^ +/,"",$1);print val,$1}' Input_file
Explanation: Adding explanation too here for same.
awk '
/Ex[0-9]+_inner/{ ##Checking condition here if a line contains string Ex then digits _inner if yes then do following actions.
val=$NF; ##Creating variable named val whose value is $NF(last field of current line).
getline; ##using getline which is out of the box keyword of awk to take the cursor to the next line from current line.
sub(/^ +/,"",$1); ##Using sub utility of awk to substitute initial space of first field with NULL.
print val,$1 ##Printing variable named val and first field value here.
}
' Input_file ##Mentioning the Input_file name here.

another awk
$ awk '/^all/{next}
!/^chr/{printf "%s\n", $1; next}
{printf "%s ", $NF}' file
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
or perhaps
$ awk '!/^all/{if(/^chr/) printf "%s", $NF OFS; else print $1}' file

Related

grep file matching specific column

I want to keep only the lines in results.txt that matched the IDs in uniq.txt based on matches in column 3 of results.txt. Usually I would use grep -f uniq.txt results.txt, but this does not specify column 3.
uniq.txt
9606
234831
131
31313
results.txt
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014973.1 443143 121 121 26 151 3
With your shown samples, please try following code.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' uniq.txt results.txt
Explanation:
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when uniq.txt is being read.
arr[$0] ##Creating arrar with index of current line.
next ##next will skip all further statements from here.
}
($3 in arr) ##If 3rd field is present in arr then print line from results.txt here.
' uniq.txt results.txt ##Mentioning Input_file names here.
2nd solution: In case your field number is not set in results.txt and you want to search values in whole line then try following.
awk 'FNR==NR{arr[$0];next} {for(key in arr){if(index($0,key)){print;next}}}' uniq.txt results.txt
You can use grep in combination with sed to manipulate the input patterns and achieve what you're looking for
grep -Ef <(sed -e 's/^/^(\\S+\\s+){2}/;s/$/\\s*/' uniq.txt) result.txt
If you want to match nth column, replace 2 in above command with n-1
outputs
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3

dividing a data file to new files based on data on a particular column

I have a data file (data.txt) as shown below:
0 25 10 25000
1 25 7 18000
1 25 9 15000
0 20 9 1000
1 20 8 800
0 20 8 900
0 50 10 4000
0 50 5 2500
1 50 10 5000
I want to copy the rows with same value in the second column to separate files. I want to get following three files:
data.txt_25
0 25 10 25000
1 25 7 18000
1 25 9 15000
data.txt_20
0 20 9 1000
1 20 8 800
0 20 8 900
data.txt_50
0 50 10 4000
0 50 5 2500
1 50 10 5000
I have just started learning awk. I have tried the following bash script:
1 #!/bin/bash
2
3 for var in 20 25 50
4 do
5 awk -v var="$var" '$2==var { print $0 }' data.txt > data.txt_$var
6 done
While the bash script does what I want it to do, it is time consuming as I have to put the values of second column data in line 3 manually.
So I would like to do this using awk. How can I achieve this using awk ?
Thanks in advance.
Could you please try following, this considers that your 2nd column numbers are NOT in sorted form.
sort -k2 Input_file |
awk '
prev!=$2{
close(output_file)
output_file="data.txt_"$2
}
{
print > (output_file)
prev=$2
}'
In case your Input_file's 2nd column is sorted then no need to use sort you could directly use like:
awk '
prev!=$2{
close(output_file)
output_file="data.txt_"$2
}
{
print > (output_file)
prev=$2
}' Input_file
Explanation: Adding a detailed explanation for above.
sort -k2 Input_file | ##Sorting Input_file with respect to 2nd column then passing output to awk
awk ' ##Starting awk program from here.
prev!=$2{ ##Checking if prev variable is NOT equal to $2 then do following.
close(output_file) ##Closing output_file in back-end to avoid "too many files opened" errors.
output_file="data.txt_"$2 ##Creating variable output_file to data.txt_ with $2 here.
}
{
print > (output_file) ##Printing current line to output_file here.
prev=$2 ##Setting variable prev to $2 here.
}'
For the given sample, you can also use this:
awk -v RS= '{f = "data.txt_" $2; print > f; close(f)}' data.txt
-v RS= paragraph mode, empty lines are used to separate input records
f = "data.txt_" $2 construct filename using second column value (by default awk split input record on spaces/tabs/newlines)
print > f write input record contents to filename
close(f) close the file

awk to update file with sum of matching `another file

In the awk below I am trying to add a penalty to a score to each matching $1 in file2 based on the sum of $3+$4 (variable TL) in file1. Then the $4 value in file1 is divided by TL and multiplied by 100 (this valvue is variable S). Finally, $2 in file2 -S gives the updated $2 result in file2. Since math is not my strong suit there probaly is a better way of doing this, but this is what I could think off. Thank you :).
file1 space delimited
ACP5 4 1058 0
ACTB5 10 1708 79
ORAI1 2 952 0
TBX1 9 1932 300
file2 tab-delimited
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
desired output tab-delimited the --- is an example calculation and not part of the output
ACP5 100.00
ACTB 89.59 ---- $3+$4=1787 this is TL (comes from file1), $4/TL*100 is 4.42, $2 in file2 is 100 - 4.42 = 95.58 ----
ORAI1 94.01
TBX1 63.79
awk
awk '
FNR==NR{ # process each line
TL[$1]=($3+$4);next} ($1 in TL) # from file1 store sum of $3 and $4 in TL
{S=(P[$4]/TL)*100;printf("%s\t %.2f\n",$1, $2-S) # store $4/TL from file1 in S and subtract S from $2 in file2, output two decimal places
}1' OFS="\t" file1 FS="\t" file2 # update and define input
current output
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
As pointed out in the comments, the question is not completely clear. Since I can't comment yet I will give a solution that calculates the values as requested.
awk '
NF==4 { S[$1] = 100 * $4 / ($3 + $4) }
NF==2 { printf("%s\t%.2f\n", $1, $2 - S[$1]) }
' file1 file2
file1
ACP5 4 1058 0
ACTB 10 1708 79
ORAI1 2 952 0
TBX1 9 1932 300
file2
ACP5 100.00
ACTB 100.00
ORAI1 94.01
TBX1 77.23
output
ACP5 100.00
ACTB 95.58
ORAI1 94.01
TBX1 63.79
Explanation:
The script works by calculating and storing the S value in a associative array using $1 as the key. This is done in a block filtered by NF==4, so it will only runs for the first file (the only one with 4 fields). Finally, for NF==2 representing the second file, the result is printed using a printf and by subtracting the corresponding S value from $2.
Observation: Keep in mind that as #kvantour pointed out the example you provided does not follow the indications in the question. For example, where did the 89.59 value come from? The explanation ends up with 95.58 as the result just like the output of the script I provided

Comparing two lists and printing select columns from each list

I want to compare two lists and print some columns from one, and some from the other if two match. I suspect I'm close but I suppose it's better to check..
1st file: Data.txt
101 0.123
145 0.119
242 0.4
500 0.88
2nd File: Map.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
So, if I want to compare the 3rd column of file2 against the 1st of file1 and print the 1st column of file2 and all of file1, I tried awk 'NR==FNR {a[$3];next}$1 in a{print$0}' file2 file1
but that only prints matches in file1. I tried adding x=$1 in the awk. i.e. awk 'NR==FNR {x=$1;a[$3];next}$1 in a{print x $0} file2 file1 but that saves only one value of $1 and outputs that value every line. I also tried adding $1 into a[$3], which is obviously wrong thus giving zero output.
Ideally I'd like to get this output:
blue 145 0.119
ted 500 0.88
which is the 1st column of file2 and the 3rd column of file2 matched to 1st column of file1, and the rest of file1.
You had it almost exactly in your second attempt. Just instead of assigning the value of $1 to a scalar you can stash it in the array for later use.
awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
$ cat file1.txt
101 0.123
145 0.119
242 0.4
500 0.88
$ cat file2.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
$ awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
blue 101 0.123
ted 500 0.88

Concatenate files based off unique titles in their first column

I have many files that are of two column format with a label in the first column and a number in the second column. The number is positive (never zero):
AGS 3
KET 45
WEGWET 12
FEW 56
Within each file, the labels are not repeated.
I would like to concatenate these many files into one file with many+1 columns, such that the first column includes the unique set of all labels across all files, and the last five columns include the number for each label of each file. If the label did not exist in a certain file (and hence there is no number for it), I would like it to default to zero. For instance, if the second file contains this:
AGS 5
KET 14
KJV 2
FEW 3
then the final output would look like:
AGS 3 5
KET 45 14
WEGWET 12 0
KJV 0 2
FEW 56 3
I am new to Linux, and have been playing around with sed and awk, but realize this probably requires multiple steps...
*Edit note: I had to change it from just 2 files to many files. Even though my example only shows 2 files, I would like to do this in case of >2 files as well. Thank you...
Here is one way using awk:
awk '
NR==FNR {a[$1]=$0;next}
{
print (($1 in a)?a[$1] FS $2: $1 FS "0" FS $2)
delete a[$1]
}
END{
for (x in a) print a[x],"0"
}' file1 file2 | column -t
AGS 3 5
KET 45 14
KJV 0 2
FEW 56 3
WEGWET 12 0
You read file1 in to an array indexed at column 1 and assign entire line as it's value
For the file2, check if column 1 is present in our array. If it is print the value from file1 along with value from file2. If it is not present print 0 as value for file1.
Delete the array element as we go along to get only what was unique in file1.
In the END block print what was unique in file1 and print 0 for file2.
Pipe the output to column -t for pretty format.
Assuming that your data are in files named file1 and file2:
$ awk 'FNR==NR {a[$1]=$2; b[$1]=0; next} {a[$1]+=0; b[$1]=$2} END{for (x in b) {printf "%-15s%3s%3s\n",x,a[x],b[x]}}' file1 file2
KJV 0 2
WEGWET 12 0
KET 45 14
AGS 3 5
FEW 56 3
To understand the above, we have to understand an awk trick.
In awk, NR is the number of records (lines) that have been processed and FNR is the number of records that we have processed in the current file. Consequently, the condition FNR==NR is true only when we are processing in the first file. In this case, the associative array a gets all the values from the first file and associative array b gets placeholder, i.e. zero, values. When we process the second file, its values go in array b and we make sure that array a at least has a placeholder value of zero. When we are done with the second file, the data is printed.
More than two files using GNU Awk
I created a file3:
$ cat file3
AGS 3
KET 45
WEGWET 12
FEW 56
AGS 17
ABC 100
The awk program extended to work with any number of files is:
$ awk 'FNR==1 {n+=1} {a[$1][n]=$2} END{for (x in a) {printf "%-15s",x; for (i=1;i<=n;i++) {printf "%5s",a[x][i]};print ""}}' file1 file2 file3
KJV 2
ABC 100
WEGWET 12 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56
This code works creates a file counter. We know that we are in a new file every time that FNR is 1 and a counter, n, is incremented. For every line we encounter, we put the data in a 2-D array. The first dimension of a is the label and the second is the number of the file that we encountered it in. In the end, we just loop over all the labels and all the files, from 1 to n and print the data.
More than 2 files without GNU Awk
Without requiring GNU's awk, we can solve the problem using simulated two-dimensional arrays:
$ awk 'FNR==1 {n+=1} {b[$1]=1; a[$1,":",n]=$2} END{for (x in b) {printf "%-15s",x; for (i=1;i<=n;i++) {q=a[x,":",i]+0; printf "%5s",q};print ""}}' file1 file2 file3
KJV 0 2 0
ABC 0 0 100
WEGWET 12 0 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56