awk to remove lines in file with character in it - awk

Trying to use awk to remove each line that has an_ in $5. The formating of the file makes it look like its $4, but neither works. I also tied sed '/_/d' but that removed all lines. Thank you :).
file
chr1 114713907 114713907 chr1:115256528-115256528 NRAS
chr1 114713789 114713988 NRAS_3
chr1 247424106 247424106 chr1:247587408-247587408 NLRP3
chr1 247423836 247425609 NLRP3_3
file
chr1 114713907 114713907 chr1:115256528-115256528 NRAS
chr1 247424106 247424106 chr1:247587408-247587408 NLRP3
awk
awk -F\t '$4 !~ /_/'
awk -F\t '$5 !~ /_/'

Could you please try following, written and tested with shown samples in GNU awk.
awk -F'[[:blank:]]+' '$5!~/_/' Input_file
Explanation: Simply making [[:blank:]] character class as a field separator for all the lines of Input_file. Then checking condition if 5th field is NOT having _ then print that line(no action mentioned so by default printing of that line will happen).
2nd solution: Or if its always last field in your Input_file then try following.
awk '$NF!~/_/' Input_file

You may use $NF as that field is last field in every line:
awk -F '\t' '$NF !~ /_/' file
chr1 114713907 114713907 chr1:115256528-115256528 NRAS
chr1 247424106 247424106 chr1:247587408-247587408 NLRP3
Or you can avoid regex:
awk -F '\t' 'index($NF, "_") == 0' file

Related

Print line modified and the line after using awk

I want to modify lines in a file using awk and print the new lines with the following line.
My file is like this
Name_Name2_ Name3_Name4
ASHRGSJFSJRGDJRG
Name5_Name6_Name7_Name8
ADGTHEGHGTJKLGRTIWRK
I want
Name-Name2
ASHRGSJFSJRGDJRG
Name5-Name6
ADGTHEGHGTJKLGRTIWRK
I have sued awk to modify my file:
awk -F'_' {print $1 "-" $2} file > newfile
but I don't know how to tell to print also the line just after (ABDJRH)
sure is it possible with awk x=NR+1 NR<=x
thanks
Following awk may help you on same.
awk -F"_" '/_/{print $1"-"$2;next} 1' Input_file
assuming your structure in sample (no separation in line with "data" letter )
awk '$0=$1' Input_file
# or with sed
sed 's/[[:space:]].*//' Input_file

How to delete first three columns in a delimited file

For example, I have a csv file as follow,
12345432|1346283301|5676438284971|13564357342151697 ...
87540258|1356433301|1125438284971|135643643462151697 ...
67323266|1356563471|1823543828471|13564386436651697 ...
and hundreds more columns but I want to remove first three columns and save to a new file(if possible same file would be better for me)
This is the result I want.
13564357342151697 ...
135643643462151697 ...
13564386436651697 ...
I have been looking and trying but I am not able to do it. And below is the code I have.
awk -F'|' '{print $1 > "newfile"; sub(/^[^|]+\|/,"")}1' old.csv > new.csv
Appreciate if someone can help me. Thank you.
You can use cut :
cut -f4- -d'|' old.csv > new.csv
#Heng: try:
awk -F"|" '{for(i=4;i<=NF;i++){printf("%s%s",$i,i==NF?"":"|")};print ""}' Input_file
OR
awk -F"|" '{for(i=4;i<=NF;i++){printf("%s%s",$i,i==NF?"\n":"|")};}' Input_file
you could re-direct this command's output into a file as per your need.
EDIT:
awk -F"|" 'FNR==1{++e;fi="REPORT_A1_"e;} {for(i=4;i<=NF;i++){printf("%s%s",$i,i==NF?"\n":"|") > fi}}' Input_file1 Input_file2 Input_file3
This is what you're looking for:
awk -F '|' '{$1=$2=$3=""; print $0}' oldfile > newfile
But it will have leading whitespaces so then add the following substitution:
sub(/^[ \t\|]+/,"") --> changed to sub(/^[ \t\|]+/,"") (escaped leading '|' from column removal)
awk -F '|' '{$1=$2=$3="";OFS="|";sub(/^[ \t\|]+/,"") ;print $0}' oldFile > newFile
awk -F\| '{print $NF}' file >newfile
13564357342151697 ...
135643643462151697 ...
13564386436651697 ...

Using file redirects to input a variable search pattern to awk

I'm attempting to write a small script in bash. The script's purpose is to pull out a search pattern from file1.txt and to print the line number of the matching search from file2.txt. I know the exact place of the pattern that I want in file1.txt, and I can pull that out quite easily with sed and awk e.g.
sed -n 3p file1.txt | awk '{print $4}'
The part that I'm having trouble with is passing that information again to awk to use as a search pattern in file2.txt. Something along the lines of:
awk '/search_pattern/{print NR}' file2.txt
I was able to get this code working in two lines of code by storing the output of the first line as a variable, and passing that variable to awk in the second line,
myVariable=`sed -n 3p file1.txt | awk '{print $4}'`
awk '/'"$myVariable"'/{print NR}' file2.txt
but this seems "inelegant". I was hoping there was a way to do this in one line of code using file redirects (or something similar?). Any help is greatly appreciated!
You can avoid sed | awk with
awk 'NR==3{print $4; exit 0}' file1.txt
You can do your search with:
search=$(awk 'NR==3{print $4; exit 0}' file1.txt)
awk -v search="$search" '$0 ~ search { print NR }' file2.txt
You could even write that all on one line, but I don't recommend that; clarity is more important than brevity.
In principle, you could use:
awk 'NR==3{search = $4; next} FNR!=NR && $0 ~ search {print NR}' file1.txt file2.txt
This scans file1.txt and finds the search pattern; then it scans file2.txt and finds the lines that match. One line — even moderately clear. There'll be lots of matches if there isn't a column 4 on line 3 of file1.txt.

initialising field seperators on condition in awk

I know that initialising FS in BEGIN is the correct practice but what if i need different field seperators for different lines(lines containing a particular pattern)? eg: my awk script is
{if($0 ~ /.*youtube.*/){FS="=";print $2}}
This code is not processing the first line.How to fix this?
You can use split. Eks get the middle date from third field green
echo "on,cat ,blue|green|red,more" | awk -F, '{split($3,a,"|");print a[2]}'
green
And you BEGIN block is not only where you can set the Field Separator:
echo "on,two,three" | awk -F, '{print $2}'
echo "on,two,three" | awk '{print $2}' FS=,
echo "on,two,three" | awk 'BEGIN{FS=","} {print $2}'
echo "on,two,three" | awk -v FS=, '{print $2}'
All these will print two
But they may have some different impact in when they can be used.
awk -F, 'BEGIN{print FS}'
,
and this does not work and gives no output.
awk 'BEGIN{print FS}' FS=,
Back to your problem:
This:
awk '{if($0 ~ /.*youtube.*/){FS="=";print $2}}' file
should be:
awk '{if($0 ~ /.*youtube.*/){split($0,a,"=");print a[2]}}' file
You do not need to test for any characters before and after regex, so:
awk '{if($0 ~ /youtube/){split($0,a,"=");print a[2]}}' file
And this could even more be simplified:
awk '/youtube/ {split($0,a,"=");print a[2]}' file
If data is like this:
cat file
youtube=thisisyoutube1 //starts here
youtube=thisisyoutube2
youtube=thisisyoutube3
youtube=thisisyoutube4
yautube=thisisnottobeprinted
Then do like this:
awk -F= '/youtube/ {split($2,a," ");print a[1]}' file
thisisyoutube1
thisisyoutube2
thisisyoutube3
thisisyoutube4

use awk to print a column, adding a comma

I have a file, from which I want to retrieve the first column, and add a comma between each value.
Example:
AAAA 12345 xccvbn
BBBB 43431 fkodks
CCCC 51234 plafad
to obtain
AAAA,BBBB,CCCC
I decided to use awk, so I did
awk '{ $1=$1","; print $1 }'
Problem is: this add a comma also on the last value, which is not what I want to achieve, and also I get a space between values.
How do I remove the comma on the last element, and how do I remove the space? Spent 20 minutes looking at the manual without luck.
$ awk '{printf "%s%s",sep,$1; sep=","} END{print ""}' file
AAAA,BBBB,CCCC
or if you prefer:
$ awk '{printf "%s%s",(NR>1?",":""),$1} END{print ""}' file
AAAA,BBBB,CCCC
or if you like golf and don't mind it being inefficient for large files:
$ awk '{r=r s $1;s=","} END{print r}' file
AAAA,BBBB,CCCC
awk {'print $1","$2","$3'} file_name
This is the shortest I know
Why make it complicated :) (as long as file is not too large)
awk '{a=NR==1?$1:a","$1} END {print a}' file
AAAA,BBBB,CCCC
For better porability.
awk '{a=(NR>1?a",":"")$1} END {print a}' file
You can do this:
awk 'a++{printf ","}{printf "%s", $1}' file
a++ is interpreted as a condition. In the first row its value is 0, so the comma is not added.
EDIT:
If you want a newline, you have to add END{printf "\n"}. If you have problems reading in the file, you can also try:
cat file | awk 'a++{printf ","}{printf "%s", $1}'
awk 'NR==1{printf "%s",$1;next;}{printf "%s%s",",",$1;}' input.txt
It says: If it is first line only print first field, for the other lines first print , then print first field.
Output:
AAAA,BBBB,CCCC
In this case, as simple cut and paste solution
cut -d" " -f1 file | paste -s -d,
In case somebody as me wants to use awk for cleaning docker images:
docker image ls | grep tag_name | awk '{print $1":"$2}'
Surpised that no one is using OFS (output field separator). Here is probably the simplest solution that sticks with awk and works on Linux and Mac: use "-v OFS=," to output in comma as delimiter:
$ echo '1:2:3:4' | awk -F: -v OFS=, '{print $1, $2, $4, $3}' generates:
1,2,4,3
It works for multiple char too:
$ echo '1:2:3:4' | awk -F: -v OFS=., '{print $1, $2, $4, $3}' outputs:
1.,2.,4.,3
Using Perl
$ cat group_col.txt
AAAA 12345 xccvbn
BBBB 43431 fkodks
CCCC 51234 plafad
$ perl -lane ' push(#x,$F[0]); END { print join(",",#x) } ' group_col.txt
AAAA,BBBB,CCCC
$
This can be very simple like this:
awk -F',' '{print $1","$1","$2","$3}' inputFile
where input file is : 1,2,3
2,3,4 etc.
I used the following, because it lists the api-resource names with it, which is useful, if you want to access it directly. I also use a label "application" to find specific apps in a namespace:
kubectl -n ops-tools get $(kubectl api-resources --no-headers=true --sort-by=name | awk '{printf "%s%s",sep,$1; sep=","}') -l app.kubernetes.io/instance=application