copy everything after matched pattern till end of each line - file-io

i have a file like
abcjjsdxsaxsaskjakjxas, sham nkhhjkllllhhghhkjlkll
jjjusdiolsjshsnsjsusjus sham ooushsjsysghsjsjksksls
ilsjsusynshshssjsjgsgtsttwfwgwywuwnwhww owuwywtwbwwh sham
losuyeeegftgsyshshsh sjsisusns sisisusus lckcncncmcn owueyete
sisuysyshsbs sham hdndhgdgebeheodjdjdhdgdgd
loshsbvsgshjsjssmms twrqeqqgtw wtwrfsxvxvzflld
spsishdvd dkdididjd shsh shshsh llll sham iiiiyhh
I want to print everything after sham of each line into a different file.
I want the output line wise only. I have tried various things using awk in unix.
Thanks,
sham

i got the answer here
awk '{for(i=1;i<=NF;i++)if($i~/sham/)print $(i) $(i+1) $(i+2) $(i+3) $(i+4) $(i+5) $(i+6) $(i+7)}' file
u can adjust print cammand per ur requirement..

Related

AWK - Replace between two csv files and print correctly

I have 2 files:
f1.csv:
CL*VIN
AV*AZA
PS*LUG
f2.csv:
2100-12-31*1234A*Thomas*Frederuc*1931-02-20*6791237*6791238*test1*1*0*0*CL*Jame 12*13*a1*zz3*D*13*1234*Tex*F
2100-12-31*1235A*Jack*Borbin*1931-02-21*7791238*7791239*test2*1*0*0*PS*Willliams Hou*14*a2*zz4*A*13*1235*Barc*F
2100-12-31*1236A*Pierce*Matheus*1931-02-22*8791239*8791240*test3*1*1*1*AV*Magnum st*15*a3*zz5*A*13*1236*Euo*F
And I want this output:
2100-12-31*1234A*Thomas*Frederuc*1931-02-20*6791237*6791238*test1*1*0*0*VIN*Jame 12*13*a1*zz3*D*13*1234*Tex*F
2100-12-31*1235A*Jack*Borbin*1931-02-21*7791238*7791239*test2*1*0*0*LUG*Willliams Hou*14*a2*zz4*A*13*1235*Barc*F
2100-12-31*1236A*Pierce*Matheus*1931-02-22*8791239*8791240*test3*1*1*1*AZA*Magnum st*15*a3*zz5*A*13*1236*Euo*F
I have the following code:
awk -F"*" 'FNR==NR{ A[$1]=$2;next} ($12 in A){$12=A[$12];print}' OFS='*' f1.csv f2.csv
But the output is:
*Jame 12*13*a1*zz3*D*13*1234*Tex*F931-02-20*6791237*6791238*test1*1*0*0*VIN
*Willliams Hou*14*a2*zz4*A*13*1235*Barc*F791238*7791239*test2*1*0*0*LUG
*Magnum st*15*a3*zz5*A*13*1236*Euo*F-02-22*8791239*8791240*test3*1*1*1*VIN
How can I obtain my desired output?
Your code works perfectly fine here, what's your system/code environment and awk version?
It seems something to do with carriage returns, so better run this before dealing with these files:
dos2unix files
However, you can try this:
awk 'BEGIN{FS=OFS="*";RS="\r\n|\r|\n";}FNR==NR{A[$1]=$2;next}($12 in A){$12=A[$12];print}' f1.csv f2.csv

Keeping line in file that do not match with list of words in another file

I want to keep lines in a file that do not match with a list of words kept in another file (not the whole line match). For a toy dataset, I have created a list_file.txt which contains:
BGC0001184
BGC0000853
And a large_file.txt that contains:
contig com1_25_species_1.25M_idxstats.txt
BGC0000853 0
BGC0000853 14
BGC0000853 2
BGC0000854 6
BGC0001185 7
BGC0001185 13
BGC0001184 31
BGC0001186 11
BGC0001184 31
BGC0001184 31
And I use grep as follows:
grep -vf list_file.txt large_file.txt
All good. I get the desired output:
contig com1_25_species_1.25M_idxstats.txt
BGC0000854 6
BGC0001185 7
BGC0001185 13
BGC0001186 11
Now, when I try to apply the same on real large dataset (same format, large files), it’s not working.
What am I missing here? Please let me know if you have any awk/sed suggestions.
Thanks.
Link for my large dataset files are below:
List File:
https://drive.google.com/file/d/14wa6iopzgZUz56C8a3eWRvLPyU_PkCMK/view?usp=sharing
Large File:
https://drive.google.com/file/d/1O3LYE15o9wJmMmsdxcb4xzjaIw1E9VYa/view?usp=sharing
For your shown samples, could you please try following, written and tested in GNU awk.
awk '{sub(/\r$/,"")} FNR==NR{arr[$0];next} !($1 in arr)' list_file.txt contig
2nd solution: In case your values are NOT exact same(1st fields of list and contig files) and you want to do partial matching then try following.
awk '{sub(/\r$/,"")} FNR==NR{arr[$0];next} {for(i in arr){if(index($0,i)){next}}} 1' list_file.txt contig

print from match & process several input files

when you scrutiny my questions from the past weeks you find I asked questions similar to this one. I had problems to ask in a demanded format since I did not really know where my problems came from. E. Morton tells me not to use range expression. Well, I do not know what they are excactly. I found in this forum many questions alike mine with working answers.
Like: "How to print following line from a match" (e.g.)
But all solutions I found stop working when I process more than one input file. I need to process many.
I use this command:
gawk -f 1.awk print*.csv > new.txt
while 1.awk contains:
BEGIN { OFS=FS=";"
pattern="row4"
}
go {print} $0 ~ pattern {go = 1}
input file 1 print1.csv contains:
row1;something;in;this;row;;;;;;;
row2;something;in;this;row;;;;;;;
row3;something;in;this;row;;;;;;;
row4;don't;need;to;match;the;whole;line,;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
Input file 2 print2.csv contains the same just for illustration purpose.
The 1.awk (and several others ways I found in this forum to print from match) works for one file. Output:
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
BUT not when I process more input files.
Each time I process this way more than one input file awk commands 'to print from match' seem to be ignored.
As said I was told not to use range expression. I do not know how and maybe the problem is linked to the way I input several files?
just reset your match indicator at the beginning of each file
$ awk 'FNR==1{p=0} p; /row4/{p=1} ' file1 file2
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
UPDATE
From the comments
is it possible to combine your awk with: "If $1="row5" then write in
$6="row5" and delete the value "row5" in $5? In other words, to move
content "row5" in column1, if found there, to new column 6? I could to
this with another awk but a combination into one would be nicer
... $1=="row5"{$6=$5; $5=""} ...
or, if you want to use another field instead of $5 replace $5 with the corresponding field number.

How can I remove lines from a file with more than a certain number of entries

I've looked at the similar question about removing lines with more than a certain number of characters and my problem is similar but a bit trickier. I have a file that is generated after analyzing some data and each line is supposed to contain 29 numbers. For example:
53.0399 0.203827 7.28285 0.0139936 129.537 0.313907 11.3814 0.0137903 355.008 \
0.160464 12.2717 0.120802 55.7404 0.0875189 11.3311 0.0841887 536.66 0.256761 \
19.4495 0.197625 46.4401 2.38957 15.8914 17.1149 240.192 0.270649 19.348 0.230\
402 23001028 23800855
53.4843 0.198886 7.31329 0.0135975 129.215 0.335697 11.3673 0.014766 355.091 0\
.155786 11.9938 0.118147 55.567 0.368255 11.449 0.0842612 536.91 0.251735 18.9\
639 0.184361 47.2451 0.119655 18.6589 0.592563 240.477 0.298805 20.7409 0.2548\
56 23001585
50.7302 0.226066 7.12251 0.0158698 237.335 1.83226 15.4057 0.059467 -164.075 5\
.14639 146.619 1.37761 55.6474 0.289037 11.4864 0.0857042 536.34 0.252356 19.3\
91 0.198221 46.7011 0.139855 20.1464 0.668163 240.664 0.284125 20.3799 0.24696\
23002153
But every once in a while, a line like the first one appears that has an extra 8 digit number at the end from analyzing an empty file (so it just returns the file ID number but not on a new line like it should). So I just want to find lines that have this extra 30th number and remove just that 30th entry. I figure I could do this with awk but since I have little experience with it I'm not sure how. So if anyone can help I'd appreciate it.
Thanks
Summary: Want to find lines in a text file with an extra entry in a row and remove the last extra entry so all rows have same number of entries.
With awk, you tell it how many fields there are per record. The extras are ignored
awk '{NF = 29; print}' filename
If you want to save that back to the file, you have to do a little extra work
awk '{NF = 29; print}' filename > filename.new && mv filename.new filename

awk getline skipping to last line -- possible newline character issue

I'm using
while( (getline line < "filename") > 0 )
within my BEGIN statement, but this while loop only seems to read the last line of the file instead of each line. I think it may be a newline character problem, but really I don't know. Any ideas?
I'm trying to read the data in from a file other than the main input file.
The same syntax actually works for one file, but not another, and the only difference I see is that the one for which it DOES work has "^M" at the end of each line when I look at it in Vim, and the one for which it DOESN'T work doesn't have ^M. But this seems like an odd problem to be having on my (UNIX based) Mac.
I wish I understood what was going with getline a lot better than I do.
You would have to specify RS to something more vague.
Here is a ugly hack to get things working
RS="[\x0d\x0a\x0d]"
Now, this may require some explanation.
Diffrent systems use difrent ways to handle change of line.
Read http://en.wikipedia.org/wiki/Carriage_return and http://en.wikipedia.org/wiki/Newline if you are
interested in it.
Normally awk hadles this gracefully, but it appears that in your enviroment, some files are being naughty.
0x0d or 0x0a or 0x0d 0x0a (CR+LF) should be there, but not mixed.
So lets try a example of a mixed data stream
$ echo -e "foo\x0d\x0abar\x0d\x0adoe\x0arar\x0azoe\x0dqwe\x0dtry" |awk 'BEGIN{while((getline r )>0){print "r=["r"]";}}'
Result:
r=[foo]
r=[bar]
r=[doe]
r=[rar]
try]oe
We can see that the last lines are lost.
Now using the ugly hack to RS
$ echo -e "foo\x0d\x0abar\x0d\x0adoe\x0arar\x0azoe\x0dqwe\x0dtry" |awk 'BEGIN{RS="[\x0d\x0a\x0d]";while((getline r )>0){print "r=["r"]";}}'
Result:
r=[foo]
r=[bar]
r=[doe]
r=[rar]
r=[zoe]
r=[qwe]
r=[try]
We can see every line is obtained, reguardless of the 0x0d 0x0a junk :-)
Maybe you should preprocess your input file with for example dos2unix (http://sourceforge.net/projects/dos2unix/) utility?