what operations are performing in the given coding [closed] - awk
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am having a shell script code like below:
awk 'NR==FNR{a[$0]; next} !($0 in a){print "Fail: "$0 " is not found"}' <(cat file3 <(grep -r names file2)) <(grep -r present file1)
Can someone explain in the above code what the awk is doing here..?
This is the kind of question where you can take it apart piece by piece:
do grep -r present file1 on it's own and see what it outputs
although if "file1" is truly a file and not a directory, then the -r option is useless
<(...) is a Process Substitution -- it takes the output of the script and lets you handle that as a file
Similarly, <(cat file3 <(grep -r names file2)) concatenates the contents
of "file3" and the output of the grep command.-
now, the awk script
awk 'NR==FNR {do something; next} some more awk code' fileA fileB is a very common awk idiom
NR == FNR means "the current record number (out of all files processed so far) is equal to the record number of the current file being processed" -- this can only happen for the first file in the list[1]
so, do something only for the first file, because next won't allow the "some more awk code" to be reached.
Without showing us the contents of the files, there's not much more to say. If you were to show the inputs and output, we can help you understand exactly why you see the results you see.
Related
how to remove part of the string if the condition exists [closed]
Closed. This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 2 years ago. Improve this question I have a file similar to this: A*01:03:05 B1*02:06:08 F2*03:01:06 R5*02:01 S1*02:08 And would like to remove the last 2 numbers and the colon, only when there are 2 colon separators. so it will be: A*01:03 B1*02:06 F2*03:01 R5*02:01 S1*02:08 The last 2 lines remain unchanged because they do not have 2 colon separators after the *, so no changes are made to those values I used sed and gsub to remove everything after the last underscore but was not sure how to add a condition to exempt the condition when I do not have 2 colons after the *.
This might work for you (GNU sed): sed 's/:..//2' file This removes the second occurrence of a : followed by 2 characters. If this is too lax, use: sed -E 's/^([^:]*:[^:]*):[0-9]{2}/\1/' file
With cut, you can set : as delimiter and print only upto first two fields cut -d: -f-2 ip.txt Similar logic can be done with awk, assuming the implementation supports manipulating NF awk 'BEGIN{FS=OFS=":"} NF==3{NF=2} 1' ip.txt
This works: $ sed -E 's/^([^:]*:[^:]*):[0-9][0-9]$/\1/' file The [^:] means 'any character other than a :' so it works by making the deletion at the end only if there are two leading colons. This awk works too: $ awk 'gsub(/:/,":")==2 {sub(/:[0-9][0-9]$/,"")} 1' file In this case, gsub returns the number of replacements made. So if there are two colons, delete the ending. You can also use GNU grep (with PCRE) to only match the template of what you are looking for: $ grep -oP '^\w+\*\d\d:\d\d' file Or perl same way: $ perl -lnE 'say "$1" if /(^\w+\*\d\d:\d\d)/' file
Search for a string which is stretched over 2 lines [closed]
Closed. This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 2 years ago. Improve this question i search for this string abcdefgh in a very large file like this and i don't know at which position the new line begin. My first thought was to remove all \n, but the file is over 3 gb ... I think there is smart way to do this (sed, awk, ...?) efhashivia hjasjdjasd oqkfoflABC DEFGHqpclq pasdpapsda
Assuming that your search string cannot expand into more than 2 lines, you can use this awk: awk -v p="ABCDEFGH" 's $0 ~ p {print NR,s $0} {s=$0}' file Or you can paste each line with its next one, and grep the result. This way you have to create a file with double size of your large input. tail -n +2 file | paste -d '' file - > output.txt > cat output.txt efhashiviahjasjdjasd hjasjdjasdoqkfoflABC oqkfoflABCDEFGHqpclq DEFGHqpclqpasdpapsda pasdpapsda > grep -n ABCDEFGH output.txt 3:oqkfoflABCDEFGHqpclq
Combine 2 Conditions of AWK in 1 set of forward slashs [closed]
Closed. This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago. Improve this question I have a file where I need only 18th column and that 18th column must not contain 30 words like AAA, BBB, CCC etc Sample file $ cat a.csv 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,Aaa 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,BBB awk -F, '!($18 ~ /AAA/) && !($18 ~ /BBB/) {print $18 }' It's possible to write something like awk -F, '!($18 ~ /AAA, BBB /) {print $18 }' EDIT If I use i=$("AAA|BBB") awk -F, '!($18 ~ /$i/) {print $18 }' it produces error command not found
You could use the alternation operator | and use something like awk -F',' '$18 !~ /AAA|BBB|CCC/{print $18}' a.csv
If you want to simply strip lines where a field is one of a set of blacklist entries, you can create the blacklist once in the BEGIN section, then simply use ~ to see if that blacklist contains your field. Perhaps the easiest way to do this is to construct the blacklist using the input field separator (so you know it won't be part of the field). With an input.csv file of: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,AAA 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,BBB 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,CCC 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,DDD 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,EEE Let's say you don't want lines where field 18 is AAA, BBB or DDD: pax> awk -F, 'BEGIN{ss=",AAA,BBB,DDD,"}ss!~","$18","{print}' input.csv CCC EEE Below, we'll break down how it works: BEGIN { ss=",AAA,BBB,DDD," # This is the blacklist, note IFS separator and start/end. } ss !~ ","$18"," { # If ",<column 18>," not in blacklist, print. print $18 } The trick is to create a string which is the column we're checking surrounded by the delimiters (which cannot be in the column). If we find that in the blacklist (which is every unwanted item surrounded by the delimiter), we can discard it. Note that you're not restricted to a fixed blacklist (either in your string or if you decide to use a regex solution), you can if you wish read the entries from a file and dynamically construct the list. For example, consider the file blacklist.txt: AAA BBB DDD and the input.txt file as shown above. The following awk command can dynamically create the blacklist from that file thusly: pax> awk -F, 'BEGIN{ss=","}NR==FNR{ss=ss""$1",";next}ss!~","$18","{print}' blacklist.txt input.csv 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,CCC 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,EEE Again, breaking it down: BEGIN { ss = "," # Start blacklist. } NR==FNR { # Only true for first file in list (blacklist). ss = ss""$1"," # Extend blacklist. next # Go get next line. } ss !~ ","$18"," { # Only get here for second file (input). print } Here, we process the first file to construct the blacklist (rather than having a fixed one). Lines in the second file are treated as per my original script above.
How to remove dupe lines from 1 file and check each line against all files in same folder for other dupes? [closed]
Closed. This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 6 years ago. Improve this question Got a hard one, I have 1/2TB of text files in a folder. I want to keep the text file names and not merge into 1 file. How can I go through a text file and compare each line to all the rest of the other files? Removing all the word dups for the entire directory.. etc until all done? Some of the files are large 38gb. eg. textfile1.txt has dupe word power textfile2.txt also has this word power and needs to be removed etc... Edit: all words are newline separated. Untill finished all the files in that same dir. Either in linux or win.
awk -i inplace '!seen[$0]++' * The above used GNU awk 4.* for "inplace" editing. You'll need to have enough memory to make a copy of your largest file and keep a list of all unique words in memory. The above also assumes your "words" are newline separated since you didn't tell us anything otherwise. If you don't have enough memory to copy your largest file, you could try something like: for file in * do while [ -s "$file" ]; do # copy the first 100 lines from "$file" into tmp head -n 100 "$file" > tmp # inplace remove the first 100 lines from "$file" count=$(head -100 "$file" |wc -c) dd if="$file" bs="$count" skip=1 of="$file" truncate -s "-$count" "$file" # somehow get a subset of words to check in tmp awk 'magic happens' tmp >> "${file}.new" && rm -f tmp done done but you'll have to figure out how to come up with groups of words to check at a time (e.g. see below), this will be slow, tread carefully and make a backup of your files first! If you CAN make a copy of each file but can't fit all of the "words" in memory at one time then you could do something like: for a in {a..z} do awk -v start="^$a" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' * done to look for groups of words based on some characteristics, e.g. the above looks for all words that start with a, then with b, etc. If those batches are too big, add an inner loop: for a in {a..z} do awk -v start="^$a$" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' * for b in {a..z} do awk -v start="^$a$b" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' * done done or more (to show the expanding regexp pattern): for a in {a..z} do awk -v start="^$a$" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' * for b in {a..z} do awk -v start="^$a$b$" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' * for c in {a..z} do awk -v start="^$a$b$c" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' * done done done The more nested loops the fewer words it'll process at a time and the slower it'll execute.
Add header to line using awk [closed]
Closed. This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 8 years ago. Improve this question I have a file with the following format: AACCCGTAGATCCGAACTTGTG ACCCGTAGATCCGAACTTGTG CCGTAGATCCGAACTTGTG CGTAGATCCGAACTTGT I want to give a header to each line, using awk, where the header is equal to the line that follows, like this: >AACCCGTAGATCCGAACTTGTG AACCCGTAGATCCGAACTTGTG >ACCCGTAGATCCGAACTTGTG ACCCGTAGATCCGAACTTGTG >CCGTAGATCCGAACTTGTG CCGTAGATCCGAACTTGTG >CGTAGATCCGAACTTGT CGTAGATCCGAACTTGT
Simply: $ awk '{print ">"$0;print}' file >AACCCGTAGATCCGAACTTGTG AACCCGTAGATCCGAACTTGTG >ACCCGTAGATCCGAACTTGTG ACCCGTAGATCCGAACTTGTG >CCGTAGATCCGAACTTGTG CCGTAGATCCGAACTTGTG >CGTAGATCCGAACTTGT CGTAGATCCGAACTTGT Or: $ awk '{printf ">%s\n%s\n",$0,$0}' file >AACCCGTAGATCCGAACTTGTG AACCCGTAGATCCGAACTTGTG >ACCCGTAGATCCGAACTTGTG ACCCGTAGATCCGAACTTGTG >CCGTAGATCCGAACTTGTG CCGTAGATCCGAACTTGTG >CGTAGATCCGAACTTGT CGTAGATCCGAACTTGT
The -v flag allows you to set a variable. Then for each line in the file print that variable followed by the line, and then the line itself. awk -v c=">" '{ print c $0; print $0; }' <file>