How to match "field 5 through the end of the line" (for example, in awk) - awk

I want to pretty-print the output of a find-like script that would take input like this:
- 2015-10-02 19:45 102 /My Directory/some file.txt
and produce something like this:
- 102 /My Directory/some file.txt
In other words: "f" (for "file"), file size (right-justified), then pathname (with an arbitrary number of spaces).
This would be easy in awk if I could write a script that takes $1, $4, and "everything from $5 through the end of the line".
I tried using the awk construct substr($0, index($0, $8)), which I thought meant "everything starting with field $8 to the end of $0".
Using index() in this way is offered as a solution on linuxquestions.org and was upvoted 29 times in a stackoverflow.com thread.
On closer inspection, however, I found that index() does not achieve this effect if the starting field happens to match an earlier point in the string. For example, given:
-rw-r--r-- 1 tbaker staff 3024 2015-10-01 14:39 calendar
-rw-r--r-- 1 tbaker staff 4062 2015-10-01 14:39 b
-rw-r--r-- 1 tbaker staff 2374 2015-10-01 14:39 now or later
Gawk (and awk) get the following results:
$ gawk '{ print index($0, $8) }' test.txt
49
15
49
In other words, the value of $8 ('b') matches at index 15 instead of 49 (i.e., like most of the other filenames).
My issue, then is how to specify "everything from field X to the end of the string".
I have re-written this question in order to make this clear.

Looks to me like you should just be using the "stat" command rather than "ls", for the reasons already commented upon:
stat -c "f%15s %n" *
But you should double-check how your "stat" operates; it apparently can be shell-specific.

The built-in awk function index() is sometimes recommended as a way
to print "from field 5 through the end of the string" [1, 2, 3].
In awk, index($0, $8) does not mean "the index of the first character of
field 8 in string $0". Rather, it means "the index of the first occurrence in
string $0 of the string value of field 8". In many cases, that first
occurrence will indeed be the first character in field 8 but this is not the
case in the example above.
It has been pointed out that parsing the output of ls is generally a bad
idea [4], in part because implementations of ls significantly differ in output.
Since the author of that note recommends find as a replacement for ls for some uses,
here is a script using find:
find $# -ls |
sed -e 's/^ *//' -e 's/ */ /g' -e 's/ /|/2' -e 's/ /|/2' -e 's/ /|/4' -e 's/ /|/4' -e 's/ /|/6' |
gawk -F'|' '{ $2 = substr($2, 1, 1) ; gsub(/^-/, "f", $2) }
{ printf("%s %15s %s\n", $2, $4, $6) }'
...which yields the required output:
f 4639 /Users/foobar/uu/a
f 3024 /Users/foobar/uu/calendar
f 2374 /Users/foobar/uu/xpect
This approach recursively walks through a file tree. However, there may of course be implementation differences between versions of find as well.
http://www.linuxquestions.org/questions/linux-newbie-8/awk-print-field-to-end-and-character-count-179078/
How to print third column to last column?
Print Field 'N' to End of Line
http://mywiki.wooledge.org/ParsingLs

Maybe some variation of find -printf | awk is what you're looking for?
$ ls -l tmp
total 2
-rw-r--r-- 1 Ed None 7 Oct 2 14:35 bar
-rw-r--r-- 1 Ed None 2 Oct 2 14:35 foo
-rw-r--r-- 1 Ed None 0 May 3 09:55 foo bar
$ find tmp -type f -printf "f %s %p\n" | awk '{sub(/^[^ ]+ +[^ ]/,sprintf("%s %10d",$1,$2))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
or
$ find tmp -type f -printf "%s %p\n" | awk '{sub(/^[^ ]+/,sprintf("f %10d",$1))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
It won't work with file names that contain newlines.

Related

How do I print every nth entry of the mth column, starting from a particular line of a file?

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

Count number of elements that match one file with another using AWK

First of all, thank you for your help. I have the file letter.txt:
A
B
C
And I have the file number.txt
B 10
D 20
A 15
C 18
E 23
A 12
B 14
I want to count how many times does each letter in letter.txt appears in number.txt so the output will be:
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
I know I can do it using this code, but I want to do it generally with any file.
cat number.txt | awk 'BEGIN {A=0;B=0;C=0;count=0}; {count++};{if ($1 == "A")A++};{if ($1 == "B")B++};{if ($1 == "C")C++}END{print "We have found" A "A\n" "We have found" B "B\n" "We have found" C "C"}
You basically want to do an inner join (easy enough to google) and group by the join key and return the count for each group.
awk 'NR==FNR { count[$1] = 0; next }
$1 in count { ++count[$1]; ++total}
END { for(k in count)
print "We have found", count[k], k
print "Total", total, "letters"}' letters.txt numbers.txt
All of this should be easy to find in a basic Awk tutorial, but in brief, the line number within the file FNR is equal to the overall line number NR when you are reading the first input file. We initialize count to contain the keys we want to look for. If we fall through, we are reading the second file; if we see a key we want, we increase its count. When we are done, report what we found.
Consider starting with:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c
2 A
2 B
1 C
Then:
$ join letter.txt <(cut -d' ' -f1 number.txt | sort) | uniq -c |
awk '
{ print "We have found", $1, $2; tot+=$1 }
END { print "Total letter found:", tot+0 }
'
We have found 2 A
We have found 2 B
We have found 1 C
Total letter found: 5
although in reality I'd probably just do it all in awk, just wanted to show an alternative.
Don't know if you need awk
to me easier (but slower execution as you read in comments) to use grep -c
cat file1 | while read line; do
c=`grep -c $line file2 | sed 's/ //g'`;
echo We have found $c $line;
done
it's a cycle, where
$c is the count taken with grep -c, and sed remove spaces in grep -c output
grep and coreutils can also do this:
grep -f letter.txt number.txt | cut -d' ' -f1 | sort | uniq -c
Output:
2 A
2 B
1 C

Delete "0" or "1" from the end of each line, except the first line

the input file looks like
Kick-off team 68 0
Ball safe 69 1
Attack 77 8
Attack 81 4
Throw-in 83 0
Ball possession 86 3
Goal kick 100 10
Ball possession 101 1
Ball safe 114 13
Throw-in 123 9
Ball safe 134 11
Ball safe 135 1
Ball safe 137 2
and at the end it should look like this:
Kick-off team 68 0
Attack 77 8
Attack 81 4
Ball possession 86 3
Goal kick 100 10
Ball safe 114 13
Throw-in 123 9
Ball safe 134 11
Ball safe 137 2
my solution is
awk '{print $NF}' test.txt | sed -re '2,${/(^0$|^1$)/d}'
how can i directly change the file, e.g. sed -i?
sed -i '2,$ {/[^0-9][01]$/d}' test.txt
2,$ lines to act upon, this one says 2nd line to end of file
{/[^0-9][01]$/d} from filtered lines, delete those ending with 0 or 1
'2,$ {/ [01]$/d}' can be also used if character before last column is always a space
With awk which is better suited for column manipulations:
awk 'NR==1 || ($NF!=1 && $NF!=0)' test.txt > tmp && mv tmp test.txt
NR==1 first line
($NF!=1 && $NF!=0) last column shouldn't be 0 or 1
can also use $NF>1 if last column only have non-negative numbers
> tmp && mv tmp test.txt save output to temporary file and then move it back as original file
With GNU awk, there is inplace option awk -i inplace 'NR==1 || ($NF!=1 && $NF!=0)' test.txt
Here's my take on this.
sed -i.bak -e '1p;/[^0-9][01]$/d' file.txt
The sed script prints the first line, then deletes all subsequent lines that match the pattern you described. This assumes that your first line would be a candidate for deletion; if it contains something other than 0 or 1 in the last field, this script will print it twice. And the -i option is what tells sed to edit in-place (with a backup file).
Awk doesn't have an equivalent option for editing files in-place -- if you want that kind of functionality, you need to implement it in a shell wrapper around your awk script, as #sundeep suggested.
Note that I'm not using GNU sed, but this command should work equally well with it.
awk to the rescue!
$ awk 'NR==1 || $NF && $NF!=1' file
or more cryptic
$ awk 'NR==1 || $NF*($NF-1)' file
This might work for you (GNU sed):
sed -i '1b;/\s[01]$/d' file
Other than the first line, delete any line ending in 0 or 1.

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

awk associative array grows fast

I have a file that assigns numbers to md5sums like follows:
0 0000001732816557DE23435780915F75
1 00000035552C6F8B9E7D70F1E4E8D500
2 00000051D63FACEF571C09D98659DC55
3 0000006D7695939200D57D3FBC30D46C
4 0000006E501F5CBD4DB56CA48634A935
5 00000090B9750D99297911A0496B5134
6 000000B5AEA2C9EA7CC155F6EBCEF97F
7 00000100AD8A7F039E8F48425D9CB389
8 0000011ADE49679AEC057E07A53208C1
Another file containts three md5sums in each line like follows:
00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
What I want to to is replace the first and third md5sums in the second file with the integers of the first file. Currently i am trying the following awk script:
awk '{OFS="\t"}FNR==NR{map[$2]=$1;next}
{print map[$1],$2,map[$3]}' mapping.txt relation.txt
The problem is that the script needs more that 16g ram despite the fact that the first file is only 5.7g on the hard drive.
If you don't have enough memory to store the first file, then you need to write something like this to look up the 1st file for each value in the 2nd file:
awk 'BEGIN{OFS="\t"}
{
val1 = val3 = ""
while ( (getline line < "mapping.txt") > 0 ) {
split(line,flds)
if (flds[2] == $1) {
val1 = flds[1]
}
if (flds[2] == $3) {
val3 = flds[1]
}
if ( (val1 != "") && (val3 != "") ) {
break
}
}
close("mapping.txt")
print val1,$2,val3
}' relation.txt
It will be slow. You could add a cache of N getline-d lines to speed it up if you like.
This problem could be solved, as follows (file1.txt is the file with the integers and md5sums while file2.txt is the file with the three columns of md5sums):
#!/bin/sh
# First sort each of file 1 and the first and third columns of file 2 by MD5
awk '{ print $2 "\t" $1}' file1.txt | sort >file1_n.txt
# Before we sort the file 2 columns, we number the rows so we can put them
# back into the original order later
cut -f1 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_1n.txt
cut -f3 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_3n.txt
# Now do a join between them, extract the two columns we want, and put them back in order
join -t' ' file2_1n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_1.txt
join -t' ' file2_3n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_3.txt
cut -f2 file2.txt | paste file2_1.txt - file2_3.txt >file2_new1.txt
For a case where file1.txt and file2.txt are each 1 million lines long, this solution and Ed Morton's awk-only solution take about the same length of time on my system. My system would take a very long time to solve the problem for 140 million lines, regardless of the approach used but I ran a test case for files with 10 million lines.
I had assumed that a solution that relied on sort (which automatically uses temporary files when required) should be faster for large numbers of lines because it would be O(N log N) runtime, whereas a solution that re-reads the mapping file for each line of the input would be O(N^2) if the two files are of similar size.
Timing results
My assumption with respect to the performance relationship of the two candidate solutions turned out to be faulty for the test cases that I've tried. On my system, the sort-based solution and the awk-only solution took similar (within 30%) amounts of time to each other for each of 1 million and 10 million line input files, with the awk-only solution being faster in each case. I don't know if that relationship will hold true when the input file size goes up by another factor of more than 10, of course.
Strangely, the 10 million line problem took about 10 times as long to run with both solutions as the 1 million line problem, which puzzles me as I would have expected a non-linear relationship with file length for both solutions.
If the size of a file causes awk to run out of memory, then either use another tool, or another approach entirely.
The sed command might succeed with much less memory usage. The idea is to read the index file and create a sed script which performs the remapping, and then invoke sed on the generated sedscript.
The bash script below is a implementation of this idea. It includes some STDERR output to help track progress. I like to produce progress-tracking output for problems with large data sets or other kinds of time-consuming processing.
This script has been tested on a small set of data; it may work on your data. Please give it a try.
#!/bin/bash
# md5-indexes.txt
# 0 0000001732816557DE23435780915F75
# 1 00000035552C6F8B9E7D70F1E4E8D500
# 2 00000051D63FACEF571C09D98659DC55
# 3 0000006D7695939200D57D3FBC30D46C
# 4 0000006E501F5CBD4DB56CA48634A935
# 5 00000090B9750D99297911A0496B5134
# 6 000000B5AEA2C9EA7CC155F6EBCEF97F
# 7 00000100AD8A7F039E8F48425D9CB389
# 8 0000011ADE49679AEC057E07A53208C1
# md5-data.txt
# 00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
# 00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
# Goal replace field 1 and field 3 with indexes to md5 checksums from md5-indexes
md5_indexes='md5-indexes.txt'
md5_data='md5-data.txt'
talk() { echo 1>&2 "$*" ; }
talkf() { printf 1>&2 "$#" ; }
track() {
local var="$1" interval="$2"
local val
eval "val=\$$var"
if (( interval == 0 || val % interval == 0 )); then
shift 2
talkf "$#"
fi
eval "(( $var++ ))" # increment the counter
}
# Build a sedscript to translate all occurances of the 1st & 3rd MD5 sums into their
# corresponding indexes
talk "Building the sedscript from the md5 indexes.."
sedscript=/tmp/$$.sed
linenum=0
lines=`wc -l <$md5_indexes`
interval=$(( lines / 100 ))
while read index md5sum ; do
track linenum $interval "..$linenum"
echo "s/^[[:space:]]*[[:<:]]$md5sum[[:>:]]/$index/" >>$sedscript
echo "s/[[:<:]]$md5sum[[:>:]]\$/$index/" >>$sedscript
done <$md5_indexes
talk ''
sedlength=`wc -l <$sedscript`
talkf "The sedscript is %d lines\n" $sedlength
cmd="sed -E -f $sedscript -i .bak $md5_data"
talk "Invoking: $cmd"
$cmd
changes=`diff -U 0 $md5_data.bak $md5_data | tail +3 | grep -c '^+'`
talkf "%d lines changed in $md5_data\n" $changes
exit
Here are the two files:
cat md5-indexes.txt
0 0000001732816557DE23435780915F75
1 00000035552C6F8B9E7D70F1E4E8D500
2 00000051D63FACEF571C09D98659DC55
3 0000006D7695939200D57D3FBC30D46C
4 0000006E501F5CBD4DB56CA48634A935
5 00000090B9750D99297911A0496B5134
6 000000B5AEA2C9EA7CC155F6EBCEF97F
7 00000100AD8A7F039E8F48425D9CB389
8 0000011ADE49679AEC057E07A53208C1
cat md5-data.txt
00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
Here is the sample run:
$ ./md5-reindex.sh
Building the sedscript from the md5 indexes..
..0..1..2..3..4..5..6..7..8
The sedscript is 18 lines
Invoking: sed -E -f /tmp/83800.sed -i .bak md5-data.txt
2 lines changed in md5-data.txt
Finally, the resulting file:
$ cat md5-data.txt
1 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
1 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857