Different result when count line number of a file, using wc -l and cat -n - wc

I heard that wc -l could count the number of lines in a file. However, when I use it to count lines of a file that was generated by Python, it gives a different result, miscounting one line.
Here is the MWE.
#!/usr/bin/env python
import random
def getRandomLines(in_str, num):
res = list()
lstr = len(in_str)
for i in range(num):
res.append(''.join(random.sample(in_str, lstr)))
return res
def writeRandomLines(rd_lines, fname):
lines = '\n'.join(rd_liens)
with open(fname, 'w') as fout:
fout.write(lines)
if __name__ == '__main__':
writeRandomLines(getRandomLines("foobarbazqux", 20), "example.txt")
This gives a file, example.txt, that contains 20 lines of random strings. And thus, the expection of the number of lines in example.txt is 20. However, when one applies wc -l to it, it gives 19 as the result.
$ wc -l example.txt
19 example.txt
When one uses cat -n to show the content of the file, with line number, one can see
$ cat -n example.txt
1 oaxruzaqobfb
2 ozbarboaufqx
3 fbzarbuoxoaq
4 obqfarbozaxu
5 xoqbrauboazf
6 ufqooxrababz
7 rqoxafuzboab
8 bfuaqoxaorbz
9 baxroazfouqb
10 rqzafoobxaub
11 xqaoabbufzor
12 aobxbaoruzfq
13 buozaqbrafxo
14 aobzoubfarxq
15 aquofrboazbx
16 uaoqrfobbaxz
17 bxqubarfoazo
18 aaxruzofbboq
19 xuaoarzoqfbb
20 bqouzxraobfa
Why wc -l miscount one line, and what could I do to fix this problem?
Any clues or hints will be appreciated.

In your python code, you have:
lines = '\n'.join(rd_liens)
So what you are really writing is :
word1\nword2\n...wordX-1\nwordX
Unfortunately, in man wc:
-l, --lines
print the newline counts
hence your difference.

Apparently wc -l needs to see a \n at the end of the line to count it as one. Your current format has the last line without a trailing \n, therefore not counted by wc -l. Add the newline and it should be fixed.

wc -l only counts number of new line characters.
Since you are appending lines with a '\n' characters, to join 20 lines only 19 '\n' characters were used. Hence result as 19.
If you need correct count, terminate each line with '\n'

Related

How do I print every nth entry of the mth column, starting from a particular line of a file?

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

Compare and Replace inline using Awk/Sed command

I have a file (fixed length) in which searching for consecutive 2 lines starting with number 30 and then comparing value in position 183-187 and if both matching printing the line number. I am able to achieve the desired results up to this stage. But I would like to replace the value present in the line number with empty spaces without tampering the fixed length.
awk '/^30*/a=substr($0,183,187);getline;b=substr($0,183,187); if(a==b) print NR}' file
Explanation of the command above:
line number start with 30*
assign value to a from position between 183 to 187
get next line
assign value to b from position between 183 to 187
compare a & b - if matches it proves the value between 183 to 187 in 2 consecutive lines which starts with 30.
print line number (this is the 2nd match line number)
Above command is working as expected and printing the line number.
Example Record (just for explanation purpose hence not used fixed length)
10 ABC
20 XXX
30 XYZ
30 XYZ
30 XYZ
30 XYZ
40 YYY
10 ABC
20 XXX
30 XYZ
30 XYZ
40 YYY
With above command I could able to get line number 3 and 4 but unable to replace the 4th line output with empty spaces (inline replace) so that fixed width will not get compromised
Expected Output
10 ABC
20 XXX
30 XYZ
30
30
30
40 YYY
10 ABC
20 XXX
30 XYZ
30
40 YYY
Length of all the above lines should be 255 chars - when replace happens it has to be inline without adding it as new spaces.
Any help will be highly appreciated. Thanks.
I would use GNU AWK and treat every character as field, consider following example, let file.txt content be
10 ABC
20 XXX
30 XYZ
30 XYZ
40 YYY
then
awk 'BEGIN{FPAT=".";OFS=""}prev~/^30*/{a=substr(prev,4,3);b=substr($0,4,3);if(a==b){$4=$5=$6=" "}}{print}{prev=$0}' file.txt
output
10 ABC
20 XXX
30 XYZ
30
40 YYY
Explanation: I elected to use storing whole line in variable called prev rather than using getline, thus I do {prev=$0} as last action. I set FPAT to . indicating that any single character should be treated as field and OFS (output field separator) to empty string so no unwanted character will be added during line preparing. If prev (previous line or empty string for first line) starts with 3 I get substring with characters 4,5,6 from previous line (prev) and store it in variable a and get substring with characters 4,5,6 from current line ($0) and store it in variable b, if a and b are equal I change 4th, 5th and 6th character to space each. No matter it was changed or not I print line. Disclaimer: this assume that you want to deal with at most 2 subsequent lines having equal substring. Note /^30*/ does not check if string starts with 30 but rather if it does start with 3 e.g. it will match 312, you should probably use /^30/ instead, I elected to use your pattern unchanged as you imply it does work as intended for your data.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed -E '/^30/{N;s/^(.{182}(.{5}).*\n.{182})\2/\1 /}' file
Match on a line beginning 30 and append the following line.
Using pattern matching, if the 5 characters from 183-187 for both lines are the same, replace the second group of 5 characters with 5 spaces.
For multiple adjacent lines use:
sed -E '/^30/{:a;N;s/^(.{182}(.{5}).*\n.{182})\2/\1 /;ta}' file
Or alternative:
sed -E ':a;$!N;s/^(30.{180}(\S{5}).*\n.{182})\2/\1 /;ta;P;D' file
It sounds like this is what you want, using any awk in any shell on every Unix box:
$ awk -v tgt=30 -v beg=11 -v end=13 '
($1==tgt) && seen[$1]++ { $0=substr($0,1,beg-1) sprintf("%*s",end-beg,"") substr($0,end+1) }
1' file
10 ABC
20 XXX
30 XYZ
30
40 YYY
Just change -v beg=11 -v end=13 to beg=183 -v end=187 for your real data.
If you're ever again tempted to use getline make sure to read awk.freeshell.org/AllAboutGetline first as it's usually the wrong approach.

Extract all numbers from string in list

Given some string 's' I would like to extract only the numbers from that string. I would like the outputted numbers to be each be separated by a single space.
Example input -> output
....IN:1,2,3
OUT:1 2 3
....IN:1 2 a b c 3
OUT:1 2 3
....IN:ab#35jh71 1,2,3 kj$d3kjl23
OUT:35 71 1 2 3 3 23
I have tried combinations of grep -o [0-9] and grep -v [a-z] -v [A-Z] but the issue is that other chars like - and # could be used between the numbers. Regardless of the number of non-numeric characters between the numbers I need them to be replaced with a single space.
I have also been experimenting with awk and sed but have had little luck.
Not sure about spaces in your expected output, based on your shown samples, could you please try following.
awk '{gsub(/[^0-9]+/," ")} 1' Input_file
Explanation: Globally substituting anything apart from digit with spaces. Mentioning 1 will print current line.
In case you want to remove initial/starting space and ending space in output then try following.
awk '{gsub(/[^0-9]+/," ");gsub(/^ +| +$/,"")} 1' Input_file
Explanation: Globally substituting everything apart from digits with space in current line and then globally substituting starting and ending spaces with NULL in current line. Mentioning 1 will print edited/non-edited current line.
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | grep -o '[[:digit:]]*'
35
71
1
2
3
3
23
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | tr -sc '[:digit:]' ' '
35 71 1 2 3 3 23

Get the line number of the first line matching second pattern

Is it possible using awk or sed to get the line number of a line such that it is the first line matching a regex after another line matching another regex?
In other words:
Find line l1 matching regex r1. l1 is the first line matching r1.
Find line l2 below l1. l2 matches regex r2. l2 is the first line matching r2, ignoring lines l1 and above.
Clarification: By match I mean partial match, for most general solution.
A partial match can of course be turned into a full-word match with \<...\> or a full-line match with ^...$.
Example input:
- - '787928'
- stuff
- - '810790'
- more stuff
- - '787927'
- yet more stuff
- - '828055'
- some more stuff
- - '828472'
- some other stuff
If r1 is ^-.*787927.* and r2 is ^- I'd expect the output to be 7, i.e. the number of the line that says - - '828055'.
Input example :
world
zekfzlefkzl
fezekzevnkzjnz
hello
zeniznejkglz
world
eznkflznfkel
hello
zenilzligeegz
world
Command :
pat1="hello"; pat2="world";
awk -v pat1=$pat1 -v pat2=$pat2 '$0 ~ pat1{pat1_match = 1}($0 ~ pat2)&&pat1_match{print NR; exit}' <input>
Output :
6
For an input file that looks like this:
1 pat2
2 x
3 pat1
4 x
5 pat2
6 x
7 pat1
8 x
9 pat2
you could use sed as follows:
$ sed -n '/pat1/,${/pat2/{=;q;};}' infile
5
which works like this:
sed -n ' # suppress output with -n
/pat1/,$ { # for all lines from the first occurrence of "pat1" on...
/pat2/ { # if the line matches "pat2"
= # print line number
q # quit
}
}' infile
The above fails if the first occurrence of pat1 is on the same line as pat2:
1 pat2
2 x
3 pat1 pat2
4 x
5 pat2
6 x
7 pat1
8 x
9 pat2
would print 3. With GNU sed, we can use this instead:
$ sed -n '0,/pat1/!{/pat2/{=;q;};}' infile
5
sed -n ' # suppress output
0,/pat1/! { # for all lines after the first occurrence of "pat1"
/pat2/ { # if the line matches "pat2"
= # print line number
q # quit
}
}' infile
The 0 address is a GNU extension; using 1 instead would break if pat1 was on the first line.
This might work for you (GNU sed):
sed -n '/^-.*787927.*/{:a;n;/^-/!ba;=;q}' file
On encountering a line that begins -.*787927.*, start a loop that replaces the current line with the next, until a line begins - where on print the line number and quit.

Find duplicates and give sum of values in column next to it (UNIX) (with solution -> need faster way)

I am writing a script for bioinformatical use. I have a file with 2 columns, in which column A shows a number and column B a specific string. I need a script that will search the file for the string in column B, IF any duplicates are found the number in column A should all be added up, duplicates should be removed and only one line with column A having the sum and column B the string should remain.
I have written something that does exactly that, but because I am not really a programmer I am sure there is a much faster way. My files contain sometimes 500k lines and my code takes to long for such files. Please have a look at it and see what I could change to speed things up. Also I can't use uniq because for this Id have to also use sort but the order of the lines have to stay the way they are!
13 ABCD
15 BGDA
12 ABCD
10 BGDA
10 KLMN
17 BGDA
should become
25 ABCD
42 BGDA
10 KLMN
This does it but for a file with 500k lines it takes too long:
for AASEQUENCE in file.txt;
do
#see how many lines the file has and save that number in $LN
LN="$(wc -l $AASEQUENCE | cut -d " " -f 1)"
for ((i=1;i<=${LN};i++));
do
#Create a variable that will have just the string from column B
#save it in $STRING
STRING="$(cut -f2 $AASEQUENCE | head -n $i| tail -n 1 | cut -f1)";
#create $UNIQ: a variable that will have number+string of that
#line. This will be used in the ELSE-statement, IF there are no
#duplicates of the string, it will just be added to the
# output file without further processing
UNIQ="$(head -n $i $AASEQUENCE | tail -n 1)"
for DUPLICATE in $AASEQUENCE;
do
#create variable that will display the number of lines
#of duplicates. IF its 1 the IF-statement will jump to the ELSE
#part as there are no duplicates
VAR="$(grep -w "${STRING}" $DUPLICATE | wc -l)"
#Now add up all the numbers from column A that have $STRING in
#column B
TOTALCOUNT="$(grep -w "${STRING}" $DUPLICATE | cut -f1 | awk
'{SUM += $1} END {print SUM}')"
#Create a file that the single line can be put into it
touch MERGED_`basename $AASEQUENCE`
#The IF-statement checks if the AA occurs more than once
#If it does a second IF-statement checks if this AA-sequence has
#already been added.
#If it hasnt been added, it will be, if not nothing happens.
ALREADYMATCHED="$(grep -w "${STRING}" MERGED_`basename
$AASEQUENCE` | wc -l)"
if [[ "$VAR" > 1 ]];
then if [[ "$ALREADYMATCHED" != 0 ]]; then paste <(echo
"$TOTALCOUNT") <(echo "$STRING") --delimiters ' '>>
MERGED_`basename $AASEQUENCE` ;fi;
else echo $UNIQ >> MERGED_`basename $AASEQUENCE` ;fi
done;
done;
done;
P.S: When I have fileA.txt fileB.txt ... and use file* the loop still always stops after the first file. Any suggestions why?
maybe pure awk solution?
$ cat > in
13 ABCD
15 BGDA
12 ABCD
10 BGDA
10 KLMN
17 BGDA
$ awk '{dc[$2] += $1} END{for (seq in dc) {print dc[seq], seq}}' in
25 ABCD
42 BGDA
10 KLMN
$