I want the number of lines of my python files with relative path.
I get that like this::
$ find ./ -name "*.py" -exec wc -l {} \;| awk '{print $1, $2}'
29 ./setup.py
28 ./proj_one/setup.py
896 ./proj_one/proj_one/data_ns.py
169 ./proj_one/proj_one/lib.py
310 ./proj_one/proj_one/base.py
0 ./proj_one/proj_one/__init__.py
72 ./proj_one/tests/lib_test.py
How could I get (formated ints) like this::
29 ./setup.py
28 ./proj_one/setup.py
896 ./proj_one/proj_one/data_ns.py
169 ./proj_one/proj_one/lib.py
310 ./proj_one/proj_one/base.py
0 ./proj_one/proj_one/__init__.py
72 ./proj_one/tests/lib_test.py
You can use printf with a width format modifier to make a formatted column:
$ find ./ -name "*.py" -exec wc -l {} \;| awk '{printf "%10s %s\n", $1, $2}'
On most platforms, you can print with comma separators as a specifier (if you have BIG files) but the quoting can be challenging for command line use:
$ echo 10000000 | awk '{printf "%'\''d\n", $1}'
10,000,000
How about just pipping the output of find to column -t.
The column utility formats its input into multiple columns
-t Determine the number of columns the input contains and create a table. Read man page
za$ find . -name "*rb" -exec wc -l {} \; | column -t
20 ./dirIO.rb
314 ./file_info.rb
53 ./file_santizer.rb
154 ./file_writer.rb
58 ./temp/maps.rb
248 ./usefu_ruby.rb
Related
I was wondering if there is any way to pass in a file list to awk. The file list has thousands of files and I am using a grep -l to find a subset of files I am interested in passing to awk
E.g.,
grep -l id file-*.csv
file-1.csv
file-2.csv
$ cat file-1.csv
id,col_1,col_2
1,abc,100
2,def,200
$ cat file-2.csv
id,col_1,col_2
3,xyz,1000
4,hij,2000
If I do
$ awk -F, '{print $2,$3}' file-1.csv file-2.csv | grep -v col
abc 100
def 200
xyz 1000
hij 2000
it works how I would want but seeing as there are too many files to manually do like this
file-1.csv file-2.csv
I was wondering if there is a way to pass in the result of the...
grep -l id file-*.csv
Edit:
grep -l id
is the condition. Each file has a header but only some have 'id' in the header so I can't use the file-*.csv wildcard in the awk statement.
If I did an ls on file-*.csv I would end up with more the file-1.csv and file-2.csv.
e.g.,
$ cat file-3.csv
name,col,num
a1,hij,3000
b2,lmn,50000
$ ls -l file-*.csv
-rw-r--r-- 1 tp staff 35 20 Sep 18:50 file-1.csv
-rw-r--r-- 1 tp staff 37 20 Sep 18:51 file-2.csv
-rw-r--r-- 1 tp staff 38 20 Sep 18:52 file-3.csv
$ grep -l id file-*.csv
file-1.csv
file-2.csv
Based on the output you show under "If I do", it sounds like this might be what you're trying to do:
awk -F, 'FNR>1{print $2,$3}' file-*.csv
but your question isn't clear so it's a guess.
Given your updated question all you need with GNU awk for nextfile is:
awk -F, 'FNR==1{if ($1 != "id") nextfile} {print $2,$3}' file-*.csv
and with any awk (but less efficiently than with GNU awk):
awk -F, 'FNR==1{f=($1=="id"?1:0); next} f{print $2,$3}' file-*.csv
awk -F, 'NR > 1{print $2,$3}' $(grep -l id file-*.csv)
(This will not work if any of your filenames contain whitespace.)
To find the files with id field, merge/output their contents excluding the lines with field id:
grep trick:
grep --no-group-separator -hA 1000000 'id' file-*.csv | grep -v 'id'
-h - suppress the prefixing the file names on output
-A num - print num lines of trailing context after matching line(s). 1000000 - considered as maximal number of line which, presumably, will not be exceeded(you may adjust it in case if you really have files with more than 1000000 lines)
The output (for 2 sample files from the question):
1,abc,100
2,def,200
3,xyz,1000
4,hij,2000
I am trying to perform some grep in contents of several files in a directory and appending my grep match in a single file, in my output I would also want a column which will have the filename as well to understand from which files that entry was picked up. I was trying to use awk for the same but it did not work.
for i in *_2.5kb.txt; do more $i | grep "NM_001080771" | echo `basename $i` | awk -F'[_.]' '{print $1"_"$2}' | head >> prom_genes_2.5kb.txt; done
files names are like this , I have around 50 files
48hrs_CT_merged_peaks_2.5kb.txt
48hrs_TAMO_merged_peaks_2.5kb.txt
72hrs_TAMO_merged_peaks_2.5kb.txt
72hrs_CT_merged_peaks_2.5kb.txt
5D_CT_merged_peaks_2.5kb.txt
5D_TAMO_merged_peaks_2.5kb.txt
each file contents several lines
chr1 3663275 3663483 14 2.55788 2.99631 1.40767 NM_001011874 -
chr1 4481687 4488063 264 7.85098 28.25170 26.41094 NM_011441 -
chr1 5008006 5013929 243 8.20677 26.17854 24.37907 NM_021374 -
chr1 5578362 5579949 65 3.48568 7.83501 6.57570 NM_011011 +
chr1 5905702 5908002 148 5.84647 16.53171 14.88463 NM_010342 -
chr1 9288507 9290352 77 4.04459 9.12442 7.77642 NM_027671 -
chr1 9291742 9292528 142 5.74749 16.21792 14.28185 NM_027671 -
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_021511 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_175236 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NR_027664 +
When I am getting a match for "NM_001080771" I am printing the entire content of that line to a new file and for each file this operation is being done and appending the match to one output file. I also want to add a column with filename as shown above in the final output so that I know from which file I am getting the entries.
desired output
chr4 21610972 21618492 193 7.28409 21.01724 19.35525 NM_001080771 - 48hrs_CT
chr4 21605096 21618696 76 4.22442 9.32981 7.68131 NM_001080771 - 48hrs_TAMO
chr4 21604864 21618713 12 1.78194 2.36793 1.25883 NM_001080771 - 72hrs_CT
chr4 21610305 21615717 26 2.90579 4.47333 2.65353 NM_001080771 - 72hrs_TAMO
chr4 21609924 21618600 23 2.63778 4.0642 2.33685 NM_001080771 - 5D_CT
chr4 21609936 21618680 30 5.63778 3.0642 8.33685 NM_001080771 - 5D_TAMO
This is not working. I want to basically append a column where the filename should also get added as an entry either first or the last column. How to do that?
or you can do all in awk
awk '/NM_001080771/ {print $0, FILENAME}' *_2.5kb.txt
this trims the filename in the desired format
$ awk '/NM_001080771/{sub(/_merged_peaks_2.5kb.txt/,"",FILENAME);
print $0, FILENAME}' *_2.5kb.txt
As long as the number of files is not huge, why not just:
grep NM_001080771 *_2.5kb.txt | awk -F: '{print $2,$1}'
If you have too many files for that to work, here's a script-based approach that uses awk to append the filename:
#!/bin/sh
for i in *_2.5kb.txt; do
< $i grep "NM_001080771" | \
awk -v where=`basename $i` '{print $0,where}'
done
./thatscript | head > prom_genes_2.5kb.txt
Here we are using awk's -v VAR=VALUE command line feature to pass in the filename (because we are using stdin we don't have anything useful in awk's built-in FILENAME variable).
You can also use such a loop around #karakfa's elegant awk-only approach:
#!/bin/sh
for i in *_2.5kb.txt; do
awk '/NM_001080771/ {print $0, FILENAME}' $i
done
And finally, here's a version with the desired filename munging:
#!/bin/sh
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} '/NM_001080771/ {print $0, TAG}' $i
done
(this uses the shell's variable substitution ${variable%pattern} to trim pattern from the end of variable)
Bonus
Guessing you might want to search for other strings in the future, so why don't we pass in the search string like so:
#!/bin/sh
what=${1?Need search string}
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} /${what}/' {print $0, TAG}' $i
done
./thatscript NM_001080771 | head > prom_genes_2.5kb.txt
YET ANOTHER EDIT
Or if you have a pathological need to over-complicate and pedantically quote things, even in 5-line "throwaway" scripts:
#!/bin/sh
shopt -s nullglob
what="${1?Need search string}"
filematch="*_2.5kb.txt"
trimsuffix="_merged_peaks_2.5kb.txt"
for filename in $filematch; do
awk -v tag="${filename%${trimsuffix}}" \
-v what="${what}" \
'$0 ~ what {print $0, tag}' $filename
done
I want to pretty-print the output of a find-like script that would take input like this:
- 2015-10-02 19:45 102 /My Directory/some file.txt
and produce something like this:
- 102 /My Directory/some file.txt
In other words: "f" (for "file"), file size (right-justified), then pathname (with an arbitrary number of spaces).
This would be easy in awk if I could write a script that takes $1, $4, and "everything from $5 through the end of the line".
I tried using the awk construct substr($0, index($0, $8)), which I thought meant "everything starting with field $8 to the end of $0".
Using index() in this way is offered as a solution on linuxquestions.org and was upvoted 29 times in a stackoverflow.com thread.
On closer inspection, however, I found that index() does not achieve this effect if the starting field happens to match an earlier point in the string. For example, given:
-rw-r--r-- 1 tbaker staff 3024 2015-10-01 14:39 calendar
-rw-r--r-- 1 tbaker staff 4062 2015-10-01 14:39 b
-rw-r--r-- 1 tbaker staff 2374 2015-10-01 14:39 now or later
Gawk (and awk) get the following results:
$ gawk '{ print index($0, $8) }' test.txt
49
15
49
In other words, the value of $8 ('b') matches at index 15 instead of 49 (i.e., like most of the other filenames).
My issue, then is how to specify "everything from field X to the end of the string".
I have re-written this question in order to make this clear.
Looks to me like you should just be using the "stat" command rather than "ls", for the reasons already commented upon:
stat -c "f%15s %n" *
But you should double-check how your "stat" operates; it apparently can be shell-specific.
The built-in awk function index() is sometimes recommended as a way
to print "from field 5 through the end of the string" [1, 2, 3].
In awk, index($0, $8) does not mean "the index of the first character of
field 8 in string $0". Rather, it means "the index of the first occurrence in
string $0 of the string value of field 8". In many cases, that first
occurrence will indeed be the first character in field 8 but this is not the
case in the example above.
It has been pointed out that parsing the output of ls is generally a bad
idea [4], in part because implementations of ls significantly differ in output.
Since the author of that note recommends find as a replacement for ls for some uses,
here is a script using find:
find $# -ls |
sed -e 's/^ *//' -e 's/ */ /g' -e 's/ /|/2' -e 's/ /|/2' -e 's/ /|/4' -e 's/ /|/4' -e 's/ /|/6' |
gawk -F'|' '{ $2 = substr($2, 1, 1) ; gsub(/^-/, "f", $2) }
{ printf("%s %15s %s\n", $2, $4, $6) }'
...which yields the required output:
f 4639 /Users/foobar/uu/a
f 3024 /Users/foobar/uu/calendar
f 2374 /Users/foobar/uu/xpect
This approach recursively walks through a file tree. However, there may of course be implementation differences between versions of find as well.
http://www.linuxquestions.org/questions/linux-newbie-8/awk-print-field-to-end-and-character-count-179078/
How to print third column to last column?
Print Field 'N' to End of Line
http://mywiki.wooledge.org/ParsingLs
Maybe some variation of find -printf | awk is what you're looking for?
$ ls -l tmp
total 2
-rw-r--r-- 1 Ed None 7 Oct 2 14:35 bar
-rw-r--r-- 1 Ed None 2 Oct 2 14:35 foo
-rw-r--r-- 1 Ed None 0 May 3 09:55 foo bar
$ find tmp -type f -printf "f %s %p\n" | awk '{sub(/^[^ ]+ +[^ ]/,sprintf("%s %10d",$1,$2))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
or
$ find tmp -type f -printf "%s %p\n" | awk '{sub(/^[^ ]+/,sprintf("f %10d",$1))}1'
f 7 tmp/bar
f 2 tmp/foo
f 0 tmp/foo bar
It won't work with file names that contain newlines.
I want to sum up occurrence output of "uniq -c" command.
How can I do that on the command line?
For example if I get the following in output, I would need 250.
45 a4
55 a3
1 a1
149 a5
awk '{sum+=$1} END{ print sum}'
This should do the trick:
awk '{s+=$1} END {print s}' file
Or just pipe it into awk with
uniq -c whatever | awk '{s+=$1} END {print s}'
for each line add the value of of first column to SUM, then print out the value of SUM
awk is a better choice
uniq -c somefile | awk '{SUM+=$1}END{print SUM}'
but you can also implement the logic using bash
uniq -c somefile | while read num other
do
let SUM+=num;
done
echo $SUM
uniq -c is slow compared to awk. like REALLY slow.
{mawk/mawk2/gawk} 'BEGIN { OFS = "\t" } { freqL[$1]++; } END { # modify FS for that
# column you want
for (x in freqL) { printf("%8s %s\n", freqL[x], x) } }' # to uniq -c upon
if your input isn't large like 100MB+, then gawk suffices after adding in the
PROCINFO["sorted_in"] = "#ind_num_asc"; # gawk specific, just use gawk -b mode
if it's really large, it's far faster to use mawk2 then pipe to to
{ mawk/mawk2 stuff... } | gnusort -t'\t' -k 2,2
While the aforementioned answer uniq -c example-file | awk '{SUM+=$1}END{print SUM}' would theoretically work to sum the left column output of uniq -c so should wc -l somefile as mentioned in the comment.
If what you are looking for is the number of uniq lines in your file, then you can use this command:
sort -h example-file | uniq | wc -l
I have a file with a single line in it that has a record that is deliminted by semi-colons. So far I have figured out that I can use tr by issuing:
tr ';' '\n' < t
However since the record has 140 fields, I'd like to be able to show the field count when displaying such as the following:
1 23
2 324234
3 AAA
.
.
140 Blah
Help is greatly appreciated!
tr \; '\n' <t|nl
or
awk -v RS=';' '$1=++i" "$1' file
test:
kent$ echo "a;b;c;d"|awk -v RS=';' '$1=++i" "$1'
1 a
2 b
3 c
4 d
You could just run it through cat -n.
tr \; '\n' < t | cat -n
Since this is tagged awk, you could do it that way, too; it's just a little wordier:
awk -F\; '{for (i=1;i<=NF;++i) { print i" "$i }}'
In shell you can use IFS to specify a field separator like so,
IFS=";"
i=0
for s in $(<file)
do
((i++))
echo $i $s
done