Piping grep result to awk - awk

I would to delete column from csv files containing some text as the column header.
I would like that the output_file be the same name as the file name found by the grep.
I did the following
grep -l "pattern" * | xargs -0 awk -F'\t' '{print $1"\t"$2}' > output_file
How to output the result to the same file found by the grep ?
Thank you.

Just do this :
grep -l "pattern" * | xargs awk -F'\t' '{print $1"\t"$2 > FILENAME}'
FILENAME is the awk variable for your input file
Example :
$ cat file1
ABC zzz
EFG xxx
HIJ yyy
$ cat file2
123 aaa
456 bbb
789 ccc
grep -l "123" * | xargs awk '{print $2"\t"$1 > FILENAME}'
I switch columns 1 and 2 in the file containing "123" and overwrite file2.
$ cat file1
ABC zzz
EFG xxx
HIJ yyy
$ cat file2
aaa 123
bbb 456
ccc 789

Related

Using multiple delimiters when one of them is a pipe character

I have a text file where fields are separated by a pipe character. Since it is a human readable text, there are spaces used for column alignment.
Here is a sample input:
+------------------------------------------+----------------+------------------+
| Column1 | Column2 | Column3 | Column4 | Last Column |
+------------------------------------------+----------------+------------------+
| some_text | other_text | third_text | fourth_text | last_text |
<more such lines>
+------------------------------------------+----------------+------------------+
How can I use awk to extract the third field in this case? The
I tried:
awk -F '[ |]' '{print $3}' file
awk -F '[\|| ]' '{print $3}' file
awk -F '[\| ]' '{print $3}' file
The expected result is:
<blank>
Column3
<more column 3 values>
<blank>
third_text
I am trying to achieve this with a single awk command. Isn't that possible?
The following post talks about using pipe as a delimiter in awk but it doesn't talk about the case of multiple delimiters where one of them is a pipe character:
Using pipe character as a field separator
Am I missing something ?
Example input :
+------------------------------------------+----------------+------------------+
| Column1 | Column2 | Column3 | Column4 | Last Column |
+------------------------------------------+----------------+------------------+
| some_text | other_text | third_text | fourth_text | last_text |
| some_text2| other_text2 | third_text2 | fourth_text2 | last_text2 |
+------------------------------------------+----------------+------------------+
Command :
gawk -F '[| ]*' '{print $4}' <file>
Output :
<blank>
Column3
<blank>
third_text
third_text2
<blank>
Works for every column (you just need to use i+1 instead of i because first column empty values or +-----).
perl is better suited for this use case :
$ perl -F'\s*\|\s*' -lane 'print $F[3]' File
#  ____________
#  ^
#  |
# FULL regex support with -F switch (delimiter, like awk, but more powerful)
First preparse with sed - remove first, third and last line, replace all spaces+|+spaces with a single |, remove leading | - then just split with awk using | (could be really cut -d'|' -f3).
sed '1d;3d;$d;s/ *| */|/g;s/^|//;' |
awk -F'|' '{print $3}'

Feed these particular commands to xargs to run in parallel

Here is the correctly working sequential version:
tail -n +2 movieratings.csv | cut -d "|" -f 1 | sort | uniq | wc -l
tail -n +2 movieratings.csv | cut -d "|" -f 2 | sort | uniq | wc -l
tail -n +2 movieratings.csv | cut -d "|" -f 3 | sort | uniq | wc -l
and so on for many more f values. The field number is the only thing that changes. Parallelism is required for allowed answers to my question.
I tried lots of things as follows (and even more than shown) if you want to wade through them, but none are working right. GNU parallel is seemingly not allowed on this host otherwise I would use it and be done with it already. It is a host on google colaboratory. Perhaps there is a way to install GNU parallel on colab and then that answer would also be acceptable to me even if not to the rest of stack overflow. It's my question and I own it.
for i in {1..1}; do echo $i; done | xargs -L 30 -I x -n 1 -P 8 cut movieratings.csv -d "|" -f x | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -I xxx -P 8 cut movieratings.csv -d "|" -f xxx | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -d $'\n' -I xxx -P 8 cut movieratings.csv -d "|" -f xxx | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -d $'\n' -n 1 -I xxx -P 8 cut -d "|" -f xxx movieratings.csv | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -n 1 -I xxx -P 8 cut -d "|" -f xxx movieratings.csv | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -I xxx -P 8 'cut -d "|" -f xxx movieratings.csv' | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -I xxx -P 8 cut movieratings.csv -d "|" -f xxx | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -n 1 -I xxx -P 8 cut movieratings.csv -d "|" -f xxx | sort | uniq | wc -l
for i in {1..2}; do echo $i; done | xargs -I xxx -P 8 "cut movieratings.csv -d '|'" -f xxx | sort | uniq | wc -l
Here is some sample data on which to run commands. Hope this helps.
userid|itemid|rating|rating_year|title|unknown|action|adventure|animation|childrens|comedy|crime|documentary|drama|fantasy|film_noir|horror|musical|mystery|romance|scifi|thriller|war|western|movie_year|movie_age|user_age|gender|job|zipcode
196|242|3.0|1997|Kolya (1996)|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1997|0|49|M|writer|55105
186|302|3.0|1998|L.A. Confidential (1997)|0|0|0|0|0|0|1|0|0|0|1|0|0|1|0|0|1|0|0|1997|1|39|F|executive|00000
22|377|1.0|1997|Heavyweights (1994)|0|0|0|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1994|3|25|M|writer|40206
244|51|2.0|1997|Legends of the Fall (1994)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|1|0|0|1|1|1994|3|28|M|technician|80525
166|346|1.0|1998|Jackie Brown (1997)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|0|0|0|1997|1|47|M|educator|55113
298|474|4.0|1998|Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|1|0|1963|35|44|M|executive|01581
115|265|2.0|1997|Hunt for Red October, The (1990)|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|1990|7|31|M|engineer|17110
253|465|5.0|1998|Jungle Book, The (1994)|0|0|1|0|1|0|0|0|0|0|0|0|0|0|1|0|0|0|0|1994|4|26|F|librarian|22903
305|451|3.0|1998|Grease (1978)|0|0|0|0|0|1|0|0|0|0|0|0|1|0|1|0|0|0|0|1978|20|23|M|programmer|94086
I do not know colab, but did you read https://oletange.wordpress.com/2018/03/28/excuses-for-not-installing-gnu-parallel/ (Especially parallel --embed).
tail -n +2 movieratings.csv |
parallel --tee --pipe 'cut -d "|" -f {} | sort | uniq | wc -l' ::: 1 2 3

Separate two words in the same colum in awk cli?

In CLI I type:
$ timedatectl | grep Time | awk '{print $3}'
Which gives me the correct output:
Country/City
How can I just get the print out of city without the Country?
Use this command instead:
timedatectl | grep Time | awk '{print $3}' | cut -d/ -f2

Get first word of every line with AWK or SED?

I have a text file like this:
aaaa bbbb cccc
dddd eeee ffff
gggg hhhh iiii
...
..
.
How can I create a text file only with first column of every lines with awk or sed like this?
aaaa
dddd
gggg
...
..
.
I have seen similar topics but I could not resolve my problem!
If your input is
aaaa bbbb cccc
dddd eeee ffff
gggg hhhh iiii
and what you want is:
aaaa
dddd
gggg
then you can use any of:
awk NF=1 input.file
sed 's/ .*//' input.file
cut -d' ' -f1 input.file
Using awk: By setting number of fields to 1:
awk '{NF=1}1' inputfile
Using grep and look arounds:
grep -oP '^.[^ ]+' inputfile
Using sed backrefrencing:
sed -r 's/(^.[^ ]+).*/\1/' inputfile

Using multicharacter field separator using AWK

I'm having problems with AWK's field delimiter,
the input file appears as below
1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria | scientific name |
2 | Monera | Monera | in-part |
2 | Procaryotae | Procaryotae | in-part |
2 | Prokaryota | Prokaryota | in-part |
2 | Prokaryotae | Prokaryotae | in-part |
2 | bacteria | bacteria | blast name |
the field delimiter here is tab,pipe,tab \t|\t
so in my attempt to print just the 1st and 2nd column
awk -F'\t|\t' '{print $1 "\t" $2}' nodes.dmp | less
instead of the desired output, the output is the 1st column followed by the pipe character. I tried escaping the pipe \t\|\t, but the output remains the same.
1 |
1 |
2 |
2 |
2 |
2 |
Printing the 1st and 3rd column gave me the original intended output.
awk -F'\t|\t' '{print $1 "\t" $3}' nodes.dmp | less
but i'm puzzed as to why this is not working as intended.
I understand that the perl one liner below will work but what i really want is to use awk.
perl -aln -F"\t\|\t" -e 'print $F[0],"\t",$F[1]' nodes.dmp | less
The pipe | character seems to be confusing awk into thinking that \t|\t implies that the field separator could be one of \t or \t. Tell awk to interpret the | literally.
$ awk -F'\t[|]\t' '{print $1 "\t" $2}'
1 all
1 root
2 Bacteria
2 Monera
2 Procaryotae
2 Prokaryota
2 Prokaryotae
2 bacteria
From your posted input:
your lines can end in |, not |\t, and
you have cases (the first 2 lines) where the input contains |\t|, and
your lines start with a tab
So, an FS of tab-pipe-tab is wrong since it won't match any of the above cases since the first is just tab-pipe and the tab in the middle of the second will match the tab-pipe-tab from the preceding field but then that just leaves pipe-tab for the following field, and the first leaves you with an undesirable leading tab.
What you actually need is to set the FS to just tab-pipe and then strip off the leading tab from each field:
awk -F'\t|' -v OFS='\t' '{gsub(/(^|[|])\t/,""); print $1, $2}' file
That way you can handle all fields from 1 to NF-1 exactly the same as each other.
Using cut command:
cut -f1,2 -d'|' file.txt
without pipe in output:
cut -f1,2 -d'|' file.txt | tr -d '|'