Replace all lines after a match using sed or awk - awk

I have the following txt file:
Col1,,Col2,,Col 3,,Session,,Time
Mike,,Rg,,Tx,,32658723,,2:00
,,,,,,,,
,,,,,,23623623,,
,,,,,,,,
Joe,,Tx,,Rg,,47235623,,1:00
,,,,,,,,
Peter,,Un,,Xs,,6523,,1:00
,,,,,,,,
Nick,,Xe,,Lk,,67286734,,3:00
,,,,,,,,
,,,,,,,,
,,,,,,32623,,
,,,,,,,,
Bob Li,,Yh,,Xa,,2362,,3:00
,,,,,,,,
,,,,,,,,
,,,,,,,,
,,,,,,,,
,,,,,,323,,
,,,,,,,,
,,,,,,,,
,,,,,,,,
,,,,,,,,
Lin Xu,,Rw,,NB,,1352362,,1:00
,,,,,,,,
The most important value on this file is what is on column 7th. I would like to fill all the empty lines right below the first line which has a value for column 7th. Leaving that value untouched.
I have been trying some send commands like:
sed -n '/^,/{g;};h;p'
But it is replacing all the empty lines, even if they do have an expected value.
What I would like the above file is to become like this:
Col1,,Col2,,Col 3,,Session,,Time
Mike,,Rg,,Tx,,32658723,,2:00
Mike,,Rg,,Tx,,32658723,,2:00
Mike,,Rg,,Tx,,23623623,,2:00
Mike,,Rg,,Tx,,23623623,,2:00
Joe,,Tx,,Rg,,47235623,,1:00
Joe,,Tx,,Rg,,47235623,,1:00
Peter,,Un,,Xs,,6523,,1:00
Peter,,Un,,Xs,,6523,,1:00
Nick,,Xe,,Lk,,67286734,,3:00
Nick,,Xe,,Lk,,67286734,,3:00
Nick,,Xe,,Lk,,67286734,,3:00
Nick,,Xe,,Lk,,32623,,3:00
Nick,,Xe,,Lk,,32623,,3:00
Bob Li,,Yh,,Xa,,2362,,3:00
Bob Li,,Yh,,Xa,,2362,,3:00
Bob Li,,Yh,,Xa,,323,,3:00
Bob Li,,Yh,,Xa,,323,,3:00
Bob Li,,Yh,,Xa,,323,,3:00
Bob Li,,Yh,,Xa,,323,,3:00
Bob Li,,Yh,,Xa,,323,,3:00
Lin Xu,,Rw,,NB,,1352362,,1:00
Lin Xu,,Rw,,NB,,1352362,,1:00

This might work for you (GNU sed):
sed -E '1b
/^,{8}$/{g;b}
/^,{6}\S+,,$/{G;s/^,{6}(\S+),,\n(([^,]*,){6})[^,]*/\2\1/}
h' file
Ignore the header line.
If the line contains empty fields, replace it by the previous good line and break.
If the line contains all empty fields bar the seventh, append the last good line and replace its seventh field by that of the current lines.
This line is now good, make a copy.
All lines are printed by default.

Related

How to get the string which is less than 4 using SED or AWS or GREP

I'm trying to get strings which are less than 4 (0,3) characters which might include some special characters too.
The issue here is I'm not really sure what all special characters are involved
It can contain names of any length with some special characters not sure what all are included.
Sample Input data is as below
r#nger
d!nger
'iterr
4#e
c#nuidig
c#niting
c^neres
sample
Sample Output should be like this
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
I have tried below which both works but both has flaws apart from the 0,3 character strings I'm also getting only 1 character outputs which is incorrect.
Like just C, which I don't have in the input by itself
grep -iE '^[a-z0-9\.-+?$_,#]{0,3}$'
sed -n '/^.\{0,3\}$/p'
grep uid: file.csv | awk {'print $2'} | sed -En 's/^([^[:space:]]{3}).*/\1/p' | sort -f > output
Sample Output from above
r#n
d!n
'it
4#e
c#n
c
c
sam
s
I'm thinking that there might be some special character after the first character which is making it break and only printing the first character.
Can someone please suggest how to get this working as expected
Thanks,
To get the output you posted from the input you posted is just:
$ cut -c1-3 file
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
If that's not all you need then edit your question to more clearly state your requirements and provide more truly representative sample input/output including cases where this doesn't work.
You may use this grep with -o and -E options:
grep -oE '^[^[:blank:]]{1,3}' file
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
Regex ^[^[:blank:]]{1,3} matches and outputs 1 to 3 non-whitespace characters from start position.
Using awk:
awk '{print (length($0)<3) ? $0 : substr($0,0,3)}' src.dat
Output:
r#n
d!n
'it
4#e
c#n
c#n
c^n
sam
1
11
-1
.
Contents of src.dat:
r#nger
d!nger
'iterr
4#e
c#nuidig
c#niting
c^neres
sample
1
11
-1
.
sed 's/.//4g' file
Delete every char starting at 4th until there aren't any more. GNU sed, which says:
Note: the POSIX standard does not specify what should happen when you mix the g and number modifiers, and currently there is no widely agreed upon meaning across sed implementations. For GNU sed, the interaction is defined to be: ignore matches before the numberth, and then match and replace all matches from the numberth on.
Also: grep -o '^...' file

How to delete text/word or character from a text file? [duplicate]

I'm working with this file of data that looks like this:
text in file
hello random text in file
example text in file
words in file hello
more words in file
hello text in file can be
more text in file
I'm trying to replace all lines that do not contain the string hello with match using sed, so the output would be:
match
hello random text in file
match
words in file hello
match
hello text in file can be
match
I tried using sed '/hello/!d' but that deletes the line. Also, I read that I can match using ! within sed but I'm not sure how to match every line and replace properly. If you could give me some direction, I would really appreciate it.
You can do it like this:
$ sed '/hello/!s/.*/match/' infile
match
hello random text in file
match
words in file hello
match
hello text in file can be
match
/hello/! makes sure we're substituting only on lines not containing hello (you had that right), and the substitution then replaces the complete pattern space (.*) with match.
awk to the rescue!
$ awk '!/hello/{$0="match"}1' file
replace the lines not matching "hello" with "match" and print all lines.
Sed with the c (change) command:
$ sed '/hello/!c match' file
match
hello random text in file
match
words in file hello
match
hello text in file can be
match
Just use awk for clarity, simplicity, etc.:
awk '{print (/hello/ ? $0 : "match")}' file

How to remove all lines after a line containing some string?

I need to remove all lines after the first line that contains the word "fox".
For example, for the following input file "in.txt":
The quick
brown fox jumps
over
the
lazy dog
the second fox
is hunting
The result will be:
The quick
brown fox jumps
I prefer a script made in awk or sed but any other command line tools are good, like perl or php or python etc.
I am using gnuwin32 tools in Windows, and the solution I could find was this one:
grep -m1 -n fox in.txt | cut -f1 -d":" > var.txt
set /p MyNumber=<var.txt
head -%MyNumber% in.txt > out.txt
However, I am looking for a solution that is shorter and that is portable (this one contains Windows specific command set /p).
Similar questions:
How to delete all the lines after the last occurence of pattern?
How to delete lines before a match perserving it?
Remove all lines before a match with sed
How to delete the lines starting from the 1st line till line before encountering the pattern '[ERROR] -17-12-2015' using sed?
How to delete all lines before the first and after the last occurrence of a string?
awk '{print} /fox/{exit}' file
With GNU sed:
sed '0,/fox/!d' file
or
sed -n '0,/fox/p' file

Is there a way to replace all the commas except those in quotes using sed

# Director, Movie Title, Year, Comment
Ethan Coen, No Country for Old Men, 2007, none
Ethan Coen, "O Brother, Where Art Thou?", 2000, none
Like in here the the commas outside the quotes have to be replaced with |.
Ethan Coen| "O Brother, Where Art Thou?"| 2000| none
i did try this command
sed -e 's/,(?=(?:[^"]*"[^"]*")*[^"]*$)/|/g',
to first match those strings, put i am getting RE error: repetition-operator operand invalid
i don't know how to convert the re, totally new to shell and regex, i would be grateful for any help
You may try this gnu awk command with FPAT to split fields using a custom regex:
awk -v OFS='|' -v FPAT=' *"[^"]*"|[^",]+' '{$1=$1} 1' file
# Director| Movie Title| Year| Comment
Ethan Coen| No Country for Old Men| 2007| none
Ethan Coen| "O Brother, Where Art Thou?"| 2000| none
The proper way to deal with CSV is to use a proper csv parser.
Given:
$ cat file
# Director, Movie Title, Year, Comment
Ethan Coen, No Country for Old Men, 2007, none
Ethan Coen, "O Brother, Where Art Thou?", 2000, none
First issue, your example has a two character delimiter of ', ' vs a single comma ','. That will throw off most csv parsers unless they support multi character delimiters. For example, csvkit does not support multi character delimiters. (The place where it fails is on quoted fields since if the csv parser is looking for ," as the start of a quoted field, it fails with , "...)
The lightest weight but full featured csv parser commonly available at the command line is in ruby.
With ruby, you can do:
$ ruby -rcsv -ne 'puts (CSV.parse $_, col_sep:", ").join("|")' file
# Director|Movie Title|Year|Comment
Ethan Coen|No Country for Old Men|2007|none
Ethan Coen|O Brother, Where Art Thou?|2000|none
If you want the replacement to also be '| ' vs the single character delimiter of '|' you can do:
$ ruby -rcsv -ne 'puts (CSV.parse $_, col_sep:", ").join("| ")' file
# Director| Movie Title| Year| Comment
Ethan Coen| No Country for Old Men| 2007| none
Ethan Coen| O Brother, Where Art Thou?| 2000| none
Note that O Brother, Where Art Thou? is no longer quoted since the ', ' is no longer a delimiter.
To even be more proper you would use the csv module to re-encode back into a proper RFC 4180 compliant file.
Suppose you wanted to fix the ', ' into a compliant ',' and maintain the quoted fields. Our single line of Ruby does not do that.
Instead:
$ ruby -rcsv -ne 'out=(CSV.parse $_, col_sep:", ").map do |row|
row.to_csv(:col_sep=>",")
end
puts out' file
# Director,Movie Title,Year,Comment
Ethan Coen,No Country for Old Men,2007,none
Ethan Coen,"O Brother, Where Art Thou?",2000,none
Or into a '|':
$ ruby -rcsv -ne 'out=(CSV.parse $_, col_sep:", ").map do |row|
row.to_csv(:col_sep=>"|")
end
puts out' file
# Director|Movie Title|Year|Comment
Ethan Coen|No Country for Old Men|2007|none
Ethan Coen|O Brother, Where Art Thou?|2000|none
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^",]*"[^"]*)*"[^",]*),/\1\n/;ta;y/,\n/|,/' file
Replace all ,'s between "'s with \n's, then translate all ,'s for |s and all \n's for ,'s.
You may find perl better suited to handling complex regular expressions like this, when you need to avoid matching a pattern in specific context(s):
perl -i -pe 's/"[^"]*"(*SKIP)(*F)|,/|/g' file.txt
Or, if you need to match across lines,
perl -0777 -i -pe 's/"[^"]*"(*SKIP)(*F)|,/|/g' file.txt
See the regex demo and the Bash demo.
Here,
-0777 slurps the file contents into a single string (or, it will be treated as a list of lines, each of which will be fed to the regex engine separately)
-i - makes changes to the input file "inline"
"[^"]*"(*SKIP)(*F)|, - the regex that matches ", then any zero or more chars other than " and then a " and then skips the match and goes on searching for a match from the position where it failed, or (|) matches a , in any other context
| is the replacement
g - replaces all occurrences.
If your file is UTF8, and you need to make some manipulation with Unicode text, you would need to add -CSD -Mutf8, say, before -i.

Convert capitalized words only into lower case to cancels capitalization of nouns

Given :
$ cat input
Hello
Welcome
strIng
North Korea
USA
U.K.
I want to obtain:
$ cat output
hello
welcome
strIng
North Korea
USA
U.K.
How to convert capitalized words* only to lower case ?
*: first letter is capitalized.
Note: I look for a command which cancels the capitalization of nouns, while not attacking acronyms and weird words.
something like this will cover the sample input but not sure it's comprehensive of all other implied conditions
$ awk '/^[A-Z][^A-Z]+$/{$1=tolower(substr($1,1,1)) substr($1,2)}1' file
hello
welcome
strIng
North Korea
USA
U.K.
if first char matches an upper case and any subsequent chars are not, convert first char to lower case.
A sed solution:
sed '/^[A-Z][^A-Z]*$/ {
/^./y/ABCEDEFGHIJKLMNOPQRSTUVWXYZ/abcedefghijklmnopqrstuvwxyz/
}' input.txt > output.txt
Tested and confirmed to work for your example. Modify to work for accented characters (not sure how well awk's toupper would do with them).