Concatenating multiple lines with a discriminator - scripting

I have the input like this
Input:
a,b,c
d,e,f
g,h,i
k,l,m
n,o,p
q,r,s
I wan to be able to concatenate the lines with a discriminator like "|"
Output:
a,b,c|d,e,f|g,h,i
k,l,m|n,o.p|q,r,s
The file has 1million lines and I want to be able to concatenate lines like the example before.
Any ideas about how to approach this?

#OP, if you want to group them for every 3 records,
$ awk 'ORS=(NR%3==0)?"\n":"|"' file
a,b,c|d,e,f|g,h,i
k,l,m|n,o,p|q,r,s
with Perl,
$ perl -lne 'print $_ if $\ = ($. % 3 == 0) ? "\n" : "|"' file
a,b,c|d,e,f|g,h,i
k,l,m|n,o,p|q,r,s

Since your tags include sed here is a way to use it:
sed 'N;N;s/\n/|/g' datafile

gawk:
BEGIN {
state=0
}
state==0 {
line=$0
state=1
next
}
state==1 {
line=line "|" $0
state=2
next
}
state==2 {
print line "|" $0
state=0
next
}

If Perl is fine, you can try:
$i = 1;
while(<>) {
chomp;
unless($i % 3)
{ print "$line\n"; $i = 1; $line = "";}
$line .= "$_|";
$i++;
}
to run:
perl perlfile.pl 1millionlinesfile.txt

$ paste -sd'|' input | sed -re 's/([^|]+\|[^|]+\|[^|]+)\|/\1\n/g'
With paste, we join the lines together, and then sed dices them up. The pattern grabs runs of 3 pipe-terminated fields and replaces their respective final pipes with newlines.
With Perl:
#! /usr/bin/perl -ln
push #a => $_;
if (#a == 3) {
print join "|" => #a;
#a = ();
}
END { print join "|" => #a if #a }

Related

change field value of one file based on another input file using awk

I have a sparse matrix ("matrix.csv") with 10k rows and 4 columns (1st column is "user", and the rest columns are called "slots" and contain 0s or 1s), like this:
user1,0,1,0,0
user2,0,1,0,1
user3,1,0,0,0
Some of the slots that contain a "0" should be changed to contain a "1".
I have another file ("slots2change.csv") that tells me which slots should be changed, like this:
user1,3
user3,2
user3,4
So for user1, I need to change slot3 to contain a "1" instead of a "0", and for user3 I should change slot2 and slot4 to contain a "1" instead of a "0", and so on.
Expected result:
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
How can I achieve this using awk or sed?
Looking at this post: awk or sed change field of file based on another input file, a user proposed an answer that is valid if the "slots2change.csv" file do not contain the same user in diferent rows, which is not the case in here.
The solution proposed was:
awk 'BEGIN{FS=OFS=","}
NR==FNR{arr[$1]=$2;next}
NR!=FNR {for (i in arr)
if ($1 == i) {
F=arr[i] + 1
$F=1
}
print
}
' slots2change.csv matrix.csv
But that answer doesn't apply in the case where the "slots2change.csv" file contain the same user in different rows, as is now the case.
Any ideas?
Using GNU awk for arrays of arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
users2slots[$1][$2]
next
}
$1 in users2slots {
for ( slot in users2slots[$1] ) {
$(slot+1) = 1
}
}
{ print }
$ awk -f tst.awk slots2change.csv matrix.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
or using any awk:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
if ( !seen[$0]++ ) {
users2slots[$1] = ($1 in users2slots ? users2slots[$1] FS : "") $2
}
next
}
$1 in users2slots {
split(users2slots[$1],slots)
for ( idx in slots ) {
slot = slots[idx]
$(slot+1) = 1
}
}
{ print }
$ awk -f tst.awk slots2change.csv matrix.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
Using sed
while IFS="," read -r user slot; do
sed -Ei "/$user/{s/(([^,]*,){$slot})[^,]*/\11/}" matrix.csv
done < slots2change.csv
$ cat matrix.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
If the order in which the users are outputted doesn't matter then you could do something like this:
awk '
BEGIN { FS = OFS = "," }
FNR == NR {
fieldsCount[$1] = NF
for (i = 1; i <= NF; i++ )
matrix[$1,i] = $i
next
}
{ matrix[$1,$2+1] = 1 }
END {
for ( id in fieldsCount ) {
nf = fieldsCount[id]
for (i = 1; i <= nf; i++)
printf "%s%s", matrix[id,i], (i < nf ? OFS : ORS)
}
}
' matrix.csv slots2change.csv
user1,0,1,1,0
user2,0,1,0,1
user3,1,1,0,1
This might work for you (GNU sed):
sed -E 's#(.*),(.*)#/^\1/s/,[01]/,1/\2#' fileChanges | sed -f - fileCsv
Create a sed script from the file containing the changes and apply it to the intended file.
The solution above, manufactures a match and substitution for each line in the file changes. This is then piped through to second invocation of sed which applies the sed script to the csv file.

How to merge lines using awk command so that there should be specific fields in a line

I want to merge some rows in a file so that the lines should contain 22 fields seperated by ~.
Input file looks like this.
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~
UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~
~76~23~2523~Y~UN~~223~223~~~~A~123
and So on
First line looks fine. 2nd and 3rd line needs to be merged so that it becomes a line with 22 fields. 4th,5th and 6th line should be merged and so on.
Expected output:
200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The file has 10 GB data but the code I wrote (used while loop) is taking too much time to execute . How to solve this problem using awk/sed command?
Code Used:
IFS=$'\n'
set -f
while read line
do
count_tild=`echo $line | grep -o '~' | wc -l`
if [ $count_tild == 21 ]
then
echo $line
else
checkLine
fi
done < file.txt
function checkLine
{
current_line=$line
read line1
next_line=$line1
new_line=`echo "$current_line$next_line"`
count_tild_mod=`echo $new_line | grep -o '~' | wc -l`
if [ $count_tild_mod == 21 ]
then
echo "$new_line"
else
line=$new_line
checkLine
fi
}
Using only the shell for this is slow, error-prone, and frustrating. Try Awk instead.
awk -F '~' 'NF==1 { next } # Hack; see below
NF<22 {
for(i=1; i<=NF; i++) f[++a]=$i }
a==22 {
for(i=1; i<=a; ++i) printf "%s%s", f[i], (i==22 ? "\n" : "~")
a=0 }
NF==22
END {
if(a) for(i=1; i<=a; i++) printf "%s%s", f[i], (i==a ? "\n" : "~") }' file.txt>file.new
This assumes that consecutive lines with too few fields will always add up to exactly 22 when you merge them. You might want to check this assumption (or perhaps accept this answer and ask a new question with more and better details). Or maybe just add something like
a>22 {
print FILENAME ":" FNR ": Too many fields " a >"/dev/stderr"
exit 1 }
The NF==1 block is a hack to bypass the weirdness of the completely empty line 5 in your sample.
Your attempt contained multiple errors and inefficiencies; for a start, try http://shellcheck.net/ to diagnose many of them.
$ cat tst.awk
BEGIN { FS="~" }
{
sub(/^[0-9]+\./,"")
gsub(/[[:space:]]+/,"")
$0 = prev $0
if ( NF == 22 ) {
print ++cnt "." $0
prev = ""
}
else {
prev = $0
}
}
$ awk -f tst.awk file
1.200269~7414~0027001~VALTD~OM3500~963~~~~716~423~2523~Y~UN~~2423~223~~~~A~200423
2.2269~744~2701~VALD~3500~93~~~~76~423~223~Y~UN~~243~223~~~~A~200123
3.209~7414~7001~VALD~OM30~963~~~~76~23~2523~Y~UN~~223~223~~~~A~123
The assumption above is that you never have more than 22 fields on 1 line nor do you exceed 22 in any concatenation of the contiguous lines that are each less than 22 fields, just like you show in your sample input.
You can try this awk
awk '
BEGIN {
FS=OFS="~"
}
{
while(NF<22) {
if(NF==0)
break
a=$0
getline
$0=a$0
}
if(NF!=0)
print
}
' infile
or this sed
sed -E '
:A
s/((.*~){21})([^~]*)/\1\3/
tB
N
bA
:B
s/\n//g
' infile

Convert rows into columns using awk

Not all columns (&data) are present for all records. Hence whenever fields missing are missing, they should be replaced with nulls.
My Input format:
.set 1000
EMP_NAME="Rob"
EMP_DES="Developer"
EMP_DEP="Sales"
EMP_DOJ="20-10-2010"
EMR_MGR="Jack"
.set 1001
EMP_NAME="Koster"
EMP_DEP="Promotions"
EMP_DOJ="20-10-2011"
.set 1002
EMP_NAME="Boua"
EMP_DES="TA"
EMR_MGR="James"
My desired output Format:
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
I tried the below:
awk 'NR>1{printf "%s"(/^\.set/?RS:"~"),a} {a=substr($0,index($0,"=")+1)} END {print a}' $line
This is printing:
Rob~Developer~Sales~20-10-2010~Jack
Koster~Promotions~20-10-2011~
Boua~TA~James~
This awk script produces the desired output:
BEGIN { FS = "[=\"]+"; OFS = "~" }
/\.set/ { ++records; next }
NR > 1 { f[records,$1] = $2 }
END {
for (i = 1; i <= records; ++i) {
print f[i,"EMP_NAME"], f[i,"EMP_DES"], f[i,"EMP_DEP"], f[i,"EMP_DOJ"], f[i,"EMR_MGR"]
}
}
A two-dimensional array is used to store all of the values that are defined for each record.
After all the file has been processed, the loop goes through each row of the array and prints all of the values. The elements that are undefined will be evaluated as an empty string.
Specifying the elements explicity allows you to control the order in which they are printed. Using print rather than printf allows you to make correct use of the OFS variable which has been set to ~, as well as the ORS which is a newline character by default.
Thanks to #Ed for his helpful comments that pointed out some flaws in my original script.
Output:
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
$ cat tst.awk
BEGIN{ FS="[=\"]+"; OFS="~" }
/\.set/ { ++numRecs; next }
{ name2val[numRecs,$1] = $2 }
!seen[$1]++ { names[++numNames] = $1 }
END {
for (recNr=1; recNr<=numRecs; recNr++)
for (nameNr=1; nameNr<=numNames; nameNr++)
printf "%s%s", name2val[recNr,names[nameNr]], (nameNr<numNames?OFS:ORS)
}
$ awk -f tst.awk file
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
If you want some pre-defined order of fields in your output rather than creating it on the fly from the rows in each record as they're read, just populate the names[] array explicitly in the BEGIN section and if you have that situation AND don't want to save the whole file in memory:
$ cat tst.awk
BEGIN{
FS="[=\"]+"; OFS="~";
numNames=split("EMP_NAME EMP_DES EMP_DEP EMP_DOJ EMR_MGR",names,/ /)
}
function prtName2val( nameNr, i) {
if ( length(name2val) ) {
for (nameNr=1; nameNr<=numNames; nameNr++)
printf "%s%s", name2val[names[nameNr]], (nameNr<numNames?OFS:ORS)
delete name2val
}
}
/\.set/ { prtName2val(); next }
{ name2val[$1] = $2 }
END { prtName2val() }
$ awk -f tst.awk file
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
The above uses GNU awk for length(name2val) and delete name2val, if you don't have that then use for (i in name2val) { do stuff; break } and split("",name2val) instead..
This is all I can suggest:
awk '{ t = $0; sub(/^[^"]*"/, "", t); gsub(/"[^"]*"/, "~", t); sub(/".*/, "", t); print t }' file
Or sed:
sed -re 's|^[^"]*"||; s|"[^"]*"|~|g; s|".*||' file
Output:
Rob~Developer~Sales~20-10-2010~Jack~Koster~Promotions~20-10-2011~Boua~TA~James

awk | Rearrange fields of CSV file on the basis of column value

I need you help in writing awk for the below problem. I have one source file and required output of it.
Source File
a:5,b:1,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
Output File
session:4,a=5,b=1,c=2
session:3,a=11,b=3,c=5|3
Notes:
Fields are not organised in source file
In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c
For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol.
Please help.
Will work in any modern awk:
$ cat file
a:5,b:1,c:2,session:4,e:8
a:5,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
$ cat tst.awk
BEGIN{ FS="[,:]"; split("session,a,b,c",order) }
{
split("",val) # or delete(val) in gawk
for (i=1;i<NF;i+=2) {
val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1)
}
for (i=1;i in order;i++) {
name = order[i]
printf "%s%s", (i==1 ? name ":" : "," name "="), val[name]
}
print ""
}
$ awk -f tst.awk file
session:4,a=5,b=1,c=2
session:4,a=5,b=,c=2
session:3,a=11,b=3,c=5|3
If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output.
Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with:
awk '
BEGIN {
FS = "[,:]"
OFS = ","
}
{
for ( i = 1; i <= NF; i+= 2 ) {
if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue }
hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1)
}
asorti( hash, hash_orig )
for ( i = 1; i <= length(hash); i++ ) {
printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ]
}
printf "\n"
delete hash
delete hash_orig
}
' infile
that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields:
session:4,a:5,b:1,c:2,e:8
session:3,a:11,b:3,c:5|3,e:9

Remove lines between the first and last occurrence of a marker line

Using Linux tools like awk, how can I get all lines that are not between the # NUMBERS lines in the example below? In other words, I only want the line before the first # NUMBERS and after the last # NUMBERS.
Note: The marker is not always exactly # NUMBERS, but there may be any number of spaces between # and the NUMBERS
Input
param1=23
param2=34
param3=4
# NUMBERS
343546
3454
657
534
# NUMBERS
5454
# NUMBERS
param4=41
Expected output
param1=23
param2=34
param3=4
param4=41
The main ideas are:
Printing lines before "# NUMBER";
Do not printing lines after "# NUMBER";
At the end print buffer.
So
$> cat ./printOutsideNumbers.awk
/#( )*NUMBER/ {
if (insideSection == 0) {
insideSection = 1;
} else {
sectionBuffer = ""
}
}
! insideSection {
print $0
}
insideSection && ! /#( )*NUMBER/ {
sectionBuffer = sectionBuffer"\n"$0
}
END {
print sectionBuffer
}
And
$> awk -f ./printOutsideNumbers.awk file.data.txt
param1=23
param2=34
param3=4
param4=41
Here's an approach using tac and awk:
(
cat data.txt | awk '/# *NUMBERS/ { nextfile } 1';
tac data.txt | awk '/# *NUMBERS/ { nextfile } 1' | tac
)
There is surely a more simpler code than this.
awk '{if($0 ~ /#[ ]*NUMBERS/){i++;next;}if(i%2==0 )print}' data.txt
Output
param1=23
param2=34
param3=4
param4=41
Try next awk command. It divides file with string '# NUMBER' and prints first and last registers removing spaces.
awk 'BEGIN { RS = "#[[:space:]]+NUMBERS"; }
FNR == 1 { sub(/^[[:space:]]*/, "", $0); sub(/[[:space:]]*$/, "", $0); print }
END { sub(/^[[:space:]]*/, "", $0); sub(/[[:space:]]*$/, "", $0); print }' infile
Output:
param1=23
param2=34
param3=4
param4=41
An interesting "Greedy" line matching question.
"tac" one is dirty nice :) Mine is just like #3
BEGIN { buffer = ""; ignore = 0; }
/^# *NUMBERS/ {
if (ignore == 0) {
print buffer;
ignore += 1;
}
buffer = ""; next;
}
NF > 0 {
buffer = sprintf("%s%s\n", buffer, $0)
}
END { print buffer; }
I think a FS="\n"; RS="" and then use greedy regex will also do the trick but that will be sedish, not awkish...
This might work for you:
sed '1{h;d};H;${x;s/[^\n]*\(NUMBERS\).*\1.*\n//;p};d' file.data.txt
Explanation:
Slurp the file into sed's hold space. Then at end of file, swap to HS, delete anything between NUMBERS and print out the remainder.
Or this:
sed '/NUMBERS/=;d' file.data.txt |
sed -n '1h;${x;G;s/\n/,/;s,.*,sed &d file.data.txt,p}' | sh
Explanation:
Note line numbers of NUMBERS. Use the first and last addresses to build a sed script to delete the unwanted lines. Pass the script to the shell to run.