infile dlm='##', but truncated email address - file-io

I am trying to use:
infile dlm='##' dsd missover;
to copy a SAS code to a new location, but it truncated email address (there is an email address e.g. abc#xyz.com in the SAS code), and only the username 'ABC' show up in the new code, and the '#xyz.com' part was truncated.
So i excluded the infile option
dlm='##'
re-run the code, and the email address was read correctly, however some regular lines are missing.
Just wonder if some infile options I can try to read all the lines correctly, also read the email address correctly too.
thanks!
an example:
*91,87,95 abc#xyz.com test hudpiwaHUOV0
97,,92% bmno[aej0i34hmbtgkoersw934bnrtui9sdobn vnbud9rw0aq598vnfjipa
njuio9rpep0snhtui9es000
from="mjerrt_thpian#wedoo.com"
fjsui123,1,1 0 ;
data a;
infile "/.../email.xlsx"
missover dsd lrecl=32767 firstobs=1;* dlm='#'; * delimiter = '##';
informat all $char50. ;
input all $ ;
pk=_n_;
run;

Looks like your data is using space as the delimiter.
Let's convert your example text into a file so we have something to test against.
filename txt temp;
options parmcards=txt;
parmcards4;
*91,87,95 abc#xyz.com test hudpiwaHUOV0
97,,92% bmno[aej0i34hmbtgkoersw934bnrtui9sdobn vnbud9rw0aq598vnfjipa
njuio9rpep0snhtui9es000
from="mjerrt_thpian#wedoo.com"
fjsui123,1,1 0 ;
;;;;
Now we can read the file and parse it into the individual "words".
data parse ;
infile txt dlm=' ' length=llen column=ccol ;
lineno+1;
do wordno=1 by 1 until(ccol>llen);
length word $200 ;
input word # ;
output;
end;
run;
Results:
Obs lineno wordno word
1 1 1 *91,87,95
2 1 2 abc#xyz.com
3 1 3 test
4 1 4 hudpiwaHUOV0
5 2 1 97,,92%
6 2 2 bmno[aej0i34hmbtgkoersw934bnrtui9sdobn
7 2 3 vnbud9rw0aq598vnfjipa
8 3 1 njuio9rpep0snhtui9es000
9 4 1 from="mjerrt_thpian#wedoo.com"
10 5 1 fjsui123,1,1
11 5 2 0
12 5 3 ;
If you add the DSD option to the INFILE statement you will get more words since adjacent (or leading) spaces will indicate an empty word.

Use
infile 'email.xlsx' dlm='00'x;
if you really need no delimiter.

Related

awk with empty field in columns

Here my file.dat
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
Running awk '{print $2}' file.dat gives:
A
2
4
7
U
But I would like to keep the empty field:
A
4
U
How to do it?
I must add that between :
column 1 and 2 there is 3 whitespaces field separator
column 2 and 3 and between column 3 and 4 one whitespace field separator
So in column 2 there are 2 fields missing (lines 2 and 4) and in column 4
there are also 2 fields missing (lines 3 and 5)
If this isn't all you need:
$ awk -F'[ ]' '{print $4}' file
A
4
U
then edit your question to provide a more truly representative example and clearer requirements.
If the input is fixed-width columns, you can use substr to extract the slice you want. I have assumed that you want a single character at index 5:
awk '{ print(substr($0,5,1)) }' file
Your awk code is missing field separators.
Your example file doesn't clearly show what that field separator is.
From observation your file appears to have 5 columns.
You need to determine what your field separator is first.
This example code expects \t which means <TAB> as the field separator.
awk -F'\t' '{print $3}' OFS='\t' file.dat
This outputs the 3rd column from the file. This is the 'read in' field separator -F'\t' and OFS='\t' is the 'read out'.
A
4
U
For GNU awk. It processes the file twice. On the first time it examines all records for which string indexes have only space and considers continuous space sequences as separator strings building up FIELDWIDTHS variable. On the second time it uses that for fixed width processing of the data.
a[i]:s get valus 0/1 and h (header) with this input will be 100010101 and that leads to FIELDWIDTHS="4 2 2 1":
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
| | | |
100010101 - while(match(h,/10*/))
\ /|/|/|
4 2 2 1
Script:
$ awk '
NR==FNR {
for(i=1;i<=length;i++) # all record chars
a[i]=((a[i]!~/^(0|)$/) || substr($0,i,1)!=" ") # keep track of all space places
if(--i>m)
m=i # max record length...
next
}
BEGINFILE {
if(NR!=0) { # only do this once
for(i=1;i<=m;i++) # ... used here
h=h a[i] # h=100010101
while(match(h,/10*/)) { # build FIELDWIDTHS
FIELDWIDTHS=FIELDWIDTHS " " RLENGTH # qnd
h=substr(h,RSTART+RLENGTH)
}
}
}
{
print $2 # and output
}' file file
And output:
A
4
U
You need to trim off the space from the fields, though.

Counting the number of lines in each column

Is it possible to count the number of lines in each column of a file? For example, I've been trying to use awk to separate columns on the semi-colon symbol, specify each column individually and us wc command to count any and all occurrences within that column.
For the below command I am trying to find the number of items in column 3 without counting blank lines. Unfortunately, this command just counts the entire file. I could move the column to a different file and count that file but I just want to know if there is a much quicker way of going about this?
awk -F ';' '{print $3}' file.txt | wc -l
Data file format
; 1 ; 2 ; 3 ; 4 ; 5 ; 6 ;
; 3 ; 4 ; 5 ; 6 ; ; 4 ;
; ; 3 ; 5 ; 6 ; 9 ; 8 ;
; 1 ; 6 ; 3 ; ; ; 4 ;
; 2 ; 3 ; ; 3 ; ; 5 ;
Example output wanted
Column 1 = 4 aka(1 + 3 + 1 + 2)
Column 2 = 5
Column 3 = 4
Colunm 4 = 4
Column 5 = 2
Column 6 = 5
Keep separate counts for each field using an array, then print the totals when you're done:
$ awk -F' *; *' '{ for (i = 2; i < NF; ++i) if ($i != "") ++count[i] }
END { for (i = 2; i < NF; ++i) print "Column", i-1, "=", count[i] }' file
Column 1 = 4
Column 2 = 5
Column 3 = 4
Column 4 = 4
Column 5 = 2
Column 6 = 5
Set the field separator to consume the semicolons as well as any surrounding spaces.
Loop through each field (except the first and last ones, which are always empty) and increment a counter for non-empty fields.
it would be tempting to use if ($i) but this would fail for a column containing a 0.
Print the counts in the END block, offsetting by -1 to start from 1 instead of 2.
One assumption made here is that the number of columns in each line is uniform throughout the file, so that NF from the last line can safely be used in the END block.
A slight variation, using a simpler field separator:
$ awk -F';' '{ for (i = 2; i < NF; ++i) count[i] += ($i ~ /[^ ]/) }
END { for (i = 2; i < NF; ++i) print "Column", i-1, "=", count[i] }' file
$i ~ /[^ ]/ is equal to 1 if any non-space characters exist in the ith field, 0 otherwise.

Count a number of records in a field using awk

I have a big file like this one
C1,C2,C3
C1,C2
C5
C3,C5
I expected one output like this
C1,C2,C3 3
C1,C2 2
C5 1
C3,C5 2
I would like to make this using shell. Could you help me guys, please?
Thankyou
Something like
awk 'BEGIN{FS=","}{printf "%-20s\t%d\n",$0,NF;}' file
should give
C1,C2,C3 3
C1,C2 2
C5 1
C3,C5 2
Note You need to adjust the width logically considering the maximum length of your lines
Another in awk:
$ awk '{
m=(m<(n=length($0))?n:m) # get the max record length
a[NR]=$0 } # hash to a
END {
for(i=1;i<=NR;i++) # iterate and (below) output nicely
printf "%s%"(m-length(a[i])+4)"s\n",a[i],gsub(/,/,"&",a[i])+1 }
' file
C1,C2,C3 3
C1,C2 2
C5 1
C3,C5 2
IF you want to change the distance between fields and the length, toy with that +4 in the printf.

AWK - sum particular fields after match

I have a txt file that is 10s to hundreds lines long and and I need to sum a particular field each line ( and output) if a preceeding field matches.
Here is an example datset:
Sample4;6a0f64d2;size=1;,Sample4;f1cb4733a;size=6;,Sample3;aa44410feb29210c1156;size=2;
Sample2;5b91bef2329bd87f4c7;size=2;,Sample1;909cd4e2940f328b3;size=2;
The structure is
<sample ID>;<random id>;size=<numeric>;, then the next entry. There could be hundreds of entries in a line (this is just a small example)
Basically, I want to sum the "size" numbers for each entry across a line (entries seperated by ',') , but only those that have match with a particular sample IDentifier (e.g. sample4 for example)
So, if we want to match just the 'Sample4's, the script would produce this-
awk '{some-code for sample4}' example.txt
7
0
Because the entries with 'Sample4' add up to 7 in line 1, but in line 2, there are no Sample4 entries matching.
This could be done for each "SampleID" or, ideally, done for all sample IDs provided in a list ( perhaps in simple file, 1 line per sample ID), which would then output the counts for each row, with each Sample ID having its own column - e.g. for the example file above, results of the script would be:
Sample1 Sample2 Sample3 Sample4
0 0 2 7
2 2 0 0
Any ideas on how to get started?
Thanks!
another awk
awk -F';' '{for(i=1;i<NF-1;i+=3)
{split($(i+2),e,"=");
sub(/,/,"",$i);
header[$i];
a[$i,NR]+=e[2]}}
END {for(h in header) printf "%s", h OFS;
print "";
for(i=1;i<=NR;i++)
{for(h in header) printf "%s", a[h,i]+0 OFS;
print ""}}' file | column -t
Sample1 Sample2 Sample3 Sample4
0 0 2 7
2 2 0 0
ps. the order of columns is not guaranteed.
Explanation
To simplify parsing I used ; as the delimiter and got rid of , before the names. Using the structure assign name=sum of values for each line using multi-dim array a, separately keep track of all names in the header array. Once the lines are consumed, in the END block print the header and for each line the value of the corresponding name (or 0 if missing). Pretty print with column -t.
If I am understanding this correctly, you can do:
$ awk '{split($0,samp,/,/)
for (i=1; i in samp; i++){
sub(/;$/, "", samp[i])
split(samp[i], fields, /;/)
split(fields[3], ns, /=/)
data[fields[1]]+=ns[2]
}
printf "For line %s:\n", NR
for (e in data)
print e, data[e]
split("", data)
}' file
Prints:
For line 1:
Sample3 2
Sample4 7
For line 2:
Sample1 2
Sample2 2

How can I use sed or awk to delete lines matching certain field criteria?

I have following kind of data :
1 abc xyz - - 2 mno
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw
I would like to delete the lines which satisfies the following two conditions:
4th & 5th column are blank
The row number of the corresponding line is not contained in the 6th column of any other line
In this case line 1 should be deleted. How could I do it in sed/awk or most suitable scripting language for this case.
May be something like this could work -
awk 'NR==FNR{a[$6];next}
($4 ~ /[- ]/ && $5 ~ /[- ]/) && !($1 in a){next}1' file file
Condition:
If Column 4 and Column 5 are blank AND Index not present in Column 6, we skip that line and we print everything else.
Explanation:
We use NR and FNR built-in variables and pass the same file twice. In the first run, we scan thru the file and store Column 6 in an array. next is used to prevent the second pattern{action} statement from running until the first file is being read. Once, the file is completely read, we test the same file against your condition. If the Column 4 and Column 5 are blank, we look at the index and if it is not in the array then we skip the line using next, else we print it.
Test:
[jaypal:~/Temp] cat file
1 abc xyz - - 2 mno
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw
[jaypal:~/Temp] awk 'NR==FNR{a[$6];next} ($4 ~ /[- ]/ && $5 ~ /[- ]/) && !($1 in a){next}1' file file
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw
A possible solution using perl:
Content of script.pl:
use warnings;
use strict;
## Accept one argument, the input file.
#ARGV == 1 or die qq[Usage: perl $0 input-file\n];
my ($lines, %hash);
## Process file.
while ( <> ) {
## Remove leading and trailing spaces for each line.
s/^\s*//;
s/\s*$//;
## Get both indexes.
my ($idx1, $idx2) = (split)[0,5];
## Save line and index1.
push #{$lines}, [$_, $idx1];
## Save index2.
$hash{ $idx2 } = 1;
}
## Process file for second time.
for ( #{$lines} ) {
## Get fields of the line.
my #f = split /\s+/, $_->[0];
## If fourth and fifth fields are empty (-) and first index exists as second
## index, go to next line without printing.
if ( $f[3] eq qq[-] && $f[4] eq qq[-] && ! exists $hash{ $_->[1] } ) {
next;
}
## Print line.
printf qq[%s\n], $_->[0];
}
Run the script (infile has the data to process):
perl script.pl infile
And results:
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw
This might work for you:
sed -rn 's/^.*(\S+)\s+\S+$/\1/;H;${x;s/^|\n/:/gp}' file |
sed -r '1{h;d};/^(\s*\S*){3}\s*-\s*-/{G;/^\s*(\S*).*:\1:/!d;s/\n.*//}' - file
2 lnm dse - - 3 pqr
3 ebe aaa xhd asw 4 pow
4 abc fww wrw ffp 3 ffw
Explanation:
Read the file and build a look-up table from column 6 delimited by :
Read the table (first line) into the hold space (HS) and then read the file again .
When columns 5 and 6 contain - only.
Append the look-up table to the pattern space (PS)
Do a look up using the first column as the key and if it fails delete that
line.
For all remaining lines remove the look-up table.