How can you tell which characters are in which character classes? - awk

We often see people ask questions about why their string doesn't match their regexp and sometimes the answer comes down to them expecting a character to be part of a character class when it isn't or them trying to use a shorthand for a character class (e.g. \d for [[:digit:]]) that exists in some other tool but simply isn't part of the awk language. So with that in mind I'm creating a canonical answer to the question of which characters exist in which character classes in awk.

The following script will generate the set of chars in each character class (plus the \s, \S, \w, and \W extensions if your awk supports them) for your locale for the chars in the numeric range 0-127 as listed in the first table at http://www.asciitable.com/ and https://en.wikipedia.org/wiki/ASCII. For a horizontal tab character as output by print "\t" the first reference uses TAB and the other HT as the abbreviation - I prefer TAB so I used it below. They both use Space to represent the char output by print " " so I also did that below even though I more commonly refer to it as a "blank char":
$ cat prtCharClasses.awk
# From the gawk manual, https://www.gnu.org/software/gawk/manual/gawk.html#Bracket-Expressions:
# [:alnum:] Alphanumeric characters
# [:alpha:] Alphabetic characters
# [:blank:] Space and TAB characters
# [:cntrl:] Control characters
# [:digit:] Numeric characters
# [:graph:] Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both)
# [:lower:] Lowercase alphabetic characters
# [:print:] Printable characters (characters that are not control characters)
# [:punct:] Punctuation characters (characters that are not letters, digits, control characters, or space characters)
# [:space:] Space characters (these are: space, TAB, newline, carriage return, formfeed and vertical tab)
# [:upper:] Uppercase alphabetic characters
# [:xdigit:] Characters that are hexadecimal digits
# \s Matches any whitespace character. Think of it as shorthand for ‘[[:space:]]’.
# \S Matches any character that is not whitespace. Think of it as shorthand for ‘[^[:space:]]’.
# \w Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for ‘[[:alnum:]_]’.
# \W Matches any character that is not word-constituent. Think of it as shorthand for ‘[^[:alnum:]_]’.
BEGIN {
asciiMax = (asciiMax == "" ? 127 : asciiMax)
numClasses = split("\
[[:alpha:]] \
[[:digit:]] \
[[:alnum:]] \
[[:lower:]] \
[[:upper:]] \
[[:xdigit:]] \
[[:punct:]] \
[[:cntrl:]] \
[[:graph:]] \
[[:print:]] \
[[:blank:]] \
[[:space:]] \
\\s \
\\S \
\\w \
\\W \
", classes)
# Map the control chars and white space in the 0-127 range to
# their abbreviations to make them visible in the output:
split("NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US Space", map)
map[128] = "DEL"
for (asciiNr=0; asciiNr<=asciiMax; asciiNr++) {
char = sprintf("%c", asciiNr)
chars[++numChars] = char
}
for (classNr in classes) {
class = classes[classNr]
for (charNr in chars) {
char = chars[charNr]
if ( char ~ class ) {
classChars[classNr,charNr]
}
}
}
for (classNr=1; classNr<=numClasses; classNr++) {
class = classes[classNr]
printf "%-12s =", class
for (charNr=1; charNr<=numChars; charNr++) {
if ( (classNr,charNr) in classChars ) {
char = chars[charNr]
printf " %s", (charNr in map ? map[charNr] : char)
}
}
print ""
}
}
Here is it's output for chars 0-127 in the C locale, if you have a different locale then the output will be different so run the above script to see what it is in your locale:
$ awk -f prtCharClasses.awk file
[[:alpha:]] = A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
[[:digit:]] = 0 1 2 3 4 5 6 7 8 9
[[:alnum:]] = 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
[[:lower:]] = a b c d e f g h i j k l m n o p q r s t u v w x y z
[[:upper:]] = A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
[[:xdigit:]] = 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
[[:punct:]] = ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
[[:cntrl:]] = NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US DEL
[[:graph:]] = ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
[[:print:]] = Space ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
[[:blank:]] = TAB Space
[[:space:]] = TAB LF VT FF CR Space
\s = TAB LF VT FF CR Space
\S = NUL SOH STX ETX EOT ENQ ACK BEL BS SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ DEL
\w = 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
\W = NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US Space ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ ` { | } ~ DEL
Note that \s, \S, \w, and \W are extensions only available in some tools, e.g. GNU awk. \d and \D are not present above - those are extensions available in some tools that support PCREs as shorthand for [:digit:] but that does not include any variant of awk. If you want a shorthand for [:digit:] then [0-9] appears to be portable across locales but I stand to be corrected.
If you need to see the chars past number 127, then you can set asciiMax on the command line, e.g.:
$ awk -v asciiMax=255 -f prtCharClasses.awk
[[:alpha:]] = A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
[[:digit:]] = 0 1 2 3 4 5 6 7 8 9
[[:alnum:]] = 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
[[:lower:]] = a b c d e f g h i j k l m n o p q r s t u v w x y z µ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
[[:upper:]] = A B C D E F G H I J K L M N O P Q R S T U V W X Y Z À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ
[[:xdigit:]] = 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
[[:punct:]] = ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © « ¬ ® ¯ ° ± ² ³ ´ ¶ · ¸ ¹ » ¼ ½ ¾ ¿ × ÷
[[:cntrl:]] = NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US DEL
[[:graph:]] = ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
[[:print:]] = Space ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
[[:blank:]] = TAB Space  
[[:space:]] = TAB LF VT FF CR Space  
\s = TAB LF VT FF CR Space  
\S = NUL SOH STX ETX EOT ENQ ACK BEL BS SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? # A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ DEL ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
\w = 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
\W = NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US Space ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ ` { | } ~ DEL   ¡ ¢ £ ¤ ¥ ¦ § ¨ © « ¬ ­ ® ¯ ° ± ² ³ ´ ¶ · ¸ ¹ » ¼ ½ ¾ ¿ × ÷

Related

Sed command issue

I have this file inside a mariaDB that looks like this
name callerid secret context type host
1000 Omar Al-Ani <1000> op1000DIR MANAGEMENT friend dynamic
1001 Ammar Zigderly <1001> 1001 MANAGEMENT peer dynamic
1002 Lubna COO Office <1002> 1002 ELdefault peer dynamic
i want to convert it using sed and awk to look like this format
[1000]
callerid=Omar Al-Ani <1000>
secret=op1000DIR
context=MANAGEMENT
type=friend
host=dynamic
[1001]
callerid=Ammar Zigderly <1001>
secret=1001
context=MANAGEMENT
type=peer
host=dynamic
[1002]
callerid=Lubna COO Office <1002>
secret=1002
context=ELdefault
type=peer
host=dynamic
This is the output of this command head -3 filename | od -c on the input file
0000000 n a m e \t c a l l e r i d \t s e
0000020 c r e t \t c o n t e x t \t t y p
0000040 e \t h o s t \n 1 0 0 0 \t O m a
0000060 r A l - A n i < 1 0 0 0 >
0000100 \t o p 1 0 0 0 D I R \t M A N A
0000120 G E M E N T \t f r i e n d \t d y
0000140 n a m i c \n 1 0 0 1 \t A m m
0000160 a r Z i g d e r l y < 1 0 0
0000200 1 > \t 1 0 0 1 \t M A N A G E
0000220 M E N T \t p e e r \t d y n a m i
0000240 c \n
0000243
Any idea would be helpfull !
I think awk is going to be a bit simpler and easier (?) to modify if requirements change:
awk -F'\t' '
BEGIN { labels[2]="callerid"
labels[3]="secret"
labels[4]="context"
labels[5]="type"
labels[6]="host"
}
FNR>1 { gsub(/ /,"",$1) # remove spaces from 1st column
printf "[%s]\n",$1
for (i=2;i<=6;i++)
printf "\t%s=%s\n", labels[i],$i
print ""
}
' names.dat
This generates:
[1000]
callerid=Omar Al-Ani <1000>
secret=op1000DIR
context=MANAGEMENT
type=friend
host=dynamic
[1001]
callerid=Ammar Zigderly <1001>
secret=1001
context=MANAGEMENT
type=peer
host=dynamic
[1002]
callerid=Lubna COO Office <1002>
secret=1002
context=ELdefault
type=peer
host=dynamic
assuming tab separated fields
$ awk -F'\t' 'NR==1 {split($0,h); next}
{print "[" $1 "]";
for(i=2;i<=NF;i++) print "\t" h[i] ":" $i}' file.tcv
[1000]
callerid:Omar Al-Ani <1000>
secret:op1000DIR
context:MANAGEMENT
type:friend
host:dynamic
[1001]
callerid:Ammar Zigderly <1001>
secret:1001
context:MANAGEMENT
type:peer
host:dynamic
[1002]
callerid:Lubna COO Office <1002>
secret:1002
context:ELdefault
type:peer
host:dynamic

AWK - Adding new line based on text in previous line?

So I have some database table info in a file that looks like this:
2.6 G 7.7 G abc-def-ghi_2021-09-19_random_letters_random_numbers
2.6 G 7.7 G abc-def-ghi_2021-09-20_random_letters_random_numbers
2.6 G 7.8 G abc-def-ghi_2021-09-21_random_letters_random_numbers
18.9 G 56.8 G def-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-27_random_letters_random_numbers
110.4 M 331.2 M ghi-abc-def_2021-09-28_random_letters_random_numbers
55.1 K 165.3 K jkl-ghi-def_2021-09-20_random_letters_random_numbers
50.7 K 152.1 K jkl-ghi-def_2021-09-24_random_letters_random_numbers
49.6 K 148.8 K jkl-ghi-def_2021-09-25_random_letters_random_numbers
48.6 K 138.8 K jkl-ghi-def_2021-09-26_random_letters_random_numbers
Basically, I need it to look like this:
2.6 G 7.7 G qabc-def-ghi_2021-09-19_random_letters_random_numbers
2.6 G 7.7 G qabc-def-ghi_2021-09-20_random_letters_random_numbers
2.6 G 7.8 G qabc-def-ghi_2021-09-21_random_letters_random_numbers
18.9 G 56.8 G def-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-wwwdef_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-wwwdef_2021-09-27_random_letters_random_numbers
110.4 M 331.2 M ghi-abc-wwwdef_2021-09-28_random_letters_random_numbers
55.1 K 165.3 K jkl-deghi-def_2021-09-20_random_letters_random_numbers
50.7 K 152.1 K jkl-deghi-def_2021-09-24_random_letters_random_numbers
49.6 K 148.8 K jkl-deghi-def_2021-09-25_random_letters_random_numbers
48.6 K 138.8 K jkl-deghi-def_2021-09-26_random_letters_random_numbers
Where there is a new line after the start of each unique table-name prefix. Right now I'm having to do all of this manually for hundreds of table names. Also, if there is a way to count how many times each table name occurs, that would be great too.
Here is the code I got so far #Cyrus:
awk 'BEGIN{FS="[ _]+"} NR==1{last=$(NF-1)} NR>1 && last!=$(NF-1){printf RS} {last=$(NF-1); print}' test2.txt
Here is the output
2.6 G 7.7 G abc-def-ghi_2021-09-19_random_letters_random_numbers
2.6 G 7.7 G abc-def-ghi_2021-09-20_random_letters_random_numbers
2.6 G 7.8 G abc-def-ghi_2021-09-21_random_letters_random_numbers
18.9 G 56.8 G def-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-27_random_letters_random_numbers
110.4 M 331.2 M ghi-abc-def_2021-09-28_random_letters_random_numbers
55.1 K 165.3 K jkl-ghi-def_2021-09-20_random_letters_random_numbers
50.7 K 152.1 K jkl-ghi-def_2021-09-24_random_letters_random_numbers
49.6 K 148.8 K jkl-ghi-def_2021-09-25_random_letters_random_numbers
48.6 K 138.8 K jkl-ghi-def_2021-09-26_random_letters_random_numbers
This command works for table names like these:
2.6 G 7.7 G abc-def-ghi_2021-09-19
2.6 G 7.7 G abc-def-ghi_2021-09-20
2.6 G 7.8 G abc-def-ghi_2021-09-21
18.9 G 56.8 G def-abc-def_2021-09-21
110.3 M 331.0 M ghi-abc-def_2021-09-21
110.3 M 331.0 M ghi-abc-def_2021-09-27
110.4 M 331.2 M ghi-abc-def_2021-09-28
55.1 K 165.3 K jkl-ghi-def_2021-09-20
50.7 K 152.1 K jkl-ghi-def_2021-09-24
49.6 K 148.8 K jkl-ghi-def_2021-09-25
48.6 K 138.8 K jkl-ghi-def_2021-09-26
Seems like you can just do:
awk 'last && $5 != last { print count; count=0 } {last = $5; count++ } 1' FS='[ _]*'
eg:
$ cat input
2.6 G 7.7 G abc-def-ghi_2021-09-19_random_letters_random_numbers
2.6 G 7.7 G abc-def-ghi_2021-09-20_random_letters_random_numbers
2.6 G 7.8 G abc-def-ghi_2021-09-21_random_letters_random_numbers
18.9 G 56.8 G def-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-27_random_letters_random_numbers
110.4 M 331.2 M ghi-abc-def_2021-09-28_random_letters_random_numbers
55.1 K 165.3 K jkl-ghi-def_2021-09-20_random_letters_random_numbers
50.7 K 152.1 K jkl-ghi-def_2021-09-24_random_letters_random_numbers
49.6 K 148.8 K jkl-ghi-def_2021-09-25_random_letters_random_numbers
48.6 K 138.8 K jkl-ghi-def_2021-09-26_random_letters_random_numbers
$ awk 'last && $5 != last { print count; count=0 } {last = $5; count++ } 1' FS='[ _]*' input
2.6 G 7.7 G abc-def-ghi_2021-09-19_random_letters_random_numbers
2.6 G 7.7 G abc-def-ghi_2021-09-20_random_letters_random_numbers
2.6 G 7.8 G abc-def-ghi_2021-09-21_random_letters_random_numbers
3
18.9 G 56.8 G def-abc-def_2021-09-21_random_letters_random_numbers
1
110.3 M 331.0 M ghi-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-27_random_letters_random_numbers
110.4 M 331.2 M ghi-abc-def_2021-09-28_random_letters_random_numbers
3
55.1 K 165.3 K jkl-ghi-def_2021-09-20_random_letters_random_numbers
50.7 K 152.1 K jkl-ghi-def_2021-09-24_random_letters_random_numbers
49.6 K 148.8 K jkl-ghi-def_2021-09-25_random_letters_random_numbers
48.6 K 138.8 K jkl-ghi-def_2021-09-26_random_letters_random_numbers
$ awk -F'[[:space:]_]+' '(NR>1) && ($5 != prev){print ""} {print; prev=$5}' file
2.6 G 7.7 G abc-def-ghi_2021-09-19_random_letters_random_numbers
2.6 G 7.7 G abc-def-ghi_2021-09-20_random_letters_random_numbers
2.6 G 7.8 G abc-def-ghi_2021-09-21_random_letters_random_numbers
18.9 G 56.8 G def-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-27_random_letters_random_numbers
110.4 M 331.2 M ghi-abc-def_2021-09-28_random_letters_random_numbers
55.1 K 165.3 K jkl-ghi-def_2021-09-20_random_letters_random_numbers
50.7 K 152.1 K jkl-ghi-def_2021-09-24_random_letters_random_numbers
49.6 K 148.8 K jkl-ghi-def_2021-09-25_random_letters_random_numbers
48.6 K 138.8 K jkl-ghi-def_2021-09-26_random_letters_random_numbers
$ awk -F'[[:space:]_]+' '(NR>1) && ($5 != prev){print cnt; cnt=0} {print; prev=$5; cnt++} END{if (cnt) print cnt}' file
2.6 G 7.7 G abc-def-ghi_2021-09-19_random_letters_random_numbers
2.6 G 7.7 G abc-def-ghi_2021-09-20_random_letters_random_numbers
2.6 G 7.8 G abc-def-ghi_2021-09-21_random_letters_random_numbers
3
18.9 G 56.8 G def-abc-def_2021-09-21_random_letters_random_numbers
1
110.3 M 331.0 M ghi-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-27_random_letters_random_numbers
110.4 M 331.2 M ghi-abc-def_2021-09-28_random_letters_random_numbers
3
55.1 K 165.3 K jkl-ghi-def_2021-09-20_random_letters_random_numbers
50.7 K 152.1 K jkl-ghi-def_2021-09-24_random_letters_random_numbers
49.6 K 148.8 K jkl-ghi-def_2021-09-25_random_letters_random_numbers
48.6 K 138.8 K jkl-ghi-def_2021-09-26_random_letters_random_numbers
4
The if (cnt) in the END sections is just so you don't print a null string or zero if the input file is empty.
Assumptions:
all input lines have a total of 5 space-delimited fields that look like: #.# {G,K,M} #.# {G,K,M} <table-name-prefix>_YYYY-MM-DD_random_letters_random_numbers
input file has already been sorted by <table-name-prefix>
objective is to add a blank line to the output before we process a line with a new/different <table-name-prefix>
OP mentions keeping track of how many times each unique <table-name-prefix> is seen, but since there's no mention of what to do with said number I'll just print it on the new 'blank' line (for now)
expected output is to look like the 2nd block of data provided in the question (with proviso that <table-name-prefix> counts are placed in the 'blank' lines)
One awk idea:
awk '{ split($5,arr,"_") # split field #5 on "_" delimiter
tabname=arr[1] # grab table name
if (tabname != prevname && prevname != "" ) { # if this is a new table name then ...
printf "%s\n", count # print a new line with a count of the last table name and then ...
count=0
}
print # print current line
count++ # increment count for table name
prevname = tabname # keep track of previous table name
}
END { printf "%s\n", count } # flush last table name count to stdout
' test2.txt
This generates:
2.6 G 7.7 G abc-def-ghi_2021-09-19_random_letters_random_numbers
2.6 G 7.7 G abc-def-ghi_2021-09-20_random_letters_random_numbers
2.6 G 7.8 G abc-def-ghi_2021-09-21_random_letters_random_numbers
3
18.9 G 56.8 G def-abc-def_2021-09-21_random_letters_random_numbers
1
110.3 M 331.0 M ghi-abc-def_2021-09-21_random_letters_random_numbers
110.3 M 331.0 M ghi-abc-def_2021-09-27_random_letters_random_numbers
110.4 M 331.2 M ghi-abc-def_2021-09-28_random_letters_random_numbers
3
55.1 K 165.3 K jkl-ghi-def_2021-09-20_random_letters_random_numbers
50.7 K 152.1 K jkl-ghi-def_2021-09-24_random_letters_random_numbers
49.6 K 148.8 K jkl-ghi-def_2021-09-25_random_letters_random_numbers
48.6 K 138.8 K jkl-ghi-def_2021-09-26_random_letters_random_numbers
4

Priority 8-to-3 encoder in Verilog (case, casex)

I'm trying to describe a SN54LS348 element (8-line to 3-line priority encoder).
The truth table is:
INPUTS OUTPUTS
E | 0 1 2 3 4 5 6 7 ** A2 A1 A0 | GS EO
///////////////////////////////////////
H | X X X X X X X X ** Z Z Z | H H
L | H H H H H H H H ** Z Z Z | H L
L | X X X X X X X L ** L L L | L H
L | X X X X X X L H ** L L H | L H
L | X X X X X L H H ** L H L | L H
L | X X X X L H H H ** L H H | L H
L | X X X L H H H H ** H L L | L H
L | X X L H H H H H ** H L H | L H
L | X L H H H H H H ** H H L | L H
L | L H H H H H H H ** H H H | L H
Here's my implementation:
module L348 (E, D0, D1, D2, D3, D4, D5, D6, D7, A0, A1, A2, GS, EO);
input E, D0, D1, D2, D3, D4, D5, D6, D7;
output A0, A1, A2, GS, EO;
assign D = {D0, D1, D2, D3, D4, D5, D6, D7};
parameter HIGH_IMPEDANCE = 3'bz;
reg [7:0] MASK_1 = 8'b0000_0001;
reg [7:0] MASK_2 = 8'b0000_0011;
reg [7:0] MASK_3 = 8'b0000_0111;
reg [7:0] MASK_4 = 8'b0000_1111;
reg [7:0] MASK_5 = 8'b0001_1111;
reg [7:0] MASK_6 = 8'b0011_1111;
reg [7:0] MASK_7 = 8'b0111_1111;
reg [7:0] MASK_8 = 8'b1111_1111;
reg [2:0] A;
reg [1:0] GS_EO;
reg [7:0] temp;
reg [7:0] mem [7:0];
initial
begin
mem[0] = MASK_1;
mem[1] = MASK_2;
mem[2] = MASK_3;
mem[3] = MASK_4;
mem[4] = MASK_5;
mem[5] = MASK_6;
mem[6] = MASK_7;
mem[7] = MASK_8;
temp = 8'bxxxx_xxxx;
end
assign {A2, A1, A0} = A;
assign {GS, EO} = GS_EO;
integer i;
always #(*)
begin
for (i = 7; i > 0; i = i - 1)
if (mem[i] & D == mem[i])
begin
temp = mem[i];
i = -1;
end
if (E)
begin
A = HIGH_IMPEDANCE;
GS_EO = 2'b11;
end
else
begin
if (temp == 8'b1111_1111)
begin
A = HIGH_IMPEDANCE;
GS_EO = 2'b10;
end
else
begin
GS_EO = 2'b01;
case (temp)
8'b0000_0001: A = 3'b001;
8'b0000_0011: A = 3'b010;
8'b0000_0111: A = 3'b011;
8'b0000_1111: A = 3'b100;
8'b0001_1111: A = 3'b101;
8'b0011_1111: A = 3'b110;
8'b0111_1111: A = 3'b111;
endcase
end
end
end
endmodule
It fails to achieve the switching of signals A2-A0 which are always in a X-state (except when E = H). I've tried many solutions, but it feels like simulator can't manage 'case' block ( I tried also 'casex' block). There is a bug somewhere, but I can't figure it out. Does anyone have ideas?
You've got quite a few things going on here but your most immediate problem is probably.
assign D = {D0, D1, D2, D3, D4, D5, D6, D7};
This implicitly defined wire is only going to be 1 bit wide and so the high 7b are going to be dropped. Isn't Verilog fun?
There are other logical problems but the easiest way of doing a priority encoder with a case statement is as follows:
casez (in)
4'b???1 : out = 0;
4'b??10 : out = 1;
4'b?100 : out = 2;
4'b1000 : out = 3;
default : out = 0; //no match
endcase
The casez allow you to put in ? for don't care conditions similar to your truth table. The first matching entry is taken which give you the priority behavior.
Adapt as needed for your case for width, direction of priority, width of IO, etc...
Finally as a stylistic concern your early loop termination should use break rather than directly modifying the loop variable.

pdfbox keep whitespaces on beginning from line

i have a PDF and i want to get the Text of this PDF with PDFBox 2.x an i want to keep all Whitespaces on the beginning from the Line!
Here the PDF Examples:
First Example with seven Whitespace on the Line-Beginning
when i mark the Text in the PDF and copy it to a Editor i get this:
> Erster Testuebertrag auf die Neuentwicklung fuer die PSA Direktbank ma
> l mit sehr langen Verwendungszweck gleich zum testen wann dieser cuted
EDIT: here is the Stream-Dump from this Sectionimage:
BT
/F19 8.9664 Tf 96.197 606.119 Td [(Kommunikation)]TJ
ET
q
1 0 0 1 85.238 594.35 cm
[]0 d 0 J 0.398 w 0 0 m 0 7.352 l S
Q
BT
/F19 8.9664 Tf 133.856 595.758 Td [(Erster)-600(Testuebertrag)-600(auf)-600(die)-600(Neuentwicklung)-600(fuer)-600(die)-600(PSA)-600(Direktbank)-600(ma)]TJ
ET
q
1 0 0 1 85.238 583.989 cm
[]0 d 0 J 0.398 w 0 0 m 0 7.352 l S
Q
BT
/F19 8.9664 Tf 133.856 585.397 Td [(l)-600(mit)-600(sehr)-600(langen)-600(Verwendungszweck)-600(gleich)-600(zum)-600(testen)-600(wann)-600(dieser)-600(cuted)]TJ
ET
Second Example with five Whitespaces in second Line
when i mark the Text in the PDF and copy it to a Editor i get this:
> Rueckzahlung mal mit Umlauten obwohl das ganze ja gar nicht mehr gehen
> duerte, aber hier haben wir jetzt sogar die zweite Zeile erreicht
EDIT: this is the Stream-Dump of the Sectionimage #2:
BT
/F19 8.9664 Tf 96.197 267.821 Td [(Kommunikation)-8400(:)]TJ
ET
q
1 0 0 1 85.238 256.052 cm
[]0 d 0 J 0.398 w 0 0 m 0 7.352 l S
Q
BT
/F19 8.9664 Tf 117.716 257.46 Td [(Rueckzahlung)-600(mal)-600(mit)-600(Umlauten)-600(obwohl)-600(das)-600(ganze)-600(ja)-600(gar)-600(nicht)-600(mehr)-600(gehen)]TJ
ET
q
1 0 0 1 85.238 245.691 cm
[]0 d 0 J 0.398 w 0 0 m 0 7.352 l S
Q
BT
/F19 8.9664 Tf 123.096 247.099 Td [(duerte,)-600(aber)-600(hier)-600(haben)-600(wir)-600(jetzt)-600(sogar)-600(die)-600(zweite)-600(Zeile)-600(erreicht)]TJ
ET
So, when i extract the Text with
PDFText2HTML Stripper = new PDFText2HTML();
or
PDFTextStripper Stripper = new PDFTextStripper();
the i get every Time the Text without leading Whitespaces on the Line-Beginning but i need it ;)
Example:
> Erster Testuebertrag auf die Neuentwicklung fuer die PSA Direktbank ma
> l mit sehr langen Verwendungszweck gleich zum testen wann dieser cuted
>
> Rueckzahlung mal mit Umlauten obwohl das ganze ja gar nicht mehr gehen
> duerte, aber hier haben wir jetzt sogar die zweite Zeile erreicht
Is there every Solution to keep the Whitespaces with PDFBox 2.x?
Greats

Extract helix residues from DSSP with awk

I would like to extract helix(H) residues from DSSP files .
1CRN.dssp
31 37 A K H < S+
32 38 A V H < S+
33 39 A F H >< S-
34 40 A G G >< S+
35 41 A K G > S+
1GB5.dssp
113 242 B G H 3>>S+
114 243 B I H <45S+
115 244 B L H X45S+
116 245 B S H 3<5S+
117 246 B K T >X5S+
I want to save the output in the following format.
>1CRN
KVF
>1GB5
GILS
How can I do this with awk? Your suggestions would be appreciated!
It's the 'H' in the 5 th column that indicates "helix(H) residues"?
awk '{
if (FNR == 1 ) print ">" FILENAME
if ($5 == "H") {
printf $4
}
}
END { printf "\n"}' file
output
>tstDat.txt
KVF
IHTH