Removing SOME line breaks from srt/txt file - awk

I have a text file which has numbered entries, a timecode and a transcript. I am trying to remove the line breaks in the transcript and leave the others. I'm trying to use grep or awk.
File is like
1
00:00:27,160 --> 00:00:29,054
Sometimes there's not much dialogue.
2
00:00:30,100 --> 00:00:31,090
But other times there is quite a bit,
and it's formatted into two lines
3
00:00:31,500 --> 00:00:33,700
I want to remove the line breaks only on
these long lines, leaving all other formatting.
4
00:00:33,805 --> 00:00:37,285
So that all dialogue ends up being on a single
line no matter how long that line.
Output would look like:
1
00:00:27,160 --> 00:00:29,054
Sometimes there's not much dialogue.
2
00:00:30,100 --> 00:00:31,090
But other times there is quite a bit,
and it's formatted into two lines
3
00:00:31,500 --> 00:00:33,700
I want to remove the line breaks only on
these long lines, leaving all other formatting.
4
00:00:33,805 --> 00:00:37,285
So that all dialogue ends up being on a single line no matter how long that line.
thanks to all who have provided help

Don't rely on lines starting (or not) with any specific characters - just attach the 4th and subsequent lines in each record to the end of the 3rd line of that record:
$ awk '
BEGIN { RS=ORS=""; FS=OFS="\n" }
{
print $1,$2,$3
for (i=4;i<=NF;i++)
printf " %s", $i
print "\n\n"
}
' file
1
00:00:27,160 --> 00:00:29,054
Sometimes there's not much dialogue.
2
00:00:30,100 --> 00:00:31,090
But other times there is quite a bit, and it's formatted into two lines
3
00:00:31,500 --> 00:00:33,700
I want to remove the line breaks only on these long lines, leaving all other formatting.
4
00:00:33,805 --> 00:00:37,285
So that all dialogue ends up being on a single line no matter how long that line.

I think you need something like
awk '/[0-9]+/,/^$/{ if(NR<3) print $0; else {while($0!=""){ printf $0;next; }}}' file
It's not working, but you may get the idea.

You can try something like this with awk:
awk '!NF{print}/[a-z]/{printf "%s ", $0;next}1' file
$ cat file
1
00:00:27,160 --> 00:00:29,054
Sometimes there's not much dialogue.
2
00:00:30,100 --> 00:00:31,090
But other times there is quite a bit,
and it's formatted into two lines
3
00:00:31,500 --> 00:00:33,700
I want to remove the line breaks only on
these long lines, leaving all other formatting.
4
00:00:33,805 --> 00:00:37,285
So that all dialogue ends up being on a single
line no matter how long that line.
$ awk '!NF{print}/[a-z]/{printf "%s ", $0;next}1' file
1
00:00:27,160 --> 00:00:29,054
Sometimes there's not much dialogue.
2
00:00:30,100 --> 00:00:31,090
But other times there is quite a bit, and it's formatted into two lines
3
00:00:31,500 --> 00:00:33,700
I want to remove the line breaks only on these long lines, leaving all other formatting.
4
00:00:33,805 --> 00:00:37,285
So that all dialogue ends up being on a single line no matter how long that line.

Delete all new lines that are preceded by a letter or a space or tab:
perl -pe 's/([a-zA-Z \t])\n$/$1/'

I had the same problem and wrote this little code, which solved my problem:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[]) {
FILE *quelle,*ziel;
int i;
long maxsub,count,tmp,sub;
char puffer[10][200], *ptr,line[400];
if(argc != 3)
{
printf("Usage: srtlinejoin Filename CountOfSubtitles\n");
return EXIT_FAILURE;
}
maxsub = strtol( argv[2], &ptr, 10);
if( (quelle=fopen(argv[1],"r")) == NULL) {
fprintf(stderr, "Can't open %s\n", argv[1]);
return EXIT_FAILURE;
}
if( (ziel=fopen("out.srt","w")) == NULL) {
fprintf(stderr, "Can't open out.srt\n");
fclose(quelle);
return EXIT_FAILURE;
}
//read and write first line
fgets(puffer[0], 200, quelle);
fputs(puffer[0], ziel);
for(count=1; count < maxsub;count++)
//for(count=1; count <= 3;count++)
{
//printf("Processing subtitle %d\n",count);
tmp=0;
//Read and write time
fgets(puffer[0], 200, quelle);
fputs(puffer[0], ziel);
do {
fgets(puffer[tmp], 200, quelle);
//Scan for next Subtitle
sub = strtol( puffer[tmp], &ptr, 10);
tmp++;
}
while(sub != (count+1));
//Der Untertitel hat nur eine Zeile
if (strlen(puffer[1]) == 2)
{
fputs(puffer[0], ziel); //New Subtitle
fputs(puffer[1], ziel); //Next empty line
fputs(puffer[2], ziel); //Next number
}
//Der Untertitel hat zwei Zeile
if ((strlen(puffer[1]) > 2) && (strlen(puffer[2]) == 2))
{
for(i=0;i<400;i++)
line[i]=0;
strncpy(line,puffer[0],(strlen(puffer[0])-2));
strcat(line," ");
strcat(line,puffer[1]);
fputs(line, ziel); //New Subtitle
fputs(puffer[2], ziel); //Next empty line
fputs(puffer[3], ziel); //Next number
}
//Der Untertitel hat mehr als zwei Zeile
if ((strlen(puffer[1]) == 2) && (strlen(puffer[2]) == 2))
{
printf("Attention: The subtitles has more than two lines\n");
}
}
printf("Check last subtitle!\n");
fclose(quelle);
fclose(ziel);
return EXIT_SUCCESS;
}

Related

Fail to continue parsing after correct input

I have two input numbers separated by ','.
The program works fine for the first try, but for the second try it always ends with error.
How do I keep parsing?
lex file snippet:
#include "y.tab.h"
%%
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
. return yytext[0];
%%
yacc file snippet:
%{
#include <stdio.h>
int yylex();
int yyerror();
%}
%start s
%token NUMBER
%%
s: NUMBER ',' NUMBER{
if(($1 % 3 == 0) && ($3 % 2 == 0)) {printf("OK");}
else{printf("NOT OK, try again.");}
};
%%
int main(){ return yyparse(); }
int yyerror() { printf("Error Occured.\n"); return 0; }
output snippet:
benjamin#benjamin-VirtualBox:~$ ./ex1
15,4
OK
15,4
Error Occured.
Your start rule (indeed, your only rule) is:
s: NUMBER ',' NUMBER
That means that an input consists of a NUMBER, a ',' and another NUMBER.
That's it. After the parser encounters those three things, it expects an end of input indicator, because that's what you've told it a complete input looks like.
If you want to accept multiple lines, each consisting of two numbers separated by a comma, you'll need to write a grammar which describes that input. (And in order to describe the fact that they are lines, you'll probably want to make a newline character a token. Right now, it falls through the the scanner's default rule, because in (f)lex . doesn't match a newline character.) You'll also probably want to include an error production so that your parser doesn't suddenly terminate on the first error.
Alternatively, you could parse your input one line at a time by reading the lines yourself, perhaps using fgets or the Posix-standard getline function, and then passing each line to your scanner using yy_scan_string

Print smallest integer from file using awk custom function?

awk function looks like this in a file name fun.awk:
{
print small()
}
function small()
{
a[NR]=$0
smal=0
for(i=1;i<=3;i++)
{
if( a[i]<a[i+1])
smal=a[i]
else
smal=a[i+1]
}
return smal
}
The contents of awk.write:
1
23
32
The awk command is:
awk -f fun.awk awk.write
It gives me no result? Why?
I think you are going about this the wrong way. In awk, one approach might be:
NR == 1 {
small = $0
}
$0 < small {
small = $0
}
END {
print small
}
which simply simply sets small to the smallest integer we've seen so far on each line, and prints it at the end. (Note: you need to start with a initializing small on the first line.
A simpler approach might just be to sort the lines as numbers with sort, and pick the first one.

Reinitialization of awk variables

I am struggling with resetting some awk variables. I have multiple lines of the form:
one two three ... ten
with various appearances of each word in every line. I am trying to count the number of times each word is one each line, separate from the counts from the other lines.
this is what I have so far:
{ for(i=length(Num); i>0; i--)
if( Num[i] == "one" )
{
oneCount++
}
else if( Num[i] == "two" )
{
twoCount++
}
else if( Num[i] == "three" )
{
threeCount++
}
...
}
when I print out the count values, the count doesn't reinitialize with each new line. how do i fix this?
any help is much appreciated
You seem very confused. To get a count of each field in a ;-separated line would be:
awk -F';' '{
split("",cnt) # or "delete cnt" if using GNU awk.
for (i=1;i<=NF;i++) {
cnt[$i]++
}
for (word in cnt) {
print word, cnt[word]
}
}' file
Now is there anything else you need it to do?
Try initializing an array in the BEGIN portion to however many variables you'd like to count. You can run a loop in the portion to clear the array at the beginning of every new line.
Alternatively, you could just reset the value of each variable to 0 or null in the portion of the program that executes every line, but I'm guessing you have many variables.

How to detect the last line in awk before END?

I'm trying to concatenate String values and print them, but if the last types are Strings and there is no change of type then the concatenation won't print:
input.txt:
String 1
String 2
Number 5
Number 2
String 3
String 3
awk:
awk '
BEGIN { tot=0; ant_t=""; }
{
t = $1; val=$2;
#if string, concatenate its value
if (t == "String") {
tot+=val;
nx=1;
} else {
nx=0;
}
#if type change, add tot to res
if (t != "String" && ant_t == "String") {
res=res tot;
tot=0;
}
ant_t=t;
#if string, go next
if (nx == 1) {
next;
}
res=res"\n"val;
}
END { print res; }' input.txt
Current output:
3
5
2
Expected output:
3
5
2
6
How can I detect if awk is reading last line, so if there won't be change of type it will check if it is the last line?
awk reads line by line hence it cannot determine if it is reading the last line or not. The END block can be useful to perform actions once the end of file has reached.
To perform what you expect
awk '/String/{sum+=$2} /Number/{if(sum) print sum; sum=0; print $2} END{if(sum) print sum}'
will produce output as
3
5
2
6
what it does?
/String/ selects line that matches String so is Number
sum+=$2 performs the concatanation with String lines. When Number occurs, print the sum and reset to zero
Like this maybe:
awk -v lines="$(wc -l < /etc/hosts)" 'NR==lines{print "LAST"};1' /etc/hosts
I am pre-calculating the number of lines (using wc) and passing that into awk as a variable called lines, if that is unclear.
Just change last line to:
END { print res; print tot;}'
awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
Explanation
y is used as a boolean, and I check at the END if the last pattern was a string and print the sum
You can actually use x as the boolean like nu11p01n73R does which is smarter
Test
$ cat file
String 1
String 2
Number 5
Number 2
String 3
String 3
$ awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
3
5
2
6

looks like widechar input from getline in awk

I'm having trouble with AWK that I've never seen before.
I'm reading in a file, no special chars, and printing it back out.
When I read a text file, it prints out with a NUL between every char.
Reading an HTML file works exactly as expected and prints out what was read in.
Code snippet:
while ((getline line < In) > 0) {
print ":0:", line, ":0:" > "out";
reads the line "signature1"
and prints
":0: xFFxFEsNULiNULgNULnNULaNULtNULuNULrNULeNUL1NUL/r
NUL :0:/r/n"
as viewed in Notepad++.
"In" is the input filename.
I assume it is some Language setting on my machine, but I can't find anything.
A second print line, redirected to a file, prints every other line in Chinese.
TL;RD; Complete text of the app:
BEGIN { ProcessFile(); }
function ProcessFile() {
In = "default.txt";
Works = "NoProblem.html";
Out = "quote.txt";
RS = "/n";
while ((getline textLine < In) > 0) {
print "*0*", textLine, "*0:*" > "out.txt";
print textLine > Out; # prints every other line in Chinese ???
}
close(In);
close(Out);
}
Output of the second print line:
signature1
਍猀椀最渀愀琀甀爀攀㈀ഀഀ