how can I find the total number of lines in a file and also detect the empty lines by using CGI and Perl - cgi

I have a script which reads a text file and print it. How can I detect the empty lines in the file and ignore them.
is there any way to find out the total line number of the file without running the
while (<$file>)
$linenumbers++;

To print the number of non empty lines in a file
perl -le 'print scalar(grep{/./}<>)'

Related

Comparing a .TTL file to a CSV file and extract "similar" results into a new file

I have a large CSV file that is filled with millions of different lines of which each have the following format:
/resource/example
Now I also have a .TTL file in which each line possibly has the exact same text. Now I want to extract every single line from that .TTL file containing the same text as my current CSV file into a new CSV file.
I think this is possible using grep but that is a linux command and I am very, very inexperienced with that. Is it possible to do this in Windows? I could write a Python script that compares the two files, but since both files contain millions of lines that would literally take days to execute I think. Could anyone point me in the right direction on how to do this?
Thanks in advance! :)
Edit:
Example line from .TTL file:
<nl.dbpedia.org/resource/Algoritme>; <purl.org/dc/terms/subject>; <nl.dbpedia.org/resource/Categorie:Algoritme>; .
Example line from current CSV file:
/resource/algoritme
So with these two example lines it should export the line from the .TTL file into a new CSV file.
Using GNU awk. First read the CSV and hash it to a. Then compare each entry in a against each row in the TTL file:
$ awk 'BEGIN { IGNORECASE = 1 } # ignoring the case
NR==FNR { a[$1]; next } # hash csv to a hash
{
for(i in a) # each entry in a
if($0 ~ i) { # check against every record of ttl
print # if match, output matched ttl record
next # and skip to next ttl record
}
}' file.csv file.ttl
<nl.dbpedia.org/resource/Algoritme>; <purl.org/dc/terms/subject>; <nl.dbpedia.org/resource/Categorie:Algoritme>; .
Depending on the sizes of files it might be slow and maybe could be made faster but not based on info offered in the OP.

How to count patterns in multiple files using awk

I have multiple log files and I need to count the number of occurrences of certain patterns in all those files.
#!/usr/bin/awk
match($0,/New connection from user \[[a-z_]*\] for company \[([a-z_]*)\]/, a)
{instance[a[1]]++}
END {
for(i in instance)
print i":"instance[i]
}
I am running this script like this:
awk -f script *
But it looks like count is not correct. Is my above approach correct to handle multiple files?
try moving the curly brace up to the same line as the match function. Otherwise the instance[a[1]]++ will occur for every line. Doesn't it also print out the full line of every match too, to start? The match will have a default action of {print} when on its own line.
#!/usr/bin/awk
match($0,/pattern/, a) {
instance[a[1]]++
}
END {
for(i in instance)
print i":"instance[i]
}
Further details of how individual file names are read is available at the GNU site but applicable generally. BEGIN is before any files have been read, END is after all, and variables stay the same, apart from a few built in (FNR for example, record number for this file).

While read loop and command with file output

I have run into an issue making a while loop (yes, I am new at this..).
I have a file $lines_to_find.txt, containing a list of names which I would like to find in another (large) file $file_to_search.fasta.
When the lines in lines_to_find.txt are found in file_to_search.fasta, the lines with search hits I would like to be printed to a new file: output_file.fasta.
So I have a command similar to grep, that takes the sequences (for that is whats in the large file), and prints them to a new file:
obigrep -D SEARCHWORD INPUTFILE.fasta > OUPUTFILE.fasta
Now I would like the searchword to be replaced with the file lines_to_find.txt, and each line should be read and matched to the file_to_search.fasta. Output should preferably be one file, containing the sequence-hits from all lines in file lines_to_find.txt.
I tried this:
while read line
do
obigrep -D '$line' file_to_search.fasta >> outputfile.fasta
done < lines_to_find.txt
But my outputfile just returns empty.
What am I doing wrong?
Am I just building the while read loop wrong?
Are there other ways to do it?
I'm open to all suggestions, and as I am new, please point out obvious begginer-flaws.

Buffering output with AWK

I have an input file which consists of three parts:
inputFirst
inputMiddle
inputLast
Currently I have an AWK script which with this input creates an output file which consists of two parts:
outputFirst
outputLast
where outputFirst and outputLast is generated (on the fly) from inputFirst and inputLast respectively. However, to calculate the outputMiddle part (which is only one line) I need to scan the entire input, so I store it in a variable. The problem is that the value of this variable should go in between outputFirst and outputLast in the output file.
Is there a way to solve this using a single portable AWK script that takes no arguments? Is there a portable way to create temporary files in an AWK script or should I store the output from outputFirst and outputLast in two variables? I suspect that using variables will be quite inefficient for large files.
All versions of AWK (since at least 1985) can do basic I/O redirection to files or pipelines, just like the shell can, as well as run external commands without I/O redirection.
So, there are any number of ways to approach your problem and solve it without having to read the entire input file into memory. The most optimal solution will depend on exactly what you're trying to do, and what constraints you must honour.
A simple approach to the more precise example problem you describe in your comment above would perhaps go something like this: first in the BEGIN clause form two unique filenames with rand() (and define your variables), then read and sum the first 50 numbers from standard input while also writing them to a temporary file, then continuing to read and sum the next 50 numbers and write them to a second file, then finally in an END clause you would use a loop to read the first temporary file with getline and write it to standard output, print the total sum, then read the second temporary file the same way and write it to standard output, and finally call system("rm " file1 " " file2) to remove the temporary files.
If the output file is not too large (whatever that is), saving outputLast in a variable is quite reasonable. The first part, outputFirst, can (as described) be generated on the fly. I tried this approach and it worked fine.
Print the "first" output while processing the file, then write the remainder to a temporary file until you have written the middle.
Here is a self-contained shell script which processes its input files and writes to standard output.
#!/bin/sh
t=$(mktemp -t middle.XXXXXXXXX) || exit 127
trap 'rm -f "$t"' EXIT
trap 'exit 126' HUP INT TERM
awk -v temp="$t" "NR<500000 { print n+1 }
{ s+=$1 }
NR>=500000 { print n+1 >>temp
END { print s }' "$#"
cat "$t"
For illustration purposes, I used really big line numbers. I'm afraid your question is still too vague to really obtain a less general answer, but perhaps this can help you find the right direction.

Use awk to split one file into many files

Have a Master file (Master.txt) where each row is a string defining an HTML page and each field is tab delimited.
The record layout is as follows:
<item_ID> <field_1> <field_2> <field_3>
1 1.html <html>[content for 1.html in HTML format]</html> <EOF>
2 2.html <html>[content for 2.html in HTML format]</html> <EOF>
3 3.html <html>[content for 3.html in HTML format]</html> <EOF>
The HTML page is defined in <field_2>. <field_3> may not be necessary, but included here to indicate the logical location of end_of_file.
How to use awk to generate a file for each row (which begins with <item_ID>) where the content of the new file is <field_2> and the name of the new file is <field_1>?
Am running GNUwin32 under Windows 7 and will configure an awk solution to execute in a .bat file. Unfortunately can't do pipe-lining in Windows, so hoping for an single-awk-program solution.
TY in advance.
Assuming the HTML in field 3 may or may not contain tabs:
awk -F'\t' 'match($0,/<html>.*<\/html>/){print substr($0,RSTART,RLENGTH) > $2}' file