Use awk to split one file into many files - awk

Have a Master file (Master.txt) where each row is a string defining an HTML page and each field is tab delimited.
The record layout is as follows:
<item_ID> <field_1> <field_2> <field_3>
1 1.html <html>[content for 1.html in HTML format]</html> <EOF>
2 2.html <html>[content for 2.html in HTML format]</html> <EOF>
3 3.html <html>[content for 3.html in HTML format]</html> <EOF>
The HTML page is defined in <field_2>. <field_3> may not be necessary, but included here to indicate the logical location of end_of_file.
How to use awk to generate a file for each row (which begins with <item_ID>) where the content of the new file is <field_2> and the name of the new file is <field_1>?
Am running GNUwin32 under Windows 7 and will configure an awk solution to execute in a .bat file. Unfortunately can't do pipe-lining in Windows, so hoping for an single-awk-program solution.
TY in advance.

Assuming the HTML in field 3 may or may not contain tabs:
awk -F'\t' 'match($0,/<html>.*<\/html>/){print substr($0,RSTART,RLENGTH) > $2}' file

Related

Rename txt file name in Pentaho

I have a problem. I have created a few files txt in directory.
file1.txt
file2.txt
file3.txt
Next I writing name files to file txt: filenames.txt with step: Shell.
ls D:\test\prep\ > filename.txt
I have there all name files which are in directory. My filenames.txt looks like this:
file1.txt
file2.txt
file3.txt
Later I read the values from the file in step Text file input and value which I get I writing to step copy to result.
Next I use get rows from result and transformation_executor.
I would like get a new name file for each file with step get rows from result: instead file1.txt I want file.txt. I think that in transformation_executor I must have TABLE_INPUT with name with step get rows from result but I don't know what's next.
Any have idea?
You need to use below step/way, if you want to read a directory files based on another configuration file (which contain the directory files information).
Step-1:
Step-2:
Step-3:
You can found the all transformation/Job from HERE
Please let me know if its ok with you.

How to print contents of file as well as filename in linux using some adhoc command

I have multiple files starting with DUMP_*.
Each file has data for a particular dump.
I want to print filename as well as contents of file in stdout
The expected output should be
FILENAME
ALL CONTENTS OF FILE
and so on
Closest thing I have tried is
cat $(ll DUMP_* | awk -F ' ' '{print $9}' ) | less
With this I am not able to figure out which content belongs to which file.
Also, I am reluctant to use a shell script, an adhoc command is preferred.
This answer is not fully in line with your expectations, but you see the link between a filename and its content even better:
Situation:
Prompt>cat DUMP_1
Info
More Info
Prompt>cat DUMP_2
Info
Solution:
Prompt>grep "" DUMP_*
DUMP_1:Info
DUMP_1:More Info
DUMP_2:Info

Comparing a .TTL file to a CSV file and extract "similar" results into a new file

I have a large CSV file that is filled with millions of different lines of which each have the following format:
/resource/example
Now I also have a .TTL file in which each line possibly has the exact same text. Now I want to extract every single line from that .TTL file containing the same text as my current CSV file into a new CSV file.
I think this is possible using grep but that is a linux command and I am very, very inexperienced with that. Is it possible to do this in Windows? I could write a Python script that compares the two files, but since both files contain millions of lines that would literally take days to execute I think. Could anyone point me in the right direction on how to do this?
Thanks in advance! :)
Edit:
Example line from .TTL file:
<nl.dbpedia.org/resource/Algoritme>; <purl.org/dc/terms/subject>; <nl.dbpedia.org/resource/Categorie:Algoritme>; .
Example line from current CSV file:
/resource/algoritme
So with these two example lines it should export the line from the .TTL file into a new CSV file.
Using GNU awk. First read the CSV and hash it to a. Then compare each entry in a against each row in the TTL file:
$ awk 'BEGIN { IGNORECASE = 1 } # ignoring the case
NR==FNR { a[$1]; next } # hash csv to a hash
{
for(i in a) # each entry in a
if($0 ~ i) { # check against every record of ttl
print # if match, output matched ttl record
next # and skip to next ttl record
}
}' file.csv file.ttl
<nl.dbpedia.org/resource/Algoritme>; <purl.org/dc/terms/subject>; <nl.dbpedia.org/resource/Categorie:Algoritme>; .
Depending on the sizes of files it might be slow and maybe could be made faster but not based on info offered in the OP.

While read loop and command with file output

I have run into an issue making a while loop (yes, I am new at this..).
I have a file $lines_to_find.txt, containing a list of names which I would like to find in another (large) file $file_to_search.fasta.
When the lines in lines_to_find.txt are found in file_to_search.fasta, the lines with search hits I would like to be printed to a new file: output_file.fasta.
So I have a command similar to grep, that takes the sequences (for that is whats in the large file), and prints them to a new file:
obigrep -D SEARCHWORD INPUTFILE.fasta > OUPUTFILE.fasta
Now I would like the searchword to be replaced with the file lines_to_find.txt, and each line should be read and matched to the file_to_search.fasta. Output should preferably be one file, containing the sequence-hits from all lines in file lines_to_find.txt.
I tried this:
while read line
do
obigrep -D '$line' file_to_search.fasta >> outputfile.fasta
done < lines_to_find.txt
But my outputfile just returns empty.
What am I doing wrong?
Am I just building the while read loop wrong?
Are there other ways to do it?
I'm open to all suggestions, and as I am new, please point out obvious begginer-flaws.

how can I find the total number of lines in a file and also detect the empty lines by using CGI and Perl

I have a script which reads a text file and print it. How can I detect the empty lines in the file and ignore them.
is there any way to find out the total line number of the file without running the
while (<$file>)
$linenumbers++;
To print the number of non empty lines in a file
perl -le 'print scalar(grep{/./}<>)'