Have a Master file (Master.txt) where each row is a string defining an HTML page and each field is tab delimited.
The record layout is as follows:
<item_ID> <field_1> <field_2> <field_3>
1 1.html <html>[content for 1.html in HTML format]</html> <EOF>
2 2.html <html>[content for 2.html in HTML format]</html> <EOF>
3 3.html <html>[content for 3.html in HTML format]</html> <EOF>
The HTML page is defined in <field_2>. <field_3> may not be necessary, but included here to indicate the logical location of end_of_file.
How to use awk to generate a file for each row (which begins with <item_ID>) where the content of the new file is <field_2> and the name of the new file is <field_1>?
Am running GNUwin32 under Windows 7 and will configure an awk solution to execute in a .bat file. Unfortunately can't do pipe-lining in Windows, so hoping for an single-awk-program solution.
TY in advance.
Assuming the HTML in field 3 may or may not contain tabs:
awk -F'\t' 'match($0,/<html>.*<\/html>/){print substr($0,RSTART,RLENGTH) > $2}' file
Related
I have a problem. I have created a few files txt in directory.
file1.txt
file2.txt
file3.txt
Next I writing name files to file txt: filenames.txt with step: Shell.
ls D:\test\prep\ > filename.txt
I have there all name files which are in directory. My filenames.txt looks like this:
file1.txt
file2.txt
file3.txt
Later I read the values from the file in step Text file input and value which I get I writing to step copy to result.
Next I use get rows from result and transformation_executor.
I would like get a new name file for each file with step get rows from result: instead file1.txt I want file.txt. I think that in transformation_executor I must have TABLE_INPUT with name with step get rows from result but I don't know what's next.
Any have idea?
You need to use below step/way, if you want to read a directory files based on another configuration file (which contain the directory files information).
Step-1:
Step-2:
Step-3:
You can found the all transformation/Job from HERE
Please let me know if its ok with you.
I have multiple files starting with DUMP_*.
Each file has data for a particular dump.
I want to print filename as well as contents of file in stdout
The expected output should be
FILENAME
ALL CONTENTS OF FILE
and so on
Closest thing I have tried is
cat $(ll DUMP_* | awk -F ' ' '{print $9}' ) | less
With this I am not able to figure out which content belongs to which file.
Also, I am reluctant to use a shell script, an adhoc command is preferred.
This answer is not fully in line with your expectations, but you see the link between a filename and its content even better:
Situation:
Prompt>cat DUMP_1
Info
More Info
Prompt>cat DUMP_2
Info
Solution:
Prompt>grep "" DUMP_*
DUMP_1:Info
DUMP_1:More Info
DUMP_2:Info
I have a large CSV file that is filled with millions of different lines of which each have the following format:
/resource/example
Now I also have a .TTL file in which each line possibly has the exact same text. Now I want to extract every single line from that .TTL file containing the same text as my current CSV file into a new CSV file.
I think this is possible using grep but that is a linux command and I am very, very inexperienced with that. Is it possible to do this in Windows? I could write a Python script that compares the two files, but since both files contain millions of lines that would literally take days to execute I think. Could anyone point me in the right direction on how to do this?
Thanks in advance! :)
Edit:
Example line from .TTL file:
<nl.dbpedia.org/resource/Algoritme>; <purl.org/dc/terms/subject>; <nl.dbpedia.org/resource/Categorie:Algoritme>; .
Example line from current CSV file:
/resource/algoritme
So with these two example lines it should export the line from the .TTL file into a new CSV file.
Using GNU awk. First read the CSV and hash it to a. Then compare each entry in a against each row in the TTL file:
$ awk 'BEGIN { IGNORECASE = 1 } # ignoring the case
NR==FNR { a[$1]; next } # hash csv to a hash
{
for(i in a) # each entry in a
if($0 ~ i) { # check against every record of ttl
print # if match, output matched ttl record
next # and skip to next ttl record
}
}' file.csv file.ttl
<nl.dbpedia.org/resource/Algoritme>; <purl.org/dc/terms/subject>; <nl.dbpedia.org/resource/Categorie:Algoritme>; .
Depending on the sizes of files it might be slow and maybe could be made faster but not based on info offered in the OP.
I have run into an issue making a while loop (yes, I am new at this..).
I have a file $lines_to_find.txt, containing a list of names which I would like to find in another (large) file $file_to_search.fasta.
When the lines in lines_to_find.txt are found in file_to_search.fasta, the lines with search hits I would like to be printed to a new file: output_file.fasta.
So I have a command similar to grep, that takes the sequences (for that is whats in the large file), and prints them to a new file:
obigrep -D SEARCHWORD INPUTFILE.fasta > OUPUTFILE.fasta
Now I would like the searchword to be replaced with the file lines_to_find.txt, and each line should be read and matched to the file_to_search.fasta. Output should preferably be one file, containing the sequence-hits from all lines in file lines_to_find.txt.
I tried this:
while read line
do
obigrep -D '$line' file_to_search.fasta >> outputfile.fasta
done < lines_to_find.txt
But my outputfile just returns empty.
What am I doing wrong?
Am I just building the while read loop wrong?
Are there other ways to do it?
I'm open to all suggestions, and as I am new, please point out obvious begginer-flaws.
I have a script which reads a text file and print it. How can I detect the empty lines in the file and ignore them.
is there any way to find out the total line number of the file without running the
while (<$file>)
$linenumbers++;
To print the number of non empty lines in a file
perl -le 'print scalar(grep{/./}<>)'