I have a HUGE file with a lot of HL7 segments. It must be split into 1000 (or so ) smaller files.
Since it has HL7 data, there is a pattern (logic) to go by. Each data chunk starts with "MSH|" and ends when next segment starts with "MSH|".
The script must be windows (cmd) based or VBS as I cannot install any software on that machine.
File structure:
MSH|abc|123|....
s2|sdsd|2323|
...
..
MSH|ns|43|...
...
..
..
MSH|sdfns|4343|...
...
..
asds|sds
MSH|sfns|3|...
...
..
as|ss
File in above example, must be split into 2 or 3 files. Also, the files comes from UNIX, so newlines must remain as they are in the source file.
Any help?
This is a sample script that I used to parse large hl7 files into separate files with the new file names based on the data file. Uses REBOL which does not require installation ie. the core version does not make any registry entries.
I have a more generalised version that scans an incoming directory and splits them into single files and then waits for the next file to arrive.
Rebol [
file: %split-hl7.r
author: "Graham Chiu"
date: 17-Feb-2010
purpose: {split HL7 messages into single messages}
]
fn: %05112010_0730.dat
outdir: %05112010_0730/
if not exists? outdir [
make-dir outdir
]
data: read fn
cnt: 0
filename: join copy/part form fn -4 + length? form fn "-"
separator: rejoin [ newline "MSH"]
parse/all data [
some [
[ copy result to separator | copy result to end ]
(
write to-file rejoin [ outdir filename cnt ".txt" ] result
print "Got result"
?? result
cnt: cnt + 1
)
1 skip
]
]
HL7 has a lot of segments - I assume that you know that your file has only MSH segments. So, have you tried parsing the file for the string "(newline)MSH|"? Just keep a running buffer and dump that into an output file when it gets too big.
Related
I am working with a nextflow workflow that, at a certain stage, groups a series of files by their sample id using groupTuple(), and resulting in a channel that looks like this:
[sample_id, [file_A, file_B, ... , file_N]]
[sample_id, [file_A, file_B, ... , file_N]]
...
[sample_id, [file_A, file_B, ... , file_N]]
Note that this is the same channel structure that you get from .fromFilePairs().
I want to use these channel items in a process in such a way that, for each item, the process reads the sample_id from the first field and all the files from the inner tuple at once.
The nextflow documentation is somewhat cryptic about this, and it is hard to find how to declare this type of input in a channel, so I thought I'd create a question on stack overflow and then answer it myself for anyone who will ever be looking for this answer.
How does one declare the inner tuple in the input section of a nextflow process?
In the example given above, my inner tuple contains items of only one type (files). I can therefore pass the whole second term of the tuple (i.e. the inner tuple) as a single input item under the file() qualifier. Like this:
input:
tuple \
val(sample_id), \
file(inner_tuple) \
from Input_channel
This will ensure that the tuple content is read as file (one by one), the same way as performing .collect() on a channel of files, in the sense that all files will then be available in the nextflow temp directory where the process is executed.
The question is how you come up with sample_id, but in case they just have different file extensions you might use something like this:
all_files = Channel.fromPath("/path/to/your/files/*")
all_files.map { it -> [it.simpleName, it] }
.groupTuple()
.set { grouped_files }
The path qualifier (previously the file qualifier) can be used to stage a single (file) value or a collection of (file) values into the process execution directory. The note at the bottom of the multiple input files section in the docs also mentions:
The normal file input constructs introduced in the input of files
section are valid for collections of multiple files as well.
This means, you can use a script variable, e.g.:
input:
tuple val(sample_id), path(my_files)
In which case, the variable will hold the list of files (preserving the original filenames). You could use it directly to refer to all of the files in the list, or, you could access specific (file) elements (if you need them) using square bracket (slice) notation.
This is the syntax you will want most of the time. However, if you need predicable filenames or if you need to deal with files with the identical filenames, you may need a different approach:
Alternatively, you could specify a target filename, e.g.:
input:
tuple val(sample_id), path('my_file')
In the case where a single file is received by the process, the file would be staged with the target filename. However, when a collection of files is received by the process, the filename will be appended with a numerical suffix representing its ordinal position in the list. For example:
process test {
tag { sample_id }
debug true
stageInMode 'rellink'
input:
tuple val(sample_id), path('fastq')
"""
echo "${sample_id}:"
ls -g --time-style=+"" fastq*
"""
}
workflow {
readgroups = Channel.fromFilePairs( '*_{1,2}.fastq' )
test( readgroups )
}
Results:
$ touch {foo,bar,baz}_{1,2}.fastq
$ nextflow run .
N E X T F L O W ~ version 22.04.4
Launching `./main.nf` [scruffy_caravaggio] DSL2 - revision: 87a80d6d50
executor > local (3)
[65/66f860] process > test (bar) [100%] 3 of 3 ✔
baz:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../baz_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../baz_2.fastq
foo:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../foo_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../foo_2.fastq
bar:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../bar_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../bar_2.fastq
Note that the names of staged files can be controlled using the * and ? wildcards. See the links above for a table that shows how the wildcards are replaced depending on the cardinality of the input collection.
I have a pipeline with two processes. I want to pass the results files of the first process used by the second process and save the output files in a separate directory for each of the files:
process one {
.
.
output:
file "split.*.json" into groups
.
.
}
process two {
.
.
publishDir params.output_path + "/exec_logs/train/${split.baseName}", mode: 'copy'
input:
set split from groups.flatten()
output:
file ".command.out"
file ".command.err"
.
.
"""
echo $split
"""
}
When I try the run the pipeline I get the following error:
No such variable: split in the publishDir ... line, but I have access to the split variable in the run section.
How can I get access to the split variable in the publishDir?
Thanks
I think the problem is actually the string concatenation. Try supplying instead a single GString:
publishDir "${params.output_path}/exec_logs/train/${split.baseName}", mode: 'copy'
For the example above, ensure also that the 'split' variable is provided using the path qualifier:
input:
path split from groups.flatten()
I have a few .txt files with data in JSON to be loaded to google BigQuery table. Along with the columns in the text files I will need to insert filename and current timestamp for each rows. It is in GCP Dataflow with Python 3.7
I accessed the Filemetadata containing the filepath and size using GCSFileSystem.match and metadata_list.
I believe I need to get the pipeline code to run in a loop, pass the filepath to ReadFromText, and call a FileNameReadFunction ParDo.
(p
| "read from file" >> ReadFromText(known_args.input)
| "parse" >> beam.Map(json.loads)
| "Add FileName" >> beam.ParDo(AddFilenamesFn(), GCSFilePath)
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(known_args.output,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
I followed the steps in Dataflow/apache beam - how to access current filename when passing in pattern? but I can't make it quite work.
Any help is appreciated.
You can use textio.ReadFromTextWithFilename instead of ReadFromText. That will produce a PCollection of (filename,line) tuples.
To include the file and timestamp in your output json record, you could change your "parse" line to
| "parse" >> beam.map(lambda (file, line): {
**json.loads(line),
"filename": file,
"timestamp": datetime.now()})
How to export monetdb query result (e.g. to csv file)?
Manual says:
Copy into File
The COPY INTO command with a file name argument allows for fast
dumping of a result set into an ASCII file. The file must be
accessible by the server and a full path name may be required. The
file STDOUT can be used to direct the result to the primary output
channel.
The delimiters and NULL AS arguments provide control over the layout
required.
COPY subquery INTO file_name [ [USING] DELIMITERS
field_separator [',' record_separator [ ',' string_quote ]]] [ NULL AS
null_string ]
https://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto
I'm trying with various syntax but with no result.
example query:
select * from test;
example failures:
copy select * from test into test.csv;
copy "select * from test" into test.csv;
OK. Missing apostrophe and full path. Also delimiters useful
copy select * from test into '/home/user/test.csv' using delimiters ',';
In Rebol, there are words for directory and file management, like make-dir, what-dir, rename, create-link, etc.
But I cannot find a word to simply copy a file to another location or to a newly created file.
A solution is to READ and WRITE. For example, I can do:
>> source: %.bash_history
== %.bash_history
>> target: %nothing
== %nothing
>> write/binary target (read/binary source)
And it works well. But what if I have a file larger than the available memory? Is there any way to copy a file without loading it into memory?
At the moment, I do with a CALL to the underlying OS:
>> call rejoin ["cp " to-string source " " to-string target]
But this is not portable to some different platforms than mine (GNU/Linux Mint): it will run on all Unices, Mac OSX, but not the rest.
I suppose it shouldn't be too hard to write a small function to do this, guessing the running operating system, and adapting the command line accordingly.
So my question: is there already a rebol standard word to copy files? If not, is there a plan to make one, in a module or something?
I don't recall a built-in way to do it aside from what's in the question, but you can do that by using file ports without buffering:
source: open/direct/binary/read %source
target: open/direct/binary/write %target
bytes_per: 1024 * 100
while [not none? data: copy/part source bytes_per][
insert target data
]
close target
close source
(Note: This answer is for Rebol 2)
You can also use system/version to detect which OS your script runs on:
call rejoin either 3 = system/version/4 [
;windows
[{copy "} to-local-file source {" "} to-local-file target {"}]
] [
;others
["cp " to-string source " " to-string target]
]
check this script as well http://www.rebol.org/view-script.r?script=environ.r
If there are other cases you can use;
switch/default system/version/4 [
2 [] ;mac
3 [] ;win
;...
] [
;default
]
Also check there, a few other answers for this problem:
Carl implemented something (I'm surprised it is not included in the heart of Rebol):
http://www.rebol.com/article/0281.html
And Patrick was as surprised as you, a decade and some days ago:
http://www.mail-archive.com/rebol-list#rebol.com/msg16473.html