Interlacing text from multiple files based on contents of line - optimization

I'm trying to take N files, which, incidentally, are all syslog log files, and interlace them based on the timestamp which is the first part of the line. I can do this naively but I fear that my approach will not scale well with any more than just a handful of these files.
So let's say I just have two files, 1.log and 2.log. 1.log looks like this:
2016-04-06T21:13:23.655446+00:00 foo 1
2016-04-06T21:13:24.384521+00:00 bar 1
and 2.log looks like this:
2016-04-06T21:13:24.372946+00:00 foo 2
2016-04-06T21:13:24.373171+00:00 bar 2
Given that example, I would want the output to be:
2016-04-06T21:13:23.655446+00:00 foo 1
2016-04-06T21:13:24.372946+00:00 foo 2
2016-04-06T21:13:24.373171+00:00 bar 2
2016-04-06T21:13:24.384521+00:00 bar 1
As that would be the lines of the files, combined, and sorted by the timestamp with which each line begins.
We can assume that each file is internally sorted before the program is run. (If it isn't, rsyslog and I have some talking to do.)
So quite naively I could write something like this, ignoring memory concerns and whatnot:
interlaced_lines = []
first_lines = [[f.readline(), f] for f in files]
while first_lines:
first_lines.sort()
oldest_line, f = first_lines[0]
while oldest_line and (len(first_lines) == 1 or (first_lines[1][0] and oldest_line < first_lines[1][0])):
interlaced_lines.append(oldest_line)
oldest_line = f.readline()
if oldest_line:
first_lines[0][0] = oldest_line
else:
first_lines = first_lines[1:]
I fear that this might be quite slow, reading line by line like this. However, I'm not sure how else to do it. Can I perform this task faster with a different algorithm or through parallelization? I am largely indifferent to which languages and tools to use.

As it turns out, since each file is internally presorted, I can get pretty far with sort --merge. With over 2GB of logs it sorted them in 15 seconds. Using my example:
% sort --merge 1.log 2.log
2016-04-06T21:13:23.655446+00:00 foo 1
2016-04-06T21:13:24.372946+00:00 foo 2
2016-04-06T21:13:24.373171+00:00 bar 2
2016-04-06T21:13:24.384521+00:00 bar 1

Related

snakemake dry run for a single wildcard in order of execution

Is it possible to do a dry run for snakemake for a single wildcard, in the order of execution?
When I call a dry run, I get the following at the bottom:
Job counts:
count jobs
1 all
1 assembly_eval
5 cat_fastq
1 createGenLogDir
5 createLogDir
5 flye
5 medaka_first
5 medaka_second
5 minimap_first
5 quast_medaka_first
5 quast_medaka_second
5 quast_racon_first
5 racon_first
5 symLinkFQ
58
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
So I guess it would be useful to:
get the dry run commands for a single wildcard (except for the aggregate rules, obviously), after all, the only thing that differs among the commands of any of those rules is the wildcard in the input, output and param directives.
get the workflow printed in the order of execution, for enhanced visualisation.
I did not find a suitable option using snakemake -h, and I'd be looking for something that --rulegraph, does compared --dag, which is to avoid redundancy.
If there is no solution to this, or if the solution is too cumbersome, I guess I will suggest this as enhancement in their github page.
Here are some possible solutions:
You can specify a target file with the specific wildcard you want, e.g. snakemake -nq output_wc1.txt
If your wildcards are stored in a list/dataframe, limit to just the first. I frequently do this while developing, e.g. chroms = range(1,2) # was range(1, 23)
If you have a single job for each rule and dependencies are simple (A -> B -> C), the jobs should be listed in order of execution. This is not true when your workflow has concurrent or branching rules.
Have you also checked --filegraph and --summary?

Reading and handling many small CSV-s to concatenate one large Dataframe

I have two folders each contains about 8,000 small csv files. One with an aggregated size of around 2GB and another with aggregated size of around 200GB.
These files are stored like this to better update them in a daily basis. However, when I conduct EDA, I would like them to be assigned to a single variable. For example.
path = "some random path"
df = pd.concat([pd.read_csv(f"{path}//{files}") for files in os.listdir(path)])
It would take much less time for me to read the dataset with 2GB in total size than reading it on the super computer cluster. And it is impossible to read the 200GB dataset on the local machine unless using some sort of scaling Pandas solutions. The situation does not seem to improve on the cluster even using the popular open-source tools like Dask and Modin.
Is there an effective way that enables to read those csv files effectively with given situation?
Q :"Is there an effective way that enables to read those csv files effectively ... ?"
A :Oh, sure, there is :
CSV format ( standard attempts in RFC4180 ) is not unambiguous and is not obeyed under all circumstances ( commas inside fields, header present or not ), so some caution & care is needed here. Given you are your own data curator, you shall be able to decide plausible steps for handling your own data properly.
So, the as-is state is :
# in <_folder_1_>
:::::::: # 8000 CSV-files ~ 2GB in total
||||||||||||||||||||||||||||||||||||||||||| # 8000 CSV-files ~ 200GB in total
# in <_folder_2_>
Speaking efficiency, O/S coreutils provide the best, stable, proven and most efficient (as system tool used to be since ever ) tools for the phase of merging thousands and thousands of plain CSV-files' content :
###################### if need be,
###################### use an in-place remove of all CSV-file headers first :
for F in $( ls *.csv ); do sed -i '1d' $F; done
this helps for case we cannot avoid headers on the CSV-exporter side. Works like this :
(base):~$ cat ?.csv
HEADER
1
2
3
HEADER
4
5
6
HEADER
7
8
9
(base):~$ for i in $( ls ?.csv ); do sed -i '1d' $i; done
(base):~$ cat ?.csv
1
2
3
4
5
6
7
8
9
Now, the merging phase :
###################### join
cat *.csv > __all_CSVs_JOINED.csv
Given the nature of the said file storage policy, performance can be boosted by using more processes for independent taking small files and large files separately, as defined above, having put the logic inside a pair of conversion_script_?.sh shell-scripts :
parallel --jobs 2 conversion_script_{1}.sh ::: $( seq -f "%1g" 1 2 )
As the transformation is a "just"-[CONCURRENT] flow of processing for a sake of removing the CSV-headers, but a pure-[SERIAL] ( for larger number of files, there might become interesting to use a multi-staged tree of trees - using several stages of [SERIAL]-collections of [CONCURRENT]-ly pre-processed leaves, yet for just 8000 files, not knowing the actual file-system details, the latency-masking from a just-[CONCURRENT] processing both of the directories just independently will be fine to start with )
Last but not least, the final pair of ___all_CSVs_JOINED.csv are safe to get opened using in a way, that prevents moving all disk-stored date into RAM at once ( using chunk-size-fused file-reading-iterator, avoiding RAM-spillovers by using mmaped-mode as a context manager ) :
with pandas.read_csv( "<_folder_1_>//___all_CSVs_JOINED.csv",
sep = NoDefault.no_default,
delimiter = None,
...
chunksize = SAFE_CHUNK_SIZE,
...
memory_map = True,
...
) \
as df_reader_MMAPer_CtxMGR:
...
When tweaking for ultimate performance, details matter and depend on physical hardware bottlenecks ( disk-I/O-wise, filesystem-wise, RAM-I/O-wise ), so due care may take further improvement for minimising the repetitive performed end-to-end processing times ( sometimes even turning data into a compressed/zipped form, in cases, where CPU/RAM resources permit sufficient performance advantages over limited performance of disk-I/O throughput - moving less bytes is so faster, that CPU/RAM-decompression costs are still lower, than moving 200+ [GB]s of uncompressed plain text data.
Details matter,tweak options,benchmark,tweak options,benchmark,tweak options,benchmark
would be nice to post your progress on testing the performanceend-2-end duration of strategy ... [s] AS-IS nowend-2-end duration of strategy ... [s] with parallel --jobs 2 ...end-2-end duration of strategy ... [s] with parallel --jobs 4 ...end-2-end duration of strategy ... [s] with parallel --jobs N ... + compression ... keep us posted

Nextflow: add unique ID, hash, or row number to tuple

ch_files = Channel.fromPath("myfiles/*.csv")
ch_parameters = Channel.from(['A','B, 'C', 'D'])
ch_samplesize = Channel.from([4, 16, 128])
process makeGrid {
input:
path input_file from ch_files
each parameter from ch_parameters
each samplesize from ch_samplesize
output:
tuple path(input_file), parameter, samplesize, path("config_file.ini") into settings_grid
"""
echo "parameter=$parameter;sampleSize=$samplesize" > config_file.ini
"""
}
gives me a number_of_files * 4 * 3 grid of settings files, so I can run some script for each combination of parameters and input files.
How do I add some ID to each line of this grid? A row ID would be OK, but I would even prefer some unique 6-digit alphanumeric code without a "meaning" because the order in the table doesn't matter. I could extract out the last part of the working folder which is seemingly unique per process; but I don't think it is ideal to rely on sed and $PWD for this, and I didn't see it provided as a runtime metadata variable provider. (plus it's a bit long but OK). In a former setup I had a job ID from the LSF cluster system for this purpose, but I want this to be portable.
Every combination is not guaranteed to be unique (e.g. having parameter 'A' twice in the input channel should be valid).
To be clear, I would like this output
file1.csv A 4 pathto/config.ini 1ac5r
file1.csv A 16 pathto/config.ini 7zfge
file1.csv A 128 pathto/config.ini ztgg4
file2.csv A 4 pathto/config.ini 123js
etc.
Given the input declaration, which uses the each qualifier as an input repeater, it will be difficult to append some unique id to the grid without some refactoring to use either the combine or cross operators. If the inputs are just files or simple values (like in your example code), refactoring doesn't make much sense.
To get a unique code, the simple options are:
Like you mentioned, there's no way, unfortunately, to access the unique task hash without some hack to parse $PWD. Although, it might be possible to use BASH parameter substitution to avoid sed/awk/cut (assuming BASH is your shell of course...) you could try using: "${PWD##*/}"
You might instead prefer using ${task.index}, which is a unique index within the same task. Although the task index is not guaranteed to be unique across executions, it should be sufficient in most cases. It can also be formatted for example:
process example {
...
script:
def idx = String.format("%06d", task.index)
"""
echo "${idx}"
"""
}
Alternatively, create your own UUID. You might be able to take the first N characters but this will of course decrease the likelihood of the IDs being unique (not that there was any guarantee of that anyway). This might not really matter though for a small finite set of inputs:
process example {
...
script:
def uuid = UUID.randomUUID().toString()
"""
echo "${uuid}"
echo "${uuid.take(6)}"
echo "${uuid.takeBefore('-')}"
"""
}

Why Rmarkdown shows different random numbers in pdf output than the ones in the Rmd file?

I set.seed in Rmd file to generate random numbers, but when I knit the document I get different random numbers. Here is a screen shot for the Rmd and pdf documents side by side.
In R 3.6.0 the internal algorithm used by sample() has changed. The default for a new session is
> set.seed(2345)
> sample(1:10, 5)
[1] 3 7 10 2 4
which is what you get in the PDF file. One can manually change to the old "Rounding" method, though:
> set.seed(2345, sample.kind="Rounding")
Warning message:
In set.seed(2345, sample.kind = "Rounding") :
non-uniform 'Rounding' sampler used
> sample(1:10, 5)
[1] 2 10 6 1 3
You have at some point made this change in your R session, as can be seen from the output of sessionInfo(). You can either change this back with RNGkind(sample.kind="Rejection") or by starting a new R session.
BTW, in general please include code samples as text, not as images.

Regex speed in Perl 6

I've been previously working only with bash regular expressions, grep, sed, awk etc. After trying Perl 6 regexes I've got an impression that they work slower than I would expect, but probably the reason is that I handle them incorrectly.
I've made a simple test to compare similar operations in Perl 6 and in bash. Here is the Perl 6 code:
my #array = "aaaaa" .. "fffff";
say +#array; # 7776 = 6 ** 5
my #search = <abcde cdeff fabcd>;
my token search {
#search
}
my #new_array = #array.grep({/ <search> /});
say #new_array;
Then I printed #array into a file named array (with 7776 lines), made a file named search with 3 lines (abcde, cdeff, fabcd) and made a simple grep search.
$ grep -f search array
After both programs produced the same result, as expected, I measured the time they were working.
$ time perl6 search.p6
real 0m6,683s
user 0m6,724s
sys 0m0,044s
$ time grep -f search array
real 0m0,009s
user 0m0,008s
sys 0m0,000s
So, what am I doing wrong in my Perl 6 code?
UPD: If I pass the search tokens to grep, looping through the #search array, the program works much faster:
my #array = "aaaaa" .. "fffff";
say +#array;
my #search = <abcde cdeff fabcd>;
for #search -> $token {
say ~#array.grep({/$token/});
}
$ time perl6 search.p6
real 0m1,378s
user 0m1,400s
sys 0m0,052s
And if I define each search pattern manually, it works even faster:
my #array = "aaaaa" .. "fffff";
say +#array; # 7776 = 6 ** 5
say ~#array.grep({/abcde/});
say ~#array.grep({/cdeff/});
say ~#array.grep({/fabcd/});
$ time perl6 search.p6
real 0m0,587s
user 0m0,632s
sys 0m0,036s
The grep command is much simpler than Perl 6's regular expressions, and it has had many more years to get optimized. It is also one of the areas that hasn't seen as much optimizing in Rakudo; partly because it is seen as being a difficult thing to work on.
For a more performant example, you could pre-compile the regex:
my $search = "/#search.join('|')/".EVAL;
# $search = /abcde|cdeff|fabcd/;
say ~#array.grep($search);
That change causes it to run in about half a second.
If there is any chance of malicious data in #search, and you have to do this it may be safer to use:
"/#search».Str».perl.join('|')/".EVAL
The compiler can't quite generate that optimized code for /#search/ as #search could change after the regex gets compiled. What could happen is that the first time the regex is used it gets re-compiled into the better form, and then cache it as long as #search doesn't get modified.
(I think Perl 5 does something similar)
One important fact you have to keep in mind is that a regex in Perl 6 is just a method that is written in a domain specific sub-language.