Nexflow: structured inputs with files - nextflow

I have an array of structure data similar to:
- name: foobar
sex: male
fastqs:
- r1: /path/to/foobar_R1.fastq.gz
r2: /path/to/foobar_R2.fastq.gz
- r1: /path/to/more/foobar_R1.fastq.gz
r2: /path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: /path/to/bazquux_R1.fastq.gz
r2: /path/to/bazquux_R2.fastq.gz
Note that fastqs come in pairs, and the number of pairs per "sample" may be variable.
I want to write a process in nextflow that processes one sample at a time.
In order for the nextflow executor to properly marshal the files, they must somehow be typed as path (or file). Thus typed, the executor will copy the files to the compute node for processing. Simply typing the files paths as var will treat the paths as strings and no files will be copied.
A trivial example of a path input from the docs:
process foo {
input:
path x from '/some/data/file.txt'
"""
your_command --in $x
"""
}
How should I go about declaring the process input so that the files are properly marshaled to the compute node? So far I haven't found any examples in the docs for how to handle structured inputs.

Your structured data looks a lot like YAML. If you can include a top-level object so that your file looks something like this:
samples:
- name: foobar
sex: male
fastqs:
- r1: ./path/to/foobar_R1.fastq.gz
r2: ./path/to/foobar_R2.fastq.gz
- r1: ./path/to/more/foobar_R1.fastq.gz
r2: ./path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: ./path/to/bazquux_R1.fastq.gz
r2: ./path/to/bazquux_R2.fastq.gz
Then, we can use Nextflow's -params-file option to load the params when we run our workflow. We can access the top-level object from the params, which gives us a list that we can use to create a Channel using the fromList factory method. The following example uses the new DSL 2:
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(fastqs)
"""
echo "${sample_name},${sex}:"
ls -g *.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| flatMap { rec ->
rec.fastqs.collect { rg ->
readgroup = tuple( file(rg.r1), file(rg.r2) )
tuple( rec.name, rec.sex, readgroup )
}
}
| test_proc
}
Results:
$ mkdir -p ./path/to/more
$ touch ./path/to/foobar_R{1,2}.fastq.gz
$ touch ./path/to/more/foobar_R{1,2}.fastq.gz
$ touch ./path/to/bazquux_R{1,2}.fastq.gz
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [desperate_colden] DSL2 - revision: 391a9a3b3a
executor > local (3)
[ed/61c5c3] process > test_proc (foobar) [100%] 3 of 3 ✔
foobar,male:
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/foobar_R2.fastq.gz
bazquux,female:
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R1.fastq.gz -> ../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R2.fastq.gz -> ../../../path/to/bazquux_R2.fastq.gz
foobar,male:
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/more/foobar_R2.fastq.gz
As requested, here's a solution that runs per sample. The problem we have is that we cannot simply feed in a list of lists using the path qualifier (since an ArrayList is not a valid path value). We could flatten() the list of file pairs, but this makes it difficult to access each of the file pairs if we need them. You may not necessarily need the file pair relationship but assuming you do, I think the right solution is to feed the R1 and R2 files in separately (i.e. using a path qualifier for R1 and another path qualifier for R2). The following example introspects the instance type to (re-)create the list of readgroups. We can use the stageAs option to localize the files into progressively indexed subdirectories, since some files in the YAML have identical names.
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(r1, stageAs:'*/*'), path(r2, stageAs:'*/*')
script:
if( [r1, r2].every { it instanceof List } )
readgroups = [r1, r2].transpose()
else if( [r1, r2].every { it instanceof Path } )
readgroups = [[r1, r2], ]
else
error "Invalid readgroup configuration"
read_pairs = readgroups.collect { r1, r2 -> "${r1},${r2}" }
"""
echo "${sample_name},${sex}:"
echo ${read_pairs.join(' ')}
ls -g */*.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| map { rec ->
def r1 = rec.fastqs.r1.collect { file(it) }
def r2 = rec.fastqs.r2.collect { file(it) }
tuple( rec.name, rec.sex, r1, r2 )
}
| test_proc
}
Results:
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [berserk_sanger] DSL2 - revision: 2f317a8cee
executor > local (2)
[93/6345c9] process > test_proc (bazquux) [100%] 2 of 2 ✔
foobar,male:
1/foobar_R1.fastq.gz,1/foobar_R2.fastq.gz 2/foobar_R1.fastq.gz,2/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R1.fastq.gz -> ../../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R2.fastq.gz -> ../../../../path/to/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R1.fastq.gz -> ../../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R2.fastq.gz -> ../../../../path/to/more/foobar_R2.fastq.gz
bazquux,female:
1/bazquux_R1.fastq.gz,1/bazquux_R2.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R1.fastq.gz -> ../../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R2.fastq.gz -> ../../../../path/to/bazquux_R2.fastq.gz

Related

How do I get Snakemake to apply all samples to a single rule, before proceeding to the next rule?

On a machine with j cores, given a RuleB which depends on a RuleA, I expect to Snakemake to execute my workflow as follows:
RuleA Sample1 using j threads
RuleA Sample2 using j threads
...
RuleA SampleN using j threads
RuleB Sample1 using 1 thread
RuleB Sample2 using 1 thread
...
RuleB SampleN using 1 thread
With RuleB being executed on j samples simultaneously.
Instead the workflow is executed as follows:
RuleA Sample1 using j threads
RuleB Sample1 using 1 thread
RuleA Sample2 using j threads
RuleB Sample2 using 1 thread
...
with ruleB being executed on 1 sample at a time.
Executed in that order, ruleB can't be parallelised, and the workflow runs much slower than it could.
More specifically, I want to align reads to a genome using STAR, and quantify them using RNASeQC. The tool RNASEQC is single threaded, while STAR can work with multiple threads on a single sample.
This results in Snakemake aligning reads in sample1, and then quantifying them using rnaseqc, after which it proceeds to do the same in in sample2. I'd like it to reads in all samples first, and proceed to quantify them (this way, it would be able to run several instances of the single-threaded rnaseqc tool).
The relevant excerpt from the Snakemake file:
sample_basename = ["RNA-seq_L{}_S{}".format(x, y) for x,y in zip(range(1,41), range(1,41))]
sample_lane = [seq + "_L00{}".format(x) for x in [1, 2] for seq in sample_basename]
rule all:
input:
expand("rnaseqc/{s_l}/{s_l}.gene_tpm.gct", s_l=sample_lane)
rule run_star:
input:
index_dir=rules.star_index.output.index_dir,
fq1 = "data/fastq/{sample}_R1_001.fastq.gz",
fq2 = "data/fastq/{sample}_R2_001.fastq.gz",
output:
"star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
"star/{sample}/{sample}Aligned.toTranscriptome.out.bam",
"star/{sample}/{sample}ReadsPerGene.out.tab",
"star/{sample}/{sample}Log.final.out"
log:
"logs/star/{sample}.log"
params:
extra="--quantMode GeneCounts TranscriptomeSAM --chimSegmentMin 20 --outSAMtype BAM SortedByCoordinate",
sample_name = "{sample}"
threads: 18
script:
"scripts/star_align.py"
rule rnaseqc:
input:
bam="star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
gtf="data/gencode.v19.annotation.patched.collapsed.gtf"
output:
"rnaseqc/{sample}/{sample}.exon_reads.gct",
"rnaseqc/{sample}/{sample}.gene_fragments.gct",
"rnaseqc/{sample}/{sample}.gene_reads.gct",
"rnaseqc/{sample}/{sample}.gene_tpm.gct",
"rnaseqc/{sample}/{sample}.metrics.tsv"
params:
extra="-s {sample} --legacy",
output_dir="rnaseqc/{sample}"
log:
"logs/rnaseqc/{sample}"
shell:
"rnaseqc.v2.3.4.linux {params.extra} {input.gtf} {input.bam} {params.output_dir} 2> {log}"
Weirdly enough, doing a dry run with snakemake -np -j does the correct thing:
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L182_S16_L002_R1_001.fastq.gz, data/fastq/RNA-seq_L182_S16_L002_R2_001.fastq.gz
output: star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.sortedByCoord.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.toTranscriptome.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002ReadsPerGene.out.tab, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Log.final.out
log: logs/star/RNA-seq_L182_S16_L002.log
jobid: 1026
wildcards: sample=RNA-seq_L182_S16_L002
threads: 18
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L173_S7_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L173_S7_L001_R2_001.fastq.gz
output: star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001ReadsPerGene.out.tab, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Log.final.out
log: logs/star/RNA-seq_L173_S7_L001.log
jobid: 737
wildcards: sample=RNA-seq_L173_S7_L001
threads: 18
...
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.exon_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_fragments.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_tpm.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L221_S15_L001
jobid: 215
wildcards: sample=RNA-seq_L221_S15_L001
rnaseqc.v2.3.4.linux -s RNA-seq_L221_S15_L001 --legacy data/gencode.v19.annotation.patched.collapsed.gtf star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam rnaseqc/RNA-seq_L221_S15_L001 2> logs/rnaseqc/RNA-seq_L221_S15_L001
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.exon_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_fragments.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_tpm.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L284_S38_L001
jobid: 278
wildcards: sample=RNA-seq_L284_S38_L001
but executing snakemake -j without the -np flag does not.
[Mon Oct 21 13:13:49 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L249_S3_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L249_S3_L001_R2_001.fastq.gz
output: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001ReadsPerGene.out.tab, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Log.final.out
log: logs/star/RNA-seq_L249_S3_L001.log
jobid: 813
wildcards: sample=RNA-seq_L249_S3_L001
threads: 18
Aligning RNA-seq_L249_S3_L001
[Mon Oct 21 13:21:33 2019]
Finished job 813.
2 of 478 steps (0.42%) done
[Mon Oct 21 13:21:33 2019]
rule rnaseqc:
input: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.exon_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_fragments.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_tpm.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L249_S3_L001
jobid: 243
wildcards: sample=RNA-seq_L249_S3_L001
I'm using the latest version of Snakemake available through Conda:
5.5.2
Maybe what you are looking for is to give higher priority to the rule running STAR compared to the rule running rnaseqc. If so, look at the priorities directive, like:
rule star:
priority: 50
...
rule rnaseqc:
priority: 0
...
(Not tested) this should run first all the star jobs, one at a time because they need 18 cores each, then all the rnaseqc jobs in parallel.

Binding a scalar to a sigilless variable (Perl 6)

Let me start by saying that I understand that what I'm asking about in the title is dubious practice (as explained here), but my lack of understanding concerns the syntax involved.
When I first tried to bind a scalar to a sigilless symbol, I did this:
my \a = $(3);
thinking that $(...) would package the Int 3 in a Scalar (as seemingly suggested in the documentation), which would then be bound to symbol a. This doesn't seem to work though: the Scalar is nowhere to be found (a.VAR.WHAT returns (Int), not (Scalar)).
In the above-referenced post, raiph mentions that the desired binding can be performed using a different syntax:
my \a = $ = 3;
which works. Given the result, I suspect that the statement can be phrased equivalently, though less concisely, as: my \a = (my $ = 3), which I could then understand.
That leaves the question: why does the attempt with $(...) not work, and what does it do instead?
What $(…) does is turn a value into an item.
(A value in a scalar variable ($a) also gets marked as being an item)
say flat (1,2, (3,4) );
# (1 2 3 4)
say flat (1,2, $((3,4)) );
# (1 2 (3 4))
say flat (1,2, item((3,4)) );
# (1 2 (3 4))
Basically it is there to prevent a value from flattening. The reason for its existence is that Perl 6 does not flatten lists as much as most other languages, and sometimes you need a little more control over flattening.
The following only sort-of does what you want it to do
my \a = $ = 3;
A bare $ is an anonymous state variable.
my \a = (state $) = 3;
The problem shows up when you run that same bit of code more than once.
sub foo ( $init ) {
my \a = $ = $init; # my \a = (state $) = $init;
(^10).map: {
sleep 0.1;
++a
}
}
.say for await (start foo(0)), (start foo(42));
# (43 44 45 46 47 48 49 50 51 52)
# (53 54 55 56 57 58 59 60 61 62)
# If foo(42) beat out foo(0) instead it would result in:
# (1 2 3 4 5 6 7 8 9 10)
# (11 12 13 14 15 16 17 18 19 20)
Note that variable is shared between calls.
The first Promise halts at the sleep call, and then the second sets the state variable before the first runs ++a.
If you use my $ instead, it now works properly.
sub foo ( $init ) {
my \a = my $ = $init;
(^10).map: {
sleep 0.1;
++a
}
}
.say for await (start foo(0)), (start foo(42));
# (1 2 3 4 5 6 7 8 9 10)
# (43 44 45 46 47 48 49 50 51 52)
The thing is that sigiless “variables” aren't really variables (they don't vary), they are more akin to lexically scoped (non)constants.
constant \foo = (1..10).pick; # only pick one value and never change it
say foo;
for ^5 {
my \foo = (1..10).pick; # pick a new one each time through
say foo;
}
Basically the whole point of them is to be as close as possible to referring to the value you assign to it. (Static Single Assignment)
# these work basically the same
-> \a {…}
-> \a is raw {…}
-> $a is raw {…}
# as do these
my \a = $i;
my \a := $i;
my $a := $i;
Note that above I wrote the following:
my \a = (state $) = 3;
Normally in the declaration of a state var, the assignment only happens the first time the code gets run. Bare $ doesn't have a declaration as such, so I had to prevent that behaviour by putting the declaration in parens.
# bare $
for (5 ... 1) {
my \a = $ = $_; # set each time through the loop
say a *= 2; # 15 12 9 6 3
}
# state in parens
for (5 ... 1) {
my \a = (state $) = $_; # set each time through the loop
say a *= 2; # 15 12 9 6 3
}
# normal state declaration
for (5 ... 1) {
my \a = state $ = $_; # set it only on the first time through the loop
say a *= 2; # 15 45 135 405 1215
}
Sigilless variables are not actually variables, they are more of an alias, that is, they are not containers but bind to the values they get on the right hand side.
my \a = $(3);
say a.WHAT; # OUTPUT: «(Int)␤»
say a.VAR.WHAT; # OUTPUT: «(Int)␤»
Here, by doing $(3) you are actually putting in scalar context what is already in scalar context:
my \a = 3; say a.WHAT; say a.VAR.WHAT; # OUTPUT: «(Int)␤(Int)␤»
However, the second form in your question does something different. You're binding to an anonymous variable, which is a container:
my \a = $ = 3;
say a.WHAT; # OUTPUT: «(Int)␤»
say a.VAR.WHAT;# OUTPUT: «(Scalar)␤»
In the first case, a was an alias for 3 (or $(3), which is the same); in the second, a is an alias for $, which is a container, whose value is 3. This last case is equivalent to:
my $anon = 3; say $anon.WHAT; say $anon.VAR.WHAT; # OUTPUT: «(Int)␤(Scalar)␤»
(If you have some suggestion on how to improve the documentation, I'd be happy to follow up on it)

iSpin LTL property evaluation only with activated "assertion violations"?

I am trying to get used to iSpin/Promela. I am using:
Spin Version 6.4.3 -- 16 December 2014,
iSpin Version 1.1.4 -- 27 November 2014,
TclTk Version 8.6/8.6,
Windows 8.1.
Here is an example where I try to use LTL. The verification of the LTL property should produce an error if the two steps in the for loop are non-atomic:
1 #define ten ((n !=10) && (finished == 2))
2
3 int n = 0;
4 int finished = 0;
5 active [2] proctype P() {
6 //assert(_pid == 0 || _pid == 1);
7
8 int t = 0;
9 byte j;
10 for (j : 1 .. 5) {
11 atomic {
12 t = n;
13 n = t+1;
14 }
15 }
16 finished = finished+1;
17 }
18
19 ltl alwaysten {[] ! ten }
In the verification tap I just want to test the LTL property, so I disable all safety properties and activate "use claim". The claim name is "alwaysten".
But it seems that the LTL property is just evaluated if I activate "assertion violations". Why? A collegue is using iSpin v1.1.0 and he does not need to activate this? What am I doing wrong? I want to prove assertions and LTL properties independently...
Here is the trace:
pan: elapsed time 0.002 seconds
To replay the error-trail, goto Simulate/Replay and select "Run"
spin -a 1_2_ConcurrentCounters_8.pml
ltl alwaysten: [] (! (((n!=10)) && ((finished==2))))
C:/cygwin/bin/gcc -DMEMLIM=1024 -O2 -DXUSAFE -w -o pan pan.c
./pan -m10000 -E -a -N alwaysten
Pid: 6980
warning: only one claim defined, -N ignored
(Spin Version 6.4.3 -- 16 December 2014)
+ Partial Order Reduction
Full statespace search for:
never claim + (alwaysten)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by -E flag)
State-vector 36 byte, depth reached 57, errors: 0
475 states, stored
162 states, matched
637 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.024 equivalent memory usage for states (stored*(State-vector + overhead))
0.291 actual memory usage for states
64.000 memory used for hash table (-w24)
0.343 memory used for DFS stack (-m10000)
64.539 total actual memory usage
unreached in proctype P
(0 of 13 states)
unreached in claim alwaysten
_spin_nvr.tmp:8, state 10, "-end-"
(1 of 10 states)
pan: elapsed time 0.001 seconds
No errors found -- did you verify all claims?
This is because your LTL is translated into a claim with an assert statement. See the following automaton.
So, without checking for assertion violations, no error can be found.
(A possible explanation of different behaviors: previous versions of Spin might translate this differently, perhaps using accept instead of assert.)

How to output data with Hive-style directory structure in Scalding?

We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like "state=CA" for example. We are using TemplatedTsv as follows:
pipe
// some other ETL
.map('STATE -> 'hdfs_state) { state: Int => "State=" + state }
.groupBy('hdfs_state) { _.pass }
.write(TemplatedTsv(baseOutputPath, "%s", 'hdfs_state,
writeHeader = false,
sinkMode = SinkMode.UPDATE,
fields = ('all except 'hdfs_state)))
We adopt the code sample from How to bucket outputs in Scalding.
Here are two issues we have:
except can't be resolved by IntelliJ: Am I missing some imports? We don't want to explicitly enter all the fields within the "fields = ()" statement as fields are derived from the code inside the groupBy statement. If entering explicitly, they could be easily out of sync.
This approach looks too hacky as we are creating an extra column so that the directory names can be processed by Hive/Hcatalog. We are wondering what should be the right way to accomplish it?
Many thanks!
Sorry previous example was a pseudocode. Below I will give a small code with input data example.
Please note that this only works with Scalding version 0.12.0 or above
Let's image we have input as below which define some purchase data,
user1 1384034400 6 75
user1 1384038000 6 175
user2 1383984000 48 3
user3 1383958800 48 281
user3 1384027200 9 7
user3 1384027200 9 11
user4 1383955200 37 705
user4 1383955200 37 15
user4 1383969600 36 41
user4 1383969600 36 21
Tab separated and the 3rd column is a State number. Here we have integer but for string based States you can easily adapt.
This code will read the input and put them in 'State=stateid' output folder buckets.
class TemplatedTsvExample(args: Args) extends Job(args) {
val purchasesPath = args("purchases")
val outputPath = args("output")
// defines both input & output schema, you can also make separate for each of them
val ioSchema = ('USERID, 'TIMESTAMP, 'STATE, 'PURCHASE)
val Purchases =
Tsv(purchasesPath, ioSchema)
.read
.map('STATE -> 'STATENAME) { state: Int => "State=" + state } // here you can make necessary changes
.groupBy('STATENAME) { _.pass } // this is optional
.write(TemplatedTsv(outputPath, "%s", 'STATENAME, false, SinkMode.REPLACE, ioSchema))
}
I hope this is helpful. Please ask me if anything is not clear.
You can find full code here.

How to find the "lexical file" in Wordnet?

If you look at the original Wordnet search and select "Display options: Show Lexical File Info", you'll see an extremely useful classification of words called lexical file. Eg for "filling" we have:
<noun.substance>S: (n) filling, fill (any material that fills a space or container)
<noun.process>S: (n) filling (flow into something (as a container))
<noun.food>S: (n) filling (a food mixture used to fill pastry or sandwiches etc.)
<noun.artifact>S: (n) woof, weft, filling, pick (the yarn woven across the warp yarn in weaving)
<noun.artifact>S: (n) filling ((dentistry) a dental appliance consisting of ...)
<noun.act>S: (n) filling (the act of filling something)
The first thing in brackets is the "lexical file". Unfortunately I have not been able to find a SPARQL endpoint that provides this info
The latest RDF translation of Wordnet 3.0 points to two things:
Talis SPARQL endpoint. Use eg this query to check there's no such info:
DESCRIBE <http://purl.org/vocabularies/princeton/wn30/synset-chair-noun-1>
W3C's mapping description. Appendix D "Conversion details" describes something useful: wn:classifiedByTopic.
But it's not the same as lexical file, and is quite incomplete. Eg "chair" has nothing, while one of the senses of "completion" is in the topic "American Football"
DESCRIBE <http://purl.org/vocabularies/princeton/wn30/synset-completion-noun-1> ->
<j.1:classifiedByTopic rdf:resource="http://purl.org/vocabularies/princeton/wn30/synset-American_football-noun-1"/>
The question: is there a public Wordnet query API, or a database, that provides the lexical file information?
Using the Python NLTK interface:
from nltk.corpus import wordnet as wn
for synset in wn.synsets('can'):
print synset.lexname
I don't think you can find it in the RDF/OWL Representation of WordNet. It's in the WordNet distribution though: dict/lexnames. Here is the content of the file as of WordNet 3.0:
00 adj.all 3
01 adj.pert 3
02 adv.all 4
03 noun.Tops 1
04 noun.act 1
05 noun.animal 1
06 noun.artifact 1
07 noun.attribute 1
08 noun.body 1
09 noun.cognition 1
10 noun.communication 1
11 noun.event 1
12 noun.feeling 1
13 noun.food 1
14 noun.group 1
15 noun.location 1
16 noun.motive 1
17 noun.object 1
18 noun.person 1
19 noun.phenomenon 1
20 noun.plant 1
21 noun.possession 1
22 noun.process 1
23 noun.quantity 1
24 noun.relation 1
25 noun.shape 1
26 noun.state 1
27 noun.substance 1
28 noun.time 1
29 verb.body 2
30 verb.change 2
31 verb.cognition 2
32 verb.communication 2
33 verb.competition 2
34 verb.consumption 2
35 verb.contact 2
36 verb.creation 2
37 verb.emotion 2
38 verb.motion 2
39 verb.perception 2
40 verb.possession 2
41 verb.social 2
42 verb.stative 2
43 verb.weather 2
44 adj.ppl 3
For each entry of dict/data.*, the second number is the lexical file info. For example, this filling entry contains the number 13, which is noun.food.
07883031 13 n 01 filling 0 002 # 07882497 n 0000 ~ 07883156 n 0000 | a food mixture used to fill pastry or sandwiches etc.
It can be done through MIT JWI (MIT Java Wordnet Interface) a Java API to query Wordnet. There's a topic in this link showing how to implement a java class to access lexicographic
This is what worked for me,
Synset[] synsets = database.getSynsets(wordStr);
ReferenceSynset referenceSynset = (ReferenceSynset) synsets[i];
int lexicalCode =referenceSynset.getLexicalFileNumber();
Then use above table to deduce "lexnames" e.g. noun.time
If you're on Windows, chances are it is in your appdata, in the local directory. To get there, you will want to open your file browser, go to the top, and type in %appdata%
Next click on roaming, and then find the nltk_data directory. In there, you will have your corpora file. The full path is something like:
C:\Users\yourname\AppData\Roaming\nltk_data\corpora
and lexnames will present under
C:\Users\yourname\AppData\Roaming\nltk_data\corpora\wordnet.