Input function for technical / biological replicates in snakemake - snakemake

I´m currently trying to write a Snakemake workflow that can check automatically via a sample.tsv file if a given sample is a biological or technical replicate. And then use in this case at some point of my workflow a rule to merge technical/biological replicates.
My tsv file looks like this:
|sample | unit_bio | unit_tech | fq1 | fq2 |
|----------|----------|-----------|-----|-----|
| bCalAnn1 | 1 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_1_R2.fastq.gz |
| bCalAnn1 | 1 | 2 | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_2_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_2_R2.fastq.gz |
| bCalAnn2 | 1 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_1_R2.fastq.gz |
| bCalAnn2 | 1 | 2 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_2_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_2_R2.fastq.gz |
| bCalAnn2 | 2 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_2_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_2_1_R2.fastq.gz |
| bCalAnn2 | 3 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_3_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_3_1_R2.fastq.gz |
My Pipeline looks like this:
import pandas as pd
import os
import yaml
configfile: "config.yaml"
samples = pd.read_table(config["samples"], dtype=str)
rule all:
input:
expand(config["arima_mapping"] + "final/{sample}_{unit_bio}_{unit_tech}.bam", zip,
sample=samples["sample"], unit_bio=samples["unit_bio"], unit_tech=samples["unit_tech"])
..
some rules
..
rule add_read_groups:
input:
config["arima_mapping"] + "paired/{sample}_{unit_bio}_{unit_tech}.bam"
output:
config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
params:
platform = "ILLUMINA",
sampleName = "{sample}",
library = "{sample}",
platform_unit ="None"
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/paired_read_groups/{sample}_{unit_bio}_{unit_tech}.log"
shell:
"picard AddOrReplaceReadGroups I={input} O={output} SM={params.sampleName} LB={params.library} PU={params.platform_unit} PL={params.platform} 2> {log}"
rule merge_tech_repl:
input:
config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
output:
config["arima_mapping"] + "merge_tech_repl/{sample}_{unit_bio}_{unit_tech}.bam"
params:
val_string = "SILENT"
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/merged_tech_repl/{sample}_{unit_bio}_{unit_tech}.log"
threads:
2 #verwendet nur maximal 2
shell:
"picard MergeSamFiles -I {input} -O {output} --ASSUME_SORTED true --USE_THREADING true --VALIDATION_STRINGENCY {params.val_string} 2> {log}"
rule mark_duplicates:
input:
config["arima_mapping"] + "merge_tech_repl/{sample}_{unit_bio}_{unit_tech}.bam" if config["tech_repl"] else config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
output:
bam = config["arima_mapping"] + "final/{sample}_{unit_bio}_{unit_tech}.bam",
metric = config["arima_mapping"] + "final/metric_{sample}_{unit_bio}_{unit_tech}.txt"
#params:
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/mark_duplicates/{sample}_{unit_bio}_{unit_tech}.log"
shell:
"picard MarkDuplicates I={input} O={output.bam} M={output.metric} 2> {log}"
At the moment I have set a boolean in a config file that tells the mark_duplicates rule whether to take its input from the add_read_group or the merge_technical_replicates rule. This is of course not optimal since it could be that some samples may have duplicates (of any numbers) while others don´t. Therefore I want to create a syntax that checks the tsv table if a given sample name and unit_bio number are identical while the unit_tech number is different (and later analog to this for biological replicates), thus merging these specific samples while nonduplicate samples skip the merging rule.
EDIT
For clarification since I think I explained my goal confusingly.
My first attempt looks like this, I want "i" to be flexible, in case the duplicate number changes. I don't think that my input function returns all duplicates together that match each other but gives them one by one which is not what I want. I´m also unsure on how I would have to handle samples that do not have duplicates since they would have to skip this rule somehow.
input_function(wildcards):
return expand({sample}_{unit_bio}_{i}.bam", sample = wildcards.sample,
unit_bio = wildcards.unit_bio,
i = samples["sample"].str.count(wildcards.sample))
rule tech_duplicate_check:
input:
input_function #(that returns a list of 2-n duplicates, where n could be different for each sample)
output:
{sample}_{unit_bio}.bam
shell:
MergeTechDupl_tool {input} # input is a list

Therefore I want to create a syntax that checks the tsv table if a given sample name and unit_bio number are identical while the unit_tech number is different (and later analog to this for biological replicates), thus merging these specific samples while nonduplicate samples skip the merging rule.
rule gather_techdups_of_a_biodup:
output: "{sample}/{unit_bio}"
input: gather_techdups_of_a_biodup_input_fn
shell: "true" # Fill this in
rule gather_biodips_of_a_techdup:
output: "{sample}/{unit_tech}"
input: gather_biodips_of_a_techdup_input_fn
shell: "true" # Fill this in
After some attempts my main problem I struggle with is the table checking. As far as I know snakemake takes templates as input and checks for all samples that match this. But I would need to check the table for every sample that shares (e.g. for technical replicate) the sample name and the unit_bio number take all these samples and give them as input for the first rule run. Then I would have to take the next sample which was not already part of a previous run to prevent merging the same samples multiple times.
The logic you describe here can be implemented in the gather_techdups_of_a_biodup_input_fn and gather_biodips_of_a_techdup_input_fn functions above. For example, read your sample TSV file with pandas, filter for wildcards.sample and wildcards.unit_bio (or wildcards.unit_tech), then extract columns fq1 and fq2 from the filtered data frame.

Related

Using multiple wildcards in snakemake for differentiate replicates in a tsv file

I've successfully implemented a Snakemake workflow and now want to add the possibility to merge specific samples, whether they are technical or biological replicates. In this case, I was told I should use multiple wildcards and work with a proper data frame. However, I'm pretty unsure what the syntax would even look like and how I can differentiate between entries properly.
My samples.tsv used to look like this:
sample
fq1
fq2
bCalAnn1_1_1
path/to/file
path/to/file
bCalAnn1_1_2
path/to/file
path/to/file
bCalAnn2_1_1
path/to/file
path/to/file
bCalAnn2_1_2
path/to/file
path/to/file
bCalAnn2_2_1
path/to/file
path/to/file
bCalAnn2_3_1
path/to/file
path/to/file
Now it looks like this:
sample
bio_unit
tech_unit
fq1
fq2
bCalAnn1
1
1
path/to/file
path/to/file
bCalAnn1
1
2
path/to/file
path/to/file
bCalAnn2
1
1
path/to/file
path/to/file
bCalAnn2
1
2
path/to/file
path/to/file
bCalAnn2
2
1
path/to/file
path/to/file
bCalAnn2
3
1
path/to/file
path/to/file
Where bio_unit contains the index of the library for one sample and tech_unit is the index of the lanes for one library.
My Snakefile looks like this:
import pandas as pd
import os
import yaml
configfile: "config.yaml"
samples = pd.read_table(config["samples"], index_col="sample")
rule all:
input:
expand(config["arima_mapping"] + "{sample}_{unit_bio}_{unit_tech}_R1.bam",
sample=samples.sample, unit_bio=samples.unit_bio, unit_tech=samples.unit_tech)
rule r1_mapping:
input:
read1 = lambda wc:
samples[(samples.sample == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
ref = config["PacBio_assembly"],
linker = config["index"] + "pac_bio_assembly.fna.amb"
output:
config["arima_mapping"] + "unprocessed_bam/({sample},{unit_bio},{unit_tech})_R1.bam"
params:
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/mapping/({sample},{unit_bio},{unit_tech})_R1.log"
threads:
12
shell:
"bwa mem -t {threads} {input.ref} {input.read1} | samtools view --threads {threads} -h -b -o {output} 2> {log}"
r1_mapping is basically my first rule which requires no differentiating between replicates. Therefore, I would want to use it for every row. I experimented a little bit with the syntax but ultimately ran into this error:
MissingInputException in rule r1_mapping in line 20 of /home/path/to/assembly_downstream/workflow/rules/arima_mapping.smk:
Missing input files for rule r1_mapping:
output: /home/path/to/assembly_downstream/intermed/arima_mapping/unprocessed_bam/('bCalAnn1', 1, 1)_R1.bam
wildcards: sample='bCalAnn1', unit_bio= 1, unit_tech= 1
affected files:
/home/path/to/assembly_downstream/data/arima_HiC/('bCalAnn1', 1, 1)_R1.fastq.gz
Do I even read the table properly? I´m currently very confused, can anyone give me a hint on where to start fixing this?
Fixed by changing:
samples = pd.read_table(config["samples"], index_col=["sample","unit_bio","unit_tech"])
to
samples = pd.read_table(config["samples"], index_col="sample")
EDIT:
New error
Missing input files for rule all:
affected files:
/home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of unit_bio ... fq2
sample ...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 2 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 3 ... /home/path/to/assembly_downstream/data/ari...
[6 rows x 4 columns]>_3_2_R1.bam
/home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of unit_bio ... fq2
sample ...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 2 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 3 ... /home/path/to/assembly_downstream/data/ari...
[6 rows x 4 columns]>_3_1_R1.bam
/home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of unit_bio ...
This goes on to a total of 6 times which sounds like I´m currently getting all rows 6 times each red. I guess this has something to do with how expand() works..? I´m gonna keep researching for now.
This is what I would do, slightly simplified to keep things shorter - not tested, there may be errors!
import pandas as pd
import os
import yaml
samples = pd.read_table('samples.tsv', dtype=str)
wildcard_constraints:
# Constraints maybe not needed but in my opinion better to set them
sample='|'.join([re.escape(x) for x in samples['sample']]),
unit_bio='|'.join([re.escape(x) for x in samples.unit_bio]),
unit_tech='|'.join([re.escape(x) for x in samples.unit_tech]),
rule all:
input:
# Assume for now you just want to produce these bam files,
# one per row of the sample sheet:
expand("scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam", zip,
sample=samples['sample'], unit_bio=samples.unit_bio, unit_tech=samples.unit_tech),
rule r1_mapping:
input:
read1 = lambda wc:
samples[(samples['sample'] == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
# Other input/params that don;t need wildcards...
output:
"scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam",
The main issue is to use an input function (lambda here) to query the sample sheet and get the fastq file corresponding to a given sample, unit_bio, and unit_tech.
If this doesn't work or you can't make sense of it, post a fully reproducible example that we can easily replicate.

Karate - How to construct two tables, using lines from each to validate against the other [duplicate]

I want to use single row under examples in cucumber like below:
Examples:
| data1 | data2|paymentOp|
| MySql | uk1 |??????????|
Where paymentOp is a number which I am getting from java method which has List as an argument. The method returns each of the numbers which I want to pass it under paymentOp.
There is an absolute way to iterate it by copy the row and paste it again in the table but I don't want that because the method has a dynamic result which may return 2 or 5 set of numbers.
Is it possible to achieve it using Karate?
How to proceed further. Any lead here would be much appreciated!
You can combine Examples: with dynamic behavior. Please read this example (especially the second one): https://github.com/intuit/karate/blob/master/karate-demo/src/test/java/demo/outline/examples.feature
Since you have difficulties reading the docs and examples (:P) here is a simple example. Take some time to understand it carefully.
Background:
* def data = { one: 1, two: 2, three: 3 }
Scenario Outline:
* match data.<key> == <value>
Examples:
| key | value |
| one | 1 |
| two | 2 |
| three | 3 |

Cypher: How to create a recursive cost query with alternates?

I have the following structure:
(:pattern)-[:contains]->(:pattern)
...basically a hierarchy of patterns that use other patterns as content. These constitute trees.
Certain patterns are generated by certain generators:
(:generator)-[:canProduce]->(:pattern)
The canProduce relationship has a cost value associated with it as a property. Multiple generators can create the same pattern.
I would like to figure out, with a query, what patterns I need to generate to produce a particular output - and which generators to choose to have the lowest cost. I started like this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern) RETURN ps
so far so good. The results don't contain the starting pattern, so I made this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as patterns
RETURN patterns
That does not feel elegant, but it also does not provide the hierarchy
I can of course do a path query (MATCH path = MATCH...) but the results don't seem very useful.
Also, now I need to connect the cost from the generator relationship.
I tried this:
MATCH (p:pattern {name: 'awesome'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as rec
CALL {
WITH rec
MATCH (rec)-[r:canGenerate]-(g:generator)
return r.GenCost as GenCost, g.name AS GenName
}
return rec.name, GenCost , GenName
The problem I have now is that if any of the patterns that are part of another pattern can be generated by multiple generators, I just get double entries in the list, but what I want is separate lists for each alternative possibility, so that I can generate the cost.
This is my pattern tree:
Awesome
input1
input2
input 3
Input 3 can be generated by 2 different generators. I now get:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
input4 | 1.4 | TestGen4
What I want is this: Two lists (or n, in the general case, where I might have n possible paths), one
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
and one:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input4 | 1.4 | TestGen4
each set representing one alternative set, so that I can calculate the costs and compare.
I have no idea how to do something like that. Any suggestions?

Value to table header in Pentaho

Hi I'm quite new in Pentaho Spoon and I have a problem:
I have a table like this:
model | type | color| q
--1---| --1-- | blue | 1
--1---| --2-- | blue | 2
--1---| --1-- | red | 1
--1---| --2-- | red | 3
--2---| --1-- | blue | 4
--2---| --2-- | blue | 5
And I would like to create a single table (to export in csv or excel) for each model grouped by type with the value of the group as header and as value the q value:
table-1.csv
type | blue | red
--1--| -1-- | -1-
--2--| -2-- | -3-
table-2.csv
type | blue
--1--| -4-
--2--| -5-
I tried with row denormalizer but nothing.
Any suggestion?
Typically it's helpful to see what you have done in order to offer help, but I know how counterintuitive the "help" on this step is.
Make sure you sort the rows on Model and Type before sending them to the denormalizer step. Then give this a try:
As for splitting the output into files, there are a few ways to handle that. Take a look at the Switch/Case step using the Model field.
Also, if you haven't found them already, take a look at the sample files that come with the PDI download. They should be in ...pdi-ce-6.1.0.1-196\data-integration\samples. They can be more helpful than the online documentation sometimes.
Row denormalizer can't be used here if number of colors is unknown, also, you can't define text output fields dynamically.
There are few ways that I can see without using java and js steps. One of them is based on the following idea: we can prepare rows with two columns:
Row Model
type|blue|red 1
1|1|1 1
2|2|3 1
type|blue 2
1|4 2
2|5 2
Then we can prepare filename for each row using Model field and then easily output all rows using text output where file name is taken from filename field. In this case all records will be exported into two files without additional efforts.
Here you can find sample transformation: copy-paste me into new transformation
Please note that it's a sample solution that works only with csv. Also it works only if you have the same number of colors for each type inside model. It's just a hint how to use spoon, it's not a complete solution.

How can I efficiently create unique relationships in Neo4j?

Following up on my question here, I would like to create a constraint on relationships. That is, I would like there to be multiple nodes that share the same "neighborhood" name, but each uniquely point to a particular city in which they reside.
As encouraged in user2194039's answer, I am using the following index:
CREATE INDEX ON :Neighborhood(name)
Also, I have the following constraint:
CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;
The following code fails to create unique relationships, and takes an excessively long period of time:
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line
WHERE line.Neighborhood IS NOT NULL
WITH line
MATCH (c:City { name : line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name : toInt(line.Neighborhood)});
Note that there is a uniqueness constraint on City, but NOT on Neighborhood (because there should be multiple ones).
Profile with Limit 10,000:
+--------------+------+--------+---------------------------+------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+---------------------------+------------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph | 9750 | 3360 | anon[307], b, neighborhood, line | MergePattern |
| SchemaIndex | 9750 | 19500 | b, line | line.City; :City(name) |
| ColumnFilter | 9750 | 0 | line | keep columns line |
| Filter | 9750 | 0 | anon[220], line | anon[220] |
| Extract | 10000 | 0 | anon[220], line | anon[220] |
| Slice | 10000 | 0 | line | { AUTOINT0} |
| LoadCSV | 10000 | 0 | line | |
+--------------+------+--------+---------------------------+------------------------------+
Total database accesses: 22860
Following Guilherme's recommendation below, I implemented the helper yet it is raising the error py2neo.error.Finished. I've searched the documentation, and wasn't able to determine a work around from this. It looks like there's an open SO post about this exception.
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph()
authenticate("localhost:7474", "account", "password")
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
except error.Finished as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
main:
queries = []
template = ["MERGE (city:City {Name:{city}})", "Merge (city)<-[:IN]-(n:Neighborhood {Name : {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
# city_neighborhood_map is a defaultdict that maps city-> set of neighborhoods
for city, neighborhoods in city_neighborhood_map.iteritems():
for neighborhood in neighborhoods:
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
c +=1
if c % batch == 0:
print "running batch"
print c
s = time.time()*1000
r = run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
print c
if queries:
s = time.time()*1000
r = run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))
If you want to create unique relationships you have 2 options:
Prevent the path from being duplicated, using MERGE, just like #user2194039 suggested. I think this is the simplest, and best approach you can take.
Turn your relationship into a node, and create an unique constraint on it. But it's hardly necessary for most cases.
If you're having trouble with speed, try using the transactional endpoint. I tried importing your data (random cities and neighbourhoods) through IMPORT CSV in 2.2.1, and I it was slow as well, though I am not sure why. If you send your queries with parameters to the transactional endpoint in batches of 1000-5000, you can monitor the process, and probably gain a performance boost.
I managed to import 1M rows in just under 11 minutes.
I used an INDEX for Neighbourhood(name) and a unique constraint for City(name).
Give it a try and see if it works for you.
Edit:
The transactional endpoint is a restful endpoint that allows you do execute transactions in batch. You can read about it here.
Basically, it allows you to stream a bunch of queries to the server at once.
I don't know what programming language/stack you're using, but in python, using a package like py2neo, it would be something like this:
with open("city.csv", "r") as fp:
reader = csv.reader(fp)
queries = []
template = ["MERGE (c :`City` {name: {city}})",
"MERGE (c)<-[:IN]-(n :`Neighborhood` {name: {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
for row in reader:
city, neighborhood = row
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
if c % batch == 0:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
c += 1
if queries:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))
Helper functions:
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph(uri) # "{protocol}://{host}:{port}/db/data/"
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
You will monitor how long each transaction takes, and you can tweak the number of queries per transactions, as well as the timeout.
To be sure we're on the same page, this is how I understand your model: Each city is unique and should have some number of neighborhoods pointing to it. The neighborhoods are unique within the context of a city, but not globally. So if you have a neighborhood 3 [IN] city Boston, you could also have a neighborhood 3 [IN] city Seattle, and both of those neighborhoods are represented by different nodes, even though they have the same name property. Is that correct?
Before importing, I would recommend adding an index to your neighborhood nodes. You can add the index without enforcing uniqueness. I have found that this greatly increases speeds on even small databases.
CREATE INDEX ON :Neighborhood(name)
And for the import:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
MERGE (c:City {name: line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name: toInt(line.Neighborhood)})
If you are importing a large amount of data, it may be best to use the USING PERIODIC COMMIT command to commit periodically while importing. This will reduce the memory used in the process, and if your server is memory-constrained, I could see it helping performance. In your case, with almost a million records, this is recommended by Neo4j. You can even adjust how often the commit happens by doing USING PERIODIC COMMIT 10000 or such. The docs say 1000 is the default. Just understand that this will break the import into several transactions.
Best of luck!