I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.
In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?
runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole = ['big','small']
rule all:
input:
expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)
rule merge_results:
input:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
output:
os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
run:
import pandas as pd
import os
# Iterate over input files, read into pandas df
tmplist = []
for f in input:
data = pd.read_csv(f, sep='\t')
# getting the pheno from the input file and adding it to the data frame
pheno = os.path.split(f)[1].split('.')[0]
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
You are doing it the right way !
In your line:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
you have to understand that role and study are wildcards. pheno is not a wildcard and is set by the second argument of the expand function.
In order to get the phenotype if your for loop, you can either parse the file name like you are doing or directly reconstruct the file name since you know the different values that pheno takes and you can access the wildcards:
run:
import pandas as pd
import os
# Iterate over phenotypes, read into pandas df
tmplist = []
for pheno in runpheno:
# conflicting variable name 'output' between a global variable and the rule variable here. Renamed global var outputDir for example
file = os.path.join(outputDir, wildcards.role, wildcards.study, pheno, pheno+'.summary')
data = pd.read_csv(file, sep='\t')
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
I don't know if this is better than parsing the file name like you were doing though. I wanted to show that you can access wildcards in the code. Either way, you are defining the input and output correctly.
I have 700MB txt file that I want to import into my Postgres database.
However, I keep getting the error below. "Jurs" is the last column header.
Note that error happened on a line 10014, not line 1. So I believe I have created a proper schema for a new table but I noticed, on the Line 10014, there is a "\" next to BRUNSVILLE. Is that an issue? I can't figure out what the problem is.
db1=# \COPY harris_2018 FROM 'C:\Users\testu\Downloads\harris_2018\original\
Real_acct_owner\real_acct.txt' with DELIMITER E'\t';
ERROR: missing data for column "jurs"
CONTEXT: COPY harris_2018, line 10014: "0081300000008 2018 STATE OF TEXAS PO BOX 1386
HOUSTON TX 77251-1386 N 0 ..."
Below is a txt file reader.
I have to use the import sys module for this syntax. What I have so far is this
import sys
file=sys.argv[1]
fp1= open(file, 'r+')
fp2= open(file+ 'cl.', 'w+')
lines =fp1.readlines()
for line in lines:
if len(line)>1 and line[0]=='Query':
print line.split('|') [0:1}
fp1.close()
Basically when I run this on the command line:
python homework4.py sqout
It gives me nothing, but if I take away the line[0}=='Query':
it prints the first 2 splits of every line (which I want it to do) just not every line. I only want it to print the first line which starts with Query. Thanks
line[0] is just the first character of string line. You could use line[0:5]=='Query' or line[:5]=='Query'
Before doing this I suggest checking first that len(line)>4 or using an exception.
I am trying to write Chinese words like 花花公子昊天鞋类专营店 in a CSV file in python, but not able to do it. I tried solution given here("issues with writing Chinese to csv file in Python"). Any help will be appreciated.
The module unicodecsv helps with that (you can install that with pip):
import unicodecsv
w = unicodecsv.writer(open("test.csv", "w"))
w.writerow((u"花花公子昊天鞋类专营店", 78.10))
del w
The resulting csv file opens succesfully in OpenOffice.
You can also read it back in Python:
r = unicodecsv.reader(open("test.csv", "rb"))
for row in r:
print row[0], row[1]
And when run, it should print:
(user#motoom) ~/Prj/python $ python chinesecsv.py
花花公子昊天鞋类专营店 78.1
I have about 350000 one-column csv files, which are essentially 200 - 2000 numbers printed one under another. The numbers are formatted like this: "-1.32%" (no quotes). I want to merge the files to create a monster of a csv file where each file is a separate column. The merged file will have 2000 rows maximum (each column may have a different length) and 350000 columns.
I thought of doing it with MySQL but there is a 30000 column limit. An awk or sed script could do the job but I don't know them all that well and I am afraid it will take a very long time. I could use a server if the solution requires to. Any suggestions?
This python script will do what you want:
#!/usr/bin/env python2
import os
import sys
import codecs
fhs = []
count = 0
for filename in sys.argv[1:]:
fhs.append(codecs.open(filename,'r','utf-8'))
count += 1
while count > 0:
delim = ''
for fh in fhs:
line = fh.readline()
if not line:
count -= 1
line = ''
sys.stdout.write(delim)
delim = ','
sys.stdout.write(line.rstrip())
sys.stdout.write('\n')
for fh in fhs:
fh.close()
Call it with all the CSV files you want to merge and it will print a new file to stdout.
Note that you can't merge all files at once; for one, you can't pass 350,000 file names as arguments to a process and secondly, a process can only open 1024 files at once.
So you'll have to do it in several passes. I.e. merge files 1-1000, then 1001-2000, etc. Then you should be able to merge the 350 resulting intermediate files at once.
Or you could write a wrapper script which uses os.listdir() to get the names or all files and calls this script several times.