How can I read two input files and create one output file via Python? - file-io

So I have two files and I need to create an output file for them - my prof. wants me to use a while loop to process the first file and a for loop for the second file, as well as use a try/except block to read from the input and write to the output. I think I have like, a general idea for the initial code but I'm still lost.
#reads the file
n1 = open('nameslist1.txt', 'r')
n2 = open('nameslist2.txt', 'r')
print(n1.read())
print(n2.read())
n1.close()
n2.close()
#writes the file
n1_o = open('allnames.txt', 'w')
n2_o = open('allnames.txt', 'w')
n1_o.write('nameslist1.txt')
n2_o.write('nameslist2.txt')
n1_o.close()
n2_o.close()

Related

Skip csv header row using boto3 in lambda and copy_from in psycopg2

I'm loading a csv into memory from s3 and then I need to insert it into postgres. I think the problem is I'm not using the right call for the s3 object or something as I don't appear to be able to skip the header line. On my local machine I would just load the file from the directory:
cur = DBCONN.cursor()
for filename in absolute_file_paths('/path/to/file/csv.log'):
print('Importing: ' + filename)
with open(filename, 'r') as log:
next(log) # Skip the header row.
cur.copy_from(log, 'vesta', sep='\t')
DBCONN.commit()
I have the below in lambda which I would like to work kind of like above, but it's different with s3. What is the correct way to have the below work like above? Or perhaps - what IS the correct way to do this?
s3 = boto3.client('s3')
#Load the file from s3 into memory
obj = s3.get_object(Bucket=bucket, Key=key)
contents = obj['Body']
next(contents, None) # Skip the header row - this does not seem to work
cur = DBCONN.cursor()
cur.copy_from(contents, 'my_table', sep='\t')
DBCONN.commit()
Seemingly, my problem had something to do with an incredibly wide csv file (I have over 200 columns) and somehow that messed up the next() function to not give the next row. SO! I will say that IF your file is not seemingly that wide, then the code I placed in the question should work. Below however is how I got it work, basically by just reading the file into memory and then writing that back to an in memory file after skipping the header row. This honestly seems a little like overkill so I'd be happy if someone could provide something more efficient but seeing as how I spend the last eight hours on this, I'm just happy to have SOMETHING that works.
s3 = boto3.client('s3')
...
def remove_header(contents):
# Reformat the file, removing the header row
data = csv.reader(io.StringIO(contents), delimiter='\t') #read data in
mem_file = io.StringIO() #create in memory file object
next(data) #skip header row
writer = csv.writer(mem_file, delimiter='\t') #set up the csv writer
writer.writerows(data) #write the data in memory to the in mem file
mem_file.getvalue() # Get the string from the buffer
mem_file.seek(0) # Go back to the beginning of the memory stream
return mem_file
...
#Load the file from s3 into memory
obj = s3.get_object(Bucket=bucket, Key=key)
contents = obj['Body'].read().decode('utf-8')
mem_file = remove_header(contents)
#Insert into postgres
try:
cur = DBCONN.cursor()
cur.copy_from(mem_file, 'my_table', sep='\t')
DBCONN.commit()
except BaseException as e:
DBCONN.rollback()
raise e
or if you want to do it with pandas
def remove_header_pandas(contents):
df = pd.read_csv(io.StringIO(contents), sep='\t')
mem_file = io.StringIO()
df.to_csv(mem_file, header=False, index=False) #remove header
mem_file.getvalue()
mem_file.seek(0)
return mem_file

Getting wildcard from input files when not used in output files

I have a snakemake rule aggregating several result files to a single file, per study. So to make it a bit more understandable; I have two roles ['big','small'] that each produce data for 5 studies ['a','b','c','d','e'], and each study produces 3 output files, one per phenotype ['xxx','yyy','zzz']. Now what I want is a rule to aggregate the phenotype results from each study to a single summary file per study (so merging the phenotypes into a single table). In the merge_results rule I give the rule a list of files (per study and role), and aggregate these using a pandas frame, and then spit out the result as a single file.
In the process of merging the results I need the 'pheno' variable from the input file being iterated over. Since pheno is not needed in the aggregated output file, it is not provided in output and as a consequence it is also not available in the wildcards object. Now to get a hold of the pheno I parse the filename to grab it, however this all feels very hacky and I suspect there is something here I have not understood properly. Is there a better way to grab wildcards from input files not used in output files in a better way?
runstudy = ['a','b','c','d','e']
runpheno = ['xxx','yyy','zzz']
runrole = ['big','small']
rule all:
input:
expand(os.path.join(output, '{role}-additive', '{study}', '{study}-summary-merge.txt'), role=runrole, study=runstudy)
rule merge_results:
input:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
output:
os.path.join(output, '{role}', '{study}', '{study}-summary-merge.txt')
run:
import pandas as pd
import os
# Iterate over input files, read into pandas df
tmplist = []
for f in input:
data = pd.read_csv(f, sep='\t')
# getting the pheno from the input file and adding it to the data frame
pheno = os.path.split(f)[1].split('.')[0]
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
You are doing it the right way !
In your line:
expand(os.path.join(output, '{{role}}', '{{study}}', '{pheno}', '{pheno}.summary'), pheno=runpheno)
you have to understand that role and study are wildcards. pheno is not a wildcard and is set by the second argument of the expand function.
In order to get the phenotype if your for loop, you can either parse the file name like you are doing or directly reconstruct the file name since you know the different values that pheno takes and you can access the wildcards:
run:
import pandas as pd
import os
# Iterate over phenotypes, read into pandas df
tmplist = []
for pheno in runpheno:
# conflicting variable name 'output' between a global variable and the rule variable here. Renamed global var outputDir for example
file = os.path.join(outputDir, wildcards.role, wildcards.study, pheno, pheno+'.summary')
data = pd.read_csv(file, sep='\t')
data['pheno'] = pheno
tmplist.append(data)
resmerged = pd.concat(tmplist)
resmerged.to_csv(output, sep='\t')
I don't know if this is better than parsing the file name like you were doing though. I wanted to show that you can access wildcards in the code. Either way, you are defining the input and output correctly.

How to load lists in pelilcanconf.py from external file

There are different lists available in pelicanconf.py such as
SOCIAL = (('Facebook','www.facebook.com'),)
LINKS =
etc.
I want to manage these content and create my own lists by loading these values from an external file which can be edited independently. I tried importing data as a text file using python but it doesn't work. Is there any other way?
What exactly did not work? Can you provide code?
You can execute arbitrary python code in your pelicanconf.py.
Example for a very simple CSV reader:
# in pelicanconf.py
def fn_to_list(fn):
with open(fn, 'r') as res:
return tuple(map(lambda line: tuple(line[:-1].split(';')), res.readlines()))
print(fn_to_list("data"))
CSV file data:
A;1
B;2
C;3
D;4
E;5
F;6
Together, this yields the following when running pelican:
# ...
((u'A', u'1'), (u'B', u'2'), (u'C', u'3'), (u'D', u'4'), (u'E', u'5'), (u'F', u'6'))
# ...
Instead of printing you can also assign this list to a variable, say LINKS.

How to run same syntax on multiple spss files

I have 24 spss files in .sav format in a single folder. All these files have the same structure. I want to run the same syntax on all these files. Is it possible to write a code in spss for this?
You can use the SPSSINC PROCESS FILES user submitted command to do this or write your own macro. So first lets create some very simple fake data to work with.
*FILE HANDLE save /NAME = "Your Handle Here!".
*Creating some fake data.
DATA LIST FREE / X Y.
BEGIN DATA
1 2
3 4
END DATA.
DATASET NAME Test.
SAVE OUTFILE = "save\X1.sav".
SAVE OUTFILE = "save\X2.sav".
SAVE OUTFILE = "save\X3.sav".
EXECUTE.
*Creating a syntax file to call.
DO IF $casenum = 1.
PRINT OUTFILE = "save\TestProcess_SHOWN.sps" /"FREQ X Y.".
END IF.
EXECUTE.
Now we can use the SPSSINC PROCESS FILES command to specify the sav files in the folder and apply the TestProcess_SHOWN.sps syntax to each of those files.
*Now example calling the syntax.
SPSSINC PROCESS FILES INPUTDATA="save\X*.sav"
SYNTAX="save\TestProcess_SHOWN.sps"
OUTPUTDATADIR="save" CONTINUEONERROR=YES
VIEWERFILE= "save\Results.spv" CLOSEDATA=NO
MACRONAME="!JOB"
/MACRODEFS ITEMS.
Another (less advanced) way is to use the command INSERT. To do so, repeatedly GET each sav-file, run the syntax with INSERT, and sav the file. Probably something like this:
get 'file1.sav'.
insert file='syntax.sps'.
save outf='file1_v2.sav'.
dataset close all.
get 'file2.sav'.
insert file='syntax.sps'.
save outf='file2_v2.sav'.
etc etc.
Good luck!
If the Syntax you need to run is completely independent of the files then you can either use: INSERT FILE = 'Syntax.sps' or put the code in a macro e.g.
Define !Syntax ()
* Put Syntax here
!EndDefine.
You can then run either of these 'manually';
get file = 'file1.sav'.
insert file='syntax.sps'.
save outfile ='file1_v2.sav'.
Or
get file = 'file1.sav'.
!Syntax.
save outfile ='file1_v2.sav'.
Or if the files follow a reasonably strict naming structure you can embed either of the above in a simple bit of python;
Begin Program.
imports spss
for i in range(0, 24 + 1):
syntax = "get file = 'file" + str(i) + ".sav.\n"
syntax += "insert file='syntax.sps'.\n"
syntax += "save outfile ='file1_v2.sav'.\n"
print syntax
spss.Submit(syntax)
End Program.

How to modify a line in a file with Erlang OTP module

I got a big file and I would like to replace the first line with other content.
When I use {ok, IoDev} = file:open("/root/FileName", [write, raw, binary]), the whole content is removed.
But when I use {ok, IoDev} = file:open("/root/FileName", [append, raw, binary]) and file:pwrite(S, {bof,0}, <<"new content\n">>), I got the result {error, badarg}.
If I set Location to 0: file:pwrite(S, 0, <<"new content\n">>), the string is appended at tail of the file.
You seem to be confused with the actual file API.
file:open/2 will truncate the file if you pass [write, raw, binary]as you do:
(about write mode): The file is opened for writing. It is created if it does not exist. If the file exists, and if write is not combined with read, the file will be truncated.
So you need to pass either [write, read] or [write, append] as documented.
file:pwrite/3 also works exactly as documented. It allows you to write at a given position in the file. In particular, you cannot pass {bof, 0} as second argument since you opened the file in raw mode:
If IoDevice has been opened in raw mode, some restrictions apply: Location is only allowed to be an integer; and the current position of the file is undefined after the operation.
The following sample code shows how they work:
ok = file:write_file("/tmp/file", "This is line 1.\nThis is line 2.\n"),
{ok, F} = file:open("/tmp/file", [read, write, raw, binary]),
ok = file:pwrite(F, 0, <<"This is line A.\n">>),
ok = file:close(F),
{ok, Content} = file:read_file("/tmp/file"),
io:put_chars(Content),
ok = file:delete("/tmp/file").
It will output:
This is line A.
This is line 2.
This works because text "This is line A.\n" is exactly as long as "This is line 1.\n". It does not really replace the line, but just bytes. If you need to replace the first line with content that has a different length, you need to rewrite the whole content of the file. A common approach is indeed to write a new file and swap them eventually. If the file is small enough, however, you can read it entirely in memory and rewrite it. file:read_file/1 and file:write_file/2 would work:
replace_first_line(Path, NewLine) ->
{ok, Content} = file:read_file(Path),
[FirstLine | Tail] = binary:split(Content, <<"\n">>),
NewContent = [NewLine, <<"\n">> | Tail],
ok = file:write_file(Path, NewContent).
The question is not related to erlang but rather general file operations.
Replacing a line in a file requires to rewrite the file in a whole. The easiest way to do so would be to write all the new content in a new file and then to move the file.