How can I print a certain line using for line in lines and line length in Python? - line

I have to use the import sys module for this syntax. What I have so far is this
import sys
file=sys.argv[1]
fp1= open(file, 'r+')
fp2= open(file+ 'cl.', 'w+')
lines =fp1.readlines()
for line in lines:
if len(line)>1 and line[0]=='Query':
print line.split('|') [0:1}
fp1.close()
Basically when I run this on the command line:
python homework4.py sqout
It gives me nothing, but if I take away the line[0}=='Query':
it prints the first 2 splits of every line (which I want it to do) just not every line. I only want it to print the first line which starts with Query. Thanks

line[0] is just the first character of string line. You could use line[0:5]=='Query' or line[:5]=='Query'
Before doing this I suggest checking first that len(line)>4 or using an exception.

Related

Using Python UDF with Hive

I am trying to learn using Python UDF's with Hive.
I have a very basic python UDF here:
import sys
for line in sys.stdin:
line = line.strip()
print line
Then I add the file in Hive:
ADD FILE /home/hadoop/test2.py;
Now I call the Hive Query:
SELECT TRANSFORM (admission_type_id, description)
USING 'python test2.py'
FROM admission_type;
This works as expected, no changes is made to the field and the output is printed as is.
Now, when I modify the UDF by introducing the split function, I get an execution error. How do I debug here? and what am I doing wrong?
New UDF:
import sys
for line in sys.stdin:
line = line.strip()
fields = line.split('\t') # when this line is introduced, I get an execution error
print line
import sys
for line in sys.stdin:
line = line.strip()
field1, field2 = line.split('\t')
print '\t'.join([str(field1), str(field2)])
SELECT TRANSFORM (admission_type_id, description)
USING 'python test2.py' As ( admission_type_id_new, description_new)
FROM admission_type;

awk/sed - generate an error if 2nd address of range is missing

We are currently using sed to filter output of regression runs. Sometimes we have a filter that looks like this:
/copyright/,/end copyright/d
If that end copyright is ever missing, the rest of the file is deleted. I'm wondering if there's some way to generate an error for this? awk would also be okay to use. I don't really want to add code that reads the file line by line and issues an error if it hits EOF.
here's a string
copyright
2016 jan 15
end copyright
date 2016 jan 5 time 15:36
last one
I'd like to get an error if end copyright is missing. The real filter also would replace the date line with DATE, so it's more that just ripping out the copyright.
You can persuade sed to generate an error if you reach end of input (i.e. see address $) between your start and end, but it won't be a very helpful message:
/copyright/,/end copyright/{
$s//\1/ # here
d
}
This will error if end copyright is missing or on the last line, with an exit status of 1 and the helpful message:
sed: -e expression #1, char 0: invalid reference \1 on `s' command's RHS
If you're using this in a makefile, you might want to echo a helpful message first, or (better) to wrap this in something that catches the error and produces a more useful one.
I tested this with GNU sed; though if you are using GNU sed, you could more easily use its useful extension:
q [EXIT-CODE]
This command only accepts a single address.
Exit 'sed' without processing any more commands or input. Note
that the current pattern space is printed if auto-print is not
disabled with the -n options. The ability to return an exit code
from the 'sed' script is a GNU 'sed' extension.
Q [EXIT-CODE]
This command only accepts a single address.
This command is the same as 'q', but will not print the contents of
pattern space. Like 'q', it provides the ability to return an exit
code to the caller.
So you could simply write
/copyright/,/end copyright/{
$Q 42
d
}
Never use range expressions /start/,/end/ as they make trivial code very slightly briefer but require a complete rewrite or duplicate conditions when you have the tiniest requirements change. Always use a flag instead. Note that since sed doesn't support variables, it doesn't support flag variables, and so you shouldn't be using sed you should be using awk instead.
In this case your original code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0}' file
And your modified code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0} END{if (f) print "Missing end copyright"}' file
The above is obviously untested since you didn't provide any sample input/output we could test a potential solution against.
With sed you can build a loop:
sed -e '/copyright/{:a;/end copyright/d;N;ba;};' file
:a defines the label "a"
/copyright end/d deletes the pattern space, only when "end copyright" matches
N appends the next line to the pattern space
ba jumps to the label "a"
Note that d ends the loop.
In this way you can avoid to delete the text until the end.
If you don't want the text to be displayed at all and prefer an error message when a "copyright" block stays unclosed, you obviously need to wait the end of the file. You can do it with sed too storing all the lines in the buffer space until the end:
sed -n -e '/copyright/{:a;/end copyright/d;${c\ERROR MESSAGE
;};N;ba;};H;${g;p};' file
H appends the current line to the buffer space
g put the content of the buffer space to the pattern space
The file content is only displayed once the last line reached with ${g;p} otherwise when the closing "end copyright" is missing, the current line is changed in the error message with ${c\ERROR MESSAGE\n;} inside the loop.
This way you can test what returns sed before redirecting it to whatever you want.

file seek in wlst / Jython 2.2.1 fails for lines longer than 8091 characters

For a CSV file generated in WLST / Jython 2.2.1 i want to update the header, the first line of the output file, when new metrics have been detected. This works fine by using seek to go to the first line and overwriting the line. But it fails when the number of characters of the first line exceeds 8091 characters.
I made simplified script which does reproduce the issue i am facing here.
#!/usr/bin/python
#
import sys
global maxheaderlength
global initheader
maxheaderlength=8092
logFilename = "test.csv"
# Create (overwrite existing) file
logfileAppender = open(logFilename,"w",0)
logfileAppender.write("." * maxheaderlength)
logfileAppender.write("\n")
logfileAppender.close()
# Append some lines
logfileAppender = open(logFilename,"a",0)
logfileAppender.write("2nd line\n")
logfileAppender.write("3rd line\n")
logfileAppender.write("4th line\n")
logfileAppender.write("5th line\n")
logfileAppender.close()
# Seek back to beginning of file and add data
logfileAppender = open(logFilename,"r+",0)
logfileAppender.seek(0) ;
header = "New Header Line" + "." * maxheaderlength
header = header[:maxheaderlength]
logfileAppender.write(header)
logfileAppender.close()
When maxheaderlength is 8091 or lower i do get the results as expected. The file test.csv starts with “New Header Line" followed by 8076 dots and
followed by the lines
2nd line
3rd line
4th line
5th line
When maxheaderlength is 8092> the test.csv results as a file starting with 8092 dots followed by "New Header Line" and then followed by 8077 dots. The 2nd ... 5th line are now show, probably overwritten by the dots.
Any idea how to work around or fix this ?
I too was able to reproduce this extremely odd behaviour and indeed it works correctly in Jython 2.5.3 so I think we can safely say this is a bug in 2.2.1 (which unfortunately you're stuck with for WLST).
My usual recourse in these circumstances is to fall back to using native Java methods. Changing the last block of code as follows seems to work as expected :-
# Seek back to beginning of file and add data
from java.io import RandomAccessFile
logfileAppender = RandomAccessFile(logFilename, "rw")
logfileAppender.seek(0) ;
header = "New Header Line" + "." * maxheaderlength
header = header[:maxheaderlength]
logfileAppender.writeBytes(header)
logfileAppender.close()

File line splitting in Jython

I am trying to read a file and populate the values in DB with the help of Jython in ODI.
For this, I read the line one by one split the line on the basis of ',' present.
Now I have a line as
4JGBB8GB5AA557812,,Miss,Maria,Cruz,,"266 Faller Drive Apt. B",
New Milford,NJ,07646,2015054604,2015054604,20091029,51133,,,
N,LESSEE,"MERCEDES-BENZ USA, LLC",N,N
"MERCEDES-BENZ USA, LLC" this field has , within the double quotes due to which it gets split into two fields whereas it should only be considered one. Can someone please tell me how should i avoid this.
fields = valueList.split(',')
I use this for splitting where valuelist is the individual line present in the file
You can use csv module which can take care of quotes:
line = '4JGBB8GB5AA557812,,Miss,Maria,Cruz,,"266 Faller Drive Apt. B",New Milford,NJ,07646,2015054604,2015054604,20091029,51133,,,N,LESSEE,"MERCEDES-BENZ USA, LLC",N,N'
import StringIO
import csv
f = StringIO.StringIO(line)
reader = csv.reader(f, delimiter=',')
for row in reader:
print('\n'.join(row))
result:
...
266 Faller Drive Apt. B
...
LESSEE
MERCEDES-BENZ USA, LLC
...
My example uses StringIO because test line is as string in code, you can simply use just opened file handler as f.
You will find more examples at "Module of the Month": http://pymotw.com/2/csv/index.html#module-csv

How can I delete a specific line (e.g. line 102,206,973) from a 30gb csv file?

What method can I use to delete a specific line from a csv/txt file that is too big too load into memory and edit manually?
Background
My question is actually an indirect solution to a problem related with importing csv into sql databases.
I have a series of 10-30gb csv files I want to import and populate an sqlite table from within R (Since they are too large to import as data frames as a whole into R). I am using the 'RSQlite' package for this.
A couple fail because of an error related to one of the lines being badly formatted. The populating process is then cancelled. R returns the line number which caused the process to fail.
The error given is:
./csvfilename line 102206973 expected 9 columns of data but found 3)
So I know exactly the line which causes the error.
I see 2 potential 'indirect' solutions which I was hoping someone could help me with.
(i) Deleting the line causing the error in 20+gb files. e.g. line 102,206,973 in the example above.
I am not concerned with 'losing' the data in line 102,206,973 by just skipping or deleting it. However I have tried and failed to somehow access the csv file and to remove the line.
(ii) Using sqlite directly (or anything else?) to import an csv which does allow you to skip lines or an error.
Although not likely to be related directly to the solution, here is the R code used.
db <- dbConnect(SQLite(), dbname=name_of_table)
dbWriteTable(conn = db, name ="currentdata", value = csvfilename, row.names = FALSE, header = TRUE)
Thanks!
To delete a specific line you can use sed:
sed -e '102206973d' your_file
If you want the replacement to be done in-place, do
sed -i.bak -e '102206973d' your_file
This will create a backup names your_file.bak and your_file will have the specified line removed.
Example
$ cat a
1
2
3
4
5
$ sed -i.bak -e '3d' a
$ cat a
1
2
4
5
$ cat a.bak
1
2
3
4
5