how to find a word in ASCII file using python - indexing

I want to find a word and its index but the problem is I am only getting its first position while the word appear more than one time in file. The file's content is,
[MAKE DATA:STUDENT1=AENIE:AGE14,STUDENT2=JOHN:AGE15,STUDENT3=KELLY:AGE14,STUDENT4=JACK:AGE16,STUDENT5=SNOW:AGE16;SET RECORD:STUDENT1=GOOD,STUDENT2=,STUDENT3=BAD,STTUDENT4=,STUDENT5=GOOD]
following is my code,
import sys,os,csv
x = str(raw_input("Enter file name :")) + '.ASCII'
fp = open(x,'r')
data = fp.read()
fp.close()
found = data.find("STUDENT1")
print found
here the word "STUDENT1" appear two time while my code gives its only 1st index position. I want its second index position too. Similarly a word may appear several times in file so how can I find its all index position?

Use the optional start parameter to str.find() to search the string again starting after the previous match:
found = data.find("STUDENT1")
while found != -1:
print found
found = data.find("STUDENT1", found+1)
It would be slightly more efficient (but less concise) to use found+len("STUDENT1") instead of found+1.
Alternatively you could use the re.finditer():
import re
for match in re.finditer("STUDENT1", data):
print match.start()

Related

EOL while scanning SyntaxError for document word count code

I've been handed some code by my lecturer which is to work out the word count for my document (markdown only), when I used it, it worked out the digit count not the word count. I believe the problem is located in the penultimate line .split(), the code initially had '' in the brackets, which I removed to make use of the default (split by whitespaces) but I get an error.
Any help greatly appreciated, novice coder problems.
import io
from nbformat import read
filepath='DSRM Report 2.ipynb'
with io.open(filepath, 'r', encoding='utf-8') as f:
nb=read(f, 4)
word_count = 0
for cell in nb\['cells'\]:
if cell.cell_type == 'markdown':
word_count += len(cell\['source'\].replace('#',').lstrip().split())
print("Submission length is {}".format(word_count))

How to remove illegal characters so a dataframe can write to Excel

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?
Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!
The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')
try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.
I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.
You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())
If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()

How to split a CSV file into groups using Pentaho?

I am new to Pentaho and am trying to read a CSV file (which I already did) and create blocks of data based on an identifier.
Eg
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
1|N|O|P
4|Q|R|S|T
5|U|V|W
I need to split and group this as such:
(each block starts when the first column is equal to '1')
Block a)
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
Block b)
1|N|O|P
4|Q|R|S|T
5|U|V|W
Eg
a |1|A|B|C
a |2|D|E|F
a |8|G|H|I|J|K
a |4|L|M
b |1|N|O|P
b |4|Q|R|S|T
b |5|U|V|W
How can this be achieved using Penatho? Thanks.
I found a similar question but answers don't really help my case
Pentaho Kettle split CSV into multiple records
I think I got the answer.
I created the transformation in this zip that can transform your "csv" file in rows almost like you described but I don't know what you intend to do next, so maybe you can give us more details. =)
I'll explain what I did:
1) First, we grab the row full text with a Text input step
When you look at configurations of Text Input step, you'll see I used a ';' has separator, when your input file uses '|' so I'm not spliting columns with the '|' but loading the whole line in one column. Grabbing the row's full text, nothing else.
2) Next we apply a regex eval to separate the ID from the rest of our string.
^(\d+)\|(.*)
Which means: in the beginning of the text I expect one or more digits followed by a pipe and anything after that. Capture the digits in the beginning of the string in one column and everything after the pipe to another column.
That gives you this output: (blue is the first capture group, red is the second)
3) Now what you need is to add a 'sequence' that only goes up if there is a row_id = 1. Which I did in the Mod JS Value with the following code:
var sequence
//if it's the first row, set sequence to 1
if(sequence == null){
sequence = 1;
}else{
//if it's not the first row, check if the row_id is equal to 1 (string)
if(row_id == '1'){
// increment the sequence
sequence++;
}else{
//nothing
}
}
And that will give you this output that seem to be what you expected: (green, the group/sequence done)
Hope it helps =)

Fortran runtime error: Bad integer for item 0 in list input?

How do I fix the Fortran runtime error: Bad integer for item 0 in list input?
Below is the Fortran program which generates a runtime error.
CHARACTER CNFILE*(*)
REAL BOX
INTEGER CNUNIT
PARAMETER ( CNUNIT = 10 )
INTEGER NN
OPEN ( UNIT = CNUNIT, FILE = CNFILE, STATUS = 'OLD' )
READ ( CNUNIT,* ) NN, BOX
The error message received from gdb is :
At line 688 of file MCNPT.f (unit = 10, file = 'LATTICE-256.txt')
Fortran runtime error: Bad integer for item 0 in list input
[Inferior 1 (process 3052) exited with code 02]
(gdb)
I am not sure what options must be specified for READ() to read to numbers from the text file. Does it matter if the two numbers on the same line are specified as either an integer or a real in the text file?
Below is the gdb execution of the program using a break point at the open call
Breakpoint 1, readcn (
cnfile=<error reading variable: Cannot access memory at address 0x7fffffffdff0>,
box=-3.37898272e+33, _cnfile=30) at MCNPT.f:686
Since you did not specify form="unformatted" on the open statement, the unit / file is opened for formatted IO. This is appropriate for a human-readable text file. ("unformatted" would be used for a non-human readable file in computer-native format, sometimes called "binary".) Therefore you should provide a format on the read, or use list-directed read, i.e., read(unit, *). To advise on a particular format we would have to know the layout of the numbers in the file. A possible read with format is: read (CNUINT, '(I4, 2X, F6.2)' ) NN, BOX
P.S. I'm answering the question in your question and not the title, which seems unrelated.
EDIT: now that you are show the text data file, a list-directed read looks easier. That is because the data doesn't line up in columns. It seems that the file has two integers on the first line, then three real numbers on each of the following lines. Most likely you need a different read for the first line. Is the code sample that you are showing us trying to read the first line, or one of the later lines? If the first line, it would seem plausible to read into two integer variables. If a later line, into two or three real variables. Two if you wish to skip the third data item on the line.
EDIT 2: the question has been substantially altered several times, which is very confusing. The first line of the text file that was shown in one version of the question contained integers, with later lines having reals. Since the listed-directed read is reading into an integer and a floating variable, it will have problems if you attempt to use it on the later lines that have two real values.

How does associative arrays work in awk?

I wanted to remove duplicate lines from a file based on a column. A quick search let me this page which had the following solution:
awk '!x[$1]++' filename
It works, but I am not sure how it works. I know it uses associate arrays in awk but I am not able to infer anything beyond it.
Update:
Thanks everyone for the explanation. With my new knowledge, I have wrote a blog post with further explanation of how it works.
That awk script !x[$1]++ fills an array named x. Suppose the first word ($1 refers to the first word in a line of text) in a line of text is line1. It effectively results in this operation on the array:
x["line1"]++
The "index" (the key) of the array is the text encountered in the file (line1 in this example), and the value associated with that key is an integer that is incremented by 1.
When a unique line of text is encountered, the current value of the array is zero, which is then post-incremented to 1. The not operator ! evaluates to non-zero (true) for each new unique line of text and so prints it. The next time the same value is encountered, the value in the array is non-zero and so the not operation results in zero (false), so the line is not printed.
A less "clever" way of writing the same thing (but possibly more clear and less fun) would be this:
{
if (x[$1] == 0 )
print
x[$1]++
}