i am extracting selected pages from a pdf file. and want to assign dataframe name based on the pages extracted:
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
for i in selected_pages():
df{str(i)} = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True,area = [100,10,740,950],pages= (i), index = False)
print (df{str(i)} )
The idea, ultimately, as in above example, is to have dataframes: df10, df11. I have tried "df" + str(i), "df" & str(i) & df{str(i)}. however all are giving error msg: SyntaxError: invalid syntax
Or any better way of doing it is most welcome. thanks
This is where a dictionary would be a much better option.
Also note the error you have at the start of the loop. selected_pages is a list, so you can't do selected_pages().
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
df = {}
for i in selected_pages:
df[i] = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True, area = [100,10,740,950], pages= (i), index = False)
i = int(i) - 1 # this will bring it to 10
dfB = df[str(i)]
#select row number to drop: 0:4
dfB.drop(dfB.index[0:4],axis =0, inplace = True)
dfB.columns = ['col1','col2','col3','col4','col5']
Related
I am trying to create vCards (Email contacts) unsing the vobject library and pandas for python.
When serializing the values I get new lines in the "notes" of my output(no new lines in source). In every new line, created by ".serialize()", there is also a space in the beginning. I would need to get rid of both.
Example of output:
BEGIN:VCARD
VERSION:3.0
EMAIL;TYPE=INTERNET:test#example.at
FN:Valentina test
N:Valentina;test;;;
NOTE:PelletiererIn Mitglieder 2 Preiserhebung Aussendung 2 Pressespiegelver
sand 2 GeschäftsführerIn PPA_PelletiererInnen GeschäftsführerIn_Pellet
iererIn
ORG:Test Company
TEL;TYPE=CELL:
TEL;TYPE=CELL:
TEL;TYPE=CELL:
END:VCARD
Is there a way that I can force the output in a single line?
output = ""
for _,row in df.iterrows():
j = vobject.vCard()
j.add('n')
j.n.value = vobject.vcard.Name(row["First Name"],row["Last Name"])
j.add('fn')
j.fn.value = (str(row["First Name"]) + " " + row["Last Name"])
o = j.add("email")
o.value = str((row["E-mail Address"]))
o.type_param = "INTERNET"
#o = j.add("email")
#o.value = str((row["E-mail 2 Address"]))
#o.type_param = "INTERNET"
j.add('org')
j.org.value = [row["Organization"]]
k = j.add("tel")
k.value = str(row["Home Phone"])
k.type_param = "CELL"
k = j.add("tel")
k.value = str(row["Business Phone"])
k.type_param = "CELL"
k = j.add("tel")
k.value = str(row["Mobile Phone"])
k.type_param = "CELL"
j.add("note")
j.note.value = row["Notiz für Kontaktexport"]
output += j.serialize()
print(output)
I am doing the following in my python script and I want to hide the index column when I print the dataframe. So I used .to_string(index = False) and then use len() to see if its zero or not. However, when i do to_string(), if the dataframe is empty the len() doesn't return zero. If i print the procinject1 it says "Empty DataFrame". Any help to fix this would be greatly appreciated.
procinject1=dfmalfind[dfmalfind["Hexdump"].str.contains("MZ") == True].to_string(index = False)
if len(procinject1) == 0:
print(Fore.GREEN + "[✓]No MZ header detected in malfind preview output")
else:
print(Fore.RED + "[!]MZ header detected within malfind preview (Process Injection indicator)")
print(procinject1)
That's the expected behaviour in Pandas DataFrame.
In your case, procinject1 stores the string representation of the dataframe, which is non-empty even if the corresponding dataframe is empty.
For example, check the below code snippet, where I create an empty dataframe df and check it's string representation:
df = pd.DataFrame()
print(df.to_string(index = False))
print(df.to_string(index = True))
For both index = False and index = True cases, the output will be the same, which is given below (and that is the expected behaviour). So your corresponding len() will always return non-zero.
Empty DataFrame
Columns: []
Index: []
But if you use a non-empty dataframe, then the outputs for index = False and index = True cases will be different as given below:
data = [{'A': 10, 'B': 20, 'C':30}, {'A':5, 'B': 10, 'C': 15}]
df = pd.DataFrame(data)
print(df.to_string(index = False))
print(df.to_string(index = True))
Then the outputs for index = False and index = True cases respectively will be -
A B C
10 20 30
5 10 15
A B C
0 10 20 30
1 5 10 15
Since pandas handles empty dataframes differently, to solve your problem, you should first check whether your dataframe is empty or not, using pandas.DataFrame.empty.
Then if the dataframe is actually non-empty, you could print the string representation of that dataframe, while keeping index = False to hide the index column.
I have xpaths as follow:
/html/body/div[1]/table[3]/tbody/tr[1]/td[3]/a
/html/body/div[1]/table[3]/tbody/tr[2]/td[3]/a
/html/body/div[1]/table[3]/tbody/tr[3]/td[3]/a
/html/body/div[1]/table[3]/tbody/tr[4]/td[3]/a
As you can see, the tr[] values are changing. I want iterate over these values.
Below is the code I have used
search_input = driver.find_elements_by_xpath('/html/body/div[1]/table[3]/tbody/tr[3]/td[3]/a')
Please let me know How can I iterate over them.
This may not be the exact solution you are looking for but this is the idea.
tableRows = driver.find_elements_by_xpath("/html/body/div[1]/table[3]/tbody/tr")
for e in tableRows:
e.find_element_by_xpath(".//td[3]/a")
if you want all third td for every row, use this:
search_input = driver.find_elements_by_xpath('/html/body/div[1]/table[3]/tbody/tr/td[3]/a')
If you want only the first 3 rows use this:
search_input = driver.find_elements_by_xpath('/html/body/div[1]/table[3]/tbody/tr[position() < 4]/td[3]/a')
For looping through tds look I.e. this answer
Another alternative, assuming 4 elements:
for elem in range(1,5):
element = f"/html/body/div[1]/table[3]/tbody/tr[{elem}]/td[3]/a"
#e = driver.find_element_by_xpath(element)
#e.click()
print(element)
Prints:
/html/body/div[1]/table[3]/tbody/tr[1]/td[3]/a
/html/body/div[1]/table[3]/tbody/tr[2]/td[3]/a
/html/body/div[1]/table[3]/tbody/tr[3]/td[3]/a
/html/body/div[1]/table[3]/tbody/tr[4]/td[3]/a
You could do whatever you wanted with the elements in the loop, I just printed to show the value
Option 1: Fixed Column number like 3 needs to be be iterated:
rows = len(self.driver.find_elements_by_xpath("/html/body/div[1]/table[3]/tbody/tr"))
for row in range(1, (rows + 1)):
local_xpath = ""/html/body/div[1]/table[3]/tbody/tr[" + str(row) + "]/td[3]"
# do something with element
# cell_text = self.driver.find_element_by_xpath(local_xpath ).text
Option 2: Both Row and Col needs to be iterated:
rows = len(self.driver.find_elements_by_xpath("/html/body/div[1]/table[3]/tbody/tr"))
columns = len(self.driver.find_elements_by_xpath("/html/body/div[1]/table[3]/tbody/tr/td"))
for row in range(1, (rows + 1)):
for column in range(1, (columns + 1)):
local_xpath = ""/html/body/div[1]/table[3]/tbody/tr[" + str(row) + "]/td[" + str(column) + "]"
# do something with element
# cell_text = self.driver.find_element_by_xpath(local_xpath ).text
I'm iterating through PDF's to obtain the text entered in the form fields. When I send the rows to a csv file it only exports the last row. When I print results from the Dataframe, all the row indexes are 0's. I have tried various solutions from stackoverflow, but I can't get anything to work, what should be 0, 1, 2, 3...etc. are coming in as 0, 0, 0, 0...etc.
Here is what I get when printing results, only the last row exports to csv file:
0
0 1938282828
0
0 1938282828
0
0 22222222
infile = glob.glob('./*.pdf')
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = pd.DataFrame([myfieldvalue2])
print(df)`
Thank you for any help!
You are replacing the same dataframe each time:
infile = glob.glob('./*.pdf')
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = pd.DataFrame([myfieldvalue2]) # this creates new df each time
print(df)
Correct Code:
infile = glob.glob('./*.pdf')
df = pd.DataFrame()
for i in infile:
if i.endswith('.pdf'):
pdreader = PdfFileReader(open(i,'rb'))
diction = pdreader.getFormTextFields()
myfieldvalue2 = str(diction['ID'])
df = df.append([myfieldvalue2])
print(df)
I have several files (yml, tf, xml) for which I need to find a string i.e. var1, and then insert a new line with foo2, the rest of the line is unchanged.
Example
variable "my_vars" {
type = "map"
default = {
var1 = "10.48.225.160/28"
var2 = "10.48.225.160/28"
var3 = "10.48.225.160/28"
var4 = "10.48.225.160/28"
}
}
I tried the code below but I need the edit in place.
import sys
import string
def find(substr, replstr, infile):
f = open(infile,"rw")
lines = f.readlines()
for i in range(len(lines)):
if substr in lines[i]:
j = string.replace(lines[i], substr, replstr)
lines.insert(i + 1, j)
print "\n".join(lines)
old_env = sys.argv[1]
new_env = sys.argv[2]
file = sys.argv[3]
find(old_env, new_env, file)
import sys
import string
def find(substr, replstr, infile):
f = open(infile,"r")
lines = f.readlines()
for i in range(len(lines)):
if substr in lines[i]:
j = string.replace(lines[i], substr, replstr)
lines.insert(i + 1, j)
print "".join(lines)
f.close()
f = open(infile,"w")
k = "".join(lines)
f.writelines(k)
f.close()
old_env = sys.argv[1]
new_env = sys.argv[2]
file = sys.argv[3]
find(old_env, new_env, file)
The one caveat is there is a match on the last line of the file, the iterator will miss this.