How to read txt line by line from a pdf file - file-io

import pyPdf
f= open('jayabal_appt.pdf','rb')
pdfl = pyPdf.PdfFileReader(f)
output = pyPdf.PdfFileWriter()
content=""
for i in range(0,1):
content += pdfl.getPage(i).extractText() + "\n"
outpu = open('b.txt','wb')
outpu.write(content)
f.close()
outpu.close()
This is not writing the content of a pdf to a txt file... what shld i do???

Iterate through every page and call extractText() like so:
content = ""
for i in range(0, num_pages):
content += pdfl.getPage(i).extractText() + "\n"
Once you have the full contents you can easily split the lines via the '\n' separator.
EDIT:
Check after the for-loop, whether the variable contents even contains any text. Not all PDF files contain text information.

Related

Is there any way to insert image logo/ Text in before saving to_html in pandas

I am saving pandas output as to_html()
Is there any way to integrate the logo/Text at the top of the html page before saving.
to_html returns a string with the html if the first parameter buf is None. You can than prepend your image or text html to this string and then write this result string to a file.
output = '<img src="logo.jpg" alt="logo"><br><b>some text</b><br>' + df.to_html()
with open('output.html', 'w') as f:
f.write(output)

How can I read values from a configuration file (text)?

I need to read values (text) from a configuration file named .env and assign them to variables so I can use them later in my program.
The .env file contains name/value pairs and looks something like this:
ENVIRONMENT_VARIABLE_ONE = AC9157847d72b1aa5370fdef36786863d9
ENVIRONMENT_VARIABLE_TWO = 73cad721b8cad6718d469acc42ffdb1f
ENVIRONMENT_VARIABLE_THREE = +13335557777
What I have tried so far
read-values.red
Red [
]
contents: read/lines %.env
env-one: first contents
env-two: second contents
env-three: third contents
print env-one ; ENVIRONMENT_VARIABLE_ONE = AC9157847d72b1aa5370fdef36786863d9
print env-two ; ENVIRONMENT_VARIABLE_ONE = 73cad721b8cad6718d469acc42ffdb1f
print env-three ; ENVIRONMENT_VARIABLE_ONE = +13335557777
What I'm looking for
print env-one ; AC9157847d72b1aa5370fdef36786863d9
print env-two ; 73cad721b8cad6718d469acc42ffdb1f
print env-three ; +13335557777
How do I continue or change my code and parse these strings such as the env- variables will contain just the values?
env-one: skip find first contents " = " 3
See help for find and skip
Another solution using parse could be:
foreach [word value] parse read %.env [collect some [keep to "=" skip keep to newline skip]] [set load word trim value]
This one will add the words to the global context ENVIRONMENT_VARIABLE_ONE will be AC9157847d72b1aa5370fdef36786863d9 and so on.

How to read PHP output row by row and create HTML-links from each row

I'm having this PHP-script:
<?php
$old_path = getcwd();
chdir('/var/www/html/SEARCHTOOLS/');
$term1 = $_POST['query1'];
$term2 = $_POST['query2'];
$var = "{$term1} {$term2}";
$outcome = shell_exec("searcher $var");
chdir($old_path);
echo "<pre>$outcome</pre>";
?>
On a searchpage two searchwords are written and the searchbutton is pushed. The search result turns up as a webpage like this:
/var/www/html/SEARCHTOOLS/1974-1991.pdf:1
/var/www/html/SEARCHTOOLS/1974-1991.pdf:3
/var/www/html/SEARCHTOOLS/1974-1991.pdf:7
/var/www/html/SEARCHTOOLS/1974-1991.pdf:7
/var/www/html/SEARCHTOOLS/1974-1991.pdf:9
/var/www/html/SEARCHTOOLS/1974-1991.pdf:13
/var/www/html/SEARCHTOOLS/1974-1991.pdf:13
The result shows links to individual PDF-files and pagenumber in that file, but are not clickable.
Is there a way to make these links clickable so that it opens up for instance in Evince or Acrobat at the correct page number?
Many thanks in advance.
/Paul
I found a correct answer to my problem. It took some time, but here it is:
<?php
// Get current working directory and put it as variable
$old_path = getcwd();
// Change directory
chdir('/var/www/html/SEARCHTOOLS/');
// Create first variable as result of first searchword on searchpage
$term1 = $_POST['query1'];
// Create second variable as result of second searchword on searchpage
$term2 = $_POST['query2'];
// Create a variable combining first AND second variable
$var = "{$term1} {$term2}";
// Create a variable as the result of the executed search using command "sokare" and variable "$var"
$outcome = shell_exec("sokare $var");
// Return to starting directory
chdir($old_path);
// Split the varible "$outcome" per line representing every page in PDF-file where variable "$var" is found
foreach(preg_split("/((\r?\n)|(\r\n?))/", $outcome) as $line){
// Create a variable out of the given pagenumber in PDF-file
$end = substr($line, strpos($line, ":") + 1);
// Trim the line by removing leading directories from line
$line2 = str_replace('/var/www/html', '', $line);
// Change a string from lower to upper case
$line2 = str_replace('searchtools', 'SEARCHTOOLS', $line2);
// Remove the colon and anything behind it from line
$line2 = array_shift(explode(':', $line2));
// Add suffix to line to facilitate linking to pagenumber in PDF-file
$line3 = str_replace(" ", "_", $line2).'#page=';
// Add pagenumber from the variable "$end"
$line3 = str_replace(" ", "_", $line3).$end;
// Print each line as a correct URL-link
echo "<pre><a href=$line3>$line3</a></pre>";
}
?>
The search results will now turn up as (and are clickable):
/SEARCHTOOLS/1974-1991.pdf#page=1
/SEARCHTOOLS/1974-1991.pdf#page=3
/SEARCHTOOLS/1974-1991.pdf#page=7
Just a small edit. The line ....
// Add suffix to line to facilitate linking to pagenumber in PDF-file
$line3 = str_replace(" ", "_", $line2).'#page=';
...works better with:
// Add suffix to line to facilitate linking to pagenumber in PDF-file
if (substr($line2, -3) == 'pdf') {
$line3 = $line2.'#page=';
}

How do I preserve the order of the dependencies?

I have the following code that opens files in a directory, runs spaCy NLP on them, and the outputs dependency parse info into a file in a new directory.
import spacy, os
nlp = spacy.load('en')
path1 = 'C:/Path/to/my/input'
path2 = '../output'
for file in os.listdir(path1):
with open(file, encoding='utf-8') as text:
txt = text.read()
doc = nlp(txt)
for sent in doc.sents:
f = open(path2 + '/' + file, 'a+')
for token in sent:
f.write(file + '\t' + str(token.dep_) + '\t' + str(token.head) + '\t' + str(token.right_edge) + '\n')
f.close()
The trouble is that this won't preserver the order of the dependencies in the output file. I can't seem to find any references to character positions in the API documentation.
The character index is at token.idx. The word index is at token.i. I know this isn't particularly intuitive.
Tokens also compare by position, so you could do:
for child in sent:
word1, word2 = sorted((child, child.head))
This would get you each dependency arc, arranged in document order. I'm not sure what you're trying to do with the right edge there, though, so I'm not sure if this does quite what you want.

scrapy handle hebrew (non-english) language

I am using scrapy to scrap a hebrew website. However even after encoding scrapped data into UTF-8, I am not able to get the hewbrew character.
Getting weird string(× ×¨×¡×™ בעמ) in CSV. However If I check print same item, I am able to see the correct string on terminal.
Following is the website I am using.
http://www.moch.gov.il/rasham_hakablanim/Pages/pinkas_hakablanim.aspx
class Spider(BaseSpider):
name = "moch"
allowed_domains = ["www.moch.gov.il"]
start_urls = ["http://www.moch.gov.il/rasham_hakablanim/Pages/pinkas_hakablanim.aspx"]
def parse(self, response):
data = {'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$cboAnaf': unicode(140),
'SearchFreeText:': u'חפש',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$txtShemKablan': u'',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$txtMisparYeshut': u'',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$txtShemYeshuv': u'הקלד יישוב',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$txtMisparKablan': u'',
'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$btnSearch': u'חפש',
'ctl00$ScriptManager1': u'ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$UpdatePanel1|ctl00$ctl13$g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d$ctl00$btnSearch'}
yield FormRequest.from_response(response,
formdata=data,
callback = self.fetch_details,
dont_click = True)
def fetch_details(self, response):
# print response.body
hxs = HtmlXPathSelector(response)
item = MochItem()
names = hxs.select("//table[#id='ctl00_ctl13_g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d_ctl00_gridRashamDetails']//tr/td[2]/font/text()").extract()
phones = hxs.select("//table[#id='ctl00_ctl13_g_dbcc924d_5066_4fee_bc5c_6671d3e2c06d_ctl00_gridRashamDetails']//tr/td[6]/font/text()").extract()
index = 0
for name in names:
item['name'] = name.encode('utf-8')
item['phone'] = phones[index].encode('utf-8')
index += 1
print item # This is printed correctly on termial.
yield item # If I create a CSV output file. Then I am not able to see proper Hebrew String
The weird thing is, If i open the same csv in notepad++. I am able to see the correct output. So as a workaroud. What i did is, I opened the csv in notepad++ and change the encoding to UTF-8. And saved it. Now when i again open the csv in excel it shows me the correct hebrew string.
Is there anyway to specify the CSV encoding, from within scrapy ?