Read pdf object from S3

Read pdf object from S3 - pdf

I am trying to create a lambda function that will access a pdf form uploaded to s3 and strip out the data entered into the form and send it elsewhere.
I am able to do this when I can download the file locally. So the below script works and allows me to read the data from the pdf into my pandas dataframe.:
import PyPDF2 as pypdf
import pandas as pd
s3 = boto3.resource('s3')
s3.meta.client.download_file(bucket_name, asset_key, './target.pdf')
pdfobject = open("./target.pdf", 'rb')
pdf = pypdf.PdfFileReader(pdfobject)
data = pdf.getFormTextFields()
pdf_df = pd.DataFrame(data, columns=get_cols(data), index=[0])
But with lambda I cannot save the file locally because I get a "read only filesystem" error.
I have tried using the s3.get_object() method like below:
s3_response_object= s3.get_object(
Bucket='pdf-forms-bucket',
Key='target.pdf',
)
pdf_bytes = s3_response_object['Body'].read()
But I have no idea how to convert the resulting bytes into an object that can be parsed with PyDF2. The output that I need and that PyDF2 will produce is like below:
{'form1[0].#subform[0].nameandmail[0]': 'Burt Lancaster',
'form1[0].#subform[0].mailaddress[0]': '675 Creighton Ave, Washington DC',
'form1[0].#subform[0].Principal[0]': 'David St. Hubbins',
'Principal[1]': None,
'form1[0].#subform[0].Principal[2]': 'Bart Simpson',
'Principal[3]': None}
So in summary, I need o be able to read a pdf with fillable forms, into memory and parse it without downloading the file because my lambda function environment won't allow local temp files.

Solved:
This does the trick:
import boto3
from PyPDF2 import PdfFileReader
from io import BytesIO
bucket_name ="pdf-forms-bucket"
item_name = "form.pdf"
s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, item_name)
fs = obj.get()['Body'].read()
pdf = PdfFileReader(BytesIO(fs))
data = pdf.getFormTextFields()

Related

How to use drive with external account in colab

So, i have all important files in my account and i need to import that files to the local machine of colab and share the files can be dangerous and for any rason the files copy and paste theirself in the drive filling it, i try to use pydrive but i cant "automatize" the process of login with a user account. I need a form to automatize the process of login using colab, ideas?
`
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
from google_drive_downloader import GoogleDriveDownloader as gdd
gauth = GoogleAuth()
drive = GoogleDrive(gauth)
file = open("client_secrets.json", "w")
file.write('{"web": __')
file.close()
file = open("credentials.json", "w")
file.write('{"access_token": __"}')
file.close()
file = open("settings.json", "w")
file.write('')
if gauth.credentials is None:
gauth.CommandLineAuth()
elif gauth.access_token_expired:
gauth.Refresh()
else:
gauth.Authorize()
( __ = private information)

How to open Russian-language PDFs for NLTK processing

I'm trying to extract text from a pdf file in Russian, and use this text as data for tokenisation, lemmatisation etc. with NLTK on Jupyter Notebook. I'm using PyPDF2, but I keep running into problems.
I am creating a function and passing to it the pdf as the input:
from PyPDF2 import PdfFileReader
def getTextPDF(pdfFileName):
pdf_file = open(pdfFileName, "rb")
read_pdf = PdfFileReader(pdf_file)
text = []
for i in range(0, read_pdf.getNumPages()):
text.append(read_pdf.getPage(i).extractText())
return "\n".join(text)
Then I call the function:
pdfFile = "sample_russian.pdf"
print("PDF: \n", myreader_pdf.getTextPDF(pdfFile))
But I get a long pink list of the same error warning:
PdfReadWarning: Superfluous whitespace found in object header b'1' b'0' [pdf.py:.....]
Any ideas would be very helpful! Thanks in advance!

Convert PDF to .txt file with Google Cloud Storage

I have this code for Python on a local file system.
What is the equivalent Python object API for os.getcwd(), os.listdir?
I want this code to work using files from GCS?
In order to use GCS folders, I include this code
from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
pdfDir = bucket.get_blob('uploads/pdf/')
txtDir = bucket.get_blob('uploads/txt/')
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt
#converts pdf, returns its text content as a string
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
#converts all pdfs in directory pdfDir, saves all resulting txt files to
txtdir
def PDF2txt(pdfDir, txtDir):
if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in
for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
fileExtension = pdf.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = pdfDir + pdf
text = convert(pdfFilename) #get string of text content of pdf
textFilename = txtDir + pdf + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
pdfDir = "C:/pdftotxt/pdfs/"
txtDir = "C:/pdftotxt/txt/"
PDF2txt(pdfDir, txtDir)

I assume that what you want is to list objects in a bucket and objects in particular folders inside a bucket. For doing that you can use directly the Python Client Libraries that Google Cloud Storage provide. Use bucket.list_blobs() for listing the whole bucket and bucket.list_blobs(prefix=prefix, delimiter=delimiter) for listing a particular folder or object.
A more detailed documentation can be found here [1] and the Git repository containing the whole libraries here [2].

minimal example of how to export a jupyter notebook to pdf using nbconvert and PDFExporter()

I am trying to export a pdf copy of a jupyter notebook using nbconvert from within a notebook cell. I have read the documentation, but I just cannot find some basic code to actually execute the nbconvert command and export to pdf.
I was able to get this far, but I was hoping that someone could just fill in the final gaps.
from nbconvert import PDFExporter
notebook_pdf = PDFExporter()
notebook_pdf.template_file = '../print_script/pdf_nocode.tplx'
Note sure how to get from here to actually getting the pdf created.
Any help would be appreciated.

I'm no expert, but managed to get this working. The key is that you need to preprocess the notebook which will allow you to use the PDFExporter.from_notebook_node() function. This will give you your pdf_data in byte format that can then be written to file:
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
from nbconvert import PDFExporter
notebook_filename = "notebook.ipynb"
with open(notebook_filename) as f:
nb = nbformat.read(f, as_version=4)
ep = ExecutePreprocessor(timeout=600, kernel_name='python3')
ep.preprocess(nb, {'metadata': {'path': 'notebooks/'}})
pdf_exporter = PDFExporter()
pdf_data, resources = pdf_exporter.from_notebook_node(nb)
with open("notebook.pdf", "wb") as f:
f.write(pdf_data)
f.close()
It's worth noting that the ExecutePreprocessor requires the resources dict, but we don't use it in this example.

Following is rest api that convert .ipynb file into .html
POST: http://URL/export/<id>
Get: http://URL/export/<id> will return a id.html
import os
from flask import Flask, render_template, make_response
from flask_cors import CORS
from flask_restful import reqparse, abort, Api, Resource
from nbconvert.exporters import HTMLExporter
exporter = HTMLExporter()
app = Flask(__name__)
cors = CORS(app, resources={r"/export/*": {"origins": "*"}})
api = Api(app)
parser = reqparse.RequestParser()
parser.add_argument('path')
notebook_file_srv = '/path of your .ipynb file'
def notebook_doesnt_exist(nb):
abort(404, message="Notebook {} doesn't exist".format(nb))
class Notebook(Resource):
def get(self, id):
headers = {'Content-Type': 'text/html'}
return make_response(render_template(id + '.html'), 200, headers)
def post(self, id):
args = parser.parse_args()
notebook_file = args['path']
notebook_file = notebook_file_srv + id + '.ipynb'
if not os.path.exists(notebook_file):
return 'notebook \'.ipynb\' file not found', 404
else:
nb_name, _ = os.path.splitext(os.path.basename(notebook_file))
# dirname = os.path.dirname(notebook_file)
output_path = os.path.join(os.path.abspath(os.path.dirname(__file__)), 'templates')
output_path = os.path.join(output_path, '{}.html'.format(nb_name))
output, resources = exporter.from_filename(notebook_file)
f = open(output_path, 'wb')
f.write(output.encode('utf8'))
f.close()
return 'done', 201
api.add_resource(Notebook, '/export/<id>')
if __name__ == '__main__':
app.run(debug=True)

How to read every file in folder to dataframe named after filename and overlay column names?

I am working on a project where I am downloading public data from (http://pdata.hcad.org/download/) and more particularly downloading the zip files "real_acct_ownership" and "real_building_land".
Each of these zip files contains data on homes built in the houston area, such as addresses, fixtures, sq ft, etc.
My goal is to organize the data so that all the files in the zip folder are data frames indexable by the column "account".
I am running into the issue as to how to create a function or for loop that will read and write the data into a data frame based on file name and how to overlay column names as the data in the zip folders does not contain the column names. The column names can be found in the access zip folder at the top left hand corner labeled "access.zip" of the website.
In my code so far I am calling each file from the above two folders and specifying each column name. I want this to be a iterative process as I will have to do this for other counties and would like a way to create a loop over the files in the folder.
my code so far with NO loops:
import pandas as pd
fixtures = pd.read_csv('/Users/Desktop/Real_building_land/fixtures.txt',header = None,
encoding= 'cp037', error_bad_lines=False, sep='\t')
real_acct =pd.read_csv('/Users/Desktop/Real_acct_owner/real_acct.txt', header = None,
encoding = 'cp037', error_bad_lines=False, sep='\t')
exterior = pd.read_csv('/Users/Desktop/Real_building_land/exterior.txt', header = None,
encoding = 'cp037', error_bad_lines=False, sep='\t')
fixtures.columns = ('ACCOUNT','BUILDING_NUMBER','FIXTURE_TYPE','FIXTURE_DESCRIPTION','UNITS')
real_acct.columns = ("ACCOUNT","TAX_YEAR","MAILTO","MAIL_ADDR_1","MAIL_ADDR_2","MAIL_CITY","MAIL_STATE",
"MAIL_ZIP","MAIL_COUNTRY","UNDELIVERABLE","STR_PFX" ,"STR_NUM", "STR_NUM_SFX","STR_NAME",
"STR_SFX","STR_SFX_DIR","STR_UNIT","SITE_ADDR_1","SITE_ADDR_2","SITE_ADDR_3","STATE_CLASS",
"SCHOOL_DIST","MAP_FACET","KEY_MAP","NEIGHBORHOOD_CODE","NEIGHBORHOOD_GROUP","MARKET_AREA_1",
"MARKET_AREA_1_DSCR","MARKET_AREA_2","MARKET_AREA_2_DSCR","ECON_AREA","ECON_BLD_CLASS",
"CENTER_CODE","YR_IMPR","YR_ANNEXED","SPLT_DT","DSC_CD","NXT_BUILDING","TOTAL_BUILDING_AREA",
"TOTAL_LAND_AREA","ACREAGE","CAP_ACCOUNT","SHARED_CAD_CODE","LAND_VALUE","IMPROVEMENT_VALUE",
"EXTRA_FEATURES_VALUE" ,"AG_VALUE","ASSESSED_VALUE","TOTAL_APPRAISED_VALUE","TOTAL_MARKET_VALUE",
"PRIOR_LND_VALUE","PRIOR_IMPR_VALUE","PRIOR_X_FEATURES_VALUE","PRIOR_AG_VALUE",
"PRIOR_TOTAL_APPRAISED_VALUE","PRIOR_TOTAL_MARKET_VALUE","NEW_CONSTRUCTION_VALUE",
"TOTAL_RCN_VALUE","VALUE_STATUS","NOTICED","NOTICE_DATE","PROTESTED","CERTIFIED_DATE",
"LAST_INSPECTED_DATE","LAST_INSPECTED_BY","NEW_OWNER_DATE","LEGAL_DSCR_1","LEGAL_DSCR_2",
"LEGAL_DSCR_3","LEGAL_DSCR_4","JURS")
exterior.columns = ("ACCOUNT","BUILDING_NUMBER","EXTERIOR_TYPE","EXTERIOR_DESCRIPTION","AREA")
df = fixtures.merge(real_acct,on='ACCOUNT').merge(exterior,on='ACCOUNT')
#df = df.loc[df['ACCOUNT'] == 10020000015]
print(df.shape)
Code with Few trials with loops nothing worked:
import pandas as pd
import glob
import os
dfs = {os.path.basename(f): pd.read_csv(f, sep='\t', header=None,encoding='cp037',
error_bad_lines=False) for f in glob.glob('/Users/Desktop/Real_building_land/*.txt')}
print(dfs)
path =r'path' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
Thank you in advance.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Read pdf object from S3 - pdf

Related

How to use drive with external account in colab

How to open Russian-language PDFs for NLTK processing

Convert PDF to .txt file with Google Cloud Storage

minimal example of how to export a jupyter notebook to pdf using nbconvert and PDFExporter()

How to read every file in folder to dataframe named after filename and overlay column names?

Categories

Resources