I have the following code that opens files in a directory, runs spaCy NLP on them, and the outputs dependency parse info into a file in a new directory.
import spacy, os
nlp = spacy.load('en')
path1 = 'C:/Path/to/my/input'
path2 = '../output'
for file in os.listdir(path1):
with open(file, encoding='utf-8') as text:
txt = text.read()
doc = nlp(txt)
for sent in doc.sents:
f = open(path2 + '/' + file, 'a+')
for token in sent:
f.write(file + '\t' + str(token.dep_) + '\t' + str(token.head) + '\t' + str(token.right_edge) + '\n')
f.close()
The trouble is that this won't preserver the order of the dependencies in the output file. I can't seem to find any references to character positions in the API documentation.
The character index is at token.idx. The word index is at token.i. I know this isn't particularly intuitive.
Tokens also compare by position, so you could do:
for child in sent:
word1, word2 = sorted((child, child.head))
This would get you each dependency arc, arranged in document order. I'm not sure what you're trying to do with the right edge there, though, so I'm not sure if this does quite what you want.
Related
I now: the automatic token refreshing is not a new topic.
This is the use case that generate my problem: let's say that we want extract data from Dropbox. Below you can find the code: for the first time works perfectly: in fact 1) the user goes to the generated link; 2) after allow the app coping and pasting the authorization code in the input box.
The problem arise when some hours after the user wants to do the same operation. How to avoid or by-pass the newly generation of authorization code and go straight to the operation?enter code here
As you can see in the code in a short period is possible reinject the auth code inside the code (commented in the code). But after 1 hour or more this is not loger possible.
Any help is welcome.
#!/usr/bin/env python3
import dropbox
from dropbox import DropboxOAuth2FlowNoRedirect
'''
Populate your app key in order to run this locally
'''
APP_KEY = ""
auth_flow = DropboxOAuth2FlowNoRedirect(APP_KEY, use_pkce=True, token_access_type='offline')
target='/DVR/DVR/'
authorize_url = auth_flow.start()
print("1. Go to: " + authorize_url)
print("2. Click \"Allow\" (you might have to log in first).")
print("3. Copy the authorization code.")
auth_code = input("Enter the authorization code here: ").strip()
#auth_code="3NIcPps_UxAAAAAAAAAEin1sp5jUjrErQ6787_RUbJU"
try:
oauth_result = auth_flow.finish(auth_code)
except Exception as e:
print('Error: %s' % (e,))
exit(1)
with dropbox.Dropbox(oauth2_refresh_token=oauth_result.refresh_token, app_key=APP_KEY) as dbx:
dbx.users_get_current_account()
print("Successfully set up client!")
for entry in dbx.files_list_folder(target).entries:
print(entry.name)
def dropbox_list_files(path):
try:
files = dbx.files_list_folder(path).entries
files_list = []
for file in files:
if isinstance(file, dropbox.files.FileMetadata):
metadata = {
'name': file.name,
'path_display': file.path_display,
'client_modified': file.client_modified,
'server_modified': file.server_modified
}
files_list.append(metadata)
df = pd.DataFrame.from_records(files_list)
return df.sort_values(by='server_modified', ascending=False)
except Exception as e:
print('Error getting list of files from Dropbox: ' + str(e))
#function to get the list of files in a folder
def create_links(target, csvfile):
filesList = []
print("creating links for folder " + target)
files = dbx.files_list_folder('/'+target)
filesList.extend(files.entries)
print(len(files.entries))
while(files.has_more == True) :
files = dbx.files_list_folder_continue(files.cursor)
filesList.extend(files.entries)
print(len(files.entries))
for file in filesList :
if (isinstance(file, dropbox.files.FileMetadata)) :
filename = file.name + ',' + file.path_display + ',' + str(file.size) + ','
link_data = dbx.sharing_create_shared_link(file.path_lower)
filename += link_data.url + '\n'
csvfile.write(filename)
print(file.name)
else :
create_links(target+'/'+file.name, csvfile)
#create links for all files in the folder belgeler
create_links(target, open('links.csv', 'w', encoding='utf-8'))
listing = dbx.files_list_folder(target)
#todo: add implementation for files_list_folder_continue
for entry in listing.entries:
if entry.name.endswith(".pdf"):
# note: this simple implementation only works for files in the root of the folder
res = dbx.sharing_get_shared_links(
target + entry.name)
#f.write(res.content)
print('\r', res)
When text is tagged by SpaCy, the vertical bar "|" is assigned different POS tags depending on the context, such as "ADV" , "DEL"... While I want "|" to be recognized as "PUNC". Is there a way to force this POS for "|" ?
I tried this command and it didn't work.
nlp.tokenizer.add_special_case('|', [{ORTH: '|', POS: PUNC}])
I would add a simple pipe into the pipeline, right after the tagger :
def pos_postprocessor_pipe(doc) :
for token in doc :
if token.text == '|':
token.pos_ = 'PUNCT'
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(pos_postprocessor_pipe, name="pos_postprocessor", after='tagger')
It looks like you're using the method from this question to overwrite the tags, but that seems to be for an old version of spaCy; it doesn't work in 2.3.1.
You can set values on tokens, so you can do something like this:
import spacy
from spacy.symbols import NOUN
nlp = spacy.load('en')
words = nlp("...")
for word in words:
if word.text == "|":
word.pos = NOUN
However, it's likely that it's easier to just add an exception for the pipe wherever it's causing you issues.
Adding an exception for the pipe would look like this.
for word in nlp(...):
pos = word.pos_
if word.text == "|":
pos = "PUNCT"
do_stuff(word, pos)
I am working on a requirement where I have to save logs of my ETL scripts to S3 location.
For this I am able to store the logs in my local system and now I need to upload them in S3.
For this I have written following code-
import logging
import datetime
import boto3
from boto3.s3.transfer import S3Transfer
from etl import CONFIG
FORMAT = '%(asctime)s [%(levelname)s] %(filename)s:%(lineno)s %
(funcName)s() : %(message)s'
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
logger = logging.getLogger()
logger.setLevel(logging.INFO)
S3_DOMAIN = 'https://s3-ap-southeast-1.amazonaws.com'
S3_BUCKET = CONFIG['S3_BUCKET']
filepath = ''
folder_name = 'etl_log'
filename = ''
def log_file_conf(merchant_name, table_name):
log_filename = datetime.datetime.now().strftime('%Y-%m-%dT%H-%M-%S') +
'_' + table_name + '.log'
fh = logging.FileHandler("E:/test/etl_log/" + merchant_name + "/"
+ log_filename)
fh.setLevel(logging.DEBUG)
fh.setFormatter(logging.Formatter(FORMAT, DATETIME_FORMAT))
logger.addHandler(fh)
client = boto3.client('s3',
aws_access_key_id=CONFIG['S3_KEY'],
aws_secret_access_key=CONFIG['S3_SECRET'])
transfer = S3Transfer(client)
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/"+filename)
Issue I am facing here is that logs are generated for different merchants hence their names are based on the merchant and this I have taken cared while saving on local.
But for uploading in S3 I don't know how to select log file name.
Can anyone please help me to achieve my goal?
s3 is an object store, it doesn't have "real path", the so call path e.g. "/" separator is actually cosmetic. So nothing prevent you from using something similar to your local file naming convention. e.g.
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/" + merchant_name + "/" + filename)
To list all the file under the arbitrary path (it is called "prefix") , you just do this
# simple list object, not handling pagination. max 1000 objects listed
client.list_objects(
Bucket = S3_BUCKET,
Prefix = folder_name + "/" + merchant_name
)
Recently i got into MEL programming to help out a few friends in maya. They wanted to have their files refferenced on a server, so i needed to change the reff strings. Now i have compiled a solution to do this, and used another example as a guide, but when i run the script it says
// Error: int $nt=tokenize $TexturePath "\" $buff;
//
// Error: Line 12.43: Unterminated string. //
What gives?
p.s. full code below for anyone who wants to use it
string $SceneTextures[] = `ls -tex`;
string $plus="";//Place a file type here to be saved in that subfolder
for ($i = 0; $i< (`size $SceneTextures`); $i++)
{
$Test = catchQuiet(`getAttr ($SceneTextures[$i] + ".fileTextureName")`);
if ($Test == 0)
{
string $TexturePath = `getAttr ($SceneTextures[$i] + ".fileTextureName")`;
string $buff[];
int $nt=`tokenize $TexturePath "\\" $buff`;
string $newPath=("${ARC_SURF}\\" + plus + "\\" + $buff[$nt-3] + "\\" + $buff[$nt-2] + "\\" + $buff[$nt-1]);
setAttr -type "string" ($SceneTextures[$i] + ".fileTextureName") $NewPath;
catchQuiet (AEfileTextureReloadCmd ($SceneTextures[$i] + ".fileTextureName"));
//print $TexturePath;
}//end if
}//end for i
EDIT: Fixed the code as it should be, now it only throws // Error: line 14: Invalid negative index used to reference array "$buff".
But i think that probably only 1 texture screws stuff up, will check and report back
I'm no expert in MEL, but in many languages \ is used to escape control-sequences, so I would guess you want "\\" instead of "\", in the many places it appears.
import pyPdf
f= open('jayabal_appt.pdf','rb')
pdfl = pyPdf.PdfFileReader(f)
output = pyPdf.PdfFileWriter()
content=""
for i in range(0,1):
content += pdfl.getPage(i).extractText() + "\n"
outpu = open('b.txt','wb')
outpu.write(content)
f.close()
outpu.close()
This is not writing the content of a pdf to a txt file... what shld i do???
Iterate through every page and call extractText() like so:
content = ""
for i in range(0, num_pages):
content += pdfl.getPage(i).extractText() + "\n"
Once you have the full contents you can easily split the lines via the '\n' separator.
EDIT:
Check after the for-loop, whether the variable contents even contains any text. Not all PDF files contain text information.