I now: the automatic token refreshing is not a new topic.
This is the use case that generate my problem: let's say that we want extract data from Dropbox. Below you can find the code: for the first time works perfectly: in fact 1) the user goes to the generated link; 2) after allow the app coping and pasting the authorization code in the input box.
The problem arise when some hours after the user wants to do the same operation. How to avoid or by-pass the newly generation of authorization code and go straight to the operation?enter code here
As you can see in the code in a short period is possible reinject the auth code inside the code (commented in the code). But after 1 hour or more this is not loger possible.
Any help is welcome.
#!/usr/bin/env python3
import dropbox
from dropbox import DropboxOAuth2FlowNoRedirect
'''
Populate your app key in order to run this locally
'''
APP_KEY = ""
auth_flow = DropboxOAuth2FlowNoRedirect(APP_KEY, use_pkce=True, token_access_type='offline')
target='/DVR/DVR/'
authorize_url = auth_flow.start()
print("1. Go to: " + authorize_url)
print("2. Click \"Allow\" (you might have to log in first).")
print("3. Copy the authorization code.")
auth_code = input("Enter the authorization code here: ").strip()
#auth_code="3NIcPps_UxAAAAAAAAAEin1sp5jUjrErQ6787_RUbJU"
try:
oauth_result = auth_flow.finish(auth_code)
except Exception as e:
print('Error: %s' % (e,))
exit(1)
with dropbox.Dropbox(oauth2_refresh_token=oauth_result.refresh_token, app_key=APP_KEY) as dbx:
dbx.users_get_current_account()
print("Successfully set up client!")
for entry in dbx.files_list_folder(target).entries:
print(entry.name)
def dropbox_list_files(path):
try:
files = dbx.files_list_folder(path).entries
files_list = []
for file in files:
if isinstance(file, dropbox.files.FileMetadata):
metadata = {
'name': file.name,
'path_display': file.path_display,
'client_modified': file.client_modified,
'server_modified': file.server_modified
}
files_list.append(metadata)
df = pd.DataFrame.from_records(files_list)
return df.sort_values(by='server_modified', ascending=False)
except Exception as e:
print('Error getting list of files from Dropbox: ' + str(e))
#function to get the list of files in a folder
def create_links(target, csvfile):
filesList = []
print("creating links for folder " + target)
files = dbx.files_list_folder('/'+target)
filesList.extend(files.entries)
print(len(files.entries))
while(files.has_more == True) :
files = dbx.files_list_folder_continue(files.cursor)
filesList.extend(files.entries)
print(len(files.entries))
for file in filesList :
if (isinstance(file, dropbox.files.FileMetadata)) :
filename = file.name + ',' + file.path_display + ',' + str(file.size) + ','
link_data = dbx.sharing_create_shared_link(file.path_lower)
filename += link_data.url + '\n'
csvfile.write(filename)
print(file.name)
else :
create_links(target+'/'+file.name, csvfile)
#create links for all files in the folder belgeler
create_links(target, open('links.csv', 'w', encoding='utf-8'))
listing = dbx.files_list_folder(target)
#todo: add implementation for files_list_folder_continue
for entry in listing.entries:
if entry.name.endswith(".pdf"):
# note: this simple implementation only works for files in the root of the folder
res = dbx.sharing_get_shared_links(
target + entry.name)
#f.write(res.content)
print('\r', res)
Related
I need to get the full path of folders where a file is located in Google Drive. I'm getting the files themselves using the Google Drive API, but I need information about it's parent folders
I'm using the following code tothe the list of spreadsheets in a Shared Drive:
from googleapiclient import discovery
from httplib2 import Http
from oauth2client import file, client, tools
# Change the value of SCOPES to 'https://www.googleapis.com/auth/drive'
# if you want to be able to read and write to the user's Google Drive.
SCOPES = 'https://www.googleapis.com/auth/drive'
store = file.Storage('storage.json')
creds = store.get()
if not creds or creds.invalid:
flow = client.flow_from_clientsecrets('client_secret.json', SCOPES)
creds = tools.run_flow(flow, store)
DRIVE = discovery.build('drive', 'v3', http=creds.authorize(Http()))
folder_id = "1Z1GzY-D3I3qwQu3oxIW-L1a9nXgD0PXl"
query = "mimeType='application/vnd.google-apps.spreadsheet'"
query+= "and fullText contains 'CLAS' and trashed = false"
# query += " and parents in '" + folder_id + "'"
spreadsheets = []
# Initialize the page token
next_page_token = None
# Loop until all pages of results have been retrieved
while True:
# Execute the list request
response = DRIVE.files().list(
q=query,
corpora='drive',
includeItemsFromAllDrives=True,
driveId='0AEJNMySKcEzsUk9PVA',
supportsAllDrives=True,
# orderBy='folder',
pageSize=1000,
fields='nextPageToken, files(id, name, parents, mimeType, webViewLink)',
pageToken=next_page_token,
).execute()
# Append the results to the list
spreadsheets.extend(response.get('files', []))
# Check if there is another page of results
next_page_token = response.get('nextPageToken', None)
if next_page_token is None:
break
# Set the page token for the next iteration
# parameters['pageToken'] = next_page_token
# Print the number of results
print(f'Last spreadsheet found: {spreadsheets[-1]["name"]}. Number of spreadsheets: {len(spreadsheets)}')
This returns a list of dictionaries with the specified fields. I would like to know the names of the parent folders for each file, for which I'm trying:
from googleapiclient.errors import HttpError
for item in spreadsheets:
if 'parents' in item:
parent_folders_list = []
parent_id = item['parents'][0]
try:
while parent_id:
folder=DRIVE.files().get(fileId=parent_id, fields='name, id, parents').execute()
parent_folders_list.append(folder.get("parents", []))
if parent_id:
parent_id = parent_id[0]
except HttpError as error:
print('An error occurred: %s' % error)
print(f'{item["name"]} is in {parent_folders_list}')
And I've been able to identify that parent_id is correctly retrieved, and that I am able to access it, as I was able to open it in the browser. However, I get back errors 'File Not Found' for all parent_id. I wonder if the DRIVE.files().get(fileId=) is the correct way to get back a folder using the API.
Any help would be greatly appreciated.
I would like to download a picture into a blob folder.
Before that I need to create the folder first.
Below codes are what I am doing.
The issue is the folder needs time to be created.
When it comes to with open(abs_file_name, "wb") as f:
it can not find the folder.
I am wondering whether there is an 'await' to get to know the completion of the folder creation, then do the write operation.
for index, row in data.iterrows():
url = row['Creatives']
file_name = url.split('/')[-1]
r = requests.get(url)
abs_file_name = lake_root + file_name
dbutils.fs.mkdirs(abs_file_name)
if r.status_code == 200:
with open(abs_file_name, "wb") as f:
f.write(r.content)
The final sub folder will not be created when using dbutils.fs.mkdirs() on blob storage.
It creates a file with the final sub folder name which would be considered as a directory, but it is not a directory. Look at the following demonstration:
dbutils.fs.mkdirs('/mnt/repro/s1/s2/s3.csv')
When I try to open this file, the error says that this is a directory.
This might be the issue with the code. So, try using the following code instead:
for index, row in data.iterrows():
url = row['Creatives']
file_name = url.split('/')[-1]
r = requests.get(url)
abs_file_name = lake_root + 'fail' #creates the fake directory (to counter the problem we are facing above)
dbutils.fs.mkdirs(abs_file_name)
if r.status_code == 200:
with open(lake_root + file_name, "wb") as f:
f.write(r.content)
I am writing a test that needs to read a file. I wrote the following function:
upload_file = (file_name, input_class) ->
await browser.executeScript("window.document.getElementsByClassName('" +
input_class + "')[0].style.display = 'block'", [])
file_path = path.join(FILE_PATH, file_name)
await $('input.' + input_class).then((res) ->
return res.setValue(file_path)
)
However, the test fails to test the script
In the code that is executed after the file is uploaded to the site, I output the file that was read. All information is correct except file size which is 0
Please help me
I am working on a requirement where I have to save logs of my ETL scripts to S3 location.
For this I am able to store the logs in my local system and now I need to upload them in S3.
For this I have written following code-
import logging
import datetime
import boto3
from boto3.s3.transfer import S3Transfer
from etl import CONFIG
FORMAT = '%(asctime)s [%(levelname)s] %(filename)s:%(lineno)s %
(funcName)s() : %(message)s'
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
logger = logging.getLogger()
logger.setLevel(logging.INFO)
S3_DOMAIN = 'https://s3-ap-southeast-1.amazonaws.com'
S3_BUCKET = CONFIG['S3_BUCKET']
filepath = ''
folder_name = 'etl_log'
filename = ''
def log_file_conf(merchant_name, table_name):
log_filename = datetime.datetime.now().strftime('%Y-%m-%dT%H-%M-%S') +
'_' + table_name + '.log'
fh = logging.FileHandler("E:/test/etl_log/" + merchant_name + "/"
+ log_filename)
fh.setLevel(logging.DEBUG)
fh.setFormatter(logging.Formatter(FORMAT, DATETIME_FORMAT))
logger.addHandler(fh)
client = boto3.client('s3',
aws_access_key_id=CONFIG['S3_KEY'],
aws_secret_access_key=CONFIG['S3_SECRET'])
transfer = S3Transfer(client)
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/"+filename)
Issue I am facing here is that logs are generated for different merchants hence their names are based on the merchant and this I have taken cared while saving on local.
But for uploading in S3 I don't know how to select log file name.
Can anyone please help me to achieve my goal?
s3 is an object store, it doesn't have "real path", the so call path e.g. "/" separator is actually cosmetic. So nothing prevent you from using something similar to your local file naming convention. e.g.
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/" + merchant_name + "/" + filename)
To list all the file under the arbitrary path (it is called "prefix") , you just do this
# simple list object, not handling pagination. max 1000 objects listed
client.list_objects(
Bucket = S3_BUCKET,
Prefix = folder_name + "/" + merchant_name
)
I am trying to scrape data from this link https://www.flatstats.co.uk/racing-system-builder.php using scrapy.
I want to automate the ajax call using scrapy.
When I click "Full SP" Button (inspect in Firebug) the post parameter has the sql string which is "strange"
race|2|eq|Ordinary|0|~tRIDER_TYPE
What dialect is this?
My code :
import scrapy
import urllib
class FlatStat(scrapy.Spider):
name= "flatstat"
allowed_domains = ["flatstats.co.uk"]
start_urls = ["https://www.flatstats.co.uk/racing-system-builder.php"]
def parse(self, response):
query_lst = response.xpath('//table[#id="system"]//tr/td[last()]/text()').extract()
query_str = ' '.join(query_lst)
url = 'https://www.flatstats.co.uk/ajax/sb_report.php'
body_dict = {'a_e_max': '9.99',
'a_e_min': '0',
'arch_min': '0',
'exp_min': '0',
'report_type':'S',
# copied from the Post parameters by inspecting. Actually I tried everything.
'sqlFullString' : u'''Type%20(Rider)%7C%3D%7COrdinary%20(Exclude%20Amatr%2C%20App%2C%20Lady%20Races
)%7CAND%7Crace%7C2%7C0%7COrdinary%7C0%7C~tRIDER_TYPE%7C-t%7Ceq''',
#I tried copying this from the post parameters as well but no success.
#I also tried sql from the table //td text() which is "normal" sql but no success
'sqlString': query_str}
#here i tried everything FormRequest as well though there is no form.
return scrapy.Request(url, method="POST", body=urllib.urlencode(body_dict), callback=self.parse_page)
def parse_page(self, response):
with open("response.html", "w") as f:
f.write(response.body)
So questions are:
What is this sql.
Why isn't it returning me the required page. How can I run the right query?
I tried Selenium as well to click the button and let it do the stuff it self but that is another unsuccessful story. :(
It's not easy to say what the website creator is doing with the submitted sqlString. It probably means something very specific to how the data is processed by their backend.
This is an extract of the page JavaScript in-HTML code:
...
function system_report(type) {
sqlString = '', sqlFullString = '', rowcount = 0;
$('#system tr').each(function() {
if(rowcount > 0) {
var editdata = this.cells[6].innerHTML.split("|");
sqlString += editdata[0] + '|' + editdata[1] + '|' + editdata[7] + '|' + editdata[3] + '|' + editdata[4] + '|' + editdata[5] + '^';
sqlFullString += this.cells[0].innerHTML + '|' + encodeURIComponent(this.cells[1].innerHTML) + '|' + this.cells[2].innerHTML + '|' + this.cells[3].innerHTML + '|' + this.cells[6].innerHTML + '^';
}
rowcount++;
});
sqlString = sqlString.slice(0, -1)
...
Looks non trivial to reverse-engineer.
Although it's not a solution to your "sql" question above, I suggest that you try using splash (an alternative to selenium in some cases).
You can launch it with docker (the easiest way):
$ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
With the following script:
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
-- this clicks the "Full SP" button
assert(splash:runjs("$('#b-full-report').click()"))
-- loading the report takes some time
assert(splash:wait(5))
return {
html = splash:html()
}
end
you can get the page HTML with the popup of the report.
You can integrate Splash with Scrapy using scrapyjs (a.k.a scrapy-splash)
See https://stackoverflow.com/a/35851072/ with an example how to do so with a custom script.