Importing OneDrive files in Streamlit based on conditions in the URL - urllib

I have created an app that generates automatic reports for my team and I based on data located on multiple files (> 200). On my localhost streamlit app, I could input a few parameters (year, deployment number, etc) and the app would automatically use the correct files (3 out of 200 for each set of parameters) and generate the desired report.
However, now that I have deployed my app, I want it to select the desired files from a general OneDrive to which my whole team has access. This means the data would all be stored online in one location and the app would automatically only take the ones needed depending on the input parameters inserted by the user.
I have two problems:
1. I would like to open a csv file from a OneDrive URL. The method below gives me an error "urllib.error.HTTPError: HTTP Error 400: Bad Request":
'''
import base64
import urllib.request
import requests
from contextlib import closing
import csv
def create_onedrive_directdownload (onedrive_link):
data_bytes64 = base64.b64encode(bytes(onedrive_link, 'utf-8'))
data_bytes64_String = data_bytes64.decode('utf-8').replace('/','_').replace('+','-').rstrip("=")
resultUrl = f"https://api.onedrive.com/v1.0/shares/u!{data_bytes64_String}/root/content"
return resultUrl
onedrive_link = "https://my.sharepoint.com/:x:/s/myteam/..."
onedrive_direct_link = create_onedrive_directdownload(onedrive_link)
df = pd.read_csv(onedrive_direct_link)
r = requests.get(onedrive_link)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
'''
2. I would like the app to select the right files depending on the first part of the URL only since the ending of the URL is a random list of numbers and letters but the beginning is predictable (all the files have a formated name which include the inputed parameters, i.e. year, deployment number, instrument). So what I am trying to do is something like this:
'''
folder_path = url to OneDrive
file_prefix_number = 062
year = 2013
if url contains "'+str(folder_path)+'/'+str(file_prefix_number)+'_ADP_'+str(year)+'-'+str(deployment)+'.csv'" then df = pd.read_csv(urlADP);
else ignore
'''
Any advice would be very welcome, I have been trying unsuccessfully many methods but I am afraid my python knowledges are not that good.
Thank you in advance!

Related

Godot: game gets stuck when trying to access local files

I have a function that gets an id, finds the relevant local text file, and returns the text from the file.
func load_file(book_id):
var file = placeholder_file % str(book_id)
var f = File.new()
f.open(file, File.READ)
var index = 1
while not f.eof_reached():
var line = f.get_line()
text_from_file += line
index += 1
f.close()
return(text_from_file)
This function seems to work fine when I run the game in Godot, but when exporting to HTML or Mac the game gets stuck at the exact moment when the function is triggered.
Found the source of the issue. Godot doesn't recognize .txt files and doesn't include them in the exported game unless explicitly instructed to do so.
To include .txt files in your exported game use the filter in the export window called: "filters to export non-resource files/folders"

Scrapy upload files to dynamically created directories in S3 based on field

I've been experimenting with Scrapy for sometime now and recently have been trying to upload files (data and images) to an S3 bucket. If the directory is static, it is pretty straightforward and I didn't hit any roadblocks. But what I want to achieve is to dynamically create directories based on a certain field from the extract data and place the data & media in those directories. The template path, if you will, is below:
s3://<bucket-name>/crawl_data/<account_id>/<media_type>/<file_name>
For example if the account_id is 123, then the images should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/images/file_name.jpeg
and the data file should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/data/file_name.json
I have been able to achieve this for the media downloads (kind of a crude way to segregate media types, as of now), with the following custom File Pipeline:
class CustomFilepathPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
adapter = ItemAdapter(item)
account_id = adapter["account_id"]
file_name = os.path.basename(urlparse(request.url).path)
if ".mp4" in file_name:
media_type = "video"
else:
media_type = "image"
file_path = f"crawl_data/{account_id}/{media_type}/{file_name}"
return file_path
The following settings have been configured at a spider level with custom_settings:
custom_settings = {
'FILES_STORE': 's3://<my_s3_bucket_name>/',
'FILES_RESULT_FIELD': 's3_media_url',
'DOWNLOAD_WARNSIZE': 0,
'AWS_ACCESS_KEY_ID': <my_access_key>,
'AWS_SECRET_ACCESS_KEY': <my_secret_key>,
}
So, the media part works flawlessly and I have been able to download the images and videos in their separate directories based on the account_id, in the S3 bucket. My questions is:
Is there a way to achieve the same results with the data files as well? Maybe another custom pipeline?
I have tried to experiment with the 1st example on the Item Exporters page but couldn't make any headway. One thing that I thought might help is to use boto3 to establish connection and then upload files but that might possibly require me to segregate files locally and upload those files together, by using a combination of Pipelines (to split data) and Signals (once spider is closed to upload the files to S3).
Any thoughts and/or guidance on this or a better approach would be greatly appreciated.

fetching data in Splunk using rest api

I want to import XML data to Splunk using below .py script
My concerns are:
Can I directly configure .py script output to index data in splunk using inputs.conf, or do I need to save output first into a .csv file. If yes can anyone please suggest some approach so that data does not get changed after storing it into a new .csv file.
How can I configure that .py file to fetch data in every 5 min.
import requests
import xmltodict
import json
url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
content=xmltodict.parse(response.text)
print(content)
If you put your Python script into a [script://] stanza in inputs.conf then not only can you have Splunk launch the script automatically every 5 minutes, but anything the script writes to stdout will be indexed in Splunk.
[script:///path/to/the/script.py]
interval = 1/5 * * * *
index = main
sourcetype = foo

Load data only once on the RAM using Python

Hopefully someone can help me. I have a set of static data files to do some data analysis, however, every time I run my script it takes really long time to see what is happening, because the data is loaded every time. Is there a way to load the data once and after just work with the data??
I have been using Jupyter notebooks and it work really well, but I would like a way to fix this problem by using Python code.
The sequence of my code is:
File 1: contains all the functions;
File 2: Contains all the variables and it calls file 1 in order to know what to do with the data.\n
File 1 = functions.py\n
import numpy as np
def dict_files(filepath_lst):
dictoffiles = {}
for namefile in filepath_lst:
content_file = np.loadtxt(namefile)
dictoffiles[namefile] = content_file
## Sorting files according to smallest timestamp to largest##
sorted_dictoffiles = {keys: values for keys, values in sorted(dictoffiles.items(), key=lambda item: item[1][0, 0])}
return sorted_dictoffiles
File 2\n
import functions as f
### ----------File Path -----------###
directory = 'some_file_path'
file_path = glob.glob(filejoin(directory, '*.dat'))
dictionary_of_files = f.dict_files(file_path)

Issue in getting iteration data from rally API

I am using following url to get the iteration data from rally.
I then parse the json data received.
def query = URLEncoder.encode("(Project.Name contains \"1 Prime Infrastructure\")", "UTF-8")
def rallyURL = "https://us1.rallydev.com/slm/webservice/v2.0/iteration?query="+query+"&fetch=true&start=1&pagesize=200"
The issue is it giving 0 records. But when i change the name to some other project the data comes.
Probably it is because of the default workspace for my username and password. I want project data from different workspace.
I have access to all this workspace.
can someone tell how to set the workspace before making an api call so that i can get the iteration data ??
Thanks,
You can simply include a workspace parameter in your url to override the default:
&workspace=/workspace/12345
You can also always further refine your results to a specific project:
&project=/project/12345
Or to a specific hierarchy:
&projectScopeUp=true
&projectScopeDown=true