Load data only once on the RAM using Python - numpy

Hopefully someone can help me. I have a set of static data files to do some data analysis, however, every time I run my script it takes really long time to see what is happening, because the data is loaded every time. Is there a way to load the data once and after just work with the data??
I have been using Jupyter notebooks and it work really well, but I would like a way to fix this problem by using Python code.
The sequence of my code is:
File 1: contains all the functions;
File 2: Contains all the variables and it calls file 1 in order to know what to do with the data.\n
File 1 = functions.py\n
import numpy as np
def dict_files(filepath_lst):
dictoffiles = {}
for namefile in filepath_lst:
content_file = np.loadtxt(namefile)
dictoffiles[namefile] = content_file
## Sorting files according to smallest timestamp to largest##
sorted_dictoffiles = {keys: values for keys, values in sorted(dictoffiles.items(), key=lambda item: item[1][0, 0])}
return sorted_dictoffiles
File 2\n
import functions as f
### ----------File Path -----------###
directory = 'some_file_path'
file_path = glob.glob(filejoin(directory, '*.dat'))
dictionary_of_files = f.dict_files(file_path)

Related

Importing OneDrive files in Streamlit based on conditions in the URL

I have created an app that generates automatic reports for my team and I based on data located on multiple files (> 200). On my localhost streamlit app, I could input a few parameters (year, deployment number, etc) and the app would automatically use the correct files (3 out of 200 for each set of parameters) and generate the desired report.
However, now that I have deployed my app, I want it to select the desired files from a general OneDrive to which my whole team has access. This means the data would all be stored online in one location and the app would automatically only take the ones needed depending on the input parameters inserted by the user.
I have two problems:
1. I would like to open a csv file from a OneDrive URL. The method below gives me an error "urllib.error.HTTPError: HTTP Error 400: Bad Request":
'''
import base64
import urllib.request
import requests
from contextlib import closing
import csv
def create_onedrive_directdownload (onedrive_link):
data_bytes64 = base64.b64encode(bytes(onedrive_link, 'utf-8'))
data_bytes64_String = data_bytes64.decode('utf-8').replace('/','_').replace('+','-').rstrip("=")
resultUrl = f"https://api.onedrive.com/v1.0/shares/u!{data_bytes64_String}/root/content"
return resultUrl
onedrive_link = "https://my.sharepoint.com/:x:/s/myteam/..."
onedrive_direct_link = create_onedrive_directdownload(onedrive_link)
df = pd.read_csv(onedrive_direct_link)
r = requests.get(onedrive_link)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
'''
2. I would like the app to select the right files depending on the first part of the URL only since the ending of the URL is a random list of numbers and letters but the beginning is predictable (all the files have a formated name which include the inputed parameters, i.e. year, deployment number, instrument). So what I am trying to do is something like this:
'''
folder_path = url to OneDrive
file_prefix_number = 062
year = 2013
if url contains "'+str(folder_path)+'/'+str(file_prefix_number)+'_ADP_'+str(year)+'-'+str(deployment)+'.csv'" then df = pd.read_csv(urlADP);
else ignore
'''
Any advice would be very welcome, I have been trying unsuccessfully many methods but I am afraid my python knowledges are not that good.
Thank you in advance!

fetching data in Splunk using rest api

I want to import XML data to Splunk using below .py script
My concerns are:
Can I directly configure .py script output to index data in splunk using inputs.conf, or do I need to save output first into a .csv file. If yes can anyone please suggest some approach so that data does not get changed after storing it into a new .csv file.
How can I configure that .py file to fetch data in every 5 min.
import requests
import xmltodict
import json
url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
content=xmltodict.parse(response.text)
print(content)
If you put your Python script into a [script://] stanza in inputs.conf then not only can you have Splunk launch the script automatically every 5 minutes, but anything the script writes to stdout will be indexed in Splunk.
[script:///path/to/the/script.py]
interval = 1/5 * * * *
index = main
sourcetype = foo

Denormalize a GCS file before uploading to BigQuery

I have written a Cloud Run API in .Net Core that reads files from a GCS location and then is supposed to denormalize (i.e. add more information for each row to include textual descriptions) and then write that to a BigQuery table. I have two options:
My cloud run API could create denormalized CSV files and write them to another GCS location. Then another cloud run API could pick up those denormalized CSV files and write them straight to BigQuery.
My cloud run API could read the original CSV file, denormalize them in memory (filestream) and then somehow write from the in memory filestream straight to the BigQuery table.
What is the best way to write to BigQuery in this scenario if performance (speed) and cost (monetary) is my goal. These files are roughly 10KB each before denormalizing. Each row is roughly 1000 characters. After denormalizing it is about three times as much. I do not need to keep denormalized files after they are successfully loaded in BigQuery. I am concerned about performance, as well as any specific BigQuery daily quotas around inserts/writes. I don't think there are any unless you are doing DML statements but correct me if I'm wrong.
I would use Cloud Functions that are triggered when you upload a file to a bucket.
It is so common that Google has a repo a tutorial just for this for JSON files Streaming data from Cloud Storage into BigQuery using Cloud Functions.
Then, I would modify the example main.py file from:
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
db_ref = DB.document(u'streaming_files/%s' % file_name)
if _was_already_ingested(db_ref):
_handle_duplication(db_ref)
else:
try:
_insert_into_bigquery(bucket_name, file_name)
_handle_success(db_ref)
except Exception:
_handle_error(db_ref)
To this that accepts CSV files:
import json
import csv
import logging
import os
import traceback
from datetime import datetime
from google.api_core import retry
from google.cloud import bigquery
from google.cloud import storage
import pytz
PROJECT_ID = os.getenv('GCP_PROJECT')
BQ_DATASET = 'fromCloudFunction'
BQ_TABLE = 'mytable'
CS = storage.Client()
BQ = bigquery.Client()
def streaming(data, context):
'''This function is executed whenever a file is added to Cloud Storage'''
bucket_name = data['bucket']
file_name = data['name']
newRows = postProcessing(bucket_name, file_name)
# It is recommended that you save
# what you process for debugging reasons.
destination_bucket = 'post-processed' # gs://post-processed/
destination_name = file_name
# saveRowsToBucket(newRows,destination_bucket,destination_name)
rowsInsertIntoBigquery(newRows)
class BigQueryError(Exception):
'''Exception raised whenever a BigQuery error happened'''
def __init__(self, errors):
super().__init__(self._format(errors))
self.errors = errors
def _format(self, errors):
err = []
for error in errors:
err.extend(error['errors'])
return json.dumps(err)
def postProcessing(bucket_name, file_name):
blob = CS.get_bucket(bucket_name).blob(file_name)
my_str = blob.download_as_string().decode('utf-8')
csv_reader = csv.DictReader(my_str.split('\n'))
newRows = []
for row in csv_reader:
modified_row = row # Add your logic
newRows.append(modified_row)
return newRows
def rowsInsertIntoBigquery(rows):
table = BQ.dataset(BQ_DATASET).table(BQ_TABLE)
errors = BQ.insert_rows_json(table,rows)
if errors != []:
raise BigQueryError(errors)
It would be still necesssary to define your map(row->newRow) and the function saveRowsToBucket if you needed it.

reading from hive table and updating same table in pyspark - using checkpoint

I am using spark version 2.3 and trying to read hive table in spark as:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.table("emp.emptable")
here I am adding a new column with current date from system to the existing dataframe
import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())
and now facing an issue,when I am trying to write this dataframe as hive table
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;'
so I am checkpointing the dataframe to break the lineage since I am reading and writing from same dataframe
checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
This way it's working fine and new column has been added to the hive table. but I have to delete the checkpoint files every time it's get created. Is there any best way to break the lineage and write the same dataframe with updated column details and save it to hdfs location or as a hive table.
or is there any way to specify a temp location for checkpoint directory, which will get deleted post the spark session completes.
As we discussed in this post, setting below property is way to go.
spark.conf.set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
That question had different context. we wanted to retain the checkpointed dataset so did not care to add on cleanup solution.
Setting above property is working sometime(tested scala, java and python) but its hard to rely on it. Official document says that by setting this property it Controls whether to clean checkpoint files if the reference is out of scope. I don't know what exactly it means because my understanding is that once spark session/context is stopped it should clean it. Would be great if someone can shad light on it.
Regarding
Is there any best way to break the lineage
Check this question, #BiS found some way to cut the lineage using createDataFrame(RDD, Schema) method. I haven't tested it by myself though.
Just FYI, I don't rely on above property usually and delete the checkpointed directory in code itself to be on safe side.
We can get the checkpointed directory like below:
Scala :
//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")
scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3
//It gives String so we can use org.apache.hadoop.fs to delete path
PySpark:
// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'
# notice 'u' at the start which means It returns unicode object use str(t)
# Below are the steps to get hadoop file system object and delete
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
False

AWS SageMaker pd.read_pickle() doesn't work but read_csv() does?

I've recently been trying to train some models on an AWS SageMaker jupyter notebook instance.
Everything is worked very well until I tried to load in some custom dataset (REDD) through files.
I have the dataframes stored in Pickle (.pkl) files on an S3 bucket. I couldn't manage to read them into sagemaker so I decided to convert them to csv's as this seemed to work but I ran into a problem. This data has an index of type datetime64 and when using .to_csv() this index gets converted to pure text and it loses it's data structure (and I need to keep this specific index for correct plotting.)
So I decided to try the Pickle files again but I can't get it to work and have no idea why.
The following code for csv's works but I can't use it due to the index problem:
bucket = 'sagemaker-peno'
houses_dfs = {}
data_key = 'compressed_data/'
data_location = 's3://{}/{}'.format(bucket, data_key)
for file in range(6):
houses_dfs[file+1] = pd.read_csv(data_location+'house_'+str(file+1)+'.csv', index_col='Unnamed: 0')
But this code does NOT work even though it uses almost the exact same syntax:
bucket = 'sagemaker-peno'
houses_dfs = {}
data_key = 'compressed_data/'
data_location = 's3://{}/{}'.format(bucket, data_key)
for file in range(6):
houses_dfs[file+1] = pd.read_pickle(data_location+'house_'+str(file+1)+'.pkl')
Yes it's 100% the correct path, because the csv and pkl files are stored in the same directory (compressed_data).
It throws me this error while using the Pickle method:
FileNotFoundError: [Errno 2] No such file or directory: 's3://sagemaker-peno/compressed_data/house_1.pkl'
I hope to find someone who has dealt with this before and can solve the read_pickle() issue or as an alternative fix my datetime64 type issue with csv's.
Thanks in advance!
read_pickle() likes the full path more than a relative path from where it was run. This fixed my issue.