Refactoring code so I dont have to implement 100+ functions - pandas

I'm making a crypto scanner which has to scan 100+ different cryptocoins at the same time. Now I'm having a really hard time simplifying this code because if I don't I'm gonna end up with more than 100 functions for something really easy. I'll post down here what I'm trying to refactor.
def main():
twm = ThreadedWebsocketManager(api_key=api_key,api_secret=api_secret)
twm.start()
dic = {'close': [], 'low': [], 'high': []}
dic2 = {'close': [], 'low': [], 'high': []}
def handle_socket_message(msg):
candle = msg['k']
close_price = candle['c']
highest_price = candle['h']
lowest_price = candle['l']
status = candle['x']
if status:
dic['close'].append(close_price)
dic['low'].append(lowest_price)
dic['high'].append(highest_price)
df = pd.DataFrame(dic)
print(df)
def handle_socket_message2(msg):
candle = msg['k']
close_price = candle['c']
highest_price = candle['h']
lowest_price = candle['l']
status = candle['x']
if status:
dic2['close'].append(close_price)
dic2['low'].append(lowest_price)
dic2['high'].append(highest_price)
df = pd.DataFrame(dic2)
print(df)
twm.start_kline_socket(callback=handle_socket_message, symbol='ETHUSDT')
twm.start_kline_socket(callback=handle_socket_message2, symbol='BTCUSDT')
twm.join()
As you can see I getting live data from BTCUSDT and ETHUSDT. Now I append the close,low and high prices to a dictionary and then I make a DataFrame out of those dictionaries. I tried to do this with 1 dictionary and 1 handle_socket_message function. But then it merges the values of both cryptocoins into 1 dataframe which is not what I want. Does anyone know how I can refactor this piece of code? I was thinking about something with a loop but I can't figure it out myself.
If you have any questions, ask away! Thanks in advance!

I don't know exactly what you are trying to do, but the following code might get you started (basically use a dict of dicts):
twm = ThreadedWebsocketManager(api_key=api_key,api_secret=api_secret)
twm.start()
symbols = ['ETHUSDT', 'BTCUSDT']
symbolToMessageKeys = {
'close': 'c',
'high': 'h',
'low': 'l'
}
dictPerSymbol = dict()
for sym in symbols:
d = dict()
dictPerSymbol[sym] = d
for key in symbolToMessageKeys.keys():
d[key] = list()
print(dictPerSymbol)
def handle_socket_message(msg):
candle = msg['k']
if candle['x']:
d = dictPerSymbol[msg['s']]
for (symbolKey, msgKey) in symbolToMessageKeys.items():
d[symbolKey].append(candle[msgKey])
df = pd.DataFrame(d)
print(df)
for sym in symbols:
twm.start_kline_socket(callback=handle_socket_message, symbol=sym)
twm.join()
Luckily, appending to lists seems thread safe. Warning: if it is not, then we have a major race condition in the code of this answer. I should also note that I haven't used neither ThreadedWebsocketManagers nor DataFrames (so the latter may as well introduce thread safety issues if it is meant to write in the provided dictionary).

Related

sort dataframe by string and set a new id

is there a possibility to adjust the strings according to the order for example 1.wav, 2.wav 3.wav etc. and the ID accordingly with ID: 1, 2, 3 etc?
i have already tried several sorting options do any of you have any ideas?
Thank you in advance
dataframe output
def createSampleDF(audioPath):
data = []
for file in Path(audioPath).glob('*.wav'):
print(file)
data.append([os.path.basename(file), file])
df_dataSet = pd.DataFrame(data, columns= ['audio_name', 'filePath'])
df_dataSet['ID'] = df_dataSet.index+1
df_dataSet = df_dataSet[['ID','audio_name','filePath']]
df_dataSet.sort_values(by=['audio_name'],inplace=True)
return df_dataSet
def createSamples(myAudioPath,savePath, sampleLength, overlap = 0):
cutSamples(myAudioPath=myAudioPath,savePath=savePath,sampleLength=sampleLength)
df_dataSet=createSampleDF(audioPath=savePath)
return df_dataSet
You can split the string, make it an integer, and then sort on multiple columns. See the pandas.Dataframe.sort_values for more info. If your links are more complicated you may need to design a regex to pull out the integers you want to sort on using pandas.Series.str.extract.
df = pd.DataFrame({
'ID':[1,2,3,4, 5],
'audio_name' : ['1.wav','10.wav','96.wav','3.wav','55.wav']})
(df
.assign(audio_name=lambda df_ : df_.audio_name.str.split('.', expand=True).iloc[:,0].astype('int'))
.sort_values(by=['audio_name','ID']))

TypeError: 'Value' object is not iterable : iterate around a Dataframe for prediction purpose with GCP Natural Language Model

I'm trying to iterate over a dataframe in order to apply a predict function, which calls a Natural Language Model located on GCP. Here is the loop code :
model = 'XXXXXXXXXXXXXXXX'
barometre_df_processed = barometre_df
barometre_df_processed['theme'] = ''
barometre_df_processed['proba'] = ''
print('DEBUT BOUCLE FOR')
for ind in barometre_df.index:
if barometre_df.verbatim[ind] is np.nan :
barometre_df_processed.theme[ind]="RAS"
barometre_df_processed.proba[ind]="1"
else:
print(barometre_df.verbatim[ind])
print(type(barometre_df.verbatim[ind]))
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]},'mime_type': 'text/plain'} },model_name=model)
print(res)
theme = res['displayNames']
proba = res["classification"]["score"]
barometre_df_processed.theme[ind]=theme
barometre_df_processed.proba[ind]=proba
and the get_prediction function that I took from the Natural Language AI Documentation :
def get_prediction(file_path, model_name):
options = ClientOptions(api_endpoint='eu-automl.googleapis.com:443')
prediction_client = automl_v1.PredictionServiceClient(client_options=options)
payload = file_path
# Uncomment the following line (and comment the above line) if want to predict on PDFs.
# payload = pdf_payload(file_path)
parameters_dict = {}
params = json_format.ParseDict(parameters_dict, Value())
request = prediction_client.predict(name=model_name, payload=payload, params=params)
print("fonction prediction")
print(request)
return resultat[0]["displayName"], resultat[0]["classification"]["score"], resultat[1]["displayName"], resultat[1]["classification"]["score"], resultat[2]["displayName"], resultat[2]["classification"]["score"]
I'm doing a loop this way because I want each of my couple [displayNames, score] to create a new line on my final dataframe, to have something like this :
verbatim1, theme1, proba1
verbatim1, theme2, proba2
verbatim1, theme3, proba3
verbatim2, theme1, proba1
verbatim2, theme2, proba2
...
The if barometre_df.verbatim[ind] is np.nan is not causing problems, I just use it to deal with nans, don't take care of it.
The error that I have is this one :
TypeError: 'Value' object is not iterable
I guess the issues is about
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]} },model_name=model)
but I can't figure what's goign wrong here.
I already try to remove
,'mime_type': 'text/plain'}
from my get_prediction parameters, but it doesn't change anything.
Does someone knows how to deal with this issue ?
Thank you already.
I think you are not iterating correctly.
The way to iterate through a dataframe is:
for index, row in df.iterrows():
print(row['col1'])

What exactly to test (unittest) in a larger function containing several dataframe manipulations

Perhaps this is a constraint of my understanding of unittests, but I get quite confused as to what should be tested, patched, etc in a method that has several pandas dataframe manipulations. Many of the unittest examples out there focus on classes and methods that are typically small. For larger methods, I get a bit lost on the typical unittest paradigm. For example:
myscript.py
class Pivot:
def prepare_dfs(self):
df = pd.read_csv(self.file, sep=self.delimiter)
g = df.groupby("Other_Location")
df1 = g.apply(lambda x: x[x["PRN"] == "Free"].count())
locations = ["O12-03-01", "O12-03-02"]
cp = df1["PRN"]
cp = cp[locations].tolist()
data = [locations, cp]
new_df = pd.DataFrame({"Other_Location": data[0], "Free": data[1]})
return new_df, df
test_myscript.py
class TestPivot(unittest.TestCase):
def setUp(self):
args = parse_args(["-f", "test1", "-d", ","])
self.pivot = Pivot(args)
self.pivot.path = "Pivot/path"
#mock.patch("myscript.cp[locations].tolist()", return_value=None)
#mock.patch("myscript.pd.read_csv", return_value=df)
def test_prepare_dfs_1(self, mock_read_csv, mock_cp):
new_df, df = self.pivot.prepare_dfs()
# Here I get a bit lost
For example here I try to circumvent the following error message:
ModuleNotFoundError: No module named 'myscript.cp[locations]'; 'myscript' is not a package
I managed to mock correctly the pd.read_csv in my method, however further down in the code there are groupy, apply, tolist etc. The error message is thrown at the following line:
cp = cp[locations].tolist()
What is the best way to approach unittesting when your method involves several manipulations on a dataframe? Is refactoring the code always advised (into smaller chunks)? In this case, how can I mock correctly the tolist ?

Should I use classes for pandas.DataFrame?

I have more of a general question. I've written a couple of functions that transform data successively:
def func1(df):
pass
...
def main():
df = pd.read_csv()
df1 = func1(df)
df2 = func2(df1)
df3 = func3(df2)
df4 = func4(df3)
df4.to_csv()
if __name__ == "__main__":
main()
Is there a better way of organizing the logic of my script?
Should I use classes for cases like this when everything is tied to one dataset?
It depends of your usecase. For what I understand, I would use dictionary of your functions that process a df.
For instance:
function_returning_a_df = { "f1": func1, "f2": func2, "f3": func3}
df = pd.read_csv(csv)
if this df needs 3 functions to be applied
df_processing = ["f1","f2","f3"] #function will be applied in this order
# If you need to keep df at every step you can make a list
dfs_processed = []
for func in df_processing:
dfs_processed.append(df) # if you want to save all steps
df = function_returning_a_df[func](df)

How to deal with sublists and dataframe with pandas?

My project is composed by several lists - that I put all together in a dataframe with pandas, to excel.
But one of my list contains sublists, and I don't know how to deal with that.
my_dataframe = pd.DataFrame({
"V1": list1,
"V2": list2,
"V3": list3
})
my_dataframe.to_excel("test.xlsx", sheet_name="Sheet 1", index=False, encoding='utf8')
Let's says that:
list1=[1,2,3]
list2=['a','b','c']
list3=['d',['a','b','c'],'e']
I would like to end in my excel file file with:
I have really no idea how to proceed - if this is even possible?
Any help is welcomed :) Thanks!
Try this before calling to_excel :
my_dataframe = (my_dataframe["V3"].apply(pd.Series)
.merge(my_dataframe.drop("V3", axis = 1), right_index = True, left_index = True)
.melt(id_vars = ['V1', 'V2'], value_name = "V3")
.drop("variable", axis = 1)
.dropna()
.sort_values("V1"))
credits to Bartosz
Hope this helps.