I'm making a crypto scanner which has to scan 100+ different cryptocoins at the same time. Now I'm having a really hard time simplifying this code because if I don't I'm gonna end up with more than 100 functions for something really easy. I'll post down here what I'm trying to refactor.
def main():
twm = ThreadedWebsocketManager(api_key=api_key,api_secret=api_secret)
twm.start()
dic = {'close': [], 'low': [], 'high': []}
dic2 = {'close': [], 'low': [], 'high': []}
def handle_socket_message(msg):
candle = msg['k']
close_price = candle['c']
highest_price = candle['h']
lowest_price = candle['l']
status = candle['x']
if status:
dic['close'].append(close_price)
dic['low'].append(lowest_price)
dic['high'].append(highest_price)
df = pd.DataFrame(dic)
print(df)
def handle_socket_message2(msg):
candle = msg['k']
close_price = candle['c']
highest_price = candle['h']
lowest_price = candle['l']
status = candle['x']
if status:
dic2['close'].append(close_price)
dic2['low'].append(lowest_price)
dic2['high'].append(highest_price)
df = pd.DataFrame(dic2)
print(df)
twm.start_kline_socket(callback=handle_socket_message, symbol='ETHUSDT')
twm.start_kline_socket(callback=handle_socket_message2, symbol='BTCUSDT')
twm.join()
As you can see I getting live data from BTCUSDT and ETHUSDT. Now I append the close,low and high prices to a dictionary and then I make a DataFrame out of those dictionaries. I tried to do this with 1 dictionary and 1 handle_socket_message function. But then it merges the values of both cryptocoins into 1 dataframe which is not what I want. Does anyone know how I can refactor this piece of code? I was thinking about something with a loop but I can't figure it out myself.
If you have any questions, ask away! Thanks in advance!
I don't know exactly what you are trying to do, but the following code might get you started (basically use a dict of dicts):
twm = ThreadedWebsocketManager(api_key=api_key,api_secret=api_secret)
twm.start()
symbols = ['ETHUSDT', 'BTCUSDT']
symbolToMessageKeys = {
'close': 'c',
'high': 'h',
'low': 'l'
}
dictPerSymbol = dict()
for sym in symbols:
d = dict()
dictPerSymbol[sym] = d
for key in symbolToMessageKeys.keys():
d[key] = list()
print(dictPerSymbol)
def handle_socket_message(msg):
candle = msg['k']
if candle['x']:
d = dictPerSymbol[msg['s']]
for (symbolKey, msgKey) in symbolToMessageKeys.items():
d[symbolKey].append(candle[msgKey])
df = pd.DataFrame(d)
print(df)
for sym in symbols:
twm.start_kline_socket(callback=handle_socket_message, symbol=sym)
twm.join()
Luckily, appending to lists seems thread safe. Warning: if it is not, then we have a major race condition in the code of this answer. I should also note that I haven't used neither ThreadedWebsocketManagers nor DataFrames (so the latter may as well introduce thread safety issues if it is meant to write in the provided dictionary).
Related
is there a possibility to adjust the strings according to the order for example 1.wav, 2.wav 3.wav etc. and the ID accordingly with ID: 1, 2, 3 etc?
i have already tried several sorting options do any of you have any ideas?
Thank you in advance
dataframe output
def createSampleDF(audioPath):
data = []
for file in Path(audioPath).glob('*.wav'):
print(file)
data.append([os.path.basename(file), file])
df_dataSet = pd.DataFrame(data, columns= ['audio_name', 'filePath'])
df_dataSet['ID'] = df_dataSet.index+1
df_dataSet = df_dataSet[['ID','audio_name','filePath']]
df_dataSet.sort_values(by=['audio_name'],inplace=True)
return df_dataSet
def createSamples(myAudioPath,savePath, sampleLength, overlap = 0):
cutSamples(myAudioPath=myAudioPath,savePath=savePath,sampleLength=sampleLength)
df_dataSet=createSampleDF(audioPath=savePath)
return df_dataSet
You can split the string, make it an integer, and then sort on multiple columns. See the pandas.Dataframe.sort_values for more info. If your links are more complicated you may need to design a regex to pull out the integers you want to sort on using pandas.Series.str.extract.
df = pd.DataFrame({
'ID':[1,2,3,4, 5],
'audio_name' : ['1.wav','10.wav','96.wav','3.wav','55.wav']})
(df
.assign(audio_name=lambda df_ : df_.audio_name.str.split('.', expand=True).iloc[:,0].astype('int'))
.sort_values(by=['audio_name','ID']))
I'm trying to iterate over a dataframe in order to apply a predict function, which calls a Natural Language Model located on GCP. Here is the loop code :
model = 'XXXXXXXXXXXXXXXX'
barometre_df_processed = barometre_df
barometre_df_processed['theme'] = ''
barometre_df_processed['proba'] = ''
print('DEBUT BOUCLE FOR')
for ind in barometre_df.index:
if barometre_df.verbatim[ind] is np.nan :
barometre_df_processed.theme[ind]="RAS"
barometre_df_processed.proba[ind]="1"
else:
print(barometre_df.verbatim[ind])
print(type(barometre_df.verbatim[ind]))
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]},'mime_type': 'text/plain'} },model_name=model)
print(res)
theme = res['displayNames']
proba = res["classification"]["score"]
barometre_df_processed.theme[ind]=theme
barometre_df_processed.proba[ind]=proba
and the get_prediction function that I took from the Natural Language AI Documentation :
def get_prediction(file_path, model_name):
options = ClientOptions(api_endpoint='eu-automl.googleapis.com:443')
prediction_client = automl_v1.PredictionServiceClient(client_options=options)
payload = file_path
# Uncomment the following line (and comment the above line) if want to predict on PDFs.
# payload = pdf_payload(file_path)
parameters_dict = {}
params = json_format.ParseDict(parameters_dict, Value())
request = prediction_client.predict(name=model_name, payload=payload, params=params)
print("fonction prediction")
print(request)
return resultat[0]["displayName"], resultat[0]["classification"]["score"], resultat[1]["displayName"], resultat[1]["classification"]["score"], resultat[2]["displayName"], resultat[2]["classification"]["score"]
I'm doing a loop this way because I want each of my couple [displayNames, score] to create a new line on my final dataframe, to have something like this :
verbatim1, theme1, proba1
verbatim1, theme2, proba2
verbatim1, theme3, proba3
verbatim2, theme1, proba1
verbatim2, theme2, proba2
...
The if barometre_df.verbatim[ind] is np.nan is not causing problems, I just use it to deal with nans, don't take care of it.
The error that I have is this one :
TypeError: 'Value' object is not iterable
I guess the issues is about
res = get_prediction(file_path={'text_snippet': {'content': barometre_df.verbatim[ind]} },model_name=model)
but I can't figure what's goign wrong here.
I already try to remove
,'mime_type': 'text/plain'}
from my get_prediction parameters, but it doesn't change anything.
Does someone knows how to deal with this issue ?
Thank you already.
I think you are not iterating correctly.
The way to iterate through a dataframe is:
for index, row in df.iterrows():
print(row['col1'])
Perhaps this is a constraint of my understanding of unittests, but I get quite confused as to what should be tested, patched, etc in a method that has several pandas dataframe manipulations. Many of the unittest examples out there focus on classes and methods that are typically small. For larger methods, I get a bit lost on the typical unittest paradigm. For example:
myscript.py
class Pivot:
def prepare_dfs(self):
df = pd.read_csv(self.file, sep=self.delimiter)
g = df.groupby("Other_Location")
df1 = g.apply(lambda x: x[x["PRN"] == "Free"].count())
locations = ["O12-03-01", "O12-03-02"]
cp = df1["PRN"]
cp = cp[locations].tolist()
data = [locations, cp]
new_df = pd.DataFrame({"Other_Location": data[0], "Free": data[1]})
return new_df, df
test_myscript.py
class TestPivot(unittest.TestCase):
def setUp(self):
args = parse_args(["-f", "test1", "-d", ","])
self.pivot = Pivot(args)
self.pivot.path = "Pivot/path"
#mock.patch("myscript.cp[locations].tolist()", return_value=None)
#mock.patch("myscript.pd.read_csv", return_value=df)
def test_prepare_dfs_1(self, mock_read_csv, mock_cp):
new_df, df = self.pivot.prepare_dfs()
# Here I get a bit lost
For example here I try to circumvent the following error message:
ModuleNotFoundError: No module named 'myscript.cp[locations]'; 'myscript' is not a package
I managed to mock correctly the pd.read_csv in my method, however further down in the code there are groupy, apply, tolist etc. The error message is thrown at the following line:
cp = cp[locations].tolist()
What is the best way to approach unittesting when your method involves several manipulations on a dataframe? Is refactoring the code always advised (into smaller chunks)? In this case, how can I mock correctly the tolist ?
I have more of a general question. I've written a couple of functions that transform data successively:
def func1(df):
pass
...
def main():
df = pd.read_csv()
df1 = func1(df)
df2 = func2(df1)
df3 = func3(df2)
df4 = func4(df3)
df4.to_csv()
if __name__ == "__main__":
main()
Is there a better way of organizing the logic of my script?
Should I use classes for cases like this when everything is tied to one dataset?
It depends of your usecase. For what I understand, I would use dictionary of your functions that process a df.
For instance:
function_returning_a_df = { "f1": func1, "f2": func2, "f3": func3}
df = pd.read_csv(csv)
if this df needs 3 functions to be applied
df_processing = ["f1","f2","f3"] #function will be applied in this order
# If you need to keep df at every step you can make a list
dfs_processed = []
for func in df_processing:
dfs_processed.append(df) # if you want to save all steps
df = function_returning_a_df[func](df)
My project is composed by several lists - that I put all together in a dataframe with pandas, to excel.
But one of my list contains sublists, and I don't know how to deal with that.
my_dataframe = pd.DataFrame({
"V1": list1,
"V2": list2,
"V3": list3
})
my_dataframe.to_excel("test.xlsx", sheet_name="Sheet 1", index=False, encoding='utf8')
Let's says that:
list1=[1,2,3]
list2=['a','b','c']
list3=['d',['a','b','c'],'e']
I would like to end in my excel file file with:
I have really no idea how to proceed - if this is even possible?
Any help is welcomed :) Thanks!
Try this before calling to_excel :
my_dataframe = (my_dataframe["V3"].apply(pd.Series)
.merge(my_dataframe.drop("V3", axis = 1), right_index = True, left_index = True)
.melt(id_vars = ['V1', 'V2'], value_name = "V3")
.drop("variable", axis = 1)
.dropna()
.sort_values("V1"))
credits to Bartosz
Hope this helps.