How to deal with sublists and dataframe with pandas?

How to deal with sublists and dataframe with pandas? - pandas

My project is composed by several lists - that I put all together in a dataframe with pandas, to excel.
But one of my list contains sublists, and I don't know how to deal with that.
my_dataframe = pd.DataFrame({
"V1": list1,
"V2": list2,
"V3": list3
})
my_dataframe.to_excel("test.xlsx", sheet_name="Sheet 1", index=False, encoding='utf8')
Let's says that:
list1=[1,2,3]
list2=['a','b','c']
list3=['d',['a','b','c'],'e']
I would like to end in my excel file file with:
I have really no idea how to proceed - if this is even possible?
Any help is welcomed :) Thanks!

Try this before calling to_excel :
my_dataframe = (my_dataframe["V3"].apply(pd.Series)
.merge(my_dataframe.drop("V3", axis = 1), right_index = True, left_index = True)
.melt(id_vars = ['V1', 'V2'], value_name = "V3")
.drop("variable", axis = 1)
.dropna()
.sort_values("V1"))
credits to Bartosz
Hope this helps.

Related

sort dataframe by string and set a new id

is there a possibility to adjust the strings according to the order for example 1.wav, 2.wav 3.wav etc. and the ID accordingly with ID: 1, 2, 3 etc?
i have already tried several sorting options do any of you have any ideas?
Thank you in advance
dataframe output
def createSampleDF(audioPath):
data = []
for file in Path(audioPath).glob('*.wav'):
print(file)
data.append([os.path.basename(file), file])
df_dataSet = pd.DataFrame(data, columns= ['audio_name', 'filePath'])
df_dataSet['ID'] = df_dataSet.index+1
df_dataSet = df_dataSet[['ID','audio_name','filePath']]
df_dataSet.sort_values(by=['audio_name'],inplace=True)
return df_dataSet
def createSamples(myAudioPath,savePath, sampleLength, overlap = 0):
cutSamples(myAudioPath=myAudioPath,savePath=savePath,sampleLength=sampleLength)
df_dataSet=createSampleDF(audioPath=savePath)
return df_dataSet

You can split the string, make it an integer, and then sort on multiple columns. See the pandas.Dataframe.sort_values for more info. If your links are more complicated you may need to design a regex to pull out the integers you want to sort on using pandas.Series.str.extract.
df = pd.DataFrame({
'ID':[1,2,3,4, 5],
'audio_name' : ['1.wav','10.wav','96.wav','3.wav','55.wav']})
(df
.assign(audio_name=lambda df_ : df_.audio_name.str.split('.', expand=True).iloc[:,0].astype('int'))
.sort_values(by=['audio_name','ID']))

Filtering list based on array columns to create a [dict(str:list)] result

My table looks like this(df):
category
product_in_cat
cat1
[A,B,C]
cat2
[E,F,G]
"category" is str, and product_in_cat is list type. I have a list:product=[A,B,G]
I want to get a final [dict(str:list)] looks like:
[{cat1:[A,B]},{cat2:[G]}]
I think I can use below code:
list1=[]
for inde,row in df.iterrows():
list1.append.({row['category']:row['product_in_cat'] in product})
I know this part is not correct,row['product_in_cat'] in product but I am not sure how to filter out the list column base on the given "product" list. Please help, and thank you in advance!

You can use np.intersect1d to find the common part of two lists:
import numpy as np
df_ = df['product_in_cat'].apply(lambda x: np.intersect1d(x, product).tolist())
l = [{k: v} for k, v in zip(df['category'], df_)]
print(l)
[{'cat1': ['A', 'B']}, {'cat2': ['G']}]

You can use convert each list in the column to a set and use intersection with the external product list:
import pandas as pd
lst = ['A','B','G']
data = {'category':['cat 1','cat 2'],
'product_in_cat': [ ['A','B','C'] ,['E','F','G']]}
df = pd.DataFrame(data)
dict(zip(df['category'],df['product_in_cat'].apply(lambda x: set(x).intersection(lst))))
#output
{'cat 1': {'A', 'B'}, 'cat 2': {'G'}}

Refactoring code so I dont have to implement 100+ functions

I'm making a crypto scanner which has to scan 100+ different cryptocoins at the same time. Now I'm having a really hard time simplifying this code because if I don't I'm gonna end up with more than 100 functions for something really easy. I'll post down here what I'm trying to refactor.
def main():
twm = ThreadedWebsocketManager(api_key=api_key,api_secret=api_secret)
twm.start()
dic = {'close': [], 'low': [], 'high': []}
dic2 = {'close': [], 'low': [], 'high': []}
def handle_socket_message(msg):
candle = msg['k']
close_price = candle['c']
highest_price = candle['h']
lowest_price = candle['l']
status = candle['x']
if status:
dic['close'].append(close_price)
dic['low'].append(lowest_price)
dic['high'].append(highest_price)
df = pd.DataFrame(dic)
print(df)
def handle_socket_message2(msg):
candle = msg['k']
close_price = candle['c']
highest_price = candle['h']
lowest_price = candle['l']
status = candle['x']
if status:
dic2['close'].append(close_price)
dic2['low'].append(lowest_price)
dic2['high'].append(highest_price)
df = pd.DataFrame(dic2)
print(df)
twm.start_kline_socket(callback=handle_socket_message, symbol='ETHUSDT')
twm.start_kline_socket(callback=handle_socket_message2, symbol='BTCUSDT')
twm.join()
As you can see I getting live data from BTCUSDT and ETHUSDT. Now I append the close,low and high prices to a dictionary and then I make a DataFrame out of those dictionaries. I tried to do this with 1 dictionary and 1 handle_socket_message function. But then it merges the values of both cryptocoins into 1 dataframe which is not what I want. Does anyone know how I can refactor this piece of code? I was thinking about something with a loop but I can't figure it out myself.
If you have any questions, ask away! Thanks in advance!

I don't know exactly what you are trying to do, but the following code might get you started (basically use a dict of dicts):
twm = ThreadedWebsocketManager(api_key=api_key,api_secret=api_secret)
twm.start()
symbols = ['ETHUSDT', 'BTCUSDT']
symbolToMessageKeys = {
'close': 'c',
'high': 'h',
'low': 'l'
}
dictPerSymbol = dict()
for sym in symbols:
d = dict()
dictPerSymbol[sym] = d
for key in symbolToMessageKeys.keys():
d[key] = list()
print(dictPerSymbol)
def handle_socket_message(msg):
candle = msg['k']
if candle['x']:
d = dictPerSymbol[msg['s']]
for (symbolKey, msgKey) in symbolToMessageKeys.items():
d[symbolKey].append(candle[msgKey])
df = pd.DataFrame(d)
print(df)
for sym in symbols:
twm.start_kline_socket(callback=handle_socket_message, symbol=sym)
twm.join()
Luckily, appending to lists seems thread safe. Warning: if it is not, then we have a major race condition in the code of this answer. I should also note that I haven't used neither ThreadedWebsocketManagers nor DataFrames (so the latter may as well introduce thread safety issues if it is meant to write in the provided dictionary).

'numpy.int64' object is not iterable when iterate thru a dictionary

I have a dictionary like this
dd={888202515573088257: tweepy.error.TweepError([{'code': 144,
'message': 'No status found with that ID.'}]),
873697596434513921: tweepy.error.TweepError([{'code': 144,
'message': 'No status found with that ID.'}]),
....,
680055455951884288: tweepy.error.TweepError([{'code': 144,
'message': 'No status found with that ID.'}])}
I want to make a dataframe from this dictionary, like so
df=pd.DataFrame(columns = ['twid','msg'])
for k,v in dd:
df = df.append({'twid': k, 'msg': v},ignore_index = True)
But I get TypeError: 'numpy.int64' object is not iterable. Can someone help me solve this please?
Thanks!

By default, iterating over a dictionary will iterate over the keys. If you want to unpack the (key, value) pairs, you can use dd.items().
In this case, it looks like you don't need the values, so the below should work.
df = pd.DataFrame(columns = ['twid'])
for k in dd:
df = df.append({'twid': k}, ignore_index = True)
Alternatively, you can just pass the keys in when creating the DataFrame.
df = pd.DataFrame(list(dd.keys()), columns=['twid'])

I did this and it works :
df=pd.DataFrame(list(dd.items()), columns=['twid', 'msg'])
df

Concatenate DataFrames.DataFrame in Julia

I have a problem when I try to concatenate multiple DataFrames (a datastructure from the DataFrames package!) with the same columns but different row numbers. Here's my code:
using(DataFrames)
DF = DataFrame()
DF[:x1] = 1:1000
DF[:x2] = rand(1000)
DF[:time] = append!( [0] , cumsum( diff(DF[:x1]).<0 ) ) + 1
DF1 = DF[DF[:time] .==1,:]
DF2 = DF[DF[:time] .==round(maximum(DF[:time])),:]
DF3 = DF[DF[:time] .==round(maximum(DF[:time])/4),:]
DF4 = DF[DF[:time] .==round(maximum(DF[:time])/2),:]
DF1[:T] = "initial"
DF2[:T] = "final"
DF3[:T] = "1/4"
DF4[:T] = "1/2"
DF = [DF1;DF2;DF3;DF4]
The last line gives me the error
MethodError: Cannot `convert` an object of type DataFrames.DataFrame to an object of type LastMain.LastMain.LastMain.DataFrames.AbstractDataFrame
This may have arisen from a call to the constructor LastMain.LastMain.LastMain.DataFrames.AbstractDataFrame(...),
since type constructors fall back to convert methods.
I don't understand this error message. Can you help me out? Thanks!

I just ran into this exact problem on Julia 0.5.0 x86_64-linux-gnu, DataFrames 0.8.5, with both hcat and vcat.
Neither clearing the workspace nor reloading DataFrames solved the problem, but restarting the REPL fixed it immediately.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to deal with sublists and dataframe with pandas? - pandas

Related

sort dataframe by string and set a new id

Filtering list based on array columns to create a [dict(str:list)] result

Refactoring code so I dont have to implement 100+ functions

'numpy.int64' object is not iterable when iterate thru a dictionary

Concatenate DataFrames.DataFrame in Julia

Categories

Resources