My dataset has following features: "description", "word_count", "char_count", "stopwords". The feature "description" has datatype as string which contains some text. I am doing IBM tone_analysis on this feature which gives me correct output and looks like this:
[{'document_tone': {'tones': [{'score': 0.677676,
'tone_id': 'analytical',
'tone_name': 'Analytical'}]}},
{'document_tone': {'tones': [{'score': 0.620279,
'tone_id': 'analytical',
'tone_name': 'Analytical'}]}},
The code for above is given as below:
result =[]
for i in new_df['description']:
tone_analysis = ta.tone(
{'text': i},
# 'application/json'
).get_result()
result.append(tone_analysis)
I need to keep the above output in pandas data frame.
Use lambda function in Series.apply:
new_df['new'] = new_df['description'].apply(lambda i: ta.tone({'text': i}).get_result())
EDIT:
def f(i):
x = ta.tone({'text': i}).get_result()['document_tone']['tones']
return pd.Series(x[0])
new_df = new_df.join(new_df['description'].apply(f).drop('tone_id', axis=1))
print (new_df)
If need also remove description column:
new_df = new_df.join(new_df.pop('description').apply(f).drop('tone_id', axis=1))
Related
I'm currently trying to replace .append in my code since it won't be supported in the future and I have some trouble with the custom index I'm using
I read the names of every .shp files in a directory and extract some date from it
To make the link with an excel file I have, I use the name I extract from the title of the file
df = pd.DataFrame(columns = ['date','fichier'])
for i in glob.glob("*.shp"):
nom_parcelle = i.split("_")[2]
if not nom_parcelle in df.index:
# print(df.last_valid_index())
date_recolte = i.split("_")[-1]
new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle)
df = df.append(new_row, ignore_index=False)
This works exactly as I want it to be
Sadly, I can't find a way to replace it with .concat
I looked for ways to keep the index whith concat but didn't find anything that worked as I intended
Did I miss anything?
Try the approach below with pandas.concat based on your code :
import glob
import pandas as pd
df = pd.DataFrame(columns = ['date','fichier'])
dico_dfs={}
for i in glob.glob("*.shp"):
nom_parcelle = i.split("_")[2]
if not nom_parcelle in df.index:
# print(df.last_valid_index())
date_recolte = i.split("_")[-1]
new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle)
dico_dfs[i]= new_row.to_frame()
df= pd.concat(dico_dfs, ignore_index=False, axis=1).T.droplevel(0)
# Output :
print(df)
date fichier
nom1 20220101 a_xx_nom1_20220101.shp
nom2 20220102 b_yy_nom2_20220102.shp
nom3 20220103 c_zz_nom3_20220103.shp
I need to find the arithmetic mean of each columns by returning res?
def ave(df, name):
df = {
'Courses':["Spark","PySpark","Python","pandas",None],
'Fee' :[20000,25000,22000,None,30000],
'Duration':['30days','40days','35days','None','50days'],
'Discount':[1000,2300,1200,2000,None]}
#CODE HERE
res = []
for i in df.columns:
res.append(col_ave(df, i))
I tried individually creating codes for the mean but Im having trouble
is there a possibility to adjust the strings according to the order for example 1.wav, 2.wav 3.wav etc. and the ID accordingly with ID: 1, 2, 3 etc?
i have already tried several sorting options do any of you have any ideas?
Thank you in advance
dataframe output
def createSampleDF(audioPath):
data = []
for file in Path(audioPath).glob('*.wav'):
print(file)
data.append([os.path.basename(file), file])
df_dataSet = pd.DataFrame(data, columns= ['audio_name', 'filePath'])
df_dataSet['ID'] = df_dataSet.index+1
df_dataSet = df_dataSet[['ID','audio_name','filePath']]
df_dataSet.sort_values(by=['audio_name'],inplace=True)
return df_dataSet
def createSamples(myAudioPath,savePath, sampleLength, overlap = 0):
cutSamples(myAudioPath=myAudioPath,savePath=savePath,sampleLength=sampleLength)
df_dataSet=createSampleDF(audioPath=savePath)
return df_dataSet
You can split the string, make it an integer, and then sort on multiple columns. See the pandas.Dataframe.sort_values for more info. If your links are more complicated you may need to design a regex to pull out the integers you want to sort on using pandas.Series.str.extract.
df = pd.DataFrame({
'ID':[1,2,3,4, 5],
'audio_name' : ['1.wav','10.wav','96.wav','3.wav','55.wav']})
(df
.assign(audio_name=lambda df_ : df_.audio_name.str.split('.', expand=True).iloc[:,0].astype('int'))
.sort_values(by=['audio_name','ID']))
For the below code:
# Reading the input
import ast,sys
input_str = sys.stdin.read()
input_list = ast.literal_eval(input_str)
# Storing the names in a variable 'name'
name = input_list[0]
# Storing the responses in a variable 'repsonse'
response = input_list[1]
import pandas as pd
df = pd.DataFrame({'Name': name,'Response': response})
This input provided is as:
['Reetesh', 'Shruti', 'Kaustubh', 'Vikas', 'Mahima', 'Akshay']
['No', 'Maybe', 'yes', 'Yes', 'maybe', 'Yes']
And the output needed is like:
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/d3FIi.png
And when I tried something like below:
def res_map(x):
return x.map({'Yes': 1, 'yes': 1, 'No': 0, 'no': 0, 'Maybe': 0.5, 'maybe': 0.5})
df[['Response']] = df[['Response']].apply(res_map)
# Print the final DataFrame
print(df)
It worked well..
but when I tried something like this:
df['Response'] = df['Response'].apply(lambda x: x.str.lower().map({'yes':1.0, 'no':0.0, 'maybe':0.5}))
I got an error "str" object does not hap method "map"...
What I did wrong here?
When you apply on a DataFrame like df[['Response']].apply(...), the lambda receives the whole df.Response column at once so you can use Series methods like str and map.
If you instead apply on a Series like df['Response'].apply(...), the lambda receives individual strings of df.Response instead of the whole df.Response column, so the mapping needs to be changed accordingly:
mapping = {'yes':1.0, 'no':0.0, 'maybe':0.5}
df['Response'] = df['Response'].apply(lambda x: mapping[x.lower()])
# Name Response
# 0 Reetesh 0.0
# 1 Shruti 0.5
# 2 Kaustubh 1.0
# 3 Vikas 1.0
# 4 Mahima 0.5
# 5 Akshay 1.0
I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)