Changing values in pandas dataframe does not work - pandas

I’m having a problem changing values in a dataframe. I also want to consult regarding a problem I need to solve and the proper way to use pandas to solve it. I'll appreciate help on both.
I have a file containing information about matching degree of audio files to speakers. The file looks something like that:
wave_path spk_name spk_example# score mark comments isUsed
190 122_65_02.04.51.800.wav idoD idoD 88 NaN NaN False
191 121_110_20.17.27.400.wav idoD idoD 87 NaN NaN False
192 121_111_00.34.57.300.wav idoD idoD 87 NaN NaN False
193 103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
194 131_101_02.08.06.500.wav idoD idoD_0 96 HIT VP False
What I need to do, is some kind of a sophisticated counting. I need to group the results by speaker, and calculate for each speaker some calculation. I then proceed with the speaker that made the best calculation for me, but before proceeding I need to mark all the files which I used for the calculation as being used, i.e. changing the isUsed value for each row in which they appear (files can appear more than once) to TRUE. Then I make another iteration. Calculate for each speaker, mark the used files and so on until no more speakers left to be calculated.
I thought a lot about how to implement that process using pandas (it is quite easy to implement in regular python but it will take a lot of looping and data structuring that my guess will slow the process down significantly, and also I’m using this process to get to learn pandas abilities more deeply)
I came out with the following solution. As preparation steps, I’ll group by speaker name and set the file name as index by the set_index method. I will then iterate over the groupbyObj and apply the calculation function, which will return the selected speaker and the files to be marked as used.
Then I’ll iterate over the files and mark them as used (this would be fast and simple since I set them as indexes beforehand), and so on until I finish calculating.
First, I’m not sure about this solution, so feel free to tell me your thoughts on it.
Now, I’ve tried implementing this, and got into trouble:
First I indexed by file name, no problem here:
In [53]:
marked_results['isUsed'] = False
ind_res = marked_results.set_index('wave_path')
ind_res.head()
Out[53]:
spk_name spk_example# score mark comments isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
131_101_02.08.06.500.wav idoD idoD 99 HIT VP False
144_35_22.46.38.700.wav idoD idoD 96 HIT VP False
41_09_17.10.11.700.wav idoD idoD 93 HIT TEST False
122_188_03.19.20.400.wav idoD idoD 93 NaN NaN False
Then I choose a file and checked that I get the entries relevant to that file:
In [54]:
example_file = ind_res.index[0];
ind_res.ix[example_file]
Out[54]:
spk_name spk_example# score mark comments isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_1 97 HIT VP False
103_31_18.59.12.800.wav idoD idoD_2 95 HIT VP False
Now problems here too. Then I tried to change the isUsed value for that file to True, and that where I got the problem:
In [56]:
ind_res.ix[example_file]['isUsed'] = True
ind_res.ix[example_file].isUsed = True
ind_res.ix[example_file]
Out[56]:
spk_name spk_example# score mark comments isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_1 97 HIT VP False
103_31_18.59.12.800.wav idoD idoD_2 95 HIT VP False
So, you see the problem. Nothing has changed. What am I doing wrong? Is the problem described above should be solved using pandas?
And also:
1. How can I approach a specific group by a groupby object? bcz I thought maybe instead of setting the files as indexed, grouping by a file, and the using that groupby obj to apply a changing function to all of its occurrences. But I didn’t find a way to approach a specific group and passing the group name as parameter and calling apply on all the groups and then acting only on one of them seemed not "right" to me.
I hope it is not to long... :)

Indexing Panda objects can return two fundamentally different objects: a view or a copy.
If mask is a basic slice, then df.ix[mask] returns a view of df. Views share the same underlying data as the original object (df). So modifying the view, also modifies the original object.
If mask is something more complicated, such as an arbitrary sequence of indices, then df.ix[mask] returns a copy of some rows in df. Modifying the copy has no affect on the original.
In your case, since the rows which share the same wave_path occur at arbitrary locations, ind_res.ix[example_file] returns a copy. So
ind_res.ix[example_file]['isUsed'] = True
has no effect on ind_res.
Instead, you could use
ind_res.ix[example_file, 'isUsed'] = True
to modify ind_res. However, see below for a groupby suggestion which I think might be closer to what you really want.
Jeff has already provided a link to the Pandas docs which state that
The rules about when a view on the data is returned are entirely
dependent on NumPy.
Here are the (complicated) rules which describe when a view or copy is returned. Basically, however, the rule is if the index is requesting a regularly spaced slice of the underlying array then a view is returned, otherwise a copy (out of necessity) is returned.
Here is a simple example which uses basic slice. A view is returned by df.ix, so modifying subdf modifies df as well:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(4,3),
columns=list('ABC'), index=[0,1,2,3])
subdf = df.ix[0]
print(subdf.values)
# [0 1 2]
subdf.values[0] = 100
print(subdf)
# A 100
# B 1
# C 2
# Name: 0, dtype: int32
print(df) # df is modified
# A B C
# 0 100 1 2
# 1 3 4 5
# 2 6 7 8
# 3 9 10 11
Here is a simple example which uses "fancy indexing" (arbitrary rows selected). A copy is returned by df.ix. So modifying subdf does not affect df.
df = pd.DataFrame(np.arange(12).reshape(4,3),
columns=list('ABC'), index=[0,1,0,3])
subdf = df.ix[0]
print(subdf.values)
# [[0 1 2]
# [6 7 8]]
subdf.values[0] = 100
print(subdf)
# A B C
# 0 100 100 100
# 0 6 7 8
print(df) # df is NOT modified
# A B C
# 0 0 1 2
# 1 3 4 5
# 0 6 7 8
# 3 9 10 11
Notice the only difference between the two examples is that in the first, where a view is returned, the index was [0,1,2,3], whereas in the second, where a copy is returned, the index was [0,1,0,3].
Since we are selected rows where the index is 0, in the first example, we can do that with a basic slice. In th second example, the rows where index equals 0 could appear at arbitrary locations, so a copy has to be returned.
Despite having ranted on about the subtlety of Pandas/NumPy slicing, I really don't think that
ind_res.ix[example_file, 'isUsed'] = True
is what you are ultimately looking for. You probably want to do something more like
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(4,3),
columns=list('ABC'))
df['A'] = df['A']%2
print(df)
# A B C
# 0 0 1 2
# 1 1 4 5
# 2 0 7 8
# 3 1 10 11
def calculation(grp):
grp['C'] = True
return grp
newdf = df.groupby('A').apply(calculation)
print(newdf)
which yields
A B C
0 0 1 True
1 1 4 True
2 0 7 True
3 1 10 True

Related

Removing the .0 from a pandas column

After a simple merge of two dataframes the following X column becomes an object and an ".0" is being added at the end for no apparent reason. I tried replacing the nan values with an integer and then converting the whole column to an integer hoping for the .0 to be gone. The code runs but it doesn't really change the dtype of that column. Also, I tried removing the .0 with the rstrip command but then all it really does is it removes everything and even the values that are 249123.0 become NaN which doesn't make sense. I know that is a very basic issue but I am not sure what else could I try at this point.
Input:
Age ID
22 23105.0
34 214541.0
51 0
8 62341.0
Desired output:
Age ID
22 23105
34 214541
51 0
8 62341
Any ideas would be much appreciated.
One of the ways to get rid of the trailing .0 in an object column is to use pandas.DataFrame.replace :
df['ID'] = df['ID'].replace(r'\.0$', '', regex=True).astype(np.int64)
# Output :
print(df)
Age ID
0 22 23105
1 34 214541
2 51 0
3 8 62341

Compare Values of 2 dataframes conditionally

I have the following problem. I have a dataframe which look like this.
Dataframe1
start end
0 0 2
1 3 7
2 8 9
and another dataframe which looks like this.
Dataframe2
data
1 ...
4 ...
8 ...
11 ...
What I am trying to achieve is following:
For each row in Dataframe1 I want to check if there is any index value in Dataframe2 which is in range(start, end) of Dataframe1.
If the condition is True, I want to create a new column["condition"] where the outcome is stored.
Since there is the possiblity to deal with large amounts of data I tried using numpy.select.
Like this:
range_start = df1.start
range_end = df1.end
condition = [
df2.index.to_series().between(range_start, range_end)
]
choice = ["True"]
df1["condition"] = np.select(condition, choice, default=0)
This gives me an error:
ValueError: Can only compare identically-labeled Series objects
I also tried a list comprehension. That didn't work either. All the things I tried are failing because I am dealing with a series (--> range_start, range_end). There has to be a way to make this work I think..
I already searched stackoverflow for this paricular problem. But I wasn't able to find a solution to this problem. It could be, that I'm just to inexperienced for this type of problem, to search for the right solution.
So maybe you can help me out here.
Thank you!
expected output:
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
Use DataFrame.drop_duplicates for remove duplicates by both columns and index, create all combinations by DataFrame.merge with cross join and last test at least one match by GroupBy.any:
df3 = (df1.drop_duplicates(['start','end'])
.merge(df2.index.drop_duplicates().to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df1.join(df3.groupby(['start','end'])['condition'].any(), on=['start','end'])
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
If all pairs in df1 are unique is possible use:
df3 = (df1.merge(df2.index.to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df3.groupby(['start','end'], as_index=False)['condition'].any()
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True

Search values in a Pandas DataFrame with values from another DataFrame

I have 2 dataframes.
df_dora
content
feature
id
1
cyber hygien
risk management
1
2
cyber risk
risk management
2
...
...
... ...
59
intellig share
information sharing
63
60
inform share
information sharing
64
df_corpus
content
id
meta.name
meta._split_id
0
market grow cyber attack...
56a2a2e28954537131a4aa734f49e361
14_Group_AG_2021
0
1
sec form file index
7aedfd4df02687d3dff9897c925da508
14_Group_AG_2021
1
...
...
...
...
213769
cyber secur alert parent compani fina...
ab10325601597f203f3f0af7aa647112
17_La_Banque_2021
8581
213770
intellig share statement parent compani fina...
6af5687ac31849d19d2048e0b2ca472d
17_La_Banque_2021
8582
I am trying to extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
I tried to use isin
df = df_corpus[df_corpus.content.isin(df_dora.content)]
len(df)
Returns only 17 rows
content
id
meta.name
meta
41474
incid
a4c478e0fad1b9775c05e01d871b3aaf
3_Agricole_2021
10185
68690
oper risk
2e5139d82c242c89523110cc1110647a
10_Banking_Group_PLC_2021
5525
...
...
...
...
...
99259
risk report
a84eefb9a4772d13eb67f2d6ae5215cb
31_Building_Society_2021
4820
105662
risk manag
e8050be841fedb6dd10599e8b4892a9f
43_Bank_SA_2021
131
df_corpus.loc[df_corpus.content.isin(df_dora.content), 'content'].tolist()
also returns 17 rows
if I search for 2 of the terms that exist in df_dora directly in df_corpus
resiliency_term = df_corpus.loc[df_corpus['content'].str.contains("cyber risk|inform share", case=False)]
print(resiliency_term)
I get 243 rows (which matches what was in the original file.)
So given the above...my question is this how do I extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
Thanks in advance for any help.
unique_vals = '|'.join(df_dora.content.unique())
df_corpus.groupby('meta.name').apply(lambda x: x.content.str.findall(unique_vals).explode().value_counts())
Output given your four lines of each:
17_La_Banque_2021 intellig share 1
Name: content, dtype: int64

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

Get coherent subsets from pandas series

I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Example:
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
blocks.append(block)
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
blocks.append(subset[block_start:])
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
EDIT
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]