What am I trying to do?
I'm trying to generate a total_score column based on two other columns.
The level_count column will have a min value of 1 and max of 3
The range_bins column will have a min value of low and max of very_high
In order to sum I created a temp range_score column from 1 to 4. Is there a better way than creating this temp column?
How do I normalise the scores so level_count and range_bins have the same weighting even though the range of each column differs? (3 values vs 4)
Data
data = { 'level_count': {0: 2, 1: 2, 2: 3, 3: 1},
'range_bins': {0: 'high', 1: 'medium', 2: 'low', 3: 'very_high'}}
df = pd.DataFrame(data)
df["range_score"] = df.range_bins.replace({"low": 1, "medium": 2,"high":3,"very_high":4})
df["total_score"] = df[["level_count","range_score"]].sum(axis=1)
drop temp column and show output:
df.drop(columns= "range_score")
level_count range_bins total_score
0 2 high 5
1 2 medium 4
2 3 low 4
3 1 very_high 5
Desired output
Rows 2 and 3 have equal importance and the total_score should reflect this. I may also need to add other similar columns with maybe only two categories in a similar way.
To achieve equal weighting of row 2 and 3, you could create a function that takes a row from the dataframe and then write down the entire logic. You can then apply this function to the dataframe with df.apply. Since I cannot tell the complete logic from your description, I can only provide half the solution - You will have to write down the complete logic.
def total_score(row):
if row.level_count == 1 & row.range_bins == 'very_high':
return 4
elif row.level_count == 2 & row.range_bins == 'very_high':
return ?
elif row.level_count == 2 & row.range_bins == 'very_high':
return ?
df['total_score'] = df.apply(lambda x: total_score(x))
Related
I have a DataFrame which looks like this :-
ID | act
1 A
1 B
1 C
1 D
2 A
2 B
3 A
3 C
I am trying to get the IDs where an activity act1 is followed by another act2, for example, A is followed by B. In that case, I want to get [1,2] as the ids. How do I go about this in a vectorized manner?
Edit :- Expected output : For the sample df defined above, the output should be a list/Series of all the IDs where A is followed immediately by B
IDs
1
2
Here is a simple, vectorised way to do it!
df.loc[(df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1)), 'ID']
Output:
0 1
4 2
Name: ID, dtype: int64
Another way of writing this, possibly clearer:
conditions = (df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1))
df.loc[conditions, 'ID']
Numpy makes it easy to filter for one or many boolean conditions. The resulting vector is used to filter your dataframe.
Here is one approach: groupby, and don't sort, since we need to track B immediately following A, based on the current dataframe structure.
Next aggregate using str.cat
check if A,B is present
get the index
pass as a list
(df
.groupby('ID',sort=False)
.Act
.agg(lambda x: x.str.cat(sep=','))
.str.contains('A,B')
.loc[lambda x: x==1]
.index.tolist()
)
[1, 2]
Another approach is using the shift function and filtering:
df['x'] = df.Act.shift()
df.loc[lambda x: (x['Act']=='B') & (x['x']=='A')].ID
I have a Multi index Data Frame. However, I wanted to change its first level to a certain list of index values. Suppose its first level is initially [2,4,1], I want to change it to [1,2,100]. What is the simplest way to achieve it? My current approach would involve, reset_index, change column values and set index again.
One way is to create a dictionary of the old values to the replacement values, then iterate through the index as tuples replacing the values, and assign the new index back to the DataFrame:
new_vals = {2: 1, 4: 2, 1: 100}
df.index = pd.MultiIndex.from_tuples([(new_vals[tup[0]], tup[1]) for tup in df.index.to_list()])
(This assumes your MultiIndex has only 2 levels, for every additional level that you want to keep you'd need to add tup[2] etc into the list comprehension.)
Use df.reindex()
data.reindex([1,2,100])
Use rename:
Setup
import pandas as pd
index = pd.MultiIndex.from_tuples([(e, i) for i, e in enumerate([2, 4, 1])])
df = pd.DataFrame([1, 2, 3], index=index)
print(df)
Output (of setup)
0
2 0 1
4 1 2
1 2 3
Code
new_index = [1, 2, 100]
new_vals = dict(zip(df.index.levels[0], new_index))
print(df.rename(new_vals, level=0))
Output
0
1 0 1
2 1 2
100 2 3
I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
The problem with your code is that it attempts to modify
the original group.
Other problem is that this function should return a single row
not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
Edit following the comment
In my opinion, it is not allowed to modify the original group,
as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice.
Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group.
It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you
create some new content, e.g. as the result of groupby instruction
and then save it under the same variable which so far held the source
DataFrame.
I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)
I am trying to repair a csv file.
Some data rows need to be removed based on a couple conditions.
Say you have the following dataframe:
-A----B-----C
000---0-----0
000---1-----0
001---0-----1
011---1-----0
001---1-----1
If two or more rows have column A in common, i want to keep the row that has column B set to 1.
The resulting dataframe should look like this:
-A----B-----C
000---1-----0
011---1-----0
001---1-----1
I've experimented with merges and drop_duplicates but cannot seem to get the result I need. It is not certain that the row with column B = 1 will be after a row with B = 0. The take_last argument of drop_duplicates seemed attractive but I don't think it applies here.
Any advice will be greatly appreciated.Thank you.
Not straight forward, but this should work
DF = pd.DataFrame({'A' : [0,0,1,11,1], 'B' : [0,1,0,1,1], 'C' : [0,0,1,0,1]})
DF.ix[DF.groupby('A').apply(lambda df: df[df.B == 1].index[0] if len(df) > 1 else df.index[0])]
A B C
1 0 1 0
4 1 1 1
3 11 1 0
Notes:
groupby divides DF into groups of rows with unique A values i.e. groups with A = 0 (2 rows), A=1 (2 rows) and A=11 (1 row)
Apply then calls the function on each group and assimilates the results
In the function (lambda) I'm looking for the index of row with value B == 1 if there is more than one row in the group, else I use the index of the default row
The result of apply is a list of index values that represent rows with B==1 if more than one row in the group else the default row for given A
The index values are then used to access the corresponding rows by ix operator
Was able to weave my way around panda to get the result I want.
It's not pretty but it gets the job done
res = DataFrame(columns=('CARD_NO', 'STATUS'))
for i in grouped.groups:
if len(grouped.groups[i]) > 1:
card_no = i
print card_no
for a in grouped.groups[card_no]:
status = df.iloc[a]['STATUS']
print 'iloc:'+str(a) +'\t'+'status:'+str(status)
if status == 1:
print 'yes'
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
else:
print 'no'
else:
#only 1 record found
#could be a status of 0 or 1
#add to dataframe
print 'UNIQUE RECORD'
card_no = i
print card_no
status = df.iloc[grouped.groups[card_no][0]]['STATUS']
print grouped.groups[card_no][0]
#print status
print 'iloc:'+str(grouped.groups[card_no][0]) +'\t'+'status:'+str(status)
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
print res