Remove a string from certain column values and then operate them Pandas - pandas

I have a dataframe with a column named months (as bellow), but it contains some vales passed as "x years". So I want to remove the word "years" and multiplicate them for 12 so all column is consistent.
index months
1 5
2 7
3 3 years
3 9
4 10 years
I tried with
if df['months'].str.contains("years")==True:
df['df'].str.rstrip('years').astype(float) * 12
But it's not working

You can create a multiplier series based on index with "years" and multiply those months by 12
multiplier = np.where(df['months'].str.contains('years'), 12,1)
df['months'] = df['months'].str.replace('years','').astype(int)*multiplier
You get
index months
0 1 5
1 2 7
2 3 36
3 3 9
4 4 120

Slice and then use replace()
indexs = df['months'].str.contains("years")
df.loc[indexs , 'months'] = df['a'].str.replace("years" , "").astype(float) * 12

Related

Print Pandas Unique Rows by Column Condition

I am trying to print the rows whereby a data condition is met in a pandas DF based on the unique values in the DF. For example, I have data that looks like this:
DF:
site temp month day
A 15 7 18
A 11 6 12
A 22 9 3
B 9 4 23
B 3 2 11
B -1 5 18
I need the result to print the rows where the max in the 'temp' column occurs such as this for the final result:
A 15
B 9
I have tried this but it is not working correctly:
for i in DF['site'].unique():
print(DF.temp.max())
I get the same answer of:
22
22
but the answer should be:
site temp month day
A 22 9 3
B 9 4 23
thank you!
A possible solution:
df.groupby('site', as_index=False).max()
Output:
site temp
0 A 22
1 B 9
In case you want to use a for loop:
for i in df['site'].unique():
print(df.loc[df['site'].eq(i), 'temp'].max())
Output:
22
9
df.groupby('site').max()
output:
temp month day
site
A 22 9 18
B 9 5 23
Let us do sort_values + drop_duplicates
df = df.sort_values('temp',ascending=False).drop_duplicates('site')
Out[190]:
site temp month day
2 A 22 9 3
3 B 9 4 23

Create a new column pandas based on another column condition [duplicate]

This question already has an answer here:
increment a value each time the next row is different from the previous one
(1 answer)
Closed 3 months ago.
I wanted to create a new column, let say named "group id" on the basis of:
compare the nth row with (n-1)th row.
if both the records are equal then in a "group id", previous "group id" is copied
If these records are not equal, then 1 should be added to "group id column".
I wanted to have the result in the following way:
The expected result
Column A
Column B
6-Aug-10
0
30-Aug-11
1
31-Aug-11
2
31-Aug-11
2
6-Sep-12
3
30-Aug-13
4
Looking for the result, similar to this excel function
=IF(T3=T2, U2, U2+1)
you can use ngroup:
df['Group ID']=df.groupby('DOB').ngroup()
#according to your example
df['Group ID']=df.groupby('Column A').ngroup()
Use factorize - consecutive groups are not count separately like compare shifted values with Series.cumsum and subtract 1:
print (df)
Column A Column B
0 6-Aug-10 0
1 30-Aug-11 1
2 31-Aug-11 2
3 31-Aug-11 2
4 6-Sep-12 3
5 30-Aug-13 4
6 30-Aug-11 5 <- added data for see difference
7 31-Aug-11 6 <- added data for see difference
df['Group ID1'] = pd.factorize(df['Column A'])[0]
df['Group ID2'] = df['Column A'].ne(df['Column A'].shift()).cumsum().sub(1)
print (df)
Column A Column B Group ID1 Group ID2
0 6-Aug-10 0 0 0
1 30-Aug-11 1 1 1
2 31-Aug-11 2 2 2
3 31-Aug-11 2 2 2
4 6-Sep-12 3 3 3
5 30-Aug-13 4 4 4
6 30-Aug-11 5 1 5
7 31-Aug-11 5 2 6

Calculate new column using relative row references

I would like to turn a data frame like this:
DF
Nrow
a
1
5
2
6
3
7
4
11
5
16
Into this:
DF
Nrow
a
b
1
5
NA
2
6
NA
3
7
2
4
11
5
5
16
9
Column 'b' is calculated as the value from column 'a' minus another value from column 'a', in [row-2]. For example b4 = a4-a2.
I have had no success so far with indexing or loops. Is there a tool or command for this or some obvious notation that I am missing? I need to do this continuously without splitting into groups.

Remove all rows which each value is the same [duplicate]

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 3 years ago.
I want to drop all rows that have same values by drop_duplicates(subset=['other_things','Dist_1','Dist_2']) but could not get it.
Input
id other_things Dist_1 Dist_2
1 a a a
2 a b a
3 10 10 10
4 a b a
5 8 12 48
6 8 12 48
Expeted
id other_things Dist_1 Dist_2
2 a b a
4 a b a
5 8 12 48
6 8 12 48
Try
df = df.drop_duplicates()
It looks like the 'id' column could be generating problems.
Would recommend using the 'subset' parameter on drop duplicates as per the documentation.
drop_duplicates documentation1

Create new ID based on cumulative sum in excel vba

I need to create a new transport ID based on the cumulative sum of the volume being transported. Let´s say that originally everything was transported in truck A with a capacity of 25. Now I want to assign these items to shipments with truck B (Capacity 15).
The only real constraint is amt shipped cannot exceed capacity.
I can´t post a picture because of the restrictions...but the overall set up would be like this:
Old Trans # Volume New Trans # Cumulative Volume for Trans
1 1
1 9
1 3
1 7
1 4
2 9
2 10
3 8
3 5
3 9
4 4
4 6
4 8
5 9
5 1
5 5
5 8
6 3
6 4
6 3
6 4
6 4
6 7
7 7
7 10
7 4
8 10
8 6
8 7
9 4
9 9
9 6
10 7
10 4
10 1
10 1
10 5
10 2
11 9
11 3
11 9
12 8
12 5
12 9
13 9
Expected output would be that the first three entries would result in a new shipment ID of 1;the next two entries would result in a new shipment ID of 2;and so on... I´ve tried everthing that I know(excluding VBA): Index/lookup/if functions. My VBA skills are very limited though.Any tips?? thanks!
I think I see what you're trying to do here, and just using an IF formula (and inserting a new column to keep track):
In the Columns C and D, insert these formulas in row 3 and copy down (changing 15 for whatever you want your new volume capacity to be):
Column C: =IF(B3+C2<15,B3+C2,B3)
Column D: =IF(B3+C2<15,D2,D2+1)
And for the cells C2 and D2:
C2: = B2
D2: = A2
Is this what you're looking to do?
A simple formula could be written that 'floats' the range totals for each successive load ID.
In the following, I've typed 25 and 15 in D1:E1 and used a custom number format of I\D 0. In this way, the column is identified and the cell can be referenced as a true number load limit. You can hard-code the limits into the formula if you prefer by overwriting D$1 but you will not have a one-size-fits-all formula that can be copied right for alternate load limits as I have in my example..
      
The formula in D2 is,
=IF(ROW()=2, 1, (SUM(INDEX($B:$B, MATCH(D1, D1:D$1, 0)):$B2)>D$1)+ D1)
Fill right to E2 then down as necessary.