How to delete "1" followed by trailing zeros from Data Frame row values ? - pandas

From my "Id" Column I want to remove the one and zero's from the left.
That is
1000003 becomes 3
1000005 becomes 5
1000011 becomes 11 and so on
Ignore -1, 10 and 1000000, they will be handled as special cases. but from the remaining rows I want to remove the "1" followed by zeros.

Well you can use modulus to get the end of the numbers (they will be the remainder). So just exclude the rows with ids of [-1,10,1000000] and then compute the modulus of 1000000:
print df
Id
0 -1
1 10
2 1000000
3 1000003
4 1000005
5 1000007
6 1000009
7 1000011
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep] % 1000000
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Edit: Here is a fully vectorized string slice version as an alternative (Like Alex' method but takes advantage of pandas' vectorized string methods):
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep].astype(str).str[1:].astype(int)
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11

Here is another way you could try to do it:
def f(x):
"""convert the value to a string, then select only the characters
after the first one in the string, which is 1. For example,
100005 would be 00005 and I believe it's returning 00005.0 from
dataframe, which is why the float() is there. Then just convert
it to an int, and you'll have 5, etc.
"""
return int(float(str(x)[1:]))
# apply the function "f" to the dataframe and pass in the column 'Id'
df.apply(lambda row: f(row['Id']), axis=1)

I get that this question is satisfactory answered. But for future visitors, what I like about alex' answer is that it does not depend on there to be exactly four zeros. The accepted answer will fail if you sometimes have 10005, sometimes 1000005 and whatever.
However, to add something more to the way we think about it. If you know it's always going to be 10000, you can do
# backup all values
foo = df.id
#now, some will be negative or zero
df.id = df.id - 10000
#back in those that are negative or zero (here, first three rows)
df.if[df.if <= 0] = foo[df.id <= 0]
It gives you the same as Karl's answer, but I typically prefer these kind of methods for their readability.

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Filling previous value by field - Pandas apply function filling None

I am trying to fill each row in a new column (Previous time) with a value from previous row of the specific subset (when condition is met). The thing is, that if I interrupt kernel and check values, it is ok. But if it runs to the end, then all rows in new column are filled with None. If previous row doesnt exist, than I will fill it with first value.
Name First round Previous time
Runner 1 2 2
Runner 2 5 5
Runner 3 5 5
Runner 1 6 2
Runner 2 8 5
Runner 3 4 5
Runner 1 2 6
Runner 2 5 8
Runner 3 5 4
What I tried:
df.insert(column = "Previous time", value = 999)
def fce(arg):
runner= arg[0]
stat = arg[1]
if stat == 999:
# I used this to avoid filling all rows in a new column again for the same runner
first = df.loc[df['Name'] == runner,"First round"].iloc[0]
df.loc[df['Name'] == runner,"Previous time"] = df.loc[df['Name'] == runner]["First round"].shift(1, fill_value = first)
df["Previous time"] = df[['Name', "Previous time"]].apply(fce, axis=1)
Condut gruopby shift for each Name and fill the missing values with the original series.
df['Previous time'] = (df.groupby('Name')['First round']
.shift()
.fillna(df['First round'], downcast='infer'))
The problem is that your function fce returns None for every row, so the Series produced by the term df[['Name', "Previous time"]].apply(fce, axis=1) is a Series of None.
That is, instead of overriding the Dataframe with df.loc inside the function, you need to return the value to fill for this position. Unfortunately, this is impossible since then you need to know which indices you already calculated.
A better way to do it would be to use groupby. This is a more natural way, since you want to perform an action on each group. If you use apply after groupby and you to return a series, you, in fact, define a value for each row. Just remember to remove the extra index "Name" that groupby adds.
def fce(g):
first = g["First round"].iloc[0]
return g["First round"].shift(1, fill_value=first)
df["Previous time"] == df.groupby("Name").apply(fce).reset_index("Name", drop=True)
Thank you very much. Please can you answer me one more question? How does it work with group by on multiple columns if I want to return mean of all rounds based on specific runner a sleeping time before race.
Expected output:
Name First round Sleep before race Mean
Runner 1 2 8 4
Runner 2 5 7 6
Runner 3 5 8 5
Runner 1 6 8 4
Runner 2 8 7 6
Runner 3 4 9 4,5
Runner 1 2 9 2
Runner 2 5 7 6
Runner 3 5 9 4,5
This does not work for me.
def last_season(g):
aa = g["First round"].mean()
df["Mean"] = df.groupby(["Name", "Sleep before race"]).apply(g).reset_index(["Name", "Sleep before race"], drop=True)

find closest match within a vector to fill missing values using dplyr

A dummy dataset is :
data <- data.frame(
group = c(1,1,1,1,1,2),
dates = as.Date(c("2005-01-01", "2006-05-01", "2007-05-01","2004-08-01",
"2005-03-01","2010-02-01")),
value = c(10,20,NA,40,NA,5)
)
For each group, the missing values need to be filled with the non-missing value corresponding to the nearest date within same group. In case of a tie, pick any.
I am using dplyr. which.closest from birk but it needs a vector and a value. How to look up within a vector without writing loops. Even if there is an SQL solution, will do.
Any pointers to the solution?
May be something like: value = value[match(which.closest(dates,THISdate) & !is.na(value))]
Not sure how to specify Thisdate.
Edit: The expected value vector should look like:
value = c(10,20,20,40,10,5)
Using knn1 (nearest neighbor) from the class package (which comes with R -- don't need to install it) and dplyr define an na.knn1 function which replaces each NA value in x with the non-NA x value having the closest time.
library(class)
na.knn1 <- function(x, time) {
is_na <- is.na(x)
if (sum(is_na) == 0 || all(is_na)) return(x)
train <- matrix(time[!is_na])
test <- matrix(time[is_na])
cl <- x[!is_na]
x[is_na] <- as.numeric(as.character(knn1(train, test, cl)))
x
}
data %>% mutate(value = na.knn1(value, dates))
giving:
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
Add an appropriate group_by if the intention was to do this by group.
You can try the use of sapply to find the values closest since the x argument in `which.closest only takes a single value.
first create a vect whereby the dates with no values are replaced with NA and use it within the which.closest function.
library(birk)
vect=replace(data$dates,which(is.na(data$value)),NA)
transform(data,value=value[sapply(dates,which.closest,vec=vect)])
group dates value
1 1 2005-01-01 10
2 1 2006-05-01 20
3 1 2007-05-01 20
4 1 2004-08-01 40
5 1 2005-03-01 10
6 2 2010-02-01 5
if which.closest was to take a vector then there would be no need of sapply. But this is not the case.
Using the dplyr package:
library(birk)
library(dplyr)
data%>%mutate(vect=`is.na<-`(dates,is.na(value)),
value=value[sapply(dates,which.closest,vect)])%>%
select(-vect)

Calculate diff() between selected rows

I have a dataframe with ordered times (in seconds) and a column that is either 0 or 1:
time bit
index
0 0.24 0
1 0.245 0
2 0.47 1
3 0.471 1
4 0.479 0
5 0.58 1
... ... ...
I want to select those rows where the time difference is, let's say <0.01 s. But only those differences between rows with bit 1 and bit 0. So in the above example I would only select row 3 and 4 (or any one of them). I thought that I would calculate the diff() of the time column. But I need to somehow select on the 0/1 bit.
Coming from the future to answer this one. You can apply a function to the dataframe that finds the indices of the rows that adhere to the condition and returns the row pairs accordingly:
def filter_(x, threshold = 0.01):
indices = df.index[(df.time.diff() < threshold) & (df.bit.diff().abs() == 1)]
mask = indices | indices - 1
return x[mask]
print(df.apply(filter_, args = (0.01,)))
Output:
time bit
3 0.471 1
4 0.479 0

Understanding The Modulus Operator %

I understand the Modulus operator in terms of the following expression:
7 % 5
This would return 2 due to the fact that 5 goes into 7 once and then gives the 2 that is left over, however my confusion comes when you reverse this statement to read:
5 % 7
This gives me the value of 5 which confuses me slightly. Although the whole of 7 doesn't go into 5, part of it does so why is there either no remainder or a remainder of positive or negative 2?
If it is calculating the value of 5 based on the fact that 7 doesn't go into 5 at all why is the remainder then not 7 instead of 5?
I feel like there is something I'm missing here in my understanding of the modulus operator.
(This explanation is only for positive numbers since it depends on the language otherwise)
Definition
The Modulus is the remainder of the euclidean division of one number by another. % is called the modulo operation.
For instance, 9 divided by 4 equals 2 but it remains 1. Here, 9 / 4 = 2 and 9 % 4 = 1.
In your example: 5 divided by 7 gives 0 but it remains 5 (5 % 7 == 5).
Calculation
The modulo operation can be calculated using this equation:
a % b = a - floor(a / b) * b
floor(a / b) represents the number of times you can divide a by b
floor(a / b) * b is the amount that was successfully shared entirely
The total (a) minus what was shared equals the remainder of the division
Applied to the last example, this gives:
5 % 7 = 5 - floor(5 / 7) * 7 = 5
Modular Arithmetic
That said, your intuition was that it could be -2 and not 5. Actually, in modular arithmetic, -2 = 5 (mod 7) because it exists k in Z such that 7k - 2 = 5.
You may not have learned modular arithmetic, but you have probably used angles and know that -90° is the same as 270° because it is modulo 360. It's similar, it wraps! So take a circle, and say that its perimeter is 7. Then you read where is 5. And if you try with 10, it should be at 3 because 10 % 7 is 3.
Two Steps Solution.
Some of the answers here are complicated for me to understand. I will try to add one more answer in an attempt to simplify the way how to look at this.
Short Answer:
Example 1:
7 % 5 = 2
Each person should get one pizza slice.
Divide 7 slices on 5 people and every one of the 5 people will get one pizza slice and we will end up with 2 slices (remaining). 7 % 5 equals 2 is because 7 is larger than 5.
Example 2:
5 % 7 = 5
Each person should get one pizza slice
It gives 5 because 5 is less than 7. So by definition, you cannot divide whole 5items on 7 people. So the division doesn't take place at all and you end up with the same amount you started with which is 5.
Programmatic Answer:
The process is basically to ask two questions:
Example A: (7 % 5)
(Q.1) What number to multiply 5 in order to get 7?
Two Conditions: Multiplier starts from `0`. Output result should not exceed `7`.
Let's try:
Multiplier is zero 0 so, 0 x 5 = 0
Still, we are short so we add one (+1) to multiplier.
1 so, 1 x 5 = 5
We did not get 7 yet, so we add one (+1).
2 so, 2 x 5 = 10
Now we exceeded 7. So 2 is not the correct multiplier.
Let's go back one step (where we used 1) and hold in mind the result which is5. Number 5 is the key here.
(Q.2) How much do we need to add to the 5 (the number we just got from step 1) to get 7?
We deduct the two numbers: 7-5 = 2.
So the answer for: 7 % 5 is 2;
Example B: (5 % 7)
1- What number we use to multiply 7 in order to get 5?
Two Conditions: Multiplier starts from `0`. Output result and should not exceed `5`.
Let's try:
0 so, 0 x 7 = 0
We did not get 5 yet, let's try a higher number.
1 so, 1 x 7 = 7
Oh no, we exceeded 5, let's get back to the previous step where we used 0 and got the result 0.
2- How much we need to add to 0 (the number we just got from step 1) in order to reach the value of the number on the left 5?
It's clear that the number is 5. 5-0 = 5
5 % 7 = 5
Hope that helps.
As others have pointed out modulus is based on remainder system.
I think an easier way to think about modulus is what remains after a dividend (number to be divided) has been fully divided by a divisor. So if we think about 5%7, when you divide 5 by 7, 7 can go into 5 only 0 times and when you subtract 0 (7*0) from 5 (just like we learnt back in elementary school), then the remainder would be 5 ( the mod). See the illustration below.
0
______
7) 5
__-0____
5
With the same logic, -5 mod 7 will be -5 ( only 0 7s can go in -5 and -5-0*7 = -5). With the same token -5 mod -7 will also be -5.
A few more interesting cases:
5 mod (-3) = 2 i.e. 5 - (-3*-1)
(-5) mod (-3) = -2 i.e. -5 - (-3*1) = -5+3
It's just about the remainders. Let me show you how
10 % 5=0
9 % 5=4 (because the remainder of 9 when divided by 5 is 4)
8 % 5=3
7 % 5=2
6 % 5=1
5 % 5=0 (because it is fully divisible by 5)
Now we should remember one thing, mod means remainder so
4 % 5=4
but why 4?
because 5 X 0 = 0
so 0 is the nearest multiple which is less than 4
hence 4-0=4
modulus is remainders system.
So 7 % 5 = 2.
5 % 7 = 5
3 % 7 = 3
2 % 7 = 2
1 % 7 = 1
When used inside a function to determine the array index. Is it safe programming ? That is a different question. I guess.
Step 1 : 5/7 = 0.71
Step 2 : Take the left side of the decimal , so we take 0 from 0.71 and multiply by 7
0*7 = 0;
Step # : 5-0 = 5 ; Therefore , 5%7 =5
Modulus operator gives you the result in 'reduced residue system'. For example for mod 5 there are 5 integers counted: 0,1,2,3,4. In fact 19=12=5=-2=-9 (mod 7). The main difference that the answer is given by programming languages by 'reduced residue system'.
lets put it in this way:
actually Modulus operator does the same division but it does not care about the answer , it DOES CARE ABOUT reminder for example if you divide 7 to 5 ,
so , lets me take you through a simple example:
think 5 is a block, then for example we going to have 3 blocks in 15 (WITH Nothing Left) , but when that loginc comes to this kinda numbers {1,3,5,7,9,11,...} , here is where the Modulus comes out , so take that logic that i said before and apply it for 7 , so the answer gonna be that we have 1 block of 5 in 7 => with 2 reminds in our hand! that is the modulus!!!
but you were asking about 5 % 7 , right ?
so take the logic that i said , how many 7 blocks do we have in 5 ???? 0
so the modulus returns 0...
that's it ...
A novel way to find out the remainder is given below
Statement : Remainder is always constant
ex : 26 divided by 7 gives R : 5
This can be found out easily by finding the number that completely divides 26 which is closer to the
divisor and taking the difference of the both
13 is the next number after 7 that completely divides 26 because after 7 comes 8, 9, 10, 11, 12 where none of them divides 26 completely and give remainder 0.
So 13 is the closest number to 7 which divides to give remainder 0.
Now take the difference (13 ~ 7) = 5 which is the temainder.
Note: for this to work divisor should be reduced to its simplest form ex: if 14 is the divisor, 7 has to be chosen to find the closest number dividing the dividend.
As you say, the % sign is used to take the modulus (division remainder).
In w3schools' JavaScript Arithmetic page we can read in the Remainder section what I think to be a great explanation
In arithmetic, the division of two integers produces a quotient and a
remainder.
In mathematics, the result of a modulo operation is the
remainder of an arithmetic division.
So, in your specific case, when you try to divide 7 bananas into a group of 5 bananas, you're able to create 1 group of 5 (quotient) and you'll be left with 2 bananas (remainder).
If 5 bananas into a group of 7, you won't be able to and so you're left with again the 5 bananas (remainder).