fill null values with condtional statement using .shift() - pandas

I want to fill null values of a column based on values in other column.
A B
1 21
0 21
0 21
1 25
1 28
0 28
My B value increases only if A value is 1.
So I have some null values in column A like
A B
1 21
0 21
NAN 21
1 25
1 28
0 28
I want to fill this null value with 0 beacuse corresponding value of B didn't increase.
df['A'] = np.where((df['A'].isnull()) & (df['B'] ==df['B'].shift()),0,df['A'])
This isn't giving the correct results. Where am i going wrong

loc might work better here.
df.loc[(df['A'] == np.nan) & (df['B'] == df['B'].shift(-1)),'A'] = 0
# I havent checked if the shift needs to be -1 or 1

Related

Find the third field if two fields have non zero values

My dataset -
A B C
abc 0 12
ert 0 45
ghj 14 0
kli 56 78
qas 0 0
I want to find the values of A for which values of B and C together are non-zero.
Expected output-
A B C
kli 56 78
I tried-
aggr(
sum({<[B]={"<>0"},[C]={"<>0"}>}A)
,[B],[C])
Depends where you are doing this, in the Data Load or through Set Analysis on the front, but really this will work on both the load editor and a table.
if("B" <> 0 and "C" <> 0, 'Non-Zero Value', 'Zero Value')
Example of what I created

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

Check if condition is true and if so add value to another column in sql

I have a postgres table that looks like this:
A B
5 4
10 10
13 15
100 250
20 Null
Using SQL, I would like to check whether the value in column A is larger than the value in column B and if so, then add a 1 to the column True. If the value in column A is smaller or equal to the value in column B or if column B contains a [NULL] value, I would like to add a 1 to the column False, like so:
A B True False
5 4 1 0
10 10 0 1
13 15 0 1
100 25 1 0
20 [NULL] 0 1
What is the best way to achieve this?
You can use case logic:
select t.*,
(case when A > B then 1 else 0 end) as true_col,
(case when A > B then 0 else 1 end) as false_col
from t;

rolling sum of a column in pandas dataframe at variable intervals

I have a list of index numbers that represent index locations for a DF. list_index = [2,7,12]
I want to sum from a single column in the DF by rolling through each number in list_index and totaling the counts between the index points (and restart count at 0 at each index point). Here is a mini example.
The desired output is in OUTPUT column, which increments every time there is another 1 from COL 1 and RESTARTS the count at 0 on the location after the number in the list_index.
I was able to get it to work with a loop but there are millions of rows in the DF and it takes a while for the loop to run. It seems like I need a lambda function with a sum but I need to input start and end point in index.
Something like lambda x:x.rolling(start_index, end_index).sum()? Can anyone help me out on this.
You can try of cummulative sum and retrieving only 1 values related information , rolling sum with diffferent intervals is not possible
a = df['col'].eq(1).cumsum()
df['output'] = a - a.mask(df['col'].eq(1)).ffill().fillna(0).astype(int)
Out:
col output
0 0 0
1 1 1
2 1 2
3 0 0
4 1 1
5 1 2
6 1 3
7 0 0
8 0 0
9 0 0
10 0 0
11 1 1
12 1 2
13 0 0
14 0 0
15 1 1

How would you do this task using SQL or R library sqldf?

I need to implement the following function (ideally in R or SQL): given two data frames (have a column for userid and the rest of the colums are booleans attributes (they are just permitted to be 0's or 1's)) I need to return a new data frame with two columns (userid and count) where count is the number of matches for 0's and 1's for each user in both tables. An user F could occur in both data frames or it could occur in just one. In this last case, I need to return NA for that user count. I write an example:
DF1
ID c1 c2 c3 c4 c5
1 0 1 0 1 1
10 1 0 1 0 0
5 0 1 1 1 0
20 1 1 0 0 1
3 1 1 0 0 1
6 0 0 1 1 1
71 1 0 1 0 0
15 0 1 1 1 0
80 0 0 0 1 0
DF2
ID c1 c2 c3 c4 c5
5 1 0 1 1 0
6 0 1 0 0 1
15 1 0 0 1 1
80 1 1 1 0 0
78 1 1 1 0 0
98 0 0 1 1 1
1 0 1 0 0 1
2 1 0 0 1 1
9 0 0 0 1 0
My function must return something like this: (the following is a subset)
DF_Return
ID Count
1 4
2 NA
80 1
20 NA
.
.
.
Could you give me any suggestions to carry this out? I'm not that expert in sql.
I put the codes in R to generate the experiment I used above.
id1=c(1,10,5,20,3,6,71,15,80)
c1=c(0,1,0,1,1,0,1,0,0)
c2=c(1,0,1,1,1,0,0,1,0)
c3=c(0,1,1,0,0,1,1,1,0)
c4=c(1,0,1,0,0,1,0,1,1)
c5=c(1,0,0,1,1,1,0,0,0)
DF1=data.frame(ID=id1,c1=c1,c2=c2,c3=c3,c4=c4,c5=c5)
DF2=data.frame(ID=c(5,6,15,80,78,98,1,2,9),c1=c2,c2=c1,c3=c5,c4=c4,c5=c3)
Many thanks in advance.
Best Regards!
Here's an approach for you. The first hardcodes the columns to compare, while the other is more general and agnostic to how many columns DF1 and DF2 have:
#Merge together using ALL = TRUE for equivlent of outer join
DF3 <- merge(DF1, DF2, by = "ID", all = TRUE, suffixes= c(".1", ".2"))
#Calculate the rowSums where the same columns match
out1 <- data.frame(ID = DF3[, 1], count = rowSums(DF3[, 2:6] == DF3[, 7:ncol(DF3)]))
#Approach that is agnostic to the number of columns you have
library(reshape2)
library(plyr)
DF3.m <- melt(DF3, id.vars = 1)
DF3.m[, c("level", "DF")] <- with(DF3.m, colsplit(variable, "\\.", c("level", "DF")))
out2 <- dcast(data = DF3.m, ID + level ~ DF, value.var="value")
colnames(out)[3:4] <- c("DF1", "DF2")
out2 <- ddply(out, "ID", summarize, count = sum(DF1 == DF2))
#Are they the same?
all.equal(out1, out2)
#[1] TRUE
> head(out1)
ID count
1 1 4
2 2 NA
3 3 NA
4 5 3
5 6 2
6 9 NA
SELECT
COALESCE(DF1.ID, DF2.ID) AS ID,
CASE WHEN DF1.c1 = DF2.c1 THEN 1 ELSE 0 END +
CASE WHEN DF1.c2 = DF2.c2 THEN 1 ELSE 0 END +
CASE WHEN DF1.c3 = DF2.c3 THEN 1 ELSE 0 END +
CASE WHEN DF1.c4 = DF2.c4 THEN 1 ELSE 0 END +
CASE WHEN DF1.c5 = DF2.c5 THEN 1 ELSE 0 END AS count_of_matches
FROM
DF1
FULL OUTER JOIN
DF2
ON DF1.ID = DF2.ID
There's probably a more elegant way, but this works:
x <- merge(DF1,DF2,by="ID",all=TRUE)
pre <- paste("c",1:5,sep="")
x$Count <- rowSums(x[,paste(pre,"x",sep=".")]==x[,paste(pre,"y",sep=".")])
DF_Return <- x[,c("ID","Count")]
We could use safe_full_join from my package safejoin, and apply ==
between conflicting columns. This will yield a new data frame with logical
c* columns that we can use rowSums on.
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
safe_full_join(DF1, DF2, by = "ID", conflict = `==`) %>%
transmute(ID, count = rowSums(.[-1]))
# ID count
# 1 1 4
# 2 10 NA
# 3 5 3
# 4 20 NA
# 5 3 NA
# 6 6 2
# 7 71 NA
# 8 15 1
# 9 80 1
# 10 78 NA
# 11 98 NA
# 12 2 NA
# 13 9 NA
You can use the apply function to handle this. To get the sum of each row, you can use:
sums <- apply(df1[2:ncol(df1)], 1, sum)
cbind(df1[1], sums)
which will return the sum of all but the first column, then bind that to the first column to get the ID back.
You could do that on both data frames. I'm not really clear what the desired behavior is after that, but maybe look at the merge function.