left outer join in R with conditions - sql

Is there a way to merge (left outer join) data frames by multiple columns, but with OR condition?
Example: There are two data frames df1 and df2 with columns x, y, num. I would like to have a data frame with all rows from df1, but with only those rows from df2 which satisfy the conditions: df1$x == df2$x OR df2$y == df2y.
Here are sample data:
df1 <- data.frame(x = LETTERS[1:5],
y = 1:5,
num = rnorm(5), stringsAsFactors = F)
df1
x y num
1 A 1 0.4209480
2 B 2 0.4687401
3 C 3 0.3018787
4 D 4 0.0669793
5 E 5 0.9231559
df2 <- data.frame(x = LETTERS[3:7],
y = 3:7,
num = rnorm(5), stringsAsFactors = F)
df2$x[4] <- NA
df2$y[3] <- NA
df2
x y num
1 C NA -0.7160824
2 <NA> 4 -0.3283618
3 E 5 -1.8775298
4 F 6 -0.9821082
5 G 7 1.8726288
Then, the result is expected to be:
x y num x y num
1 A 1 0.4209480 <NA> NA NA
2 B 2 0.4687401 <NA> NA NA
3 C 3 0.3018787 C NA -0.7160824
4 D 4 0.0669793 <NA> 4 -0.3283618
5 E 5 0.9231559 E 5 -1.8775298
The most obvious solution is to use the sqldf package:
mergedData <- sqldf::sqldf("SELECT * FROM df1
LEFT OUTER JOIN df2
ON df1.x = df2.x
OR df1.y = df2.y")
Unfortunately this simple solution is extremely slow, and it will take ages to merge data frames with more than 100k rows each.
Another option is to split the right data frame and merge by parts, but it is there any more elegant or even "out of the box" solution?

Here's one approach using data.table. For each column, we perform a join, but only extract the indices (as opposed to materialising the entire join).. Then, we can combine these indices from all the columns (this part would need some changes if there can be multiple matches).
require(data.table)
setDT(df1)
setDT(df2)
foo <- function(dx, dy, cols) {
ix = lapply(cols, function(col) {
dy[dx, on=col, which=TRUE] # for each row in dx, get matching indices of dy
# by matching on column specified in "col"
})
ix = do.call(function(...) pmax(..., na.rm=TRUE), ix)
}
ix = foo(df1, df2, c("x", "y")) # obtain matching indices of df2 for each row in df1
df1[, paste0("col", 1:3) := df2[ix]] # update df1 by reference
df1
# x y num col1 col2 col3
# 1: A 1 2.09611034 NA NA NA
# 2: B 2 -1.06795571 NA NA NA
# 3: C 3 1.38254433 C 3 1.0173476
# 4: D 4 -0.09367922 D 4 -0.6379496
# 5: E 5 0.47552072 E NA -0.1962038
You can use setDF(df1) to convert it back to a data.frame, if necessary.

Related

Pandas Dataframe transformation - Understanding problems with functions I should use and logic I should opt for

I've got a hard problem with transforming a dataframe into another one.
I don't know what functions I should use to do what I want. I had some ideas that didn't work at all.
For example, I do not understand how I should use append (or if I should use it or something else) to do what I want.
Here is my original dataframe:
df1 = pd.DataFrame({
'Key': ['K0', 'K1', 'K2'],
'X0': ['a','b','a'],
'Y0': ['c','d','c'],
'X1': ['e','f','f'],
'Y1': ['g','h','h']
})
Key X0 Y0 X1 Y1
0 K0 a c e g
1 K1 b d f h
2 K2 a c f h
This dataframe represents every links attached to an ID in column Key. Links are following each other : X0->Y0 is the father of X1->Y1.
It's easy to read, and the real dataframe I'm working with has 6500 rows by 21 columns that represents a tree of links. So this dataframe has an end to end links logic.
I want to transform it into another one that has a unitary links and ID logic (because it's a tree of links, some unitary links may be part of multiple end to end links)
So I want to get each individual links into X->Y and I need to get the list of the Keys attached to each unitary links into Keys.
And here is what I want :
df3 = pd.DataFrame({
'Key':[['K0','K2'],'K1','K0',['K1','K2']],
'X':['a','b','e','f'],
'Y':['c','d','g','h']
})
Key X Y
0 [K0, K2] a c
1 K1 b d
2 K0 e g
3 [K1, K2] f h
To do this, I first have to combine X0 and X1 into a unique X column, idem for Y0 and Y1 into a unique Y column. At the same time I need to keep the keys attached to the links. This first transformation leads to a new dataframe, containing all the original information with duplicates which I will deal with after to obtain df3.
Here is the transition dataframe :
df2 = pd.DataFrame({
'Key':['K0','K1','K2','K0','K1','K2'],
'X':['a','b','a','e','f','f'],
'Y':['c','d','c','g','h','h']
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Transition from df1 to df2
For now, I did this to put X0,X1 and Y0,Y1 into X and Y :
Key = pd.Series(dtype=str)
X = pd.Series(dtype=str)
Y = pd.Series(dtype=str)
for i in df1.columns:
if 'K' in i:
Key = Key.append(df1[i], ignore_index=True)
elif 'X' in i:
X = X.append(df1[i], ignore_index=True)
elif 'Y' in i:
Y = Y.append(df1[i], ignore_index=True)
0 K0
1 K1
2 K2
dtype: object
0 a
1 b
2 a
3 e
4 f
5 f
dtype: object
0 c
1 d
2 c
3 g
4 h
5 h
dtype: object
But I do not know how to get the keys to keep them in front of the right links.
Also, I do this to construct df2, but it's weird and I do not understand how I should do it :
df2 = pd.DataFrame({
'Key':Key,
'X':X,
'Y':Y
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 NaN e g
4 NaN f h
5 NaN f h
I tried to use append, to combine the X0,X1 and Y0,Y1 columns directly into df2, but it turns out to be a complete mess, not filling df2 columns with df1 columns content. I also tried to use append to put the Series Key, X and Y directly into df2, but it gave me X and Y in rows instead of columns.
In short, I'm quite lost with it. I know there may be a lot to program to take df1, turn in into df2 and then into df3. I'm not asking for you to solve the problem for me, but I really need help about the functions I should use or the logic I should put in place to achieve my goal.
To convert df1 to df2 you want to look into pandas.wide_to_long.
https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
>>> df2 = pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
>>> df2
X Y
Key num
K0 0 a c
K1 0 b d
K2 0 a c
K0 1 e g
K1 1 f h
K2 1 f h
You can drop the unwanted level "num" from the index using droplevel and turn the index level "Key" into a column using reset_index. Chaining everything:
>>> df2 = (
pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
.droplevel(level='num')
.reset_index()
)
>>> df2
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Finally, to get df3 you just need to group df2 by "X" and "Y", and aggregate the "Key" groups into lists.
>>> df3 = df2.groupby(['X','Y'], as_index=False).agg(list)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d [K1]
2 e g [K0]
3 f h [K1, K2]
If you don't want single keys to be lists you can do this instead
>>> df3 = (
df2.groupby(['X','Y'], as_index=False)
.agg(lambda g: g.iloc[0] if len(g) == 1 else list(g))
)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d K1
2 e g K0
3 f h [K1, K2]

if looping over a dataframe and creating new columns inside that loop, will it be endless?

I want to loop over a dataframe and manipulate each column. Say I do so by:
for feature in df:
df[feature] = df[feature].apply(lambda x: manipulate(x))
print (str(feature) + ' ready!')
Will this make me end up in an endless loop because python will iterate over all columns, including those that are newly created, or only the ones from my initial input-df?
No, this will only loop over the initial columns in the dataframes. Example:
df = pd.DataFrame({ 'x': [1,2,3,4,5], 'y': ['a','b','c','d','e']})
for col in df:
df[col + '1'] =df[col]
returns:
x y x1 y1
0 1 a 1 a
1 2 b 2 b
2 3 c 3 c
3 4 d 4 d
4 5 e 5 e

subset df by masking between specific rows

I'm trying to subset a pandas df by removing rows that fall between specific values. The problem is these values can be at different rows so I can't select fixed rows.
Specifically, I want to remove rows that fall between ABC xxx and the integer 5. These values could fall anywhere in the df and be of unequal length.
Note: The string ABC will be followed by different values.
I thought about returning all the indexes that contain these two values.
But mask could work better if I could return all rows between these two values?
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'X',1,2,'ABC',1,4,5,'Y',1,2],
})
mask = (df['Val'].str.contains(r'ABC(?!$)')) & (df['Val'] == 5)
Intended Output:
Val
0 None
8 X
9 1
10 2
15 Y
16 1
17 2
If ABC is always before 5 and always pairs (ABC, 5) get indices of values with np.where, zip and get index values between - last filter by isin with invert mask by ~:
#2 values of ABC, 5 in data
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'None','None','None',
'None','ABC','None',1,2,3,4,5,'None','None','None']
})
m1 = np.where(df['Val'].str.contains(r'ABC', na=False))[0]
m2 = np.where(df['Val'] == 5)[0]
print (m1)
[ 1 12]
print (m2)
[ 7 18]
idx = [x for y, z in zip(m1, m2) for x in range(y, z + 1)]
print (df[~df.index.isin(idx)])
Val
0 None
8 X
9 1
10 2
11 None
19 X
20 1
21 2
a = df.index[df['Val'].str.contains('ABC')==True][0]
b = df.index[df['Val']==5][0]+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
Output
Val
0 None
8 X
9 1
10 2
If there are more than one 'ABC' and 5, then you the below version.
With this you get the df other than the first ABC & the last 5
a = (df['Val'].str.contains('ABC')==True).idxmax()
b = df['Val'].where(df['Val']==5).last_valid_index()+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]

Adding a column thats result of difference in consecutive rows in pandas

Lets say I have a dataframe like this
A B
0 a b
1 c d
2 e f
3 g h
0,1,2,3 are times, a, c, e, g is one time series and b, d, f, h is another time series.
I need to be able to add two columns to the orignal dataframe which is got by computing the differences of consecutive rows for certain columns.
So i need something like this
A B dA
0 a b (a-c)
1 c d (c-e)
2 e f (e-g)
3 g h Nan
I saw something called diff on the dataframe/series but that does it slightly differently as in first element will become Nan.
Use shift.
df['dA'] = df['A'] - df['A'].shift(-1)
You could use diff and pass -1 as the periods argument:
>>> df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
>>> df["dA"] = df["A"].diff(-1)
>>> df
A B dA
0 9 12 5
1 4 7 2
2 2 5 1
3 1 4 NaN
[4 rows x 3 columns]
When using data in CSV, this would work perfectly:
my_data = pd.read_csv('sale_data.csv')
df = pd.DataFrame(my_data)
df['New_column'] = df['target_column'].diff(1)
print(df) #for the console but not necessary
Rolling differences can also be calculated this way:
df=pd.DataFrame(my_data)
my_data = pd.read_csv('sales_data.csv')
i=0
j=1
while j < len(df['Target_column']):
j=df['Target_column'][i+1] - df['Target_column'][i] #the difference btwn two values in a column.
i+=1 #move to the next value in the column.
j+=1 #next value in the new column.
print(j)

Calculating Growth-Rates by applying log-differences

I am trying to transform my data.frame by calculating the log-differences of each column
and controlling for the rows id. So basically I like to calculate the growth rates for each id's variable.
So here is a random df with an id column, a time period colum p and three variable columns:
df <- data.frame (id = c("a","a","a","c","c","d","d","d","d","d"),
p = c(1,2,3,1,2,1,2,3,4,5),
var1 = rnorm(10, 5),
var2 = rnorm(10, 5),
var3 = rnorm(10, 5)
)
df
id p var1 var2 var3
1 a 1 5.375797 4.110324 5.773473
2 a 2 4.574700 6.541862 6.116153
3 a 3 3.029428 4.931924 5.631847
4 c 1 5.375855 4.181034 5.756510
5 c 2 5.067131 6.053009 6.746442
6 d 1 3.846438 4.515268 6.920389
7 d 2 4.910792 5.525340 4.625942
8 d 3 6.410238 5.138040 7.404533
9 d 4 4.637469 3.522542 3.661668
10 d 5 5.519138 4.599829 5.566892
Now I have written a function which does exactly what I want BUT I had to take a detour which is possibly unnecessary and can be removed. However, somehow I am not able to locate
the shortcut.
Here is the function and the output for the posted data frame:
fct.logDiff <- function (df) {
df.log <- dlply (df, "code", function(x) data.frame (p = x$p, log(x[, -c(1,2)])))
list.nalog <- llply (df.log, function(x) data.frame (p = x$p, rbind(NA, sapply(x[,-1], diff))))
ldply (list.nalog, data.frame)
}
fct.logDiff(df)
id p var1 var2 var3
1 a 1 NA NA NA
2 a 2 -0.16136569 0.46472004 0.05765945
3 a 3 -0.41216720 -0.28249264 -0.08249587
4 c 1 NA NA NA
5 c 2 -0.05914281 0.36999681 0.15868378
6 d 1 NA NA NA
7 d 2 0.24428771 0.20188025 -0.40279188
8 d 3 0.26646102 -0.07267311 0.47041227
9 d 4 -0.32372771 -0.37748866 -0.70417351
10 d 5 0.17405309 0.26683625 0.41891802
The trouble is due to the added NA-rows. I don't want to collapse the frame and reduce it, which would be automatically done by the diff() function. So I had 10 rows in my original frame and am keeping the same amount of rows after the transformation. In order to keep the same length I had to add some NAs. I have taken a detour by transforming the data.frame into a list, add the NAs to each id's first line, and afterwards transform the list back into a data.frame. That looks tedious.
Any ideas to avoid the data.frame-list-data.frame class transformation and optimize the function?
How about this?
nadiff <- function(x, ...) c(NA, diff(x, ...))
ddply(df, "code", colwise(nadiff, c("var1", "var2", "var3")))