r analogous to sql inner join selection - sql

Suppose we have the contents of tables x and y in two dataframes in R. Which is the suggested way to perform an operation like the following in sql:
Select x.X1, x.X2, y.X3
into z
from x inner join y on x.X1 = y.X1
I tried the following in R. Is there a better way?
Thank you
x<-data.frame(cbind('X1'=c(5,9,7,6,4,8,3,1,10,2),'X2'=c(5,9,7,6,4,8,3,1,10,2)^2))
y<-data.frame(cbind('X1'=c(9,5,8,2),'X3'=c('nine','five','eight','two')))
z<-cbind(x[which(x$X1 %in% (y$X1)), c(1:2)][order(x[which(x$X1 %in% (y$X1)), c(1:2)]$X1),],y[order(y$X1),2])

This was already answered on stackoverflow.
Beyond merge, if you're more comfortable with SQL you should check out the sqldf package, which allows you to run SQL queries on data frames.
library(sqldf)
z <- sqldf("SELECT X1, X2, X3 FROM x JOIN y
USING(X1)")
That said, you will be better off learning the base R functions (merge, intersect, union, etc.) in the long run.

Ok, it was easy
merge(x,y)

Related

vectorize join condition in pandas

This code is working correctly as expected. But it takes a lot of time for large dataframes.
for i in excel_df['name_of_college_school'] :
for y in mysql_df['college_name'] :
if SequenceMatcher(None, i.lower(), y.lower() ).ratio() > 0.8:
excel_df.loc[excel_df['name_of_college_school'] == i, 'dupmark4'] = y
I guess, I can not use a function on join clause to compare values like this.
How do I vectorize this?
Update:
Is it possible to update with the highest score? This loop will overwrite the earlier match and it is possible that the earlier match was more relevant than current one.
What you are looking for is fuzzy merging.
a = excel_df.as_matrix()
b = mysql_df.as_matrix()
for i in a:
for j in b:
if SequenceMatcher(None,
i[college_index_a].lower(), y[college_index_b].lower() ).ratio() > 0.8:
i[dupmark_index] = j
Never use loc in a loop, it has a huge overhead. And btw, get the index of the respective columns, (the numerical one). Use this -
df.columns.get_loc("college name")
You could avoid one of the loops using apply and instead of MxN .loc operations, now it'll be M operations.
for y in mysql_df['college_name']:
match = excel_df['name_of_college_school'].apply(lambda x: SequenceMatcher(
None, x.lower(), y.lower()).ratio() > 0.8)
excel_df.loc[match, 'dupmark4'] = y

find ranges to create Uniform histogram

I need to find ranges in order to create a Uniform histogram
i.e: ages
to 4 ranges
data_set = [18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46]
is there a function that gives me the ranges so the histogram is uniform?
in this case
ranges = [(18,24), (27,29), (30,33), (42,46)]
This example is easy, I'd like to know if there is an algorithm that deals with complex data sets as well
thanks
You are looking for the quantiles that split up your data equally. This combined with cutshould work. So, suppose you want n groups.
set.seed(1)
x <- rnorm(1000) # Generate some toy data
n <- 10
uniform <- cut(x, c(-Inf, quantile(x, prob = (1:(n-1))/n), Inf)) # Determine the groups
plot(uniform)
Edit: now corrected to yield the correct cuts in the ends.
Edit2: I don't quite understand the downvote. But this also works in your example:
data_set = c(18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46)
n <- 4
groups <- cut(data_set, breaks = c(-Inf, quantile(data_set, prob = 1:(n-1)/n), Inf))
levels(groups)
With some minor renaming nessesary. For slightly better level names, you could also put in min(x) and max(x) instead of -Inf and Inf.

Number sort using Min, Max and Variables

I am new to programming and I am trying to create a program that will take 3 random numbers X Y and Z and will sort them into ascending order X being the lowest and Z the highest using Min, Max functions and a Variable (tmp)
I know that there is a particular strategy that I need to use that effects the (X,Y) pair first then (Y,Z) then (X,Y) again but I can't grasp the logic.
The closest I have got so far is...
y=min(y,z)
x=min(x,y)
tmp=max(y,z)
z=tmp
tmp=max(x,y)
y=tmp
x=min(x,y)
tmp=max(x,y)
y=tmp
I've tried so many different combinations but it seems that the problem is UNSOLVABLE can anybody else help?
You need to get sort the X,Y Pair first
tmp=min(x,y)
y=max(x,y)
x=tmp
Then sort the Y,Z pair
tmp = min(y,z)
z=max(y,z)
y=tmp
Then, resort the X,Y pair (in case the original Z was the lowest value...
tmp=min(x,y)
y=max(x,y)
x=tmp
If the commands you have mentioned are the only ones available on the website, and you can only use each one once try:
# Sort X,Y pair
tmp=max(x,y)
x=min(x,y)
y=tmp
# Sort Y,Z pair
tmp=max(y,z)
y=min(y,z)
z=tmp
# Sort X,Y pair again.
tmp=max(x,y)
x=min(x,y)
y=tmp
Hope that helps.
I'm not sure if I understood your question correctly, but you are over righting your variables. Or are you trying to solve some homework with the restriction to only use min() and max() functions?
What about using a list?
tmp = [x, y, z]
tmp.sort()
x, y, z = tmp

relational algebra natural join

Hi all I have an exam coming up and am not getting much help from the lecturer on two questions on the practice exam. She has provided the answer but has not responded to my questions about the answer, I'm hoping someone here would be able to explain why the answer is the way it is.
Consider the following two tables R and S with their instances:
R S
A B C D E
a x y x y
a z w z w
b x k
b m j
c x y
f g h
a) πA(R[natural join]B=D S)
the answer being (a,b,c), why isn't it (a,a,b,c)? does a projection make it distinct?
b) π A(R[natural join] B<>D S)
the answer being (a,b,c,f), why is a an answer? b=d both times when values are x and z, so why is this being printed out?
a)In Relation Algebra, the projection operator provides duplicate elimination. In SQL this is not the default operation, but it is for relational algebra. Here is my source. At the moment, I can't recall why it does duplicate elimination, but this was my professor for databases and he is very knowledgeable. (I think it's because Relation Algebra uses set-logic and sets do not have duplicates.)
b)The joining of 2 tables creates a CROSS PRODUCT between the 2 tables. You have 6 rows and 2 rows. So the cross product is 6x2 = 12 rows. For row 1 of table R, you have a x y. This will be paired with x y AND z w resulting in [a x y x y] and [a x y z w]. The second pairing is valid for this relational algebra statement. Columns B and D do not match x != z.
a) πA(R[natural join]B=D S)
the answer being (a,b,c), why isn't it (a,a,b,c)? does a projection make it distinct?
In relational algebra, duplicate tuples are not permitted; that a main difference between sql (where distinct is needed) and relational algebra
b) π A(R[natural join] B<>D S)
the answer being (a,b,c,f), why is a an answer? b=d both times when values are x and z, so why is this being printed out?
Natural join operation returns the set of all combinations of tuples in R and S, so in this case returns also tuples (a x y z w) and (a z w x y); thus a has to be in the resulting projection.
[natural join] B=D
This is not a natural join because "natural join" is a join that joins relations exclusively over attributes of the same name. The construct you describe might in some places be labeled/termed an "equijoin" or so, but it is certainly noy a "natural join".
[natural join] B<>D
This is not a natural join because "natural join" is a join that joins together tuples of the argument relations if and only if the attribute values are equal.
You are being hopelessly mistaught and miseducated. Reference material : "an introduction to database systems", C.J.Date. It won't do you any good for your exams, but if you seek a later career in database technology it might be worthwhile to remember this.
But to answer your actual questions (in line with preceding answers) :
a) The attribute value 'a' cannot appear twice in the result of a projection, because a projection produces a relation, and a relation is defined to be a set, and sets cannot contain duplicates.
b) The [non-] natural join contains both the tuples (axyzw) and (azwxy). "First" tuple from R with "second" tuple from S, and other way round. The projection includes the result (a).

Avoid the use of for loops

I'm working with R and I have a code like this:
for (i in 1:10)
for (j in 1:100)
if (data[i] == paths[j,1])
cluster[i,4] <- paths[j,2]
where:
data is a vector with 100 rows and 1 column
paths is a matrix with 100 rows and 5 columns
cluster is a matrix with 100 rows and 5 columns
My question is: how could I avoid the use of "for" loops to iterate through the matrix? I don't know whether apply functions (lapply, tapply...) are useful in this case.
This is a problem when j=10000 for example because the execution time is very long.
Thank you
Inner loop could be vectorized
cluster[i,4] <- paths[max(which(data[i]==paths[,1])),2]
but check Musa's comment. I think you indented something else.
Second (outer) loop could be vectorize either, by replicating vectors but
if i is only 100 your speed-up don't be large
it will need more RAM
[edit]
As I understood your comment can you just use logical indexing?
indx <- data==paths[, 1]
cluster[indx, 4] <- paths[indx, 2]
I think that both loops can be vectorized using the following:
cluster[na.omit(match(paths[1:100,1],data[1:10])),4] = paths[!is.na(match(paths[1:100,1],data[1:10])),2]