How to do a basic left outer join with data.table in R? - sql

I have a data.table of a and b that I've partitioned into below with b < .5 and above with b > .5:
DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
above = DT[DT$b > .5]
below = DT[DT$b < .5, list(a=a)]
I'd like to do a left outer join between above and below: for each a in above, count the number of rows in below. This is equivalent to the following in SQL:
with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select 3, 1),
above as (select a, b from dt where b > .5),
below as (select a, b from dt where b < .5)
select above.a, count(below.a) from above left outer join below on (above.a = below.a) group by above.a;
a | count
---+-------
3 | 0
2 | 1
(2 rows)
How do I accomplish the same thing with data.tables? This is what I tried so far:
> key(below) = 'a'
> below[above, list(count=length(b))]
a count
[1,] 2 1
[2,] 3 1
[3,] 3 1
> below[above, list(count=length(b)), by=a]
Error in eval(expr, envir, enclos) : object 'b' not found
> below[, list(count=length(a)), by=a][above]
a count b
[1,] 2 1 1
[2,] 3 NA 1
[3,] 3 NA 1
I should also be more specific in that I already tried merge but that blows through the memory on my system (and the dataset takes only about 20% of my memory).

See if this is giving you something useful. Your example is too sparse to let me know what you want, but it appears it might be a tabulation of values of above$a that are also in below$a
table(above$a[above$a %in% below$a])
If you also want the converse ... values not in below, then this would do it:
table(above$a[!above$a %in% below$a])
And you can concatenate them:
> c(table(above$a[above$a %in% below$a]),table(above$a[!above$a %in% below$a]) )
2 3
1 2
Generally table and %in% run in reasonably small footprints and are quick.

Since you appear to be using package data.table: check ?merge.data.table.
I haven't used it, but it appears this might do what you want:
merge(above, below, by="a", all.x=TRUE, all.y=FALSE)

I think this is easier:
setkey(above,a)
setkey(below,a)
Left outer join:
above[below, .N]
regular join:
above[below, .N, nomatch=0]
full outer join with counts:
merge(above,below, all=T)[,.N, by=a]

I eventually found a way to do this with data.table, which I felt is more natural for me to understand than DWin's table, though YMMV:
result = below[, list(count=length(b)), by=a]
key(result) = 'a'
result = result[J(unique(above$a))]
result$count[is.na(result$count)] = 0
I don't know if this could be more compact, though. I especially wanted to be able to do something like result = below[J(unique(above$a)), list(count=length(b))], but that doesn't work.

Related

Oracle: How do I only return the rows of the 2nd union if they exist?

I have two selects combined with union in Oracle. I just want to return the rows of the 2nd union, if it returns anything, even though the 1st union might also be returning rows.
I was thinking about using nvl but I'm not sure how to implement it.
select 1 seq,
x coord1,
y coord2,
z coord3
from tableA a
where a.prodRef = 4711
union
select 2 seq,
a coord1,
b coord2,
c coord3
from tableB b
where b.prodRef = 4711
Now the select returns rows from the queries with seq 1 and seq 2. If there's an output from the query with seq 2, I just want to see these data (excluded seq 1 rows), however, there might be cases in which seq 2 might return null rows. Of course, I just take the data of seq 1 in this case.
Do you guys have any ideas how to solve this? My mind is completely empty.
Use not exists:
select 2 as seq, a as coord1, b as coord2, c as coord3
from tableB b
where b.prodRef = 4711
union all
select 1 as seq, x as coord1, y as coord2, z as coord3
from tableA a
where a.prodRef = 4711 and
not exists (select 1 from tableB b where b.prodRef = a.prodRef);

Sum over multiple levels of nested repeated fields

I have several order-detail tables in the source database: Order Header -> Order Line -> Shipped Line -> Received Line
I create a BQ table with two levels of nested repeated fields. Here is how some sample data looks like:
WITH stol as (
SELECT 1 AS stol_id, "stol-1.1" AS stol_number, 1 AS stol_transfer_order_line_id, 3 AS stol_quantity
UNION ALL
SELECT 2 AS stol_id, "stol-2.1" AS stol_number, 2 AS stol_transfer_order_line_id, 2 AS stol_quantity
UNION ALL
SELECT 3 AS stol_id, "stol-2.2" AS stol_number, 2 AS stol_transfer_order_line_id, 2 AS stol_quantity
UNION ALL
SELECT 4 AS stol_id, "stol-2.3" AS stol_number, 2 AS stol_transfer_order_line_id, 1 AS stol_quantity
),
rtol as (
SELECT 1 AS stol_id, "rtol-1.1" as rtol_number, 2 as rtol_quantity
UNION ALL
SELECT 1 as stol_id, "rtol-1.2" as rtol_number, 1 AS rtol_quantity
UNION ALL
SELECT 2 as stol_id, "rtol-2.1" as rtol_number, 2 AS rtol_quantity
UNION ALL
SELECT 3 as stol_id, "rtol-2.2" as rtol_number, 1 AS rtol_quantity
),
tol as (
SELECT 1 as tol_id, "tol-1" as tol_number, 3 as tol_transfer_quantity
UNION ALL
SELECT 2 as tol_id, "tol-2" AS tol_number, 5 AS tol_transfer_quantity
),
nest AS (
SELECT s.stol_id,
s.stol_number,
s.stol_quantity,
s.stol_transfer_order_line_id,
ARRAY_AGG(STRUCT(r.rtol_number, r.rtol_quantity)) as received
FROM stol s
LEFT JOIN rtol r ON s.stol_id = r.stol_id
GROUP BY 1, 2, 3, 4
),
final as (
SELECT t.tol_id
,t.tol_number
,t.tol_transfer_quantity
,ARRAY_AGG(STRUCT(n.stol_number, n.stol_quantity, n.received)) as shipped
FROM tol t
LEFT JOIN nest n ON t.tol_id = n.stol_transfer_order_line_id
GROUP BY 1, 2, 3
)
I want to sum the shipped and received quantities for each order line. I can get the correct result like so:
shipped as (
SELECT tol_number
,SUM(stol_quantity) as shipped_q
FROM final t, t.shipped
GROUP BY 1
),
received as (
SELECT tol_number
,SUM(rtol_quantity) as received_q
FROM final t, t.shipped s, s.received
GROUP BY 1
)
SELECT t.tol_number
,t.tol_transfer_quantity
,s.shipped_q
,r.received_q
FROM final t
LEFT JOIN shipped s on t.tol_number = s.tol_number
LEFT JOIN received r ON t.tol_number = r.tol_number
Correct results:
Row tol_number tol_transfer_quantity shipped_q received_q
1 tol-1 3 3 3
2 tol-2 5 5 3
What i am wondering is if there is a better way to do this? Trying something like this will over count the first level of nesting but just feels and looks a lot cleaner:
SELECT tol_number
,tol_transfer_quantity
,SUM(stol_quantity) as shipped_q
,SUM(rtol_quantity) as shipped_r
FROM final t, t.shipped s, s.received
GROUP BY 1, 2
Wrong result for shipped_q:
Row tol_number tol_transfer_quantity shipped_q shipped_r
1 tol-2 5 5 3
2 tol-1 3 6 3
Many thanks for any ideas.
#standardSQL
SELECT
tol_id,
tol_transfer_quantity,
(SELECT SUM(stol_quantity) FROM final.shipped) shipped_q,
(SELECT SUM(rtol_quantity) FROM final.shipped s, s.received) shipped_r
FROM final
I'd suggest you use sub-selects in which you treat your arrays like tables:
SELECT
tol_id,
SUM(tol_transfer_quantity),
SUM( (SELECT SUM(stol_quantity) FROM final.shipped) ) shipped_q,
SUM( (SELECT SUM(rtol_quantity) FROM final.shipped s, s.received) ) shipped_r
FROM
final
GROUP BY
1
hth!

Using IN with convert in sql

I would like to use the IN clause, but with the convert function.
Basically, I have a table (A) with the column of type int.
But in the other table (B) I Have values which are of type varchar.
Essentially, what I am looking for something like this
select *
from B
where myB_Column IN (select myA_Columng from A)
However, I am not sure if the int from table A, would map / convert / evaluate properly for the varchar in B.
I am using SQL Server 2008.
You can use CASE statement in where clause like this and CAST only if its Integer.
else 0 or NULL depending on your requirements.
SELECT *
FROM B
WHERE CASE ISNUMERIC(myB_Column) WHEN 1 THEN CAST(myB_Column AS INT) ELSE 0 END
IN (SELECT myA_Columng FROM A)
ISNUMERIC will be 1 (true) for Decimal values as-well so ideally you should implement your own IsInteger UDF .To do that look at this question
T-sql - determine if value is integer
Option #1
Select * from B where myB_Column IN
(
Select Cast(myA_Columng As Int) from A Where ISNUMERIC(myA_Columng) = 1
)
Option #2
Select B.* from B
Inner Join
(
Select Cast(myA_Columng As Int) As myA_Columng from A
Where ISNUMERIC(myA_Columng) = 1
) T
On T.myA_Columng = B.myB_Column
Option #3
Select B.* from B
Left Join
(
Select Cast(myA_Columng As Int) As myA_Columng from A
Where ISNUMERIC(myA_Columng) = 1
) T
On T.myA_Columng = B.myB_Column
I will opt third one. Reason is below mentioned.
Disadvantages of IN Predicate
Suppose I have two list objects.
List 1 List 2
1 12
2 7
3 8
4 98
5 9
6 10
7 6
Using Contains, it will search for each List-1 item in List-2 that means iteration will happen 49 times !!!
You can also use exists caluse,
select *
from B
where EXISTS (select 1 from A WHERE CAST(myA_Column AS VARCHAR) = myB_Column)
You can use below query :
select B.*
from B
inner join (Select distinct MyA_Columng from A) AS X ON B.MyB_Column = CAST(x.MyA_Columng as NVARCHAR(50))
Try it by using CAST()
SELECT *
FROM B
WHERE CAST(myB_Column AS INT(11)) IN (
SELECT myA_Columng
FROM A
)

In R, How Do I Create a data.frame with Unique Values from One Column of another data.frame?

I'm trying to learn R, but I'm stuck on something that seems simple. I know SQL, and the easiest way for me to communicate my question is with that language. Can someone help me with a translation from SQL to R?
I've figured out that this:
SELECT col1, sum(col2) FROM table1 GROUP BY col1
translates into this:
aggregate(x=table1$col2, by=list(table1$col1), FUN=sum)
And I've figured out that this:
SELECT col1, col2 FROM table1 GROUP BY col1, col2
translates into this:
unique(table1[,c("col1","col2")])
But what is the translation for this?
SELECT col1 FROM table1 GROUP BY col1
For some reason, the "unique" function seems to switch to a different return type when working on only one column, so it doesn't work as I would expect.
-TC
I'm guessing that you are referring to the fact that calling unique on a vector will return a vector, rather than a data frame. Here are a couple of examples that may help:
#Some example data
dat <- data.frame(x = rep(letters[1:2],times = 5),
y = rep(letters[3:4],each = 5))
> dat
x y
1 a c
2 b c
3 a c
4 b c
5 a c
6 b d
7 a d
8 b d
9 a d
10 b d
> unique(dat)
x y
1 a c
2 b c
6 b d
7 a d
#Unique => vector
> unique(dat$x)
[1] "a" "b"
#Same thing
> unique(dat[,'x'])
[1] "a" "b"
#drop = FALSE preserves the data frame structure
> unique(dat[,'x',drop = FALSE])
x
1 a
2 b
#Or you can just convert it back (although the default column name is ugly)
> data.frame(unique(dat$x))
unique.dat.x.
1 a
2 b
If you know SQL then try packages sqldf and data.table.

Max sum for the continous N rows

I've the following table (both A and B are integers):
Update 1 - Could anyone do me a favour and run the solution on a set of 1M records with B being a random decimal (to avoid overflows) residing in [0 to 1] range for N=> 10, 100 and 1000? I'd like to get a flavor of the time, required to run the solution query. Thanks a lot in advance.
Sample data:
A B
1 1
2 8
3 1
4 11
5 1
6 1
7 6
8 1
9 1
10 2
How do I get the maximum Sum of B values for any N sequential A's? The solution mustn't use cursors, usage of table vars/tem tables has to be stongly justified.
I can use SQLCLR in case if it'll give a distinct performance boost.
Some clarifications:
Max Sum for 1 element is 11 (see A = 4)
Max Sum for 2 elements is 12 (it's either A=> 1 & 2 or A=> 2 & 3),
Max Sum for 3 elements is 20 (A=>2, 3, 4),
Max Sum for 4 is 21 (A=>1,2,3,4 or A=>2,3,4,5) etc.
Since the A values are guaranteed to be consecutive integers, given N we know for any particular A which values we are interested in. So
SELECT
A,
(SELECT SUM(B) FROM Table T2 WHERE T.A <= T2.A AND T2.A <= T.A + N - 1)
AS SumOfBs
FROM Table T
WHERE A + N - 1 <= (SELECT COUNT(*) FROM Table)
gives, for each A, the sum of the B values for the N rows starting there. The WHERE restricts us to rows that do actually have N rows starting there. Put this in a subquery and we can get the maximum:
SELECT
MAX(SumOfBs) AS DesiredValue
FROM
(
SELECT
A,
(SELECT SUM(B) FROM Table T2 WHERE T.A <= T2.A AND T2.A <= T.A + N - 1)
AS SumOfBs
FROM Table T
WHERE A + N - 1 <= (SELECT COUNT(*) FROM Table)
) Intermediate
should do the job.
I've loaded your test data into a table called data.
The following SQL gives me the answer 20 for N=3:
declare #N int
set #N = 3
select max(SumB)
from data d
cross apply (select SumB = SUM(B) from data sub where sub.A between d.A - (#N-1) and d.A) x
Try:
with cte as
(select 1 window_count union all
select window_count+1 window_count from cte where window_count<#N)
select max(sum_B) from
(select T1.A,
sum(T2.B) sum_B
from MyTable T1
cross join cte
join MyTable T2 on T1.A = T2.A + cte.window_count - 1
group by T1.A) sq
I'm possibly not understanding the question fully, but it looks to me like...
SELECT SUM(B) FROM table WHERE A <= n
If not correct, can you explain a bit more?