In R, How Do I Create a data.frame with Unique Values from One Column of another data.frame? - sql

I'm trying to learn R, but I'm stuck on something that seems simple. I know SQL, and the easiest way for me to communicate my question is with that language. Can someone help me with a translation from SQL to R?
I've figured out that this:
SELECT col1, sum(col2) FROM table1 GROUP BY col1
translates into this:
aggregate(x=table1$col2, by=list(table1$col1), FUN=sum)
And I've figured out that this:
SELECT col1, col2 FROM table1 GROUP BY col1, col2
translates into this:
unique(table1[,c("col1","col2")])
But what is the translation for this?
SELECT col1 FROM table1 GROUP BY col1
For some reason, the "unique" function seems to switch to a different return type when working on only one column, so it doesn't work as I would expect.
-TC

I'm guessing that you are referring to the fact that calling unique on a vector will return a vector, rather than a data frame. Here are a couple of examples that may help:
#Some example data
dat <- data.frame(x = rep(letters[1:2],times = 5),
y = rep(letters[3:4],each = 5))
> dat
x y
1 a c
2 b c
3 a c
4 b c
5 a c
6 b d
7 a d
8 b d
9 a d
10 b d
> unique(dat)
x y
1 a c
2 b c
6 b d
7 a d
#Unique => vector
> unique(dat$x)
[1] "a" "b"
#Same thing
> unique(dat[,'x'])
[1] "a" "b"
#drop = FALSE preserves the data frame structure
> unique(dat[,'x',drop = FALSE])
x
1 a
2 b
#Or you can just convert it back (although the default column name is ugly)
> data.frame(unique(dat$x))
unique.dat.x.
1 a
2 b

If you know SQL then try packages sqldf and data.table.

Related

May I know how can I construct the follow query in SQL Server?

CREATE TABLE (
A INT NOT NULL,
B INT NOT NULL
)
A is an enumerated values of 1, 2, 3, 4, 5
B can be any values
I would like to count() the number of occurrence group by B, with a specific subset of A e.g. {1, 2}
Example:
A B
1 7 *
2 7 *
3 7
1 8 *
2 8 *
1 9
3 9
When B = 7, A = 1, 2, 3. Good
When B = 8, A = 1, 2. Good
When B = 9, A = 1, 3. Not satisfy, 2 is missing
So the count will be 2 (when B = 7 and 8)
If I've understood you correctly, we want to find B values for which we have both a 1 and a 2 in A, and then we want to know how many of those we have.
This query does this:
declare #t table (A int not null, B int not null)
insert into #t(A,B) values
(1,7),
(2,7),
(3,7),
(1,8),
(2,8),
(1,9),
(3,9)
select COUNT(DISTINCT B) from (
select B
from #t
where A in (1,2)
group by B
having COUNT(DISTINCT A) = 2
) t
One or both of the DISTINCTs may be unnecessary - it depends on whether your data can contain repeating values.
If I understand correctly and the requirement is to find Bs with a series of As that doesn't have any "gaps", you could compare the difference between the minimal and maximal A with number of records (per B, of course):
SELECT b
FROM mytable
GROUP BY b
HAVING COUNT(*) + 1 = MAX(a) - MIN(a)
SELECT COUNT(DISTINCT B) FROM TEMP T WHERE T.B NOT IN
(SELECT B FROM
(SELECT B,A,
LAG (A,1) OVER (PARTITION BY B ORDER BY A) AS PRE_A
FROM Temp) K
WHERE K.PRE_A IS NOT NULL AND K.A<>K.PRE_A+1);

Counting characters in an Access database column using SQL

I have the following table
col1 col2 col3 col4
==== ==== ==== ====
1233 4566 ABCD CDEF
1233 4566 ACD1 CDEF
1233 4566 D1AF CDEF
I need to count the characters in col3, so from the data in the previous table it would be:
char count
==== =====
A 3
B 1
C 2
D 3
F 1
1 2
Is this possible to achieve by using SQL only?
At the moment I am thinking of passing a parameter in to SQL query and count the characters one by one and then sum, however I did not start the VBA part yet, and frankly wouldn't want to do that.
This is my query at the moment:
PARAMETERS X Long;
SELECT First(Mid(TABLE.col3,X,1)) AS [col3 Field], Count(Mid(TABLE.col3,X,1)) AS Dcount
FROM TEST
GROUP BY Mid(TABLE.col3,X,1)
HAVING (((Count(Mid([TABLE].[col3],[X],1)))>=1));
Ideas and help are much appreciated, as I don't usually work with Access and SQL.
You can accomplish your task in pure Access SQL by using a Numbers table. In this case, the Numbers table must contain integer values from 1 to some number larger than the longest string of characters in your source data. In this example, the strings of characters to be processed are in [CharacterData]:
CharacterList
-------------
GORD
WAS
HERE
and the [Numbers] table is simply
n
--
1
2
3
4
5
If we use a cross join to extract the characters (eliminating any empty strings that result from n exceeding Len(CharacterList))...
SELECT
Mid(cd.CharacterList, nb.n, 1) AS c
FROM
CharacterData cd,
Numbers nb
WHERE Mid(cd.CharacterList, nb.n, 1) <> ""
...we get ...
c
--
G
W
H
O
A
E
R
S
R
D
E
Now we can just wrap that in an aggregation query
SELECT c AS Character, COUNT(*) AS CountOfCharacter
FROM
(
SELECT
Mid(cd.CharacterList, nb.n, 1) AS c
FROM
CharacterData cd,
Numbers nb
WHERE Mid(cd.CharacterList, nb.n, 1) <> ""
)
GROUP BY c
which gives us
Character CountOfCharacter
--------- ----------------
A 1
D 1
E 2
G 1
H 1
O 1
R 2
S 1
W 1
Knowing that colum3 has a fixed length of 4, this problem is quite easy.
Assume there is a view V with four columns, each for one character in column 3.
V(c1, c2, c3, c4)
Unfortunately, I'm not familiar with Access-specific SQL, but this is the general SQL statement you would need:
SELECT c, COUNT(*) FROM
(
SELECT c1 AS c FROM V
UNION ALL
SELECT c2 FROM V
UNION ALL
SELECT c3 FROM V
UNION ALL
SELECT c4 FROM V
)
GROUP BY c
It's a shame that you don't want to consider using VBA; you don't need as much as you might think:
Public charCounts As Dictionary
Sub LoadCounts(s As String)
If charCounts Is Nothing Then Init
Dim length As Integer, i As Variant
length = Len(s)
For i = 1 To length
Dim currentChar As String
currentChar = Mid(s, i, 1)
If Not charCounts.Exists(currentChar) Then charCounts(currentChar) = 0
charCounts(currentChar) = charCounts(currentChar) + 1
Next
End Sub
Sub Init()
Set charCounts = New Scripting.Dictionary
charCounts.CompareMode = TextCompare 'for case-insensitive comparisons; otherwise use BinaryCompare
End Sub
Then, you execute the query once:
SELECT LoadCount(col3)
FROM Table1
Finally, you read out the values in the Dictionary:
Dim key As Variant
For Each key In charCounts
Debug.Print key, charCounts(key)
Next
Note that between query executions you have to call Init to clear out the old values.
Please Try this,,, I hope this will work
with cte as
(
select row_number() over(order by (select null)) as i from Charactor_Count
)
select substring( name, i, 1 ) as char, count(*) as count
from Charactor_Count, cte
where cte.i <= len(Charactor_Count.name)
group by substring(name,i,1)
order by substring(name,i,1)

Can I add multiple columns to Totals

Using MS SQL 2012
I want to do something like
select a, b, c, a+b+c d
However a, b, c are complex computed columns, lets take a simple example
select case when x > 4 then 4 else x end a,
( select count(*) somethingElse) b,
a + b c
order by c
I hope that makes sense
You can use a nested query or a common table expression (CTE) for that. The CTE syntax is slightly cleaner - here it is:
WITH CTE (a, b)
AS
(
select
case when x > 4 then 4 else x end a,
count(*) somethingElse b
from my_table
)
SELECT
a, b, (a+b) as c
FROM CTE
ORDER BY c
I would probably do this:
SELECT
sub.a,
sub.b,
(sub.a + sub.b) as c,
FROM
(
select
case when x > 4 then 4 else x end a,
(select count(*) somethingElse) b
FROM MyTable
) sub
ORDER BY c
The easiest way is to do this:
select a,b,c,a+b+c d
from (select <whatever your calcs are for a,b,c>) x
order by c
That just creates a derived table consisting of your calculations for a, b, and c, and allows you to easily reference and sum them up!

Using IN with convert in sql

I would like to use the IN clause, but with the convert function.
Basically, I have a table (A) with the column of type int.
But in the other table (B) I Have values which are of type varchar.
Essentially, what I am looking for something like this
select *
from B
where myB_Column IN (select myA_Columng from A)
However, I am not sure if the int from table A, would map / convert / evaluate properly for the varchar in B.
I am using SQL Server 2008.
You can use CASE statement in where clause like this and CAST only if its Integer.
else 0 or NULL depending on your requirements.
SELECT *
FROM B
WHERE CASE ISNUMERIC(myB_Column) WHEN 1 THEN CAST(myB_Column AS INT) ELSE 0 END
IN (SELECT myA_Columng FROM A)
ISNUMERIC will be 1 (true) for Decimal values as-well so ideally you should implement your own IsInteger UDF .To do that look at this question
T-sql - determine if value is integer
Option #1
Select * from B where myB_Column IN
(
Select Cast(myA_Columng As Int) from A Where ISNUMERIC(myA_Columng) = 1
)
Option #2
Select B.* from B
Inner Join
(
Select Cast(myA_Columng As Int) As myA_Columng from A
Where ISNUMERIC(myA_Columng) = 1
) T
On T.myA_Columng = B.myB_Column
Option #3
Select B.* from B
Left Join
(
Select Cast(myA_Columng As Int) As myA_Columng from A
Where ISNUMERIC(myA_Columng) = 1
) T
On T.myA_Columng = B.myB_Column
I will opt third one. Reason is below mentioned.
Disadvantages of IN Predicate
Suppose I have two list objects.
List 1 List 2
1 12
2 7
3 8
4 98
5 9
6 10
7 6
Using Contains, it will search for each List-1 item in List-2 that means iteration will happen 49 times !!!
You can also use exists caluse,
select *
from B
where EXISTS (select 1 from A WHERE CAST(myA_Column AS VARCHAR) = myB_Column)
You can use below query :
select B.*
from B
inner join (Select distinct MyA_Columng from A) AS X ON B.MyB_Column = CAST(x.MyA_Columng as NVARCHAR(50))
Try it by using CAST()
SELECT *
FROM B
WHERE CAST(myB_Column AS INT(11)) IN (
SELECT myA_Columng
FROM A
)

How to do a basic left outer join with data.table in R?

I have a data.table of a and b that I've partitioned into below with b < .5 and above with b > .5:
DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
above = DT[DT$b > .5]
below = DT[DT$b < .5, list(a=a)]
I'd like to do a left outer join between above and below: for each a in above, count the number of rows in below. This is equivalent to the following in SQL:
with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select 3, 1),
above as (select a, b from dt where b > .5),
below as (select a, b from dt where b < .5)
select above.a, count(below.a) from above left outer join below on (above.a = below.a) group by above.a;
a | count
---+-------
3 | 0
2 | 1
(2 rows)
How do I accomplish the same thing with data.tables? This is what I tried so far:
> key(below) = 'a'
> below[above, list(count=length(b))]
a count
[1,] 2 1
[2,] 3 1
[3,] 3 1
> below[above, list(count=length(b)), by=a]
Error in eval(expr, envir, enclos) : object 'b' not found
> below[, list(count=length(a)), by=a][above]
a count b
[1,] 2 1 1
[2,] 3 NA 1
[3,] 3 NA 1
I should also be more specific in that I already tried merge but that blows through the memory on my system (and the dataset takes only about 20% of my memory).
See if this is giving you something useful. Your example is too sparse to let me know what you want, but it appears it might be a tabulation of values of above$a that are also in below$a
table(above$a[above$a %in% below$a])
If you also want the converse ... values not in below, then this would do it:
table(above$a[!above$a %in% below$a])
And you can concatenate them:
> c(table(above$a[above$a %in% below$a]),table(above$a[!above$a %in% below$a]) )
2 3
1 2
Generally table and %in% run in reasonably small footprints and are quick.
Since you appear to be using package data.table: check ?merge.data.table.
I haven't used it, but it appears this might do what you want:
merge(above, below, by="a", all.x=TRUE, all.y=FALSE)
I think this is easier:
setkey(above,a)
setkey(below,a)
Left outer join:
above[below, .N]
regular join:
above[below, .N, nomatch=0]
full outer join with counts:
merge(above,below, all=T)[,.N, by=a]
I eventually found a way to do this with data.table, which I felt is more natural for me to understand than DWin's table, though YMMV:
result = below[, list(count=length(b)), by=a]
key(result) = 'a'
result = result[J(unique(above$a))]
result$count[is.na(result$count)] = 0
I don't know if this could be more compact, though. I especially wanted to be able to do something like result = below[J(unique(above$a)), list(count=length(b))], but that doesn't work.