I found this online for practice, but don't know where to begin.
Given a table with three columns (id, category, value) and each id has 3 or fewer possible values (price, size, color).
Now, how can I find those id's for which the value of two or more values matches to one another?
For example:
ID1 (price 10, size M, color Red),
ID2 (price 10, Size L, Color Red),
ID3 (price 15, size L, color Red)
Then the output should be two rows:
ID1 ID2
and ID2 ID3
Using the data frame DF shown reproducibly in the Note at the end:
library(sqldf)
sqldf("select a.ID first, b.ID second
from DF a
join DF b on (a.price = b.price) +
(a.size = b.size) +
(a.color = b.color) > 1 and
a.ID < b.ID")
giving:
first second
1 1 2
2 2 3
Note
DF <- data.frame(ID = 1:3,
price = c(10, 10, 15),
size = c("M", "L", "L"),
color = "Red",
stringsAsFactors = FALSE)
Related
I have a PostgreSQL database table that contains two columns a and b.
When I query all the entries of the table I get:
{1, 2},
{2, 3},
{2, 3}
So the value:
(1) appeared in field a 1 time and in field b 0 times
(2) appeared in field a 2 times and in field b 1 time
(3) appeared in field a 0 times and in field b 2 times
I want to get the following output:
{1, 1},
{2, 1},
{3, -2}
where the first field is the value stored in the database and the second field is difference.
How can I achieve that?
I first query the database and the result is in query_result
then I get the frequency of the first and second element:
f0 = query_result |> Enum.frequencies_by(&elem(&1, 0))
f1 = query_result |> Enum.frequencies_by(&elem(&1, 1))
(f0 |> Map.keys) ++ (f1 |> Map.keys) |> Enum.uniq |> Enum.into(%{}, fn key -> {key, (f0[key] || 0) - (f1[key] || 0)} end)
I am looking for a simpler way to do this.
Use a single query:
SELECT val, COALESCE(a.ct, 0) - COALESCE(b.ct, 0) AS freq_diff
FROM (
SELECT a AS val, count(*) AS ct
FROM tbl
GROUP BY 1
) a
FULL JOIN (
SELECT b AS val, count(*) AS ct
FROM tbl
GROUP BY 1
) b USING (val);
fiddle
FULL [OUTER] JOIN, because either value may be missing in the other column.
COALESCE to defend against NULL values resulting from the join.
CREATE TABLE (
A INT NOT NULL,
B INT NOT NULL
)
A is an enumerated values of 1, 2, 3, 4, 5
B can be any values
I would like to count() the number of occurrence group by B, with a specific subset of A e.g. {1, 2}
Example:
A B
1 7 *
2 7 *
3 7
1 8 *
2 8 *
1 9
3 9
When B = 7, A = 1, 2, 3. Good
When B = 8, A = 1, 2. Good
When B = 9, A = 1, 3. Not satisfy, 2 is missing
So the count will be 2 (when B = 7 and 8)
If I've understood you correctly, we want to find B values for which we have both a 1 and a 2 in A, and then we want to know how many of those we have.
This query does this:
declare #t table (A int not null, B int not null)
insert into #t(A,B) values
(1,7),
(2,7),
(3,7),
(1,8),
(2,8),
(1,9),
(3,9)
select COUNT(DISTINCT B) from (
select B
from #t
where A in (1,2)
group by B
having COUNT(DISTINCT A) = 2
) t
One or both of the DISTINCTs may be unnecessary - it depends on whether your data can contain repeating values.
If I understand correctly and the requirement is to find Bs with a series of As that doesn't have any "gaps", you could compare the difference between the minimal and maximal A with number of records (per B, of course):
SELECT b
FROM mytable
GROUP BY b
HAVING COUNT(*) + 1 = MAX(a) - MIN(a)
SELECT COUNT(DISTINCT B) FROM TEMP T WHERE T.B NOT IN
(SELECT B FROM
(SELECT B,A,
LAG (A,1) OVER (PARTITION BY B ORDER BY A) AS PRE_A
FROM Temp) K
WHERE K.PRE_A IS NOT NULL AND K.A<>K.PRE_A+1);
I have a table with the following columns...
TestName - StepNumber - Data_1
I'm trying to write a query that can look for Data_1 results and average them for one day. The TestNames are unique tests we're running, and StepNumbers are the individual steps inside of the test. Normally, I would use something like
select Data_1 from table
where TestName in(1,2,3,4)
and StepNumber in(1)
to return all of the Data_1 results I need. However, sometimes the data I need is located in different steps across the table. Test 1 might have the required data in Step 2, Test 2 in step 10, etc...and in the end, I need an average of the Data_1 results for all of the similar StepNumber results. I'm not sure how I can capture this data in a single query. There's a separate part of the query where I'm breaking it down by geography, and doing it individually would take a long time.
I'd be looking for something like...
select avg(Data_1) from table
where TestName = 1 and StepNumber = 2
and TestName = 2 and StepNumber = 10
and TestName = 3 and StepNumber = 5
If I can clarify, please let me know. Thank you!
select avg(Data_1)
from table
where (TestName = 1 and StepNumber = 2)
or (TestName = 2 and StepNumber = 10)
or (TestName = 3 and StepNumber = 5)
If I have understood correctly, you have three (or more) sets of data in your table:
TestName = 1 and StepNumber = 2 - cardinality N1
Testname = 2 and StepNumber = 10 - cardinality N2
TestName = 3 and StepNumber = 5 - cardinality N3
If you want the average for all three separately, you needs must select three columns. AVG in this case cannot help you as it would run the average on a cardinality of N4, this being the intersection of the three groups. So you have to do this by hand. I do not know how this would perform vs. three separate queries:
SELECT
AVG(Data_1) AS OverallAverage,
SUM(IF((TestName = 1 AND StepNumber = 2), Data_1, 0))
/SUM(IF((TestName = 1 AND StepNumber = 2), 1, 0)) AS AvgGroup1,
SUM(IF((TestName = 2 AND StepNumber = 10), Data_1, 0))
/SUM(IF((TestName = 2 AND StepNumber = 10), 1, 0)) AS AvgGroup2,
SUM(IF((TestName = 3 AND StepNumber = 5), Data_1, 0))
/SUM(IF((TestName = 3 AND StepNumber = 5), 1, 0)) AS AvgGroup3
FROM table
WHERE (
( TestName = 1 AND StepNumber = 2)
OR
( TestName = 2 AND StepNumber = 10)
OR
( TestName = 3 AND StepNumber = 5)
);
This kind of query can be assembled from components, i.e. programmatically depending on the groups.
This is a SQLFiddle to show the results.
I'm trying to learn R, but I'm stuck on something that seems simple. I know SQL, and the easiest way for me to communicate my question is with that language. Can someone help me with a translation from SQL to R?
I've figured out that this:
SELECT col1, sum(col2) FROM table1 GROUP BY col1
translates into this:
aggregate(x=table1$col2, by=list(table1$col1), FUN=sum)
And I've figured out that this:
SELECT col1, col2 FROM table1 GROUP BY col1, col2
translates into this:
unique(table1[,c("col1","col2")])
But what is the translation for this?
SELECT col1 FROM table1 GROUP BY col1
For some reason, the "unique" function seems to switch to a different return type when working on only one column, so it doesn't work as I would expect.
-TC
I'm guessing that you are referring to the fact that calling unique on a vector will return a vector, rather than a data frame. Here are a couple of examples that may help:
#Some example data
dat <- data.frame(x = rep(letters[1:2],times = 5),
y = rep(letters[3:4],each = 5))
> dat
x y
1 a c
2 b c
3 a c
4 b c
5 a c
6 b d
7 a d
8 b d
9 a d
10 b d
> unique(dat)
x y
1 a c
2 b c
6 b d
7 a d
#Unique => vector
> unique(dat$x)
[1] "a" "b"
#Same thing
> unique(dat[,'x'])
[1] "a" "b"
#drop = FALSE preserves the data frame structure
> unique(dat[,'x',drop = FALSE])
x
1 a
2 b
#Or you can just convert it back (although the default column name is ugly)
> data.frame(unique(dat$x))
unique.dat.x.
1 a
2 b
If you know SQL then try packages sqldf and data.table.
Having this table:
Row Pos Outdata Mismatch Other
1 10 S 0 A
2 10 S 5 A
3 10 R 0 B
4 10 R 7 B
5 20 24 0 A
6 20 24 5 B
6 20 32 10 C
How can I select all rows where Pos=10 having unique Outdata. If more than one row exists, I would like to have the row where the field Mismatch is smallest. Ie I would like to get rows 1 and 3, not 2 and 4.
In that select I would also like to do the same for all Pos=20, so the total result should be rows 1,3,5,6
(And I want to then access the "Other" field, so I cant only SELECT DISTINCT on Pos and OutData and Mismatch).
Is there a query to do this in MySQL?
Here I am assuming that (Pos, OutData, Mismatch) is not unique, but that (Row, Pos, OutData, Mismatch) is unique:
SELECT T3.*
FROM Codes T3
JOIN (
SELECT MIN(Row) AS Row
FROM (
SELECT Pos, OutData, Min(Mismatch) AS Mismatch
FROM Codes
GROUP BY Pos, OutData
) T1
JOIN Codes T2
ON T1.Pos = T2.Pos AND T2.OutData = T2.Outdata AND T1.Mismatch = T2.Mismatch
GROUP BY T2.Pos, T2.OutData, T2.Mismatch
) T4
ON T3.Row = T4.Row
Result:
1, 10, 'S', 0, 'A'
3, 10, 'R', 0, 'B'
5, 20, '24', 0, 'A'
7, 20, '32', 10, 'C'
Note that I have also corrected the second row 6 to become row 7, as I believe that this was a mistake in the question.
Rationale is to create a table with all values of Pos, OutData and the lowest Mismatch and use the combination of these fields as a unique key into your actual table.
SELECT t1.*
FROM MyTable t1
INNER JOIN (
SELECT Pos, OutData, Mismatch = MIN(Mismatch)
FROM MyTable
GROUP BY Pos, OutData
) t2 ON t2.Pos = t1.Pos
AND t2.OutData = t1.OutData
AND t2.Mismatch = t1.Mismatch
Try this:
Select * From Table ot
Where pos = 10
And MisMatch =
(Select Min(MisMatch) From Table
Where pos = 10
And Outdata = ot.OutData)
This should work for you:
SELECT *
FROM table T1
GROUP BY Pos, Outdata
HAVING Mismatch = (
SELECT MIN(Mismatch)
FROM table T2
WHERE Pos = T1.Pos AND
Outdata = T1.Outdata
)