I need to apply a 3 way split test to a subset of data, the way that I have been doing this is to create a 'TestTable'
eg
Select Group, List, Urn
into tbl_TestSplit
from tbl_AllRecords
where ApplicableToTest = 'Y'
order by List, Urn
Then I add some fields:
alter table tbl_testsplit
add
[ID][int] identity (1,1) not null,
[Split] [nvarchar] (20) null
then I update the split field as follows:
update tbl_testsplit set split = {fn MOD(id,3)}
However when I check the results of the split it is not splitting the records correctly - usually a few records out on at least one of the lists. When I investigated this, I noticed that the table it created was not actually in the order I had indicated.
I am sure there is an easier (ie smarter) way to go about this. Any help gratefully appreciated.
Thanks
You can use ROW_NUMBER to calculate the row number in the order you want
Select Group, List, Urn
, split = (ROW_NUMBER() OVER (ORDER BY List, Urn)) % 3
from tbl_AllRecords
where ApplicableToTest = 'Y'
order by List, Urn
% is the modulo function
Related
I have a table like this
CREATE TABLE userinteractions
(
userid bigint,
dobyr int,
-- lots more fields that are not relevant to the question
);
My problem is that some of the data is polluted with multiple dobyr values for the same user.
The table is used as the basis for further processing by creating a new table. These cases need to be removed from the pipeline.
I want to be able to create a clean table that contains unique userid and dobyr limited to the cases where there is only one value of dobyr for the userid in userinteractions.
For example I start with data like this:
userid,dobyr
1,1995
1,1995
2,1999
3,1990 # dobyr values not equal
3,1999 # dobyr values not equal
4,1989
4,1989
And I want to select from this to get a table like this:
userid,dobyr
1,1995
2,1999
4,1989
Is there an elegant, efficient way to get this in a single sql query?
I am using postgres.
EDIT: I do not have permissions to modify the userinteractions table, so I need a SELECT solution, not a DELETE solution.
Clarified requirements: your aim is to generate a new, cleaned-up version of an existing table, and the clean-up means:
If there are many rows with the same userid value but also the same dobyr value, one of them is kept (doesn't matter which one), rest gets discarded.
All rows for a given userid are discarded if it occurs with different dobyr values.
create table userinteractions_clean as
select distinct on (userid,dobyr) *
from userinteractions
where userid in (
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1 )
order by userid,dobyr;
This could also be done with an not in, not exists or exists conditions. Also, select which combination to keep by adding columns at the end of order by.
Updated demo with tests and more rows.
If you don't need the other columns in the table, only something you'll later use as a filter/whitelist, plain userid's from records with (userid,dobyr) pairs matching your criteria are enough, as they already uniquely identify those records:
create table userinteractions_whitelist as
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1
Just use a HAVING clause to assert that all rows in a group must have the same dobyr.
SELECT
userid,
MAX(dobyr) AS dobyr
FROM
userinteractions
GROUP BY
userid
HAVING
COUNT(DISTINCT dobyr) = 1
I have a table with a column that can have values separated by ",".
Example column group:
id column group:
1 10,20,30
2 280
3 20
I want to create a SELECT with where condition on column group where I can search for example 20 ad It should return 1 and 3 rows or search by 20,280 and it should return 1 and 2 rows.
Can you help me please?
As pointed out in comments,storing mutiple values in a single row is not a good idea..
coming to your question,you can use one of the split string functions from here to split comma separated values into a table and then query them..
create table #temp
(
id int,
columnss varchar(100)
)
insert into #temp
values
(1,'10,20,30'),
(2, '280'),
(3, '20')
select *
from #temp
cross apply
(
select * from dbo.SplitStrings_Numbers(columnss,',')
)b
where item in (20)
id columnss Item
1 10,20,30 20
3 20 20
The short answer is: don't do it.
Instead normalize your tables to at least 3NF. If you don't know what database normalization is, you need to do some reading.
If you absolutely have to do it (e.g. this is a legacy system and you cannot change the table structure), there are several articles on string splitting with TSQL and at least a couple that have done extensive benchmarks on various methods available (e.g. see: http://sqlperformance.com/2012/07/t-sql-queries/split-strings)
Since you only want to search, you don't really need to split the strings, so you can write something like:
SELECT id, list
FROM t
WHERE ','+list+',' LIKE '%,'+#searchValue+',%'
Where t(id int, list varchar(max)) is the table to search and #searchValue is the value you are looking for. If you need to search for more than one value you have to add those in a table and use a join or subquery.
E.g. if s(searchValue varchar(max)) is the table of values to search then:
SELECT distinct t.id, t.list
FROM t INNER JOIN s
ON ','+t.list+',' LIKE '%,'+s.searchValue+',%'
If you need to pass those search values from ADO.Net consider table parameters.
I have a large table containing a userID column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID is contained in other parts of the table I would like to extract those rows too.
I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:
SELECT * FROM source
TABLESAMPLE (1 PERCENT) s;
but I am not sure how to add the constraint where I would like all other instances of those 1% userIDs selected too.
You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.
select c.*
from
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
from
(select userID
from mytable
group by userID
) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;
This is set up for entity level modeling purposes, which is why I have test and train as types.
I am trying to select a single row at random from a table. I am curious as to why the two statements below don't work:
select LastName from DataGeneratorNameLast where id = (ABS(CHECKSUM(NewId())) % 3)+1
select LastName from DataGeneratorNameLast where id = cast(Ceiling(RAND(convert(varbinary, newid())) *4) as int)
Both statements return, at random, either 1 row, no rows, or multiple rows. For the life of me I can't figure out why. Just adding top 1 to the query only solves the problem of multiple rows - but not of no rows returned.
Yes I could do the same thing by selecting top 1 and ordering by newid(). But the mystery of why this does not work is driving me crazy.
Thoughts on why I get multiple rows back?
Here is the table I am using to select from:
Create Table dbo.DataGeneratorNameLast
(
[Id] [int] IDENTITY(1,1) NOT NULL,
LastName varchar(50) NOT NULL,
)
Go
insert into DataGeneratorNameLast (LastName) values ('SMITH')
insert into DataGeneratorNameLast (LastName) values ('JOHNSON')
insert into DataGeneratorNameLast (LastName) values ('Booger')
insert into DataGeneratorNameLast (LastName) values ('Tiger')
The newid() gets evaluated for every row it is compared against, generating a different number. To do what you want, you should generate the random value into a variable before the select and then reference the variable.
Declare #randId int = (abs(checksum(newid())) % 3) + 1;
select LastName from DataGeneratorNameLast where id = #randId;
As Martin said in comments to this. Rand() would behave differently, only being evaluated once per query.
If the table has at least one row that this query would return one row is mandatory.
select TOP (1) LastName
from DataGeneratorNameLast
ORDER BY NEWID()
Notice that this solution can be slow if the table has a large number of rows.
About select LastName from DataGeneratorNameLast where id = #Rand - This solution does not guarantee that there exists a row with id. Even the IDENTITY column can contain gaps. If you definitely need one row then do a preliminary check IF EXISTS (select * from DataGeneratorNameLast where id = #Rand) SELECT ...
I've Had a similar issue and fixed it by making the ID a PRIMARY KEY.
NEWID() is computed per-row. Without a primary key, there is no access pattern other than a table scan, and the filter is checked for each row, so a different value is computed for each row, and you get however many rows match.
With the key, a seek is available, so the predicate is computed once and used as a search argument for a seek.
I have a complex query and which may return more than one record per group. There is a field that has a numeric sequential number. If in a group there is more than one record returned I just want the record with the highest sequential number.
I’ve tried using the SQL MAX function, but if I try to add more than one field it returns all records, instead of the one with the highest sequential field in that group.
I am trying to accomplish this in MS Access.
Edit: 4/5/11
Trying to create a table as an example of what I am trying to do
I have the following table:
tblItemTrans
ItemID(PK)
Eventseq(PK)
ItemTypeID
UserID
Eventseq is a number field that increments for each ItemID. (Don’t ask me why, that’s how the table was created.) Each ItemID can have one or many Evenseq’s. I only need the last record (max(Eventseq)) PER each ItemTypeID.
Hope this helps any.
SELECT A.*
FROM YourTable A
INNER JOIN (SELECT GroupColumn, MAX(SequentialColumn) MaxSeq
FROM YourTable
GROUP BY GroupColumn) B
ON A.GroupColumn = B.GroupColumn AND A.SequentialColumn = B.MaxSeq
If your SequentialNumber is an ID (unique across the table), then you could use
select *
from tbl
where seqnum in (
select max(seqnum) from tbl
group by groupcolumn)
If it is not, an alternative to Lamak's query is the Access domain function DMAX
select *
from tbl
where seqnum = DMAX("seqnum", "tbl", "groupcolumn='" & groupcolumn & "'")
Note: if the groupcolumn is a date, use # instead of single quotes ' in the above, if it is a numeric, remove the single quotes.