How to randomize order of data in 3 columns - sql

I have 3 columns of data in SQL Server 2005 :
LASTNAME
FIRSTNAME
CITY
I want to randomly re-order these 3 columns (and munge the data) so that the data is no longer meaningful. Is there an easy way to do this? I don't want to change any data, I just want to re-order the index randomly.

When you say "re-order" these columns, do you mean that you want some of the last names to end up in the first name column? Or do you mean that you want some of the last names to get associated with a different first name and city?
I suspect you mean the latter, in which case you might find a programmatic solution easier (as opposed to a straight SQL solution). Sticking with SQL, you can do something like:
UPDATE the_table
SET lastname = (SELECT lastname FROM the_table ORDER BY RAND())
Depending on what DBMS you're using, this may work for only one line, may make all the last names the same, or may require some variation of syntax to work at all, but the basic approach is about right. Certainly some trials on a copy of the table are warranted before trying it on the real thing.
Of course, to get the first names and cities to also be randomly reordered, you could apply a similar query to either of those columns. (Applying it to all three doesn't make much sense, but wouldn't hurt either.)
Since you don't want to change your original data, you could do this in a temporary table populated with all rows.
Finally, if you just need a single random value from each column, you could do it in place without making a copy of the data, with three separate queries: one to pick a random first name, one a random last name, and the last a random phone number.

I suggest using newid with checksum for doing randomization
SELECT LASTNAME, FIRSTNAME, CITY FROM table ORDER BY CHECKSUM(NEWID())

In SQL Server 2005+ you could prepare a ranked rowset containing the three target columns and three additional computed columns filled with random rankings (one for each of the three target columns). Then the ranked rowset would be joined with itself three times using the ranking columns, and finally each of the three target columns would be pulled from their own instance of the ranked rowset. Here's an illustration:
WITH sampledata (FirstName, LastName, CityName) AS (
SELECT 'John', 'Doe', 'Chicago' UNION ALL
SELECT 'James', 'Foe', 'Austin' UNION ALL
SELECT 'Django', 'Fan', 'Portland'
),
ranked AS (
SELECT
*,
FirstNameRank = ROW_NUMBER() OVER (ORDER BY NEWID()),
LastNameRank = ROW_NUMBER() OVER (ORDER BY NEWID()),
CityNameRank = ROW_NUMBER() OVER (ORDER BY NEWID())
FROM sampledata
)
SELECT
fnr.FirstName,
lnr.LastName,
cnr.CityName
FROM ranked fnr
INNER JOIN ranked lnr ON fnr.FirstNameRank = lnr.LastNameRank
INNER JOIN ranked cnr ON fnr.FirstNameRank = cnr.CityNameRank
This is the result:
FirstName LastName CityName
--------- -------- --------
James Fan Chicago
John Doe Portland
Django Foe Austin

select *, rand() from table order by rand();
I understand some versions of SQL have a rand() that doesn't change for each line. Check for yours. Works on MySQL.

Related

How can I select all fields except for those with non-distinct values?

I have a table which represents data for people that have applied. Each person has one PERSON_ID, but can have multiple APP_IDs. I want to select all of the columns except for APP_ID(because its values aren't distinct) for all of the distinct people in the table.
I can list every field individually in both the select and group by clause
This works:
select PERSON_ID, FIRST,LAST,MIDDLE,BIRTHDATE,SEX,EMAIL,PRIMARY_PHONE from
applications
where first = 'Rob' and last='Robot'
group by PERSON_ID,FIRST,LAST,MIDDLE,BIRTHDATE,SEX,EMAIL,PRIMARY_PHONE
But there are twenty more fields that I may or may not use at any given time
Is there any shorter way to achieve this sort of selection without being so verbose?
select distinct is shorter:
select distinct PERSON_ID, FIRST, LAST, MIDDLE, BIRTHDATE, SEX, EMAIL, PRIMARY_PHONE
from applications
where first = 'Rob' and last = 'Robot';
But you still have to list out the columns once.
Some more modern databases support an except clause that lets you remove columns from the wildcard list. To the best of my knowledge, Oracle has no similar concept.
You could write a query to bring the columns together from the system tables. That could simplify writing the query and help prevent misspellings.

SQL Merge two rows with same ID but different column values (Oracle)

I am trying to merge different rows into one when they have the same id but different column values.
For example :
(table1)
id colour
1 red
1 blue
2 green
2 red
I would like this to be combine so that the result is :
id colour1 colour2
1 red blue
2 green red
Or
id colour
1 red, blue
2 green, red
Or any other variation of the above so that the rows are joined together some way.
Any help would be appreciated! Thanks in advance.
Please read my Comment first - you shouldn't even think about doing this unless it is ONLY for reporting purposes, and you want to see how this can be done in plain SQL (as opposed to the correct solution, which is to use your reporting tool for this job).
The second format is easiest, especially if you don't care about the order in which the colors appear:
select id, listagg(colour, ', ') within group (order by null)
from table1
group by id
order by null means order randomly. If you want to order by something else, use that in order by with listagg(). For example, to order the colors alphabetically, you could say within group (order by colour).
For the first format, you need to have an a priori limit on the number of columns, and how you do it depends on the version of Oracle you are using (which you should always include in every question you post here and on other discussion boards). The concept is called "pivoting"; since version 11, Oracle has an explicit PIVOT operator that you can use.
The following would solve your problem in the first of the two ways that you proposed. Listagg is what you would use to solve it the second of the two ways (as pointed out in the other answer):
select id,
min(decode(rn,1,colour,null)) as colour1,
min(decode(rn,2,colour,null)) as colour2,
min(decode(rn,3,colour,null)) as colour3
from (
select id,
colour,
row_number() over(partition by id order by colour) as rn
from table1
)
group by id;
In this approach, you need to add additional case statements up to the maximum number of possible colors for a given ID (this solution is not dynamic).
Additionally, this is putting the colors into color1, color2, etc. based on the alphabetical order of the color names. If you prefer a random order, or some other order, you need to change the order by.
Try this, it works for me:
Here student is the name of the table and studentId is a column. We can merge all subjects to the particular student using GROUP_CONCAT.
SELECT studentId, GROUP_CONCAT(subjects) FROM student

Controlling result of oracle query

I have a schema like this
create table sample(id number ,name varchar2(30),mark number);
Now i has to return names of the top three marks. How can i write sql query for this?
If i use max(mark) it will return only maximum and
select name from sample
returns all the names!! I tried in many ways but i was unable to control the result to 3 rows..
Please suggest the way to get rid of my problem..
How do you want to handle ties? If Mary gets a mark of 100, Tom gets a mark of 95, and John and Dave both get a mark of 90, what results do you want, for example? Do you want both John and Dave to be returned since they both tied for third? Or do you want to pick one of the two so that the result always has exactly three rows? What happens if Beth also tied for second with a score of 95? Do you still consider John and Dave tied for third place or do you consider them tied for fourth place?
You can use analytic functions to get the top N results though which analytic function you pick depends on how you want to resolve ties.
SELECT id,
name,
mark
FROM (SELECT id,
name,
mark,
rank() over (order by mark desc) rnk
FROM sample)
WHERE rnk <= 3
will return the top three rows using the RANK analytic function to rank them by MARK. RANK returns the same rank for people that are tied and uses the standard sports approach to determining your rank so that if two people tie for second, the next competitor is in fourth place, not third. DENSE_RANK ensures that numeric ranks are not skipped so that if two people tie for second, the next row is third. ROW_NUMBER assigns each row a different rank by arbitrarily breaking ties.
If you really want to use ROWNUM rather than analytic functions, you can also do
SELECT id,
name,
mark
FROM (SELECT id,
name,
mark
FROM sample
ORDER BY mark DESC)
WHERE rownum <= 3
You cannot, however, have the ROWNUM predicate at the same level as the ORDER BY clause since the predicate is applied before the ordering.
SELECT t2.name FROM
(
SELECT t.*, t.rownum rn
FROM sample t
ORDER BY mark DESC
) t2
WHERE t2.rn <=3

Best way to randomly select rows *per* column in SQL Server

A search of SO yields many results describing how to select random rows of data from a database table. My requirement is a bit different, though, in that I'd like to select individual columns from across random rows in the most efficient/random/interesting way possible.
To better illustrate: I have a large Customers table, and from that I'd like to generate a bunch of fictitious demo Customer records that aren't real people. I'm thinking of just querying randomly from the Customers table, and then randomly pairing FirstNames with LastNames, Address, City, State, etc.
So if this is my real Customer data (simplified):
FirstName LastName State
==========================
Sally Simpson SD
Will Warren WI
Mike Malone MN
Kelly Kline KS
Then I'd generate several records that look like this:
FirstName LastName State
==========================
Sally Warren MN
Kelly Malone SD
Etc.
My initial approach works, but it lacks the elegance that I'm hoping the final answer will provide. (I'm particularly unhappy with the repetitiveness of the subqueries, and the fact that this solution requires a known/fixed number of fields and therefore isn't reusable.)
SELECT
FirstName = (SELECT TOP 1 FirstName FROM Customer ORDER BY newid()),
LastName= (SELECT TOP 1 LastNameFROM Customer ORDER BY newid()),
State = (SELECT TOP 1 State FROM Customer ORDER BY newid())
Thanks!
ORDER BY NEWID() works with ROW_NUMBER in SQL Server 2008. Not sure about SQL Server 2005,
This is needed to generate values to join the 3 separate queries: it's slightly counter intuitive because you'd think it would always take the first 100 rows in a different order but it doesn't...
;With F AS
(
SELECT TOP 100
FirstName, ROW_NUMBER() OVER (ORDER BY NEWID()) AS Foo
FROM Customer
), L AS
(
SELECT TOP 100
LastName, ROW_NUMBER() OVER (ORDER BY NEWID()) AS Foo
FROM Customer
), S AS
(
SELECT TOP 100
State, ROW_NUMBER() OVER (ORDER BY NEWID()) AS Foo
FROM Customer
)
SELECT
F.FirstName, L.LastName, S.State
FROM
F
JOIN L ON F.Foo = L.Foo
JOIN S ON F.Foo = S.Foo
You could select the top N random rows at once (where N=3 is the number of columns), and then take column 1 from row 1, column 2 from row 2, etc. I'm not sure exactly how to do that last step in SQL, but if you're willing to do the last step in some other language I'm sure it would be simple.
Also, by selecting N rows at once you would have the new property that you would never be selecting two columns from the same row (though this could cause trouble if there are more columns than rows).
It seems to me that you are actually trying to generate random data -- the fact that you already have a bunch that is non-random is really just a side note. If I were in your shoes, I would look at generating random customers by choosing random words from the dictionary to use as FName, LName, City, etc. That seems easier and more random anyway.

Return all Fields and Distinct Rows

Whats the best way to do this, when looking for distinct rows?
SELECT DISTINCT name, address
FROM table;
I still want to return all fields, ie address1, city etc but not include them in the DISTINCT row check.
Then you have to decide what to do when there are multiple rows with the same value for the column you want the distinct check to check against, but with different val;ues in the other columns. In this case how does the query processor know which of the multiple values in the other columns to output, if you don't care, then just write a group by on the distinct column, with Min(), or Max() on all the other ones..
EDIT: I agree with comments from others that as long as you have multiple dependant columns in the same table (e.g., Address1, Address2, City, State ) That this approach is going to give you mixed (and therefore inconsistent ) results. If each column attribute in the table is independant ( if addresses are all in an Address Table and only an AddressId is in this table) then it's not as significant an issue... cause at least all the columns from a join to the Address table will generate datea for the same address, but you are still getting a more or less random selection of one of the set of multiple addresses...
This will not mix and match your city, state, etc. and should give you the last one added even:
select b.*
from (
select max(id) id, Name, Address
from table a
group by Name, Address) as a
inner join table b
on a.id = b.id
When you have a mixed set of fields, some of which you want to be DISTINCT and others that you just want to appear, you require an aggregate query rather than DISTINCT. DISTINCT is only for returning single copies of identical fieldsets. Something like this might work:
SELECT name,
GROUP_CONCAT(DISTINCT address) AS addresses,
GROUP_CONCAT(DISTINCT city) AS cities
FROM the_table
GROUP BY name;
The above will get one row for each name. addresses contains a comma delimted string of all the addresses for that name once. cities does the sames for all the cities.
However, I don't see how the results of this query are going to be useful. It will be impossible to tell which address belongs to which city.
If, as is often the case, you are trying to create a query that will output rows in the format you require for presentation, you're much better off accepting multiple rows and then processing the query results in your application layer.
I don't think you can do this because it doesn't really make sense.
name | address | city | etc...
abc | 123 | def | ...
abc | 123 | hij | ...
if you were to include city, but not have it as part of the distinct clause, the value of city would be unpredictable unless you did something like Max(city).
You can do
SELECT DISTINCT Name, Address, Max (Address1), Max (City)
FROM table
Use #JBrooks answer below. He has a better answer.
Return all Fields and Distinct Rows
If you're using SQL Server 2005 or above you can use the RowNumber function. This will get you the row with the lowest ID for each name. If you want to 'group' by more columns, add them in the PARTITION BY section of the RowNumber.
SELECT id, Name, Address, ...
(select id, Name, Address, ...,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY id) AS RowNo
from table) sub
WHERE RowNo = 1