suggest a method for updating data in many tables with random data? - sql

I've got about 25 tables that I'd like to update with random data that's picked from a subset of data. I'd like the data to be picked at random but meaningful -- like changing all the first names in a database to new first names at random. So I don't want random garbage in the fields, I'd like to pull from a temp table that's populated ahead of time.
The only way I can think of to do this is with a loop and some dynamic sql.
insert pick-from names into temp table
with id field
foreach table name in a list of
tables:
build a dynamic sql that updates all
first name fields to be a name
picked at random from the temp table based on rand() * max(id) from temp table
But anytime I think "loop" in SQL I figure I'm doing something wrong.
The database in question has a lot of denormalized tables in it, so that's why I think I'd need a loop (the first name fields are scattered across the database).
Is there a better way?

Red Gate have a product called SQL Data Generator that can generate fake names and other fake data for testing purposes. It's not free, but they have a trial so you can test it out, and it might be faster than trying to do it yourself.
(Disclaimer: I have never used this product, but I've been very happy with some of their other products.)

I wrote a stored procedure to do something like this a while back. It is not as good as the Red Gate product and only does names, but if you need something quick and dirty, you can download it from
http://www.joebooth-consulting.com/products/
The script name is GenRandNames.sql
Hope this helps

Breaking the 4th wall a bit by answering my own question.
I did try this as a sql script. What I learned is that SQL pretty much sucks at random. The script was slow and weird -- functions that referenced views that were only created for the script and couldn't be made in tempdb.
So I made a console app.
Generate your random data, easy
to do with the Random class (just
remember to only use one instance of
Random).
Figure out what columns and table
names that you'd like to update via
a script that looks at
information_schema.
Get the IDs
for all the tables that you're going
to update, if possible (and wow will
it be slow if you have a large table
that doesn't have any good PKs).
Update each table 100 rows at a time. Why 100? No idea. Could be 1000. I just picked a number. Dictionary is handy here: pick a random ID from the dict using the Random class.
Wash, rinse, repeat. I updated about 2.2 million rows in an hour this way. Maybe it could be faster, but it was doing many small updates so it didn't get in anyone's way.

Related

SQL where statement thousands of values

I have a spreadsheet with 12290 unique reference numbers on that I need to find any payment transactions against.
For a handful id just manually do this. But for a large number like this what would be the best way?
Can I reference a notepad file in a query in anyway?
Thanks,
This is a bit long for a comment.
Although you could put the values into a giant in statement, I would recommend that you load the data into a separate table. That table can have the reference number as a primary key. You can then use a query (join or exists) to get matching values.
This also has the nice side effect that you have the reference numbers in the database, so they can be used for another purpose -- or archived so you can keep track of what your application is doing.
While using #Gordon Linoff's answer.
To upload data there are various methods.
As in the main comments ie. writing a formula in XL and copying that and run the script in Management Studio (I would keep that in a permanent table to avoid doing the same in the future, if the list is not going to be changed often.) and also I wouldn't recommend this for 12000 records, but for less than 100 records it is OK.
Using Import Export wizard of Management Studio
Create a table to hold that data and open that in Management Studio, in Edit mode and copy-paste data to the table.

How can I divide a single table into multiple tables in access?

Here is the problem: I am currently working with a Microsoft Access database that the previous employee created by just adding all the data into one table (yes, all the data into one table). There are about 186 columns in that one table.
I am now responsible for dividing each category of data into its own table. Everything is going fine although progress is too slow. Is there perhaps an SQL command that will somehow divide each category of data into its proper table? As of now I am manually looking at the main table and carefully transferring groups of data into each respective table along with its proper IDs making sure data is not corrupted. Here is the layout I have so far:
Note: I am probably one of the very few at my campus with database experience.
I would approach this as a classic normalisation process. Your single hugely wide table should contain all of the entities within your domain so as long as you understand the domain you should be able to normalise the structure until you're happy with it.
To create your foreign key lookups run distinct queries against the columns your going to remove and then add the key values back in.
It sounds like you know what you're doing already ? Are you just looking for reassurance that you're on the right track ? (Which it looks like you are).
Good luck though, and enjoy it - it sounds like a good little piece of work.

Regarding the dividing of PL/SQL apps into several units

Here's my application workflow.
I have a ref cursor that is populated with all my employees IDs..It's just an identification number really.
But now I want to fetch a lot of information for every employee...(as fetched form the ref cursor).It's not simply data, but a lot of computed,derived data too. The sort of derivation that's more easily done via cursors and procedures and so on....
For example, the sum of all the time intervals during which an employee was stationed in Department 78...(that could be just one of the columns for each employee).
So I think I could accomplish this with a really large (by large, I mean really difficult to maintain, difficult to understand, difficult to optimize, difficult to reuse, refactor..etc etc) SQL query, but that really isn't something I'd do unless as a real last resort.
So I'm trying to find ways to use all of PL/SQL's might to split this into as many separate units (perhaps functions or procedures) as possible so as to be able to handle this in a simple and elegant way...
I think that some way to merge datasets (ref cursors probably) would solve my problems... I've looked at some stuff on the internet until now and some things looked promising, namely pipelining... Although I'm not really sure that's what I need..
To sum up, what I think I need is some way to compose the resulting ref cursor(a really big table, one column for the ID and about 40 other columns, each with a specific bit of information about that ID's owner.),using many procedures, which I can then send back to my server-side app and deal with it. (Export to excel in that case.)
I'm at a loss really.. Hope someone with more experience can help me on this.
FA
I'm not sure if that is what you want, or how often do you need to run this thing
But since it sounds very heavy maybe you dont need the data up to date this second
If it's once a day or less, you can create a table with the employee ids, and use seperate MERGE updates to calculate the different fields
Then the application can get the data from that table
You can have a job that calculates this every time you need updated data.
You can read about the merge command here wiki and specifically for oracle here oracle. Since you use separate commands you can of course do it in different procedures if that is convenient.
for example:
begin
execute immediate 'truncate table temp_table';
insert into temp_table select emp_id from emps;
MERGE INTO temp_table a
USING (
select name ) b
on (a.emp_id = b.emp_id )
WHEN MATCHED THEN
UPDATE SET a.name = b.name; ...

Should I use a separate table for repetitive values (varchar)?

I have a table in which 3 rows of data are added per second and in which I intend to keep around 30M rows. (Older data will be removed).
I need to add a column: varchar(1000). I can't tell in advance what it's content will be but I do know it will be very repetitive : thousands to millions of rows will have the same value. It is usually around 200 characters long.
Since data is being added using a Stored Procedure I see two option
Add a column varchar(1000)
Create a table (int id,varchar(1000) value)
Within the StoredProcedure, look if the value exist in that other table or create it
I would expect this other table to have a maximum of 100 value at all time.
I know some of the tradeoff between these two options but I have difficulty making up my mind on the question.
Option 1 is heavier but I get faster inserts. Requires less joins hence query are simpler.
Option 2 is lighter insert take longers but query have the potential to be faster. I think I'm closer to normal form but then I also have a table with a single meaningful column.
From the information I gave you, which option seems better? (You can also come up with another option).
You should also investigate page compression, perhaps you can do the simple thing and still get a a small(ish) table. Although, if you say is SQL Express, you won't be able to use it as is an Enterprise Edition requirement.
I have used repeatedly in my projects your second approach. Every insert would have to go through a stored procedure that gets the lookup value id, or inserts a new one if not found and returns the id. Specially for such large columns like your seems to be, with plenty of rows yet so few distinct values, the saving in space should trump the additional overhead of the foreign key and lookup cost in query joins. See also Disk is Cheap... That's not the point!.

Dynamically creating tables as a means of partitioning: OK or bad practice?

Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.
NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.
No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.
Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...
Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)