how would you generate fake account numbers for a company project with SQL? - sql

I have a big data set in Redshift which my company will share with university students to analyze. I need to mask the real customer account numbers.
I've looked at the random function but there's one catch: some customers are repeated, so I need to retain that for the analysis to be useful. Also, with a random number there's a small possibility you would repeat account numbers, right?
How would you achieve this? Generate a new_random_id. It must be unique from all others in the table (there are over 4 million in the table), but must be the same for those rows where the actual account ID is the same.
+-------------------+---------------+---------+
| actual_accound_id | new_random_id | status |
+-------------------+---------------+---------+
| 100 | 123 | new |
| 100 | 123 | upgrade |
| 200 | 249 | new |
| 300 | 401 | upgrade |
+-------------------+---------------+---------+
I realize I could first generate a mapping table like this below, and then join to the main table, but it still doesn't solve the problem of possibly repeating new random IDs.
select distinct actual_account_id, cast(random()*1000000 as int) as new_random_id
into mapping_table
from t1;

I would create a mapping table using window functions:
select actual_account_id,
row_number() over (order by random()) as fake_account_id
from t1
group by actual_account_id;
This should be a meaningless sequential number.
Redshift might be a bit slow on the ROW_NUMBER() with no PARTITION BY. If performance is an issue, you can use something like this:
select actual_account_id,
count(*) * 100 + row_number(partition by tmp order by random()) as fake_acocunt_number
from (select actual_account_id,
cast(random()*1000000 as int) as tmp
from t1
group by actual_account_id
) t;

Related

Google Big Query : Window Function Row Wise Cumulative Sum Across Columns

I am looking to calculate cumulative sum across columns in Google Big Query.
Assume there are five columns (NAME,A,B,C,D) with two rows of integers, for example:
NAME | A | B | C | D
----------------------
Bob | 1 | 2 | 3 | 4
Carl | 5 | 6 | 7 | 8
I am looking for a windowing function or UDF to calculate the cumulative sum across rows to generate this output:
NAME | A | B | C | D
-------------------------
Bob | 1 | 3 | 6 | 10
Carl | 5 | 11 | 18 | 27
Any thoughts or suggestions greatly appreciated!
I think, there are number of reasonable workarounds for your requirements mostly in the area of designing better your table. All really depends on how you input your data and most importantly how than you consume it
Still, if to stay with presented requirements - Below is not exactly what you expect in your question as an output, but might be usefull as an example:
SELECT name, GROUP_CONCAT(STRING(cum)) AS all FROM (
SELECT name,
SUM(INTEGER(num))
OVER(PARTITION BY name
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum
FROM (
SELECT name, SPLIT(all) AS num FROM (
SELECT name,
CONCAT(STRING(a),',',STRING(b),',',STRING(c),',',STRING(d)) AS all
FROM yourtable
)
)
)
GROUP BY name
Output is:
name all
Bob 1,3,6,10
Carl 5,11,18,26
Depends on how you than consume this data - it still can work for you
Note, not you avoiding now writing something like col1 + col2 + .. + col89 + col90 - but still need to explicitelly mention each column just ones.
in case if you have "luxury" of implementing your requirements outside of GBQ UI, but rather in some Client- you can use BigQuery API to programatically aquire table schema and build on fly your logic/query and than execute it
Take a look at below APIs to start with:
To get table schema - https://cloud.google.com/bigquery/docs/reference/v2/tables/get
To issue query job - https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert
There's no need for a UDF:
SELECT name, a, a+b, a+b+c, a+b+c+d
FROM tab

How to find most-correlated X for each Y?

I have a query I can run, which produces rows like this:
ID | category | property_A | property_B
----+----------+------------+------------
1 | X | tall | old
2 | X | short | old
3 | X | tall | old
4 | X | short | young
5 | Y | short | old
6 | Y | short | old
7 | Y | tall | old
I'd like to find, for each category and property_B, what is the most common property_A, and put that into another table somewhere for later use. So here I'd like to know that in category X, old people tend to be tall and young people short, while in category Y, old people tend to be short.
The domain of each column is finite, and not too large - there are something like 200 categories, and a dozen or so of property_A and property_B. So I could write a dumb script on my client, which queries the database 200*12*12 times doing a limited query, but that seems like it must be the wrong approach, as well as wasteful given that it's expensive to produce this table and then throw most of it away.
But I don't even know what words to look up to find the right approach: "sql find correlated rows" shows how to find integer correlations, but I'm not interested in integers. So what do I do instead?
You can readily do this with aggregation and the window/analytic functions. You want the top ranked one by count. The following returns the most popular A:
select category, property_b, property_a as MostPopularA
from (select category, property_b, property_a, count(*) as cnt,
row_number() over (partition by category, property_b order by count(*) desc) as seqnum
from table t
group by category, property_b, property_a
) t
where seqnum = 1;
If you want to get all values when there is a tie, then use dense_rank() instead of row_number().
I suggest a combination of GROUP BY and DISTINCT ON, which is faster / simpler / more elegant in Postgres:
SELECT DISTINCT ON (category, property_b)
category, property_b, property_a, count(*) AS ct
FROM tbl
GROUP BY category, property_b, property_a
ORDER BY category, property_b, ct DESC;
Returns:
category | property_b | property_a | ct
---------+------------+------------+----
X | old | tall | 2
X | young | short | 1
Y | old | short | 2
If multiple peers tie for the most common value, only one arbitrary pick is returned.
This works in a single query level without subquery, since aggregation (GROUP BY) is applied before the DISTINCT step. Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?
SQL Fiddle.

Remove redundant SQL price cost records

I have a table costhistory with fields id,invid,vendorid,cost,timestamp,chdeleted. It looks like it was populated with a trigger every time a vendor updated their list of prices.
It has redundant records - since it was populated regardless of whether price changed or not since last record.
Example:
id | invid | vendorid | cost | timestamp | chdeleted
1 | 123 | 1 | 100 | 1/1/01 | 0
2 | 123 | 1 | 100 | 1/2/01 | 0
3 | 123 | 1 | 100 | 1/3/01 | 0
4 | 123 | 1 | 500 | 1/4/01 | 0
5 | 123 | 1 | 500 | 1/5/01 | 0
6 | 123 | 1 | 100 | 1/6/01 | 0
I would want to remove records with ID 2,3,5 since they do not reflect any change since the last price update.
I'm sure it can be done, though it might take several steps.
Just to be clear, this table has swelled to 100gb and contains 600M rows. I am confident that a proper cleanup will take this table's size down by 90% - 95%.
Thanks!
The approach you take will vary depending on the database you are using. For SQL Server 2005+, the following query should give you the records you want to remove:
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
where Rank > 1
You can then delete them like this:
delete from costhistory
where id in (
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
)
I would suggest that you recreate the table using a group by query. Also, I assume the the "id" column is not used in any other tables. If that is the case, then you need to fix those tables as well.
Deleting such a large quantity of records is likely to take a long, long time.
The query would look like:
insert into newversionoftable(invid, vendorid, cost, timestamp, chdeleted)
select invid, vendorid, cost, timestamp, chdeleted
from table
group by invid, vendorid, cost, timestamp, chdeleted
If you do opt for a delete, I would suggestion:
(1) Fix the code first, so no duplicates are going in.
(2) Determine the duplicate ids and place them in a separate table.
(3) Delete in batches.
To find the duplicate ids, use something like:
select *
from (select id,
row_number() over (partition by invid, vendorid, cost, timestamp, chdeleted order by timestamp) as seqnum
from table
) t
where seqnum > 1
If you want to keep the most recent version instead, then use "timestamp desc" in the order by clause.

Access 2007 select first value of query results

I am running into a rather annoying thingy in Access (2007) and I am not sure if this is a feature or if I am asking for the impossible.
Although the actual database structure is more complex, my problem boils down to this:
I have a table with data about Units for specific years. This data comes from different sources and might overlap.
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
...
Now I would like the user to select certain sources, order them by priority and then extract one data value for each year.
For example, if the user selects source 1, 2 and 3 and orders them by (3, 1, 2), then I would like the following result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2011 | 90 | 1 |
I am able to order the initial table, based on a specific order. I do this with the following query
SELECT Unit, IYR, X1, Source
FROM TestTable
WHERE Source In (1,2,3)
ORDER BY Unit, IYR,
IIf(Source=3,1,IIf(Source=1,2,IIf(Source=2,3,4)))
This gives me the following intermediate result:
Unit | IYR | X1 | Source |
-----------------------------
A | 2009 | 55 | 1 |
A | 2010 | 150 | 3 |
A | 2010 | 80 | 1 |
A | 2010 | 101 | 2 |
A | 2011 | 90 | 1 |
Next step is to only get the first value of each year. I was thinking to use the following query:
SELECT X.Unit, X.IYR, first(X.X1) as FirstX1
FROM (...) AS X
GROUP BY X.Unit, X.IYR
Where (…) is the above query.
Now Access goes bananas. Whatever order I give to the intermediate results, the result of this query is.
Unit | IYR | X1 |
--------------------
A | 2009 | 55 |
A | 2010 | 80 |
A | 2011 | 90 |
In other words, for year 2010 it shows the value of source 1 instead of 3. It seems that Access does not care about the ordering of the nested query when it applies the FIRST() function and sticks to the original ordering of the data.
Is this a feature of Access or is there a different way of achieving the desired results?
Ps: Next step would be to use a self join to add the source column to the results again, but I first need to resolve above problem.
Rather than use first it may be better to determine the MIN Priority and then join back e.g.
SELECT
t.UNIT,
t.IYR,
t.X1,
t.Source ,
t.PrioritySource
FROM
(SELECT
Unit,
IYR,
X1,
Source,
SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
) as t
INNER JOIN
(SELECT
Unit,
IYR,
MIN(SWITCH ( [Source]=3, 1,
[Source]=1, 2,
[Source]=2, 3)) as PrioritySource
FROM
TestTable
WHERE
Source In (1,2,3)
GROUP BY
Unit,
IYR ) as MinPriortiy
ON t.Unit = MinPriortiy.Unit and
t.IYR = MinPriortiy.IYR and
t.PrioritySource = MinPriortiy.PrioritySource
which will produce this result (Note I include Source and priority source for demonstration purposes only)
UNIT | IYR | X1 | Source | PrioritySource
----------------------------------------------
A | 2009 | 55 | 1 | 2
A | 2010 | 150 | 3 | 1
A | 2011 | 90 | 1 | 2
Note the first subquery is to handle the fact that Access won't let you join on a Switch
Yes, FIRST() does use an arbitrary ordering. From the Access Help:
These functions return the value of a specified field in the first or
last record, respectively, of the result set returned by a query. If
the query does not include an ORDER BY clause, the values returned by
these functions will be arbitrary because records are usually returned
in no particular order.
I don't know whether FROM (...) AS X means you are using an ORDER BY inline (assuming that is actually possible) or if you are using a VIEW ('stored Query object') here but either way I assume the ORDER BY is being disregarded (because an ORDER BY should only apply to the final result).
The alternative is to use MIN() (or possibly MAX()).
This is the most concise way I have found to write such queries in Access that require pulling back all columns that correspond to the first row in a group of records that are ordered in a particular way.
First, I added a UniqueID to your table. In this case, it's just an AutoNumber field. You may already have a unique value in your table, in which case you can use that.
This will choose the row with a Source 3 first, then Source 1, then Source 2. If there is a tie, it picks the one with the higher X1 value. If there is a further tie, it is broken by the UniqueID value:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=
(SELECT TOP 1 [UniqueID] FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, UniqueID)
This yields:
Unit IYR X1 Source UniqueID
A 2009 55 1 1
A 2010 150 3 4
A 2011 90 1 5
I recommend (1) you create an index on the IYR field -- this will dramatically increase your performance for this type of query, and (2) if you have a lot (>~100K) records, this isn't the best choice. I find it works quite well for tables in the 1-70K range. For larger datasets, I like to use my GroupIncrement function to partition each group (similar to SQL Server's ROW_NUMBER() OVER statement).
The Choose() function is a VBA function and may not be clear here. In your case, it sounds like there is some interactivity required. For that, you could create a second table called "Choices", like so:
Rank Choice
1 3
2 1
3 2
Then, you could substitute the following:
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.UniqueID=(SELECT TOP 1 [UniqueID] FROM
[TestTable] t2 INNER JOIN [Choices] c
ON t2.Source=c.Choice
WHERE t.IYR=t2.IYR ORDER BY c.[Rank], t2.X1 DESC, t2.UniqueID);
Indexing Source on TestTable and Choice on the Choices table may be helpful here, too, depending on the number of choices required.
Q:
Can you get this to work without the need for surrogate key? For
example what if the unique key is the composite of
{Unit,IYR,X1,Source}
A:
If you have a compound key, you can do it like this-- however I think that if you have a large dataset, it will totally kill the performance of the query. It may help to index all four columns, but I can't say for sure because I don't regularly use this method.
SELECT t.* INTO [Chosen Rows]
FROM TestTable AS t
WHERE t.Unit & t.IYR & t.X1 & t.Source =
(SELECT TOP 1 Unit & IYR & X1 & Source FROM [TestTable]
WHERE t.IYR=IYR ORDER BY Choose([Source],2,3,1), X1 DESC, Unit, IYR)
In certain cases, you may have to coalesce some of the individual parts of the key as follows (though Access generally will coalesce values automatically):
t.Unit & CStr(t.IYR) & CStr(t.X1) & CStr(t.Source)
You could also use a query in your FROM statements instead of the actual table. The query itself would build a composite of the four fields used in the key, and then you'd use the new key name in the WHERE clause of the top SELECT statement, and in the SELECT TOP 1 [key] of the subquery.
In general, though, I will either: (a) create a new table with an AutoNumber field, (b) add an AutoNumber field, (c) add an integer and populate it with a unique number using VBA - this is useful when you get a MaxLocks error when trying to add an AutoNumber, or (d) use an already indexed unique key.

Select ID given the list of members

I have a table for the link/relationship between two other tables, a table of customers and a table of groups. a group is made up of one or more customers. The link table is like
APP_ID | GROUP_ID | CUSTOMER_ID
1 | 1 | 123
1 | 1 | 124
1 | 1 | 125
1 | 2 | 123
1 | 2 | 125
2 | 3 | 123
3 | 1 | 123
3 | 1 | 124
3 | 1 | 125
I now have a need, given a list of customer IDs to be able to get the group ID for that list of customer IDs. Group ID may not be unique, the same group ID will contain the same list of customer IDs but this group may exist in more than one app_id.
I'm thinking that
SELECT APP_ID, GROUP_ID, COUNT(CUSTOMER_ID) AS COUNT
FROM GROUP_CUST_REL
WHERE CUSTOMER_ID IN ( <list of ids> )
GROUP BY APP_ID, GROUP_ID
HAVING COUNT(CUSTOMER_ID) = <number of ids in list>
will return me all of the group IDs that contain all of the customer ids in the given list and only those group ids. So for a list of (123,125) only group id 2 would be returned from the above example
I will then have to link with the app table to use its created timestamp to identify the most recent application that the group existed in so that I can then pull the correct/most up to date info from the group table.
Does anyone have any thoughts on whether this is the most efficient way to do this? If there is another quicker/cleaner way I'd appreciate your thoughts.
This smells like a division:
Division sample
Other related stack overflow question
Taking a look at the provided links you'll see the solution to similar issues from relational alegebra's point of view, doesn't seem to be quicker and arguably cleaner.
I didn't look at your solution at first, and when I solved this I turned out to have solved this the same way you did.
Actually, I thought this:
<number of ids in list>
Could be turned into something like this (so that you don't need the extra parameter):
select count(*) from (<list of ids>) as t
But clearly, I was wrong. I'd stay with your current solution if I were you.