Automatic group creation in R or SQL - sql

I have a R data frame with a column AS_ID as given below:
AS_ID
A8653654
B7653655
C5653650
C5653650
A8653654
D1658645
D1658645
C5653650
C5653650
D1658645
C5653650
E4568640
F796740
A8653654
F796740
E4568640
I am trying to group similar record as A1, A2, A3 and so on. For example all record having AS_ID as "A8653654" should be grouped as A1 and can be entered into new column as given below:
AS_ID AS
A8653654 A1
B7653655 A2
C5653650 A3
C5653650 A3
A8653654 A1
D1658645 A4
D1658645 A4
C5653650 A3
C5653650 A3
D1658645 A4
C5653650 A3
E4568640 A5
F796740 A6
A8653654 A1
F796740 A6
E4568640 A5
I am fine with either R or oracle code, since I can write SQL code in R too. Any help will be highly appreciated. My data is bit more dynamic compare to what I have given in sample data above. Generic code will help more.

If you've read those values into an R data.frame, it's likely they are already of class "factor". If not, you can convert them to a factor. But each factor value is automatically assigned a unique integer ID already. Here's a sample data.frame
dd<-read.table(text=c("A8653654", "B7653655", "C5653650", "C5653650", "A8653654",
"D1658645", "D1658645", "C5653650", "C5653650", "D1658645", "C5653650",
"E4568640", "F796740", "A8653654", "F796740", "E4568640"), col.names="AS_ID")
Observe that
class(dd$AS_ID)
# [1] "factor"
If it was character, you could do
dd$AS_ID <- factor(dd$AS_ID)
To get the unique IDs, just use as.numeric and then paste an A in front of that
dd <- cbind(dd, AS=paste0("A",as.numeric(dd$AS_ID)))
and that gives
#> head(dd)
AS_ID AS
1 A8653654 A1
2 B7653655 A2
3 C5653650 A3
4 C5653650 A3
5 A8653654 A1
6 D1658645 A4

You can get a group identifier in Oracle using dense_rank():
select AS_ID, dense_rank() over (partition by AS_ID order by AS_ID)
from table t;
If you want an 'A' in front, then concatenate it:
select AS_ID, 'A' || dense_rank() over (partition by AS_ID order by AS_ID)

Related

How to make the records of a fixed interval all put in a specified file

I've a large dataframe with below column:
df:
rowBatchNum colA colB colC colD
1 A1, B1, C1, D1
1 A2, B2, C2, D2
1 A3, B3, C3, D3
.....
1 A90, B90, C90, D90
....
2 A21, B21, C21, D21
....
2 A290, B290, C290, D290
...
61,000,000, A61,000,000,1, B61,000,000,1, C61,000,000,1, D61,000,000,1
I've tried the partitionBy rowBatchNum but it will output many folders(1-61M), I want to specify the range from rowBatchNum=1 to rowBatchNum=100 and output to a folder with the full range data, either as parquet or csv.
Expect:
rowBatchNum1_100/:
part-00001-e469db91-aa99-4b51-84ca-0145ed937f69.c000.snappy
part-00001-e469db91-aa99-4b51-84ca-0145ed937f69.c001.snappy
...
part-00001-e469db91-aa99-4b51-84ca-0145ed937f69.c099.snappy
rowBatchNum101_200/:
part-00002-e469db91-aa99-4b51-84ca-0145ed937f69.c000.snappy
part-00002-e469db91-aa99-4b51-84ca-0145ed937f69.c001.snappy
...
part-00002-e469db91-aa99-4b51-84ca-0145ed937f69.c099.snappy
rowBatchNum60999900_61000000/:
part-01999-e469db91-aa99-4b51-84ca-0145ed937f69.c000.snappy
part-01999-e469db91-aa99-4b51-84ca-0145ed937f69.c001.snappy
...
part-01999-e469db91-aa99-4b51-84ca-0145ed937f69.c099.snappy
How can I do this? Any answers will be super helpful! Thanks

Plotting CDF for ranking distribution

I have a panda dataframe that looks like this, this is generated with the groupby command and then sorted by # of users to give me user count for top X feature combination.
count_28day, Feature1, Feature2, Feature3
5000 a1 b1 c1
1000 a2 b2 c2
50 a3 b3 c3
I'm trying to plot cdf of user distribution. I don't need to know the features. I just want to show the top X feature combinations that will give me 90% of total user.
I'm doing this in a very hacky way.
topx = table.count_28day.sort_values(ascending=False).cumsum()/sum(table.count_28day)
ser_cdf = pd.Series(topx.tolist()[1:100], index=pd.Series(range(1,100)))
ser_cdf.plot(drawstyle='steps')
Is there a more elegant way to do this using histogram or ecdf or something?

Improve efficiency of PIG Script

DATASET:
I have a data set (data.txt) in (ID, Category) format as given below:
01,X
02,Y
03,X
04,Y
05,X
06,X
07,Z
08,Z
09,X
10,Z
Objective:
The objective is to find out which category has the maximum number of IDs without using UDF.
One Approach:
I have tried multiple times and concluded that this can be achived by the followins set of PIG statements
A1 = LOAD 'data.txt' USING PigStorage(',') AS (ID:int , Category:chararray);
A2 = DISTINCT A1;
A3 = GROUP A2 BY Category;
A4 = FOREACH A3 GENERATE group AS Category, COUNT(A2.ID) AS Number;
A5 = GROUP A4 ALL;
A6 = FOREACH A5 GENERATE MAX(A4.Number);
A7 = FILTER A4 by Number == A6.$0;
A8 = FOREACH A7 GENERATE Category;
DUMP A8;
Request:
Although these statements give the desired result, I am not convinced with its efficiency.
As I am new to PIG, I am not sure if there are any inbuilt functions which can perform such tasks to output the corresponding values of the minimum or maximum value of from a table.
My request is to know if this can be achived in any less number of steps.
Many Thanks
After grouping sort the grouping by counts in descending order and get the topmost record.
A1 = LOAD 'data.txt' USING PigStorage(',') AS (ID:int , Category:chararray);
A2 = DISTINCT A1;
A3 = GROUP A2 BY Category;
A4 = FOREACH A3 GENERATE group AS Category, COUNT(A2.ID) AS Number;
A5 = ORDER A4 BY Number DESC;
A6 = LIMIT A5 1;
DUMP A6.$0;

Database transactions theory

No book seems to be able to answer this.
Suppose I have two transactions:
T1: Lock A, Lock B, Unlock A
T2: Lock B, Unlock B, Lock A, Unlock A
Q1. How many ways are there to plan these transactions? (Is it just a simple graph and the result is 3! * 4! ?)
Q2. How many of these ways are serializable?
I would really like to know what is the thinking process, how do you get to the answer?
Q1 is 7.
Proof: First of all, we have to merge the set 'Lock A', 'Lock B', 'Unlock A' (I denote the items as A1, A2, A3) into the set 'Lock B',..,'Unlock A' (I denote them B1..B4), that is to put 3 items into 5 places (between B's) with repetitions allowed, that is binomial coeff. choose 3 from (5-1+3). It is equal to 7!/(3!*4!) = 35.
Next, we have to drop 'bad' solutions (the ones prevented by locking conditions). It's where A1 stands between B3 and B4 (3 solutions) and A2 between B1 and B2 (2*4 = 8). Also, we have to exclude the solutions with B3 between A1 and A3 too. There are 3*3=9 with B3 between A1 and A2, and 6*2=12 with B3 between A2 and A3. Thus, we have 35-3-8-9-12=3. But we should also satisfy inclusion-exclusion principle: add solutions which violates two rules simultaneously. They could be only like this: B1 A2 B2 B3 B4, with A1 in either of two left positions, and A3 in either of two right ones. 4 in total. So, we have the final answer 35 - 3 - 8 - 9 - 12 + 4 = 7.

Redis and linked hashes

everyone
I would like to ask community of help to find a way of how to cache our huge plain table by splitting it to the multiple hashes or otherwise.
The sample of table, as an example for structure:
A1 B1 C1 D1 E1 X1
A1 B1 C1 D1 E1 X2
A7 B5 C2 D1 E2 X3
A8 B1 C1 D1 E2 X4
A1 B6 C3 D2 E2 X5
A1 B1 C1 D2 E1 X6
This our denormalized data, we don't have any ability to normalize it.
So currently we must perform 'group by' to get required items, for instance to get all D* we perform 'data.GroupBy(A1).GroupBy(B1).GroupBy(C1)' and it takes a lot of time.
Temporarily we had found workaround for this by creating composite a string key:
A1 -> 'list of lines begin A1'
A1:B1 -> 'list of lines begin A1:B1'
A1:B1:C1 -> 'list of lines begin A1:B1:C1'
...
as a cache of results of grouping operations.
The question is how it can be stored efficiently?
Estimated number of lines in denormalized data around 10M records and as in my an example there is 6 columns it will be 60M entries in hash. So, I'm looking an approach to lookup values in O(N) if it's possible.
Thanks.