How to make the records of a fixed interval all put in a specified file - dataframe

I've a large dataframe with below column:
df:
rowBatchNum colA colB colC colD
1 A1, B1, C1, D1
1 A2, B2, C2, D2
1 A3, B3, C3, D3
.....
1 A90, B90, C90, D90
....
2 A21, B21, C21, D21
....
2 A290, B290, C290, D290
...
61,000,000, A61,000,000,1, B61,000,000,1, C61,000,000,1, D61,000,000,1
I've tried the partitionBy rowBatchNum but it will output many folders(1-61M), I want to specify the range from rowBatchNum=1 to rowBatchNum=100 and output to a folder with the full range data, either as parquet or csv.
Expect:
rowBatchNum1_100/:
part-00001-e469db91-aa99-4b51-84ca-0145ed937f69.c000.snappy
part-00001-e469db91-aa99-4b51-84ca-0145ed937f69.c001.snappy
...
part-00001-e469db91-aa99-4b51-84ca-0145ed937f69.c099.snappy
rowBatchNum101_200/:
part-00002-e469db91-aa99-4b51-84ca-0145ed937f69.c000.snappy
part-00002-e469db91-aa99-4b51-84ca-0145ed937f69.c001.snappy
...
part-00002-e469db91-aa99-4b51-84ca-0145ed937f69.c099.snappy
rowBatchNum60999900_61000000/:
part-01999-e469db91-aa99-4b51-84ca-0145ed937f69.c000.snappy
part-01999-e469db91-aa99-4b51-84ca-0145ed937f69.c001.snappy
...
part-01999-e469db91-aa99-4b51-84ca-0145ed937f69.c099.snappy
How can I do this? Any answers will be super helpful! Thanks

Related

Convert string into very large integer in Hive

I have numeric values stored in a string column named "hst" in a hive table with 1636626 rows, but to perform arithmetic operations(comparison, difference), I need to convert the hst column into very large integers to preserve all the digits. here's a sample of my data:
id hst
A1 155836976724851034470045871285935636480
A2 55836976724791053359504802768816491263
B1 55836977111335639658316742086388875264
A3 55836977111354662261430576153184174079
C2 55836926053814078414548020414090575872
C4 55836926053833373226361854480885874687
B2 55836926013959368986746057541906857984
B4 55836926013959368635392801615616409599
C3 55836976724870256360155492454040600576
I tried decimal type:
SELECT cast('55836976724791053359504802768816491263' as DECIMAL(38, 0))
but as the max length of the field is 39 digits and decimal type allows 38 at most it doesn't work for the first value of the sample 155836976724851034470045871285935636480
Does anyone have an idea how to achieve that ?

Turning one-hot encoded table into 2D table of counts

I think I can solve this problem without too much difficulty but suspect that any solution I come up with will be sub-optimal, so am interested in how the real pandas experts would do it; I'm sure I could learn something from that.
I have a table of data that is one-hot encoded, something like:
Index. A1. A2. A3. B1. B2. C1. C2. C3. C4.
0. True. False. True. True. True. False. True. False. False.
...
So every entry is a Boolean and my columns consist of several groups of categories (the A's, B's and C's).
What I want to create is new DataFrames where I pick any two categories and get a table of counts of how many people are in the pair of categories corresponding to that row/column. So, if I was looking at categories A and B, I would generate a table:
Index. A1. A2. A3. None Total
B1. x11. x12. x13. x1N x1T
B2. x21. x22. x23. x2N. x2T
None. xN1. xN2. xN3. xNN xNT
Total. xT1. xT2. xT3. xTN xTT
where x11 is the count of rows in the original table that have both A1 and B1 True, x12 is the count of those rows that have A1 and B2 True, and so on.
I'm also interested in the counts of those entries where all the A values were False and/or all the B values were false, which are accounted for in the None columns.
Finally, I would also like the totals of rows where any of the columns in the corresponding category were True. So x1T would be the number of rows where B1 was True and any of A1, A2 or A3 were True, and so on (note that this is not just the sum of x11, x12 and x13 as the categories are not always mutually exclusive; a row could have both A1 True and A2 True for example). xNN is the number of rows that have all false values for A1, A2, A3, B1, B2, and xTT is the number of rows that have at least one true value for any of A1, A2, A3, B1 and B2, so xNN + xTT would equal the total number of rows in the original table.
Thanks
Graham
This is my approach:
def get_table(data, prefix):
'''
get the columns in the respective category
and assign `none` and `Total` columns
'''
return (df.filter(like=prefix)
.assign(none=lambda x: (1-x).prod(1),
Total=lambda x: x.any(1))
)
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.choice([True,False], size=(5,9), p=(0.6,0.4)),
columns=[f'{x}{y}'for x in 'ABC' for y in '123'])
# the output
df = df.astype(int)
get_table(df, 'B').T # get_table(df, 'A')
Output:
A1 A2 A3 none Total
B1 3 2 1 0 3
B2 3 2 1 0 3
B3 2 1 1 0 2
none 0 0 1 0 1
Total 4 3 2 0 5
Here I don't understand as why (none, Total) must be zero. Since none corresponds to all False in B and Total corresponds to some True in A.

Database transactions theory

No book seems to be able to answer this.
Suppose I have two transactions:
T1: Lock A, Lock B, Unlock A
T2: Lock B, Unlock B, Lock A, Unlock A
Q1. How many ways are there to plan these transactions? (Is it just a simple graph and the result is 3! * 4! ?)
Q2. How many of these ways are serializable?
I would really like to know what is the thinking process, how do you get to the answer?
Q1 is 7.
Proof: First of all, we have to merge the set 'Lock A', 'Lock B', 'Unlock A' (I denote the items as A1, A2, A3) into the set 'Lock B',..,'Unlock A' (I denote them B1..B4), that is to put 3 items into 5 places (between B's) with repetitions allowed, that is binomial coeff. choose 3 from (5-1+3). It is equal to 7!/(3!*4!) = 35.
Next, we have to drop 'bad' solutions (the ones prevented by locking conditions). It's where A1 stands between B3 and B4 (3 solutions) and A2 between B1 and B2 (2*4 = 8). Also, we have to exclude the solutions with B3 between A1 and A3 too. There are 3*3=9 with B3 between A1 and A2, and 6*2=12 with B3 between A2 and A3. Thus, we have 35-3-8-9-12=3. But we should also satisfy inclusion-exclusion principle: add solutions which violates two rules simultaneously. They could be only like this: B1 A2 B2 B3 B4, with A1 in either of two left positions, and A3 in either of two right ones. 4 in total. So, we have the final answer 35 - 3 - 8 - 9 - 12 + 4 = 7.

Automatic group creation in R or SQL

I have a R data frame with a column AS_ID as given below:
AS_ID
A8653654
B7653655
C5653650
C5653650
A8653654
D1658645
D1658645
C5653650
C5653650
D1658645
C5653650
E4568640
F796740
A8653654
F796740
E4568640
I am trying to group similar record as A1, A2, A3 and so on. For example all record having AS_ID as "A8653654" should be grouped as A1 and can be entered into new column as given below:
AS_ID AS
A8653654 A1
B7653655 A2
C5653650 A3
C5653650 A3
A8653654 A1
D1658645 A4
D1658645 A4
C5653650 A3
C5653650 A3
D1658645 A4
C5653650 A3
E4568640 A5
F796740 A6
A8653654 A1
F796740 A6
E4568640 A5
I am fine with either R or oracle code, since I can write SQL code in R too. Any help will be highly appreciated. My data is bit more dynamic compare to what I have given in sample data above. Generic code will help more.
If you've read those values into an R data.frame, it's likely they are already of class "factor". If not, you can convert them to a factor. But each factor value is automatically assigned a unique integer ID already. Here's a sample data.frame
dd<-read.table(text=c("A8653654", "B7653655", "C5653650", "C5653650", "A8653654",
"D1658645", "D1658645", "C5653650", "C5653650", "D1658645", "C5653650",
"E4568640", "F796740", "A8653654", "F796740", "E4568640"), col.names="AS_ID")
Observe that
class(dd$AS_ID)
# [1] "factor"
If it was character, you could do
dd$AS_ID <- factor(dd$AS_ID)
To get the unique IDs, just use as.numeric and then paste an A in front of that
dd <- cbind(dd, AS=paste0("A",as.numeric(dd$AS_ID)))
and that gives
#> head(dd)
AS_ID AS
1 A8653654 A1
2 B7653655 A2
3 C5653650 A3
4 C5653650 A3
5 A8653654 A1
6 D1658645 A4
You can get a group identifier in Oracle using dense_rank():
select AS_ID, dense_rank() over (partition by AS_ID order by AS_ID)
from table t;
If you want an 'A' in front, then concatenate it:
select AS_ID, 'A' || dense_rank() over (partition by AS_ID order by AS_ID)

Redis and linked hashes

everyone
I would like to ask community of help to find a way of how to cache our huge plain table by splitting it to the multiple hashes or otherwise.
The sample of table, as an example for structure:
A1 B1 C1 D1 E1 X1
A1 B1 C1 D1 E1 X2
A7 B5 C2 D1 E2 X3
A8 B1 C1 D1 E2 X4
A1 B6 C3 D2 E2 X5
A1 B1 C1 D2 E1 X6
This our denormalized data, we don't have any ability to normalize it.
So currently we must perform 'group by' to get required items, for instance to get all D* we perform 'data.GroupBy(A1).GroupBy(B1).GroupBy(C1)' and it takes a lot of time.
Temporarily we had found workaround for this by creating composite a string key:
A1 -> 'list of lines begin A1'
A1:B1 -> 'list of lines begin A1:B1'
A1:B1:C1 -> 'list of lines begin A1:B1:C1'
...
as a cache of results of grouping operations.
The question is how it can be stored efficiently?
Estimated number of lines in denormalized data around 10M records and as in my an example there is 6 columns it will be 60M entries in hash. So, I'm looking an approach to lookup values in O(N) if it's possible.
Thanks.