Split value from a data.frame and create additional row to store its component - sql

In R, I have a data frame called df such as the following:
A B C D
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5 - 7
a4 b4 c4 2.5
I want to split the value of the third row and D column by the dash and create another row for the second value retaining the other values for that row.
So I want this:
A B C D
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5
a3 b3 c3 7
a4 b4 c4 2.5
Any idea how this can be achieved?
Ideally, I would also want to create an extra column to specify whether the value I split is either a minimum or maximum.
So this:
A B C D E
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5 min
a3 b3 c3 7 max
a4 b4 c4 2.5
Thanks.

One option would be to use sub to paste 'min' and 'max in the 'D" column where - is found, and then use cSplit to split the 'D' column.
library(splitstackshape)
df1$D <- sub('(\\d+) - (\\d+)', '\\1,min - \\2,max', df1$D)
res <- cSplit(cSplit(df1, 'D', ' - ', 'long'), 'D', ',')[is.na(D_2), D_2 := '']
setnames(res, 4:5, LETTERS[4:5])
res
# A B C D E
#1: a1 b1 c1 2.5
#2: a2 b2 c2 3.5
#3: a3 b3 c3 5.0 min
#4: a3 b3 c3 7.0 max
#5: a4 b4 c4 2.5

Here's a dplyrish way:
DF %>%
group_by(A,B,C) %>%
do(data.frame(D = as.numeric(strsplit(as.character(.$D), " - ")[[1]]))) %>%
mutate(E = if (n()==2) c("min","max") else "")
A B C D E
(fctr) (fctr) (fctr) (dbl) (chr)
1 a1 b1 c1 2.5
2 a2 b2 c2 3.5
3 a3 b3 c3 5.0 min
4 a3 b3 c3 7.0 max
5 a4 b4 c4 2.5
Dplyr has a policy against expanding rows, as far as I can tell, so the ugly
do(data.frame(... .$ ...))
construct is required. If you are open to data.table, it's arguably simpler here:
library(data.table)
setDT(DF)[,{
D = as.numeric(strsplit(as.character(D)," - ")[[1]])
list(D = D, E = if (length(D)==2) c("min","max") else "")
}, by=.(A,B,C)]
A B C D E
1: a1 b1 c1 2.5
2: a2 b2 c2 3.5
3: a3 b3 c3 5.0 min
4: a3 b3 c3 7.0 max
5: a4 b4 c4 2.5

We can use tidyr::separate_rows. I altered the input to include a negative value to makeit more general :
df <- read.table(header=TRUE,stringsAsFactors=FALSE,text=
"A B C D
a1 b1 c1 -2.5
a2 b2 c2 3.5
a3 b3 c3 '5 - 7'
a4 b4 c4 2.5")
library(dplyr)
library(tidyr)
df %>%
mutate(E="", E = replace(E, grepl("[^^]-",D), "min - max")) %>%
separate_rows(D,E,sep = "[^^]-", convert = TRUE)
#> A B C D E
#> 1 a1 b1 c1 -2.5
#> 2 a2 b2 c2 3.5
#> 3 a3 b3 c3 5.0 min
#> 4 a3 b3 c3 7.0 max
#> 5 a4 b4 c4 2.5

Related

Create a new column for table B based on information from table A

I have this problem. I want to create a report that keeps everything in table B, but adds another column from table A (QtyRecv).
Condition: If RunningTotalQtyUsed (from table B) < QtyRecv, take that QtyRecv for the new column.
For example, for item A1, (RunningTotalQtyUsed) 55 < 100 (QtyRecv), -> ExpectedQtyRecv = 100.
But if RunningTotalQtyUsed exceeds QtyRecv, we take the next QtyRecv to cover that used quantity.
For example, 101 > 100, -> ExpectedQtyRecv = 138.
149 (RunningTotalQtyUsed) < (100 + 138) (QtyRecv) -> get 138.
250 < (100 + 138 + 121) -> get 121.
The same logic applies to item A2.
If total QtyRecv = 6 + 4 + 10 = 20, but RunningTotalQtyUsed = 31 -> result should be 99999 to notify an error that QtyRecv can't cover QtyUsed.
Table A:
Item QtyRecv
A1 100
A1 138
A1 121
A2 6
A2 4
A2 10
Table B:
Item RunningTotalQtyUsed
A1 55
A1 101
A1 149
A1 250
A2 1
A2 5
A2 9
A2 19
A2 31
Expected result:
Item RunningTotalQtyUsed ExpectedQtyRecv
A1 55 100
A1 101 138
A1 149 138
A1 250 121
A2 1 6
A2 5 6
A2 9 4
A2 19 10
A2 31 99999
What I made an effort:
SELECT b.*
FROM tableB b LEFT JOIN tableA a
ON b.item = a.item
item RunningTotalQtyUsed
A1 55
A1 55
A1 55
A1 101
A1 101
A1 101
A1 149
A1 149
A1 149
A1 250
A1 250
A1 250
A2 1
A2 1
A2 1
A2 5
A2 5
A2 5
A2 9
A2 9
A2 9
A2 19
A2 19
A2 19
A2 31
A2 31
A2 31
It doesn't keep the same number of rows as table B. How to still keep table B but add the ExpectQtyRecv from table A? Thank you so much for all the help!
SELECT B.TOTAL,B.SUM_RunningTotalQtyUsed,A.SUM_QtyRecv FROM
(
SELECT B.ITEM,SUM(B.RunningTotalQtyUsed)AS SUM_RunningTotalQtyUsed
FROM TABLE_B AS B
GROUP BY B.ITEM
)B_TOTAL
LEFT JOIN
(
SELECT A.ITEM,SUM(A.QtyRecv)AS SUM_QtyRecv
FROM TABLE_A AS A
GROUP BY A.ITEM
)A_TOTAL ON B.ITEM=A.ITEM
I can not be sure, but may be you need something like above ?

SUM based on another column Sign in Oracle

I have two tables
Table A has only last level id(leaf_id) information along with sum_data.
id sum_data
A5 40
B3 -50
C2 90
Table B has hierarchy information of id's and the sign to be considered for the id.
id Z has three children A2 and B2 and C2
id parent id leaf_id level sign
Z NULL A5 1 +
A2 Z A5 2 +
A3 A2 A5 3 -
A4 A3 A5 4 +
A5 A4 A5 5 +
Z NULL B3 1 +
B2 Z B3 2 -
B3 B2 B3 3 +
Z NULL C2 1 +
C2 Z C2 2 +
I need to calculate sum_data of Z based on sign operator and the calculation follows like this:
id parent id leaf_id level sign sum_data
Z NULL A5 1 + -40 --(rolled up sum_data from A2* sign =-40 * +)
A2 Z A5 2 + -40 --(rolled up sum_data from A3* sign =-40 * +)
A3 A2 A5 3 - -40 --(rolled up sum_data from A4* sign = 40 * -)
A4 A3 A5 4 + +40 --(rolled up sum_data from A5)
A5 A4 A5 5 + 40 --got this from Table A
Z NULL B3 1 + 50 --(rolled up sum_data from B2* sign = 50 * +)
B2 Z B3 2 - 50 --(rolled up sum_data from B3* sign = -50 * -)
B3 B2 B3 3 + -50 -- got this from Table A
Z NULL C2 1 + 0
C2 Z C2 2 ignore 0 -- (90 comes from Table A, as sign is ignore it is 0)
My output should be
id sum_data
Z 10 ( -40 from A5 hierarchy + 50 from B3 hierarchy + 0 from C2 hierarchy)
Can you please help me in deriving the sum_data in Oracle SQL code.

Need to find number of matches in Python

I want to compare 2 data columns and see if I can find a match. When I get a match, I want to show how many occurrences of that match was found. For instance
df1
Col_A Col_B
A0 B0
A1 B1
A2 B2
df2
Col_A Col_B
A0 B0
A1 B1
A0 B0
A4 B4
I want to check the df2 Col A against Col_A in df1. If I find a match, I should include them in my output table. Then I should have a count on number of times it matched by comparing. The Output should be
Col_A Col_B Result
A0 B0 1
A1 B1 1
A0 B0 2
How to achieve this in Python?
merge and cumcount
df2.assign(Result=df2.groupby([*df2]).cumcount() + 1).merge(df1)
Col_A Col_B Result
0 A0 B0 1
1 A0 B0 2
2 A1 B1 1

Compare Pandas dataframes and add column

I have two dataframe as below
df1 df2
A A C
A1 A1 C1
A2 A2 C2
A3 A3 C3
A1 A4 C4
A2
A3
A4
The values of column 'A' are defined in df2 in column 'C'.
I want to add a new column to df1 with column B with its value from df2 column 'C'
The final df1 should look like this
df1
A B
A1 C1
A2 C2
A3 C3
A1 C1
A2 C2
A3 C3
A4 C4
I can loop over df2 and add the value to df1 but its time consuming as the data is huge.
for index, row in df2.iterrows():
df1.loc[df1.A.isin([row['A']]), 'B']= row['C']
Can someone help me to understand how can I solve this without looping over df2.
Thanks
You can use map by Series:
df1['B'] = df1.A.map(df2.set_index('A')['C'])
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
It is same as map by dict:
d = df2.set_index('A')['C'].to_dict()
print (d)
{'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'}
df1['B'] = df1.A.map(d)
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
Timings:
len(df1)=7:
In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
1000 loops, best of 3: 1.73 ms per loop
In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 873 µs per loop
len(df1)=70k:
In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
100 loops, best of 3: 12.8 ms per loop
In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
100 loops, best of 3: 6.05 ms per loop
IIUC you can just merge and rename the col
df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
In [103]:
df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']})
df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']})
merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
merged
Out[103]:
A B
0 A1 C1
1 A2 C2
2 A3 C4
3 A1 C1
4 A2 C2
5 A3 C4
6 A4 C4
Based on searchsorted method, here are three approaches with different indexing schemes -
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True)
df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]

Generating binary variables in Pig

I am newbie to the world of Pig and I need to implement the following scenario.
problem:
Input to pig script: Any arbitrary relation say as below table
A B C
a1 b1 c1
a2 b2 c2
a1 b1 c3
we have to generate binary columns based on B,C so my output will look something like this.
output
A B C B.b1 B.b2 C.c1 C.c2 C.c3
a1 b1 c1 1 0 1 0 0
a2 b2 c2 0 1 0 1 0
a1 b1 c3 1 0 0 0 1
Can someone let me know how to achieve this in pig? i know this can be easily achieved using R script but my requirement is to achieve via PIG.
Your help will be highly appreciated.
Can you try this?
input
a1 b1 c1
a2 b2 c2
a1 b1 c3
PigScript:
X = LOAD 'input' USING PigStorage() AS (A:chararray,B:chararray,C:chararray);
Y = FOREACH X GENERATE A,B,C,
((B=='b1')?1:0) AS Bb1,
((B=='b2')?1:0) AS Bb2,
((C=='c1')?1:0) AS Cc1,
((C=='c2')?1:0) AS Cc2,
((C=='c3')?1:0) AS Cc3;
DUMP Y;
Output:
(a1,b1,c1,1,0,1,0,0)
(a2,b2,c2,0,1,0,1,0)
(a1,b1,c3,1,0,0,0,1)