Using string matching like grepl in a dbplyr pipeline - sql

dbplyr is very handy as it convert dplyr code into SQL. This works really well except when it doesn't. For example i am trying to subset rows by partially matching a string against values in a column. With exception of postgres, it appears as though this isn't yet implemented in dbplyr. Am I missing some {stringr} function that would accomplish the below:
library(dplyr, warn.conflicts = FALSE)
library(DBI)
data("flights", package = "nycflights13")
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "flights", flights)
## works in dplyr
flights %>%
filter(grepl("N", tailnum))
#> # A tibble: 334,264 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # … with 334,254 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## no function translation to grepl
tbl(con, "flights") %>%
filter(grepl("N", tailnum)) %>%
collect()
#> Error: no such function: grepl
## Also not implemented for stringr
library(stringr)
tbl(con, "flights") %>%
filter(str_detect(tailnum, "N")) %>%
collect()
#> Error: str_detect() is not available in this SQL variant
dbDisconnect(con)

We may use %like%
tbl(con, "flights") %>%
dplyr::filter(tailnum %like% "%N%") %>%
collect()
-output
# A tibble: 334,264 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116
6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150
7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158
8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53
9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140
10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138
# … with 334,254 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Related

Left_join in R using modified column

I am wanting to join a dataframe using a function within 'by' of the left_join in R.
Savings_YC <- Savings_YC %>%
left_join(WAP_FULL, by = c("Client", "CompanyCode", "Material", "STOCK_UOM"))
What I want is to join Savings_YC to WAP_FULL on LEFT(Savings_YC$CompanyCode, 4) = WAP_FULL$CompanyCode. As an example this is what it would be in SQL:
SELECT * FROM Savings_YC syc
LEFT JOIN WAP_FULL w ON syc.Client = w.Client and LEFT(syc.CompanyCode, 4) = w.CompanyCode AND syc.Material = w.Material ...
I am assuming I will want to use substr(x, 1, 4) for the LEFT function, but how do I include that within the left_join function? I have tried the following and getting an error.
Savings_YC <- Savings_YC %>% # Not sure this will work
left_join(WAP_FULL, by = c("Client", substr("Savings_YC.CompanyCode",1,4) = "WAP_FULL.CompanyCode", "Material", "STOCK_UOM"))
Three options. Sample data:
dmnds <- head(ggplot2::diamonds, n=10)
other <- data.frame(cut = c("Very", "Prem"), newcol = 1:2)
If you can create a new column on the LHS,
dmnds %>%
mutate(cut2 = substring(cut, 1, 4)) %>%
left_join(other, by = c(cut2 = "cut"))
# # A tibble: 10 x 12
# carat cut color clarity depth table price x y z cut2 newcol
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <int>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 Idea NA
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Prem 2
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 Good NA
# 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Prem 2
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 Good NA
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 Very 1
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Very 1
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 Very 1
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 Fair NA
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 Very 1
Using fuzzyjoin:
fuzzyjoin::regex_left_join(dmnds, other, by = "cut")
# # A tibble: 10 x 12
# carat cut.x color clarity depth table price x y z cut.y newcol
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <int>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 NA NA
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 Prem 2
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 NA NA
# 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 Prem 2
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 NA NA
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 Very 1
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Very 1
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 Very 1
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 NA NA
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 Very 1
If you can use sqldf then
### pattern-based
sqldf::sqldf("
select d.*, o.newcol
from dmnds d
left join other o on d.cut like (o.cut || '%')")
# carat cut color clarity depth table price x y z newcol
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 NA
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 2
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 NA
# 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 2
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 NA
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 1
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 1
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 1
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 NA
# 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39 1
### substring-based
sqldf::sqldf("
select d.*, o.newcol
from dmnds d
left join other o on substring(d.cut, 1, 4) = o.cut")

python reshape every nth column

I have just started with python and need some help. I have a dataframe which looks like "Input Data", What I want is stack by every nth column. In other words, I want a dataframe where every nth Column is appended below to first m rows
id
city
Col 1
Col 2
Col 3
Col 4
Col 5
Col 6
Col 7
Col 8
Col 9
Col 10
1
1
51
155
255
355
455
666
777
955
55
553
2
0
52
155
255
355
455
666
777
595
55
553
3
NAN
53
155
255
355
455
666
777
559
55
535
4
1
54
155
255
355
545
666
777
559
55
535
5
7
55
155
255
355
455
666
777
955
55
535
Required Output
id
city
Col 1
Col 2
Col 3
Col 4
Col 5
1
1
51
155
255
355
455
2
0
52
155
255
355
455
3
NAN
53
155
255
355
455
4
1
54
155
255
355
545
5
7
55
155
255
355
455
1
1
666
777
955
55
553
2
0
666
777
595
55
553
3
NAN
666
777
559
55
535
4
1
666
777
559
55
535
5
7
666
777
955
55
535
I am trying to do something opposite of this
Input & required Output
In [74]: column_list = [df.columns[k:k+5] for k in range(2, len(df.columns), 5)]
In [75]: column_list
Out[75]:
[Index(['Col 1', 'Col 2', 'Col 3', 'Col 4', 'Col 5'], dtype='object'),
Index(['Col 6', 'Col 7', 'Col 8', 'Col 9', 'Col 10'], dtype='object')]
In [76]: dfs = [df[['id', 'city'] + columns.tolist()].rename(columns=dict(zip(columns, range(5)))) for columns in column_list]
In [77]: dfs
Out[77]:
[ id city 0 1 2 3 4
0 1 1.0 51 155 255 355 455
1 2 0.0 52 155 255 355 455
2 3 NaN 53 155 255 355 455
3 4 1.0 54 155 255 355 545
4 5 7.0 55 155 255 355 455,
id city 0 1 2 3 4
0 1 1.0 666 777 955 55 553
1 2 0.0 666 777 595 55 553
2 3 NaN 666 777 559 55 535
3 4 1.0 666 777 559 55 535
4 5 7.0 666 777 955 55 535]
In [78]: pd.concat(dfs, ignore_index=True)
Out[78]:
id city 0 1 2 3 4
0 1 1.0 51 155 255 355 455
1 2 0.0 52 155 255 355 455
2 3 NaN 53 155 255 355 455
3 4 1.0 54 155 255 355 545
4 5 7.0 55 155 255 355 455
5 1 1.0 666 777 955 55 553
6 2 0.0 666 777 595 55 553
7 3 NaN 666 777 559 55 535
8 4 1.0 666 777 559 55 535
9 5 7.0 666 777 955 55 535
To explain :
First generate the required columns for each slice
pd.concat requires the column names of all the dataframes in the list to be the same, hence the renames in rename(columns=dict(zip(columns, range(5)))). We are just renaming the sliced columns to 0,1,2,3,4
Last step is to concat everything.
EDIT
Based on the comments by OP:
Sorry #Asish M. but how to add a column for dataset number in each dataset of dfs, eg- here we split our dataset into 2, so I need one column which says for first 1 to 5 ids - 'first' (or 1), then again for another 1 to 5 ids - 'second' (or 2) in the output. I hope it's making scenes
dfs = [df[['id', 'city'] + columns.tolist()].assign(split_group=idx).rename(columns=dict(zip(columns, range(5)))) for idx, columns in enumerate(column_list)]
df.assign(split_group=idx) creates a column 'split_group' with value = idx. You get the idx from enumerating the column_list

How to calculate shift and rolling sum over missing dates without adding them to data frame in Pandas?

I have a data set with dates, customers and income:
Date CustomerIncome
0 1/1/2018 A 53
1 2/1/2018 A 36
2 3/1/2018 A 53
3 5/1/2018 A 89
4 6/1/2018 A 84
5 8/1/2018 A 84
6 9/1/2018 A 54
7 10/1/2018 A 19
8 11/1/2018 A 44
9 12/1/2018 A 80
10 1/1/2018 B 24
11 2/1/2018 B 100
12 9/1/2018 B 40
13 10/1/2018 B 47
14 12/1/2018 B 10
15 2/1/2019 B 5
For both customers there are missing dates as they purchased nothing at some months.
I want to add per each customer what was the income of last month and also the rolling sum of income for the last year.
Meaning, if there's a missing month, I'll see '0' at the shift(1) column of the following month that has income. And I'll see rolling sum of 12 months even if there weren't 12 observations.
This is the expected result:
Date CustomerIncome S(1)R(12)
0 1/1/2018 A 53 0 53
1 2/1/2018 A 36 53 89
2 3/1/2018 A 53 36 142
3 5/1/2018 A 89 0 231
4 6/1/2018 A 84 89 315
5 8/1/2018 A 84 0 399
6 9/1/2018 A 54 84 453
7 10/1/2018 A 19 54 472
8 11/1/2018 A 44 19 516
9 12/1/2018 A 80 44 596
10 1/1/2018 B 24 0 24
11 2/1/2018 B 100 24 124
12 9/1/2018 B 40 0 164
13 10/1/2018 B 47 40 211
14 12/1/2018 B 10 0 221
15 2/1/2019 B 5 0 102
So far, I've added the rows with missing dates with stack and unstack, but with multiple dates and customers, it explodes the data to millions of rows, crashing kernel with most rows are 0's.
You can use .shift but have logic that if the gap is > 31 days, then make (S1) = 0
The rolling 12 calculation requires figuring out the "Rolling Date" and doing some complicated list comprehension to decide whether or not to return a value. Then, take a sum of each list per row.
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['S(1)'] = df.groupby('Customer')['Income'].transform('shift').fillna(0)
s = (df['Date'] - df['Date'].shift())/np.timedelta64(1, '31D') <= 1
df['S(1)'] = df['S(1)'].where(s,0).astype(int)
df['Rolling Date'] = (df['Date'] - pd.Timedelta('1Y'))
df['R(12)'] = df.apply(lambda d: sum([z for x,y,z in
zip(df['Customer'], df['Date'], df['Income'])
if y > d['Rolling Date']
if y <= d['Date']
if x == d['Customer']]), axis=1)
df = df.drop('Rolling Date', axis=1)
df
Out[1]:
Date Customer Income S(1) R(12)
0 2018-01-01 A 53 0 53
1 2018-02-01 A 36 53 89
2 2018-03-01 A 53 36 142
3 2018-05-01 A 89 0 231
4 2018-06-01 A 84 89 315
5 2018-08-01 A 84 0 399
6 2018-09-01 A 54 84 453
7 2018-10-01 A 19 54 472
8 2018-11-01 A 44 19 516
9 2018-12-01 A 80 44 596
10 2018-01-01 B 24 0 24
11 2018-02-01 B 100 24 124
12 2018-09-01 B 40 0 164
13 2018-10-01 B 47 40 211
14 2018-12-01 B 10 0 221
15 2019-02-01 B 5 0 102

create new column from divided columns over iteration

I am working with the following code:
url = 'https://raw.githubusercontent.com/dothemathonthatone/maps/master/fertility.csv'
df = pd.read_csv(url)
year regional_schlüssel Aus15 Deu15 Aus16 Deu16 Aus17 Deu17 Aus18 Deu18 ... aus36 aus37 aus38 aus39 aus40 aus41 aus42 aus43 aus44 aus45
0 2000 5111000 0 4 8 25 20 45 56 89 ... 935 862 746 732 792 660 687 663 623 722
1 2000 5113000 1 1 4 14 13 33 19 48 ... 614 602 498 461 521 470 393 411 397 400
2 2000 5114000 0 11 0 5 2 13 7 20 ... 317 278 265 235 259 228 204 173 213 192
3 2000 5116000 0 2 2 7 3 28 13 26 ... 264 217 206 207 197 177 171 146 181 169
4 2000 5117000 0 0 3 1 2 4 4 7 ... 135 129 118 116 128 148 89 110 124 83
I would like to create a new set of columns fertility_deu15, ..., fertility_deu45 and fertility_aus15, ..., fertility_aus45 such that aus15 / Aus15 = fertiltiy_aus15 and deu15/ Deu15 = fertility_deu15 for each ausi and Ausj where j == i \n [15-45] and deui:Deuj where j == i \n [15-45]
I'm not sure what is up with that data but we need to fix it to make it numeric. I'll end up doing that while filtering
numerator = df.filter(regex='^[a-z]+\d+$') # Lower case ones
numerator = numerator.apply(pd.to_numeric, errors='coerce') # Fix numbers
denominator = df.filter(regex='^[A-Z][a-z]+\d+$').rename(columns=str.lower)
denominator = denominator.apply(pd.to_numeric, errors='coerce')
numerator.div(denominator).add_prefix('fertility_')

How to aggregate multiple columns - Pandas

I have this df:
ID Date XXX 123_Var 456_Var 789_Var 123_P 456_P 789_P
A 07/16/2019 1 987 551 313 22 12 94
A 07/16/2019 9 135 748 403 92 40 41
A 07/18/2019 8 376 938 825 14 69 96
A 07/18/2019 5 259 176 674 52 75 72
B 07/16/2019 9 690 304 948 56 14 78
B 07/16/2019 8 819 185 699 33 81 83
B 07/18/2019 1 580 210 847 51 64 87
I want to group the df by ID and Date, aggregate the XXX column by the maximum value, and aggregate 123_Var, 456_Var, 789_Var columns by the minimum value.
* Note: The df contains many of these columns. The shape is: {some int}_Var.
This is the current code I've started to write:
df = (df.groupby(['ID','Date'], as_index=False)
.agg({'XXX':'max', list(df.filter(regex='_Var')): 'min'}))
Expected result:
ID Date XXX 123_Var 456_Var 789_Var
A 07/16/2019 9 135 551 313
A 07/18/2019 8 259 176 674
B 07/16/2019 9 690 185 699
B 07/18/2019 1 580 210 847
Create dictionary dynamic with dict.fromkeys and then merge it with {'XXX':'max'} dict and pass to GroupBy.agg:
d = dict.fromkeys(df.filter(regex='_Var').columns, 'min')
df = df.groupby(['ID','Date'], as_index=False).agg({**{'XXX':'max'}, **d})
print (df)
ID Date XXX 123_Var 456_Var 789_Var
0 A 07/16/2019 9 135 551 313
1 A 07/18/2019 8 259 176 674
2 B 07/16/2019 9 690 185 699
3 B 07/18/2019 1 580 210 847