SQL query - restriction doesn't work as I 'd expect - sql

I have the following query:
SELECT
(substr(REGEXP_SUBSTR(qot_sup_xpressfeed, ',[^,]+,'), 2, length(REGEXP_SUBSTR(qot_sup_xpressfeed, ',[^,]+,')) - 2)) ,
xex.xex_xexc_k,
FROM qot, sec, exc, ctr, xcx, xex, xexc, xcrr, crr
WHERE qot_source = 'X'
AND qot_id = 2029557521
AND nvl(qot_real_exc_id, qot_exc_id) = exc_id
AND exc_ctr_id = ctr_id
AND qot_sec_id = sec_id
AND nvl(qot_real_exc_id, qot_exc_id) = xex_exc_id(+)
AND qot_crr_id = xcx_crr_id(+)
AND xex_xexc_k = xexc_k(+)
AND xcx_xcrr_k = xcrr_k(+)
AND qot_crr_id = crr_id(+)
AND qot_status IN (1, 2)
AND (qot_sup_xpressfeed IS NULL OR to_number(substr(REGEXP_SUBSTR(qot_sup_xpressfeed, ',[^,]+,'), 2, length(REGEXP_SUBSTR(qot_sup_xpressfeed, ',[^,]+,')) - 2)) = xexc_k);
When I comment out the last restrictionAND (qot_sup_xpressfeed IS NULL OR to_number(substr(REGEXP_SUBSTR(qot_sup_xpressfeed, ',[^,]+,'), 2, length(REGEXP_SUBSTR(qot_sup_xpressfeed, ',[^,]+,')) - 2)) = xexc_k); I'm getting results:
1135 67
1135 111
1135 549
1135 246
1135 103
1135 564
1135 1135
1135 21
So as you can see I have a row:
1135 1135
But I'm getting no results when I add the last restriction:
AND (qot_sup_xpressfeed IS NULL OR to_number(substr(REGEXP_SUBSTR(qot_sup_xpressfeed, ',[^,]+,'), 2, length(REGEXP_SUBSTR(qot_sup_xpressfeed, ',[^,]+,')) - 2)) = xexc_k);
I'd expect I will get 1 result (the mentioned 1135 1135).
What am I doing wrong?

You are selecting xex.xex_xexc_k while your test is on xexc_k. As this is an OUTER JOIN they could be different. What happens when you compare with xex.xex_xexc_k instead?

Related

pandas 0.25.x to 1.5.2, nested renamer migration

[In]:
pd.set_option('display.max_colwidth', 200)
topic_stats_df = corpus_topic_df.groupby('Dominant Topic').agg({
'Dominant Topic': {
'Doc Count': np.size,
'% Total Docs': np.size }
})
topic_stats_df = topic_stats_df['Dominant Topic'].reset_index()
topic_stats_df['% Total Docs'] = topic_stats_df['% Total Docs'].apply(lambda row: round((row*100) / len(papers), 2))
topic_stats_df['Topic Desc'] = [topics_df.iloc[t]['Terms per Topic'] for t in range(len(topic_stats_df))]
topic_stats_df
[Out]:
---------------------------------------------------------------------------
SpecificationError Traceback (most recent call last)
Cell In[47], line 2
1 pd.set_option('display.max_colwidth', 200)
----> 2 topic_stats_df = corpus_topic_df.groupby('Dominant Topic').agg({
3 'Dominant Topic': {
4 'Doc Count': np.size,
5 '% Total Docs': np.size }
6 })
7 topic_stats_df = topic_stats_df['Dominant Topic'].reset_index()
8 topic_stats_df['% Total Docs'] = topic_stats_df['% Total Docs'].apply(lambda row: round((row*100) / len(papers), 2))
File ~/miniconda3/envs/nlp/lib/python3.8/site-packages/pandas/core/groupby/generic.py:894, in DataFrameGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
891 func = maybe_mangle_lambdas(func)
893 op = GroupByApply(self, func, args, kwargs)
--> 894 result = op.agg()
895 if not is_dict_like(func) and result is not None:
896 return result
File ~/miniconda3/envs/nlp/lib/python3.8/site-packages/pandas/core/apply.py:169, in Apply.agg(self)
166 return self.apply_str()
168 if is_dict_like(arg):
--> 169 return self.agg_dict_like()
170 elif is_list_like(arg):
171 # we require a list, but not a 'str'
172 return self.agg_list_like()
File ~/miniconda3/envs/nlp/lib/python3.8/site-packages/pandas/core/apply.py:478, in Apply.agg_dict_like(self)
475 selected_obj = obj._selected_obj
476 selection = obj._selection
--> 478 arg = self.normalize_dictlike_arg("agg", selected_obj, arg)
480 if selected_obj.ndim == 1:
481 # key only used for output
482 colg = obj._gotitem(selection, ndim=1)
File ~/miniconda3/envs/nlp/lib/python3.8/site-packages/pandas/core/apply.py:594, in Apply.normalize_dictlike_arg(self, how, obj, func)
587 # Can't use func.values(); wouldn't work for a Series
588 if (
589 how == "agg"
590 and isinstance(obj, ABCSeries)
591 and any(is_list_like(v) for _, v in func.items())
592 ) or (any(is_dict_like(v) for _, v in func.items())):
593 # GH 15931 - deprecation of renaming keys
--> 594 raise SpecificationError("nested renamer is not supported")
596 if obj.ndim != 1:
597 # Check for missing columns on a frame
598 cols = set(func.keys()) - set(obj.columns)
SpecificationError: nested renamer is not supported
The code is credited to Sarkar, D. (2019). Text Analytics with Python Apress, Topic modeling section.
Pip pandas 0.25.3 fails because I'm on an m1 Mac.
Have tried: pip install pandas==0.25.3
Have tried: arch -x86_64 pip install pandas==0.25.3
Pandas a will back removed nested renaming in favor of using pd.NamedAgg
topic_stats_df = corpus_topic_df.groupby('Dominant Topic').agg({
'Dominant Topic': {
'Doc Count': np.size,
'% Total Docs': np.size }
})
This statement can be rewritten as follows:
topic_stats_df = corpus_topic_df.groupby('Dominant Topic')\
.agg(Doc_count=('Dominant Topic', np.size),
Pct_Total_Docs=('Dominant Topic', np.size))

How do I get a time delta that is closest to 0 days?

I have the following dataframe:
gp_columns = {
'name': ['companyA', 'companyB'],
'firm_ID' : [1, 2],
'timestamp_one' : ['2016-04-01', '2017-09-01']
}
fund_columns = {
'firm_ID': [1, 1, 2, 2, 2],
'department_ID' : [10, 11, 20, 21, 22],
'timestamp_mult' : ['2015-01-01', '2016-03-01', '2016-10-01', '2017-02-01', '2018-11-01'],
'number' : [400, 500, 1000, 3000, 4000]
}
gp_df = pd.DataFrame(gp_columns)
fund_df = pd.DataFrame(fund_columns)
gp_df['timestamp_one'] = pd.to_datetime(gp_df['timestamp_one'])
fund_df['timestamp_mult'] = pd.to_datetime(fund_df['timestamp_mult'])
merged_df = gp_df.merge(fund_df)
merged_df
merged_df_v1 = merged_df.copy()
merged_df_v1['incidence_num'] = merged_df.groupby('firm_ID')['department_ID']\
.transform('cumcount')
merged_df_v1['incidence_num'] = merged_df_v1['incidence_num'] + 1
merged_df_v1['time_delta'] = merged_df_v1['timestamp_mult'] - merged_df_v1['timestamp_one']
merged_wide = pd.pivot(merged_df_v1, index = ['name','firm_ID', 'timestamp_one'], \
columns = 'incidence_num', \
values = ['department_ID', 'time_delta', 'timestamp_mult', 'number'])
merged_wide.reset_index()
that looks as follows:
My question is how i get a column that calculates the minimum time delta (so closest to 0). Note that the time delta can be negative or positive, so .abs() does not work for me here.
I want a dataframe with this particular output:
You can stack (which removes NaTs) and groupby.first after sorting the rows by absolute value (with the key parameter of sort_values):
df = merged_wide.reset_index()
df['time_delta_min'] = (df['time_delta'].stack()
.sort_values(key=abs)
.groupby(level=0).first()
)
output:
name firm_ID timestamp_one department_ID \
incidence_num 1 2 3
0 companyA 1 2016-04-01 10 11 NaN
1 companyB 2 2017-09-01 20 21 22
time_delta timestamp_mult \
incidence_num 1 2 3 1 2
0 -456 days -31 days NaT 2015-01-01 2016-03-01
1 -335 days -212 days 426 days 2016-10-01 2017-02-01
number time_delta_min
incidence_num 3 1 2 3
0 NaT 400 500 NaN -31 days
1 2018-11-01 1000 3000 4000 -212 days
Use lookup with indices of absolute values by DataFrame.idxmin:
idx, cols = pd.factorize(df['time_delta'].abs().idxmin(axis=1))
df['time_delta_min'] = (df['time_delta'].reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx])
print (df)

SpecificationError: Function names must be unique if there is no new column names assigned

I want to create a new column in the clin dataframe based on the following conditions:
1 if vals>=2*365 or is NAN
otherwise 0
I then assign the new column name as SURV.
import numpy as np
vals = clin['days_to_death'].astype(np.float32)
# non-LTS is 0, LTS is 1
surv = [1 if ( v>=2*365 or np.isnan(v) ) else 0 for v in vals ]
clin['SURV'] = clin.apply(surv, axis=1)
Traceback:
SpecificationError: Function names must be unique if there is no new column names assigned
---------------------------------------------------------------------------
SpecificationError Traceback (most recent call last)
<ipython-input-31-603dee8413ce> in <module>
5 # non-LTS is 0, LTS is 1
6 surv = [1 if ( v>=2*365 or np.isnan(v) ) else 0 for v in vals ]
----> 7 clin['SURV'] = clin.apply(surv, axis=1)
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
146 # multiple values for keyword argument "axis"
147 return self.obj.aggregate( # type: ignore[misc]
--> 148 self.f, axis=self.axis, *self.args, **self.kwds
149 )
150
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
7572 axis = self._get_axis_number(axis)
7573
-> 7574 relabeling, func, columns, order = reconstruct_func(func, **kwargs)
7575
7576 result = None
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/aggregation.py in reconstruct_func(func, **kwargs)
93 # there is no reassigned name
94 raise SpecificationError(
---> 95 "Function names must be unique if there is no new column names "
96 "assigned"
97 )
SpecificationError: Function names must be unique if there is no new column names assigned
clin
clin = pd.DataFrame([[1, '466', '47', 0, '90'],
[1, '357', '54', 1, '80'],
[1, '108', '72', 1, '60'],
[1, '254', '51', 0, '80'],
[1, '138', '78', 1, '80'],
[0, nan, '67', 0, '60']], columns=['vital_status', 'days_to_death', 'age_at_initial_pathologic_diagnosis',
'gender', 'karnofsky_performance_score'], index=['TCGA-06-1806', 'TCGA-06-5408', 'TCGA-06-5410', 'TCGA-06-5411',
'TCGA-06-5412', 'TCGA-06-5413'])
Expected output:
vital_status
days_to_death
age_at_initial_pathologic_diagnosis
gender
karnofsky_performance_score
SURV
TCGA-06-1806
1
466
47
0
90
0
TCGA-06-5408
1
357
54
1
80
0
TCGA-06-5410
1
108
72
1
60
0
TCGA-06-5411
1
254
51
0
80
0
TCGA-06-5412
1
138
78
1
80
0
TCGA-06-5413
0
nan
67
0
60
1
Make a new column of all 0's and then update the column with your desired parameters.
clin['SURV'] = 0
clin.loc[pd.to_numeric(clin.days_to_death).ge(2*365) | clin.days_to_death.isna(), 'SURV'] = 1
print(clin)
Output:
vital_status days_to_death age_at_initial_pathologic_diagnosis gender karnofsky_performance_score SURV
TCGA-06-1806 1 466 47 0 90 0
TCGA-06-5408 1 357 54 1 80 0
TCGA-06-5410 1 108 72 1 60 0
TCGA-06-5411 1 254 51 0 80 0
TCGA-06-5412 1 138 78 1 80 0
TCGA-06-5413 0 NaN 67 0 60 1

np.where condition is not getting satisfied

In the following line of code, I get the error shown below.
d3["WOE"] = np.where(((d3.DIST_EVENT==0) | (d3.DIST_NON_EVENT ==0)) ,np.nan ,np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT))
I if the numerator or denominator is 0, then the condition for np.nan should satisfy and d3["WOE"] shoud be nan. Why is the following error being produced?
---------------------------------------------------------------------------
FloatingPointError Traceback (most recent call last)
<ipython-input-56-a9b015683238> in <module>
----> 1 final_iv, IV = data_vars(df_leads_short,df_leads_short.close_flag)
2 IV.sort_values('IV')
<ipython-input-55-5530ad13fa5a> in data_vars(df1, target)
122 count = count + 1
123 else:
--> 124 conv = char_bin(target, df1[i])
125 conv["VAR_NAME"] = i
126 count = count + 1
<ipython-input-55-5530ad13fa5a> in char_bin(Y, X)
92 d3["DIST_EVENT"] = d3.EVENT/d3.sum().EVENT
93 d3["DIST_NON_EVENT"] = d3.NONEVENT/d3.sum().NONEVENT
---> 94 d3["WOE"] = np.where(((d3.DIST_EVENT==0) | (d3.DIST_NON_EVENT ==0)) ,np.nan ,np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT))
95 #d3["WOE"] = np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
96 d3["IV"] = np.where((d3.DIST_EVENT==0) | (d3.DIST_NON_EVENT ==0 ),np.nan ,(d3.DIST_EVENT-d3.DIST_NON_EVENT)*np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT))
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1934 self, ufunc: Callable, method: str, *inputs: Any, **kwargs: Any
1935 ):
-> 1936 return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
1937
1938 # ideally we would define this to avoid the getattr checks, but
/opt/conda/lib/python3.7/site-packages/pandas/core/arraylike.py in array_ufunc(self, ufunc, method, *inputs, **kwargs)
356 # ufunc(series, ...)
357 inputs = tuple(extract_array(x, extract_numpy=True) for x in inputs)
--> 358 result = getattr(ufunc, method)(*inputs, **kwargs)
359 else:
360 # ufunc(dataframe)
FloatingPointError: divide by zero encountered in log
We can do
cond = ((d3.DIST_EVENT==0) | (d3.DIST_NON_EVENT ==0))
d3.loc[~cond,"WOE"] = np.log(d3.loc[~cond,"DIST_EVENT"]/d3.loc[~cond,"DIST_NON_EVENT"]))
Since the np.where still need calculated the np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT) which will still yield the same error.np.where is just selection.

Sorting pandas dataframe by groups

I would like to sort a dataframe by certain priority rules.
I've achieved this in the code below but I think this is a very hacky solution.
Is there a more proper Pandas way of doing this?
import pandas as pd
import numpy as np
df=pd.DataFrame({"Primary Metric":[80,100,90,100,80,100,80,90,90,100,90,90,80,90,90,80,80,80,90,90,100,80,80,100,80],
"Secondary Metric Flag":[0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0],
"Secondary Value":[15, 59, 70, 56, 73, 88, 83, 64, 12, 90, 64, 18, 100, 79, 7, 71, 83, 3, 26, 73, 44, 46, 99,24, 20],
"Final Metric":[222, 883, 830, 907, 589, 93, 479, 498, 636, 761, 851, 349, 25, 405, 132, 491, 253, 318, 183, 635, 419, 885, 305, 258, 924]})
Primary_List=list(np.unique(df['Primary Metric']))
Primary_List.sort(reverse=True)
df_sorted=pd.DataFrame()
for p in Primary_List:
lol=df[df["Primary Metric"]==p]
lol.sort_values(["Secondary Metric Flag"],ascending = False)
pt1=lol[lol["Secondary Metric Flag"]==1].sort_values(by=['Secondary Value', 'Final Metric'], ascending=[False, False])
pt0=lol[lol["Secondary Metric Flag"]==0].sort_values(["Final Metric"],ascending = False)
df_sorted=df_sorted.append(pt1)
df_sorted=df_sorted.append(pt0)
df_sorted
The priority rules are:
First sort by the 'Primary Metric', then by the 'Secondary Metric
Flag'.
If the 'Secondary Metric Flag' ==1, sort by 'Secondary Value' then
the 'Final Metric'
If ==0, go right for the 'Final Metric'.
Appreciate any feedback.
You do not need for loop and groupby here , just split them and sort_values
df1=df.loc[df['Secondary Metric Flag']==1].sort_values(by=['Primary Metric','Secondary Value', 'Final Metric'], ascending=[True,False, False])
df0=df.loc[df['Secondary Metric Flag']==0].sort_values(["Primary Metric","Final Metric"],ascending = [True,False])
df=pd.concat([df1,df0]).sort_values('Primary Metric')
sorted with loc
def k(t):
p, s, v, f = df.loc[t]
return (-p, -s, -s * v, -f)
df.loc[sorted(df.index, key=k)]
Primary Metric Secondary Metric Flag Secondary Value Final Metric
9 100 1 90 761
5 100 1 88 93
1 100 1 59 883
3 100 1 56 907
23 100 1 24 258
20 100 0 44 419
13 90 1 79 405
19 90 1 73 635
7 90 1 64 498
11 90 1 18 349
10 90 0 64 851
2 90 0 70 830
8 90 0 12 636
18 90 0 26 183
14 90 0 7 132
15 80 1 71 491
21 80 1 46 885
17 80 1 3 318
24 80 0 20 924
4 80 0 73 589
6 80 0 83 479
22 80 0 99 305
16 80 0 83 253
0 80 0 15 222
12 80 0 100 25
sorted with itertuples
def k(t):
_, p, s, v, f = t
return (-p, -s, -s * v, -f)
idx, *tups = zip(*sorted(df.itertuples(), key=k))
pd.DataFrame(dict(zip(df, tups)), idx)
lexsort
p = df['Primary Metric']
s = df['Secondary Metric Flag']
v = df['Secondary Value']
f = df['Final Metric']
a = np.lexsort([
-p, -s, -s * v, -f
][::-1])
df.iloc[a]
Construct New DataFrame
df.mul([-1, -1, 1, -1]).assign(
**{'Secondary Value': lambda d: d['Secondary Metric Flag'] * d['Secondary Value']}
).pipe(
lambda d: df.loc[d.sort_values([*d]).index]
)