Using Imputer with conditions on the cell values of columns

Using Imputer with conditions on the cell values of columns - pandas

I have a data frame with certain number of 'nan' in certain columns:
(Total number of rows is over 50000)
Data looks something like below (showing for 1st 3 columns)
A
B
C
D
E
F
10
20
5
.
.
.
nan
54
10
.
.
.
23
nan
9
.
.
.
30
32
6
.
.
.
20
22
nan
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
There is a condition on these columns : A < B and A > C always for all rows.
I wish to use some imputer (preferably KNNImputer) such that after imputing the above conditions are satisfied.
(while applying imputer in a generic way, it turns out that many cells are not satisfying these conditions)
How can this be implemented?

Related

Identifying specific strings and checking subsequent rows for another string

I have the following DataFrame.
df = pd.DataFrame({'1': ['A','.','.','X','.','.'],
'2':['.','.','.','.','A','.'],
'3':['.','.','.','.','.','.'],
'4':['.','.','.','.','.','X']})
I want to identify all instances where 'A' occurs and check to see if 'X' occurs within the next 3 rows.
After doing that I would like to execute a command based on these conditions.
an example of what I am trying to do would be...
for i, idx in df.iterrows():
if idx == A:
if X exists within next 3 rows:
x= idx['1']
y= idx['2']
Any help would be greatly appreciated.

I am sure that the other answer could work if you were to explain what you really want to do. It would be more efficient as iterating over rows is slow.
However, here is a solution based on iterrows:
mask = df.eq('X').any(1)
mask = mask.where(mask).bfill(limit=3).fillna(False)
for idx, row in df.iterrows():
if 'A' in row.values and mask[idx]:
x = row['1']
y = row['2']
print(f'row {idx} matches: {x=}, {y=}')
example input (slightly different from yours):
1 2 3 4
0 A . . .
1 . . . .
2 . . A .
3 . . . .
4 X A . .
5 . . X .
output:
row 2 matches: x='.', y='.'
row 4 matches: x='X', y='A'

IIUC, you want to identify the cells where there is a value A and if within the next 3 rows, there is also a value X
I will use a more visual example for clarity (A/X/.):
0 1 2 3 4 5
0 A . . A . A
1 . . X . A .
2 . A . . . A
3 . X . X . X
4 X . . . . .
One can use eq to find the searched values and where+bfill(limit=3)+.fillna to extend the second mask to the previous lines.
# mask for the A
m1 = df.eq('A')
# mask for the X in the next 3 lines
m2 = df.eq('X')
m2 = m2.where(m2).bfill(limit=3).fillna(False)
# example of how to use the masks: replacing A with O
df[m1&m2] = 'O'
Example output:
0 1 2 3 4 5
0 A . . O . O
1 . . X . A .
2 . O . . . O
3 . X . X . X
4 X . . . . .
checking X for any column
Just change the second mask to:
m2 = df.eq('X').any(1)
m2 = m2.where(m2).bfill(limit=3).fillna(False)
output with this mask:
0 1 2 3 4 5
0 O . . O . O
1 . . X . O .
2 . O . . . O
3 . X . X . X
4 X . . . . .

Search a value in column and give the value from another

I have a database called Process It has 22 columns
Column 1 is Id which serial and primary,
Column 2 Process name Character(50)
column 3 is "Amount 1" Character Varying
Column 4 is "Time 1" is integer.
The rest of the columns are the same as 3 & 4 but going up in number ie column 5 "Amount 2", column 6 is "Time 2".
What i need is a query which looks in the amount columns for normal and then display the ID and the Time column. for example:
Process Table
ID . Process Name . Amount 1 . Time 1 . Amount 2 Time 2
1 . Pick . normal . 20 . normal . 40
2 . Pack . normal . 40 . 3 . 10
3 . Pull . 3 . 20 . 1 . 60
4 . Play . normal . 40
Result
ID . Time 1 . Time 2
1 . 20 . 40
2 . 40
4 . 40
I have tried the following codes :
select public."Process", amount_1 from
names-# (select ID,time_1 FROM public."Process" AS normal_tasks);
select public."Process", amount_1 from
names-# select id, Time_1 from public."Process" where Amount_1!='normal';
but i'm getting syntax errors.
Any Help will be greatly appreciated
Many Thanks
Dave

I think you are looking for CASE
SELECT CASE WHEN Amount_1 = 'Normal'
THEN Time1
END as Time1,
CASE WHEN Amount_2 = 'Normal'
THEN Time2
END as Time2
FROM Process
WHERE 'Normal' IN (Amount_1, Amount_2)
You have to add one for each of those 22 columns

SAS hierarchical structure sum

I have a dataset with a hierarchical codelist variable.
The logics of hierarchy is determined by the LEVEL variable and the prefix structure of the CODE character variable.
There are 6 (code length from 1 to 6) "aggregate" levels and the terminal level (code length of 10 characters).
I need to update the nodes variable (count of terminal nodes - the aggregate levels do not count in the "higher" aggregates, only the terminal nodes) - so the sum of counts in one level, for example every level 5's total count is the same as every level 6's.
And I need to calculate (sum up) the weight to "higher" level nodes.
NOTE: I offset the output table's NODES and WEIGHT variable so you can see better what I am talking about (just add up the numbers in each offset and you get the same value).
EDIT1: there can be multiple observations with the same code. A unique observations is a combination of 3 variables code + var1 + var2.
Input table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 999 999 999
2 2 11 . . 999 999 999
3 3 111 . . 999 999 999
4 4 1111 . . 999 999 999
5 5 11111 . . 999 999 999
6 6 111111 . . 999 999 999
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 999 999 999
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 999 999 999
12 6 111190 . . 999 999 999
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
Desired output table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 5 1.0 98,1
2 2 11 . . 5 1.0 98,1
3 3 111 . . 5 1.0 98,1
4 4 1111 . . 5 1.0 98,1
5 5 11111 . . 3 0.7 98,5
6 6 111111 . . 2 0.2 107,3
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 1 0.5 95,0
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 2 0.3 97,2
12 6 111190 . . 2 0.3 97,2
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
And here's the code I came up with. It works just like I wanted, but man, it is really slow. I need something way faster, because this is a part of a webservice which has to run "instantly" on request.
Any suggestions on speeding up the code, or any other solutions are welcome.
%macro doit;
data temporary;
set have;
run;
%do i=6 %to 2 %by -1;
%if &i = 6 %then %let x = 10;
%else %let x = (&i+1);
proc sql noprint;
select count(code)
into :cc trimmed
from have
where level = &i;
select code
into :id1 - :id&cc
from have
where level = &i;
quit;
%do j=1 %to &cc.;
%let idd = &&id&j;
proc sql;
update have t1
set nodes = (
select sum(nodes)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%")),
set weight = (
select sum(weight)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%"))
where (t1.level = &i and t1.code like "&idd");
quit;
%end;
%end;
%mend doit;
Current code based on #Quentin's solution:
data have;
input ID level code : $10. nodes weight myIndex;
cards;
1 1 1 . . .
2 2 11 . . .
3 3 111 . . .
4 4 1111 . . .
5 5 11111 . . .
6 6 111111 . . .
7 10 1111110000 1 0.1 105.5
8 10 1111119999 1 0.1 109.1
9 6 111112 . . .
10 10 1111129999 1 0.5 95.0
11 5 11119 . . .
12 6 111190 . . .
13 10 1111900000 1 0.1 80.7
14 10 1111901000 1 0.2 105.5
;
data want (drop=_:);
*hash table of terminal nodes;
if (_n_ = 1) then do;
if (0) then set have (rename=(code=_code weight=_weight));
declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight myIndex=_myIndex))');
declare hiter iter('h');
h.definekey('ID');
h.definedata('_code','_weight','_myIndex');
h.definedone();
end;
set have;
*for each non-terminal node, iterate through;
*hash table of all terminal nodes, looking for children;
if level ne 10 then do;
call missing(weight, nodes, myIndex);
do _n_ = iter.first() by 0 while (_n_ = 0);
if trim(code) =: _code then do;
weight=sum(weight,_weight);
nodes=sum(nodes,1);
myIndex=sum(myIndex,_myIndex*_weight);
end;
_n_ = iter.next();
end;
myIndex=round(myIndex/weight,.1);
end;
output;
run;

Here's an alternative hash approach.
Rather than using a hash object to do a cartesian join, this adds the nodes & weight from each level 10 node to each of the 6 applicable parent nodes as it goes along. This may be marginally faster than Quentin's approach as there are no redundant hash lookups.
It takes a bit longer than Quentin's approach when constructing the hash object, and uses a bit more memory, as each terminal node is added 6 times with different keys and existing entries often have to be updated, but afterwards each parent node only has to look up its own individual stats, rather than looping through all the terminal nodes, which is a substantial saving.
Weighted stats are possible as well, but you have to update both loops, not just the second one.
data want;
if 0 then set have;
dcl hash h();
h.definekey('code');
h.definedata('nodes','weight','myIndex');
h.definedone();
length t_code $10;
do until(eof);
set have(where = (level = 10)) end = eof;
t_nodes = nodes;
t_weight = weight;
t_myindex = weight * myIndex;
do _n_ = 1 to 6;
t_code = substr(code,1,_n_);
if h.find(key:t_code) ne 0 then h.add(key:t_code,data:t_nodes,data:t_weight,data:t_myIndex);
else do;
nodes + t_nodes;
weight + t_weight;
myIndex + t_myIndex;
h.replace(key:t_code,data:nodes,data:weight,data:MyIndex);
end;
end;
end;
do until(eof2);
set have end = eof2;
if level ne 10 then do;
h.find();
myIndex = round(MyIndex / Weight,0.1);
end;
output;
end;
drop t_:;
run;

Below is a brute-force hash approach to doing a similar Cartesian product as in the SQL. Load a hash table of the terminal nodes. Then read through the dataset of nodes, and for each node that is not a terminal node, iterate through the hash table, identifying all of the child terminal nodes.
I think the approach #joop is describing may be more efficient, as this approach doesn't take advantage of the tree structure. So there is a lot of re-calculating. With 5000 records and 3000 terminal nodes, this would do 2000*3000 comparisons. But might not be that slow since the hash table is in memory, so you're not going to have excessive I/O ....
data want (drop=_:);
*hash table of terminal nodes;
if (_n_ = 1) then do;
if (0) then set have (rename=(code=_code weight=_weight));
declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight))');
declare hiter iter('h');
h.definekey('ID');
h.definedata('_code','_weight');
h.definedone();
end;
set have;
*for each non-terminal node, iterate through;
*hash table of all terminal nodes, looking for children;
if level ne 10 then do;
call missing(weight, nodes);
do _n_ = iter.first() by 0 while (_n_ = 0);
if trim(code) =: _code then do;
weight=sum(weight,_weight);
nodes=sum(nodes,1);
end;
_n_ = iter.next();
end;
end;
output;
run;

It seems pretty simple. Just join back with itself and count/sum.
proc sql ;
create table want as
select a.id, a.level, a.code , a.var1, a.var2
, count(b.id) as nodes
, sum(b.weight) as weight
from have a
left join have b
on a.code eqt b.code
and b.level=10
group by 1,2,3,4,5
order by 1
;
quit;
If you don't want to use the EQT operator then you can use the SUBSTR() function instead.
on a.code = substr(b.code,1,a.level)
and b.level=10

Since you're using SAS, how about using proc summary to do the heavy lifting here? No cartesian joins required!
One advantage of this option over the some of the others is that it's a bit easier to generalise if you want to calculate lots of more complex statistics for multiple variables.
data have;
input ID level code : $10. nodes weight myIndex;
format myIndex 5.1;
cards;
1 1 1 . . .
2 2 11 . . .
3 3 111 . . .
4 4 1111 . . .
5 5 11111 . . .
6 6 111111 . . .
7 10 1111110000 1 0.1 105.5
8 10 1111119999 1 0.1 109.1
9 6 111112 . . .
10 10 1111129999 1 0.5 95.0
11 5 11119 . . .
12 6 111190 . . .
13 10 1111900000 1 0.1 80.7
14 10 1111901000 1 0.2 105.5
;
run;
data v_have /view = v_have;
set have(where = (level = 10));
array lvl[6] $6;
do i = 1 to 6;
lvl[i]=substr(code,1,i);
end;
drop i;
run;
proc summary data = v_have;
class lvl1-lvl6;
var nodes weight;
var myIndex /weight = weight;
ways 1;
output out = summary(drop = _:) sum(nodes weight)= mean(myIndex)=;
run;
data v_summary /view = v_summary;
set summary;
length code $10;
code = cats(of lvl:);
drop lvl:;
run;
data have;
modify have v_summary;
by code;
replace;
run;
In theory a hash of hashes might also be an appropriate data structure, but that would be extremely complicated for a relatively small benefit. I might have a go anyway just as a learning exercise...

One approach (I think) would be to make the Cartesian product, and find all of the terminal nodes that are a "match" to each of the nodes, then sum the weights.
Something like:
data have;
input ID level code : $10. nodes weight ;
cards;
1 1 1 . .
2 2 11 . .
3 3 111 . .
4 4 1111 . .
5 5 11111 . .
6 6 111111 . .
7 10 1111110000 1 0.1
8 10 1111119999 1 0.1
9 6 111112 . .
10 10 1111129999 1 0.5
11 5 11119 . .
12 6 111190 . .
13 10 1111900000 1 0.1
14 10 1111901000 1 0.2
;
proc sql;
select min(id) as id
, min(level) as level
, a.code
, count(b.weight) as nodes /*count of terminal nodes*/
, sum(b.weight) as weight /*sum of weights of terminal nodes*/
from
have as a
,(select code , weight
from have
where level=10 /*selects terminal nodes*/
) as b
where a.code eqt b.code /*EQT is equivalent to =: */
group by a.code
;
quit;
I'm not sure that is correct, but it gives the desired results for the sample data.

This is the SQL needed to estimate the parent record for every record. It only uses string functions (position and length) so it should be adaptable to any dialect of SQL, maybe even SAS. (the CTE might need to be rewritten to subqueries or a view) The idea is to:
add a parent_id field to the dataset
find the record with the longest substring of code
and use its id as the value for our parent_id
(after that) update the records from the sum(nodes),sum(weight) of their direct children (the ones with child.parent_id = this.id )
BTW: I could have used the LEVEL instead of the LENGTH(code) ; the data is a bit redundant in this aspect.
WITH sub AS (
SELECT id, length(code) AS len
, code
FROM tree)
UPDATE tree t
SET parent_id = s.id
FROM sub s
WHERE length(t.code) > s.len AND POSITION (s.code IN t.code) = 1
AND NOT EXISTS (
SELECT *
FROM sub nx
WHERE nx.len > s.len AND POSITION (nx.code IN t.code ) = 1
AND nx.len < length(t.code) AND POSITION (nx.code IN t.code ) = 1
)
;
SELECT * FROM tree
ORDER BY parent_id DESC NULLS LAST
, id
;
After finding the parents, the whole table should be updated (repeatedly) from itself
like:
-- PREPARE omg( integer) AS
UPDATE tree t
SET nodes = s.nodes , weight = s.weight
FROM ( SELECT parent_id , SUM(nodes) AS nodes , SUM(weight) AS weight
FROM tree GROUP BY parent_id) s
WHERE s.parent_id = t.id
;
In SAS, this could probably be done by sorting on {0-parent_id, id} and do some retain+summation magic. (my SAS is a bit rusty in this area)
UPDATE: if only the leaf nodes have non-NULL (non-missing) values for {nodes, weight}, the aggregation can be done in one sweep for the entire tree, without first computing the parent_ids:
UPDATE tree t
SET nodes = s.nodes , weight = s.weight
FROM ( SELECT p.id , SUM(c.nodes) AS nodes , SUM(c.weight) AS weight
FROM tree p
JOIN tree c ON c.lev > p.lev AND POSITION (p.code IN c.code ) = 1
GROUP BY p.id
) s
WHERE s.id = t.id
;
An index on {lev,code} will probably speed up things. (assuming an index on id)

how to do a full outer join that will populate with their joing criteria regardless if they don't match up

I have two tables I'm doing a full outer join on. They are joined on variable "accsnnum", which they both share. But I'm having an issue. The accsnnum only prints out when it is present in both sets, but I need it to print it out regardless of if its in both because I'm comparing counts of a variable grouped by accsnum.
To illustrate, I have a labtable that looks like this (there are only these 2 variables):
accsnum lab_lbcount
A12 3
A11 5
A14 7
A13 5
And a rslttable that looks like this (there are only these 2 variables):
accsnum rslt_lbcount
A15 5
A12 2
A14 7
A16 3
I combine these two using the following code:
proc sql;
create table both as
select Labtable.accsnnum label="Lab_accsnnum", Rslttable.accsnnum
label="Rslt_accsnnum", Labtable.Lab_lbcount label="Lab_lbtestcd_count",
Rslttable.Rslt_lbcount label="Rslt_lbtestcd_count",
Lab_lbcount-Rslt_lbcount as difference
from work.Labtable
full outer join work.Rslttable
on Labtable.accsnnum = Rslttable.accsnnum;
quit;
This produces the following table:
accsnum lab_lbcount rslt_lbcount difference
5 . .
A12 3 2 1
5 . .
A14 7 7 0
. 5 .
. 3 .
Note that it only prints out the accsnum for where the two tables join, but includes counts for "lb_count" and "rslt_count". I need to either have it print out the accsnum in that variable even if its not in both, or I need to figure a way to join by accsnum but still print out the accsum from each of the two tables. Basically, I want the output to look something like this:
accsnum lab_lbcount rslt_lbcount difference
A11 5 . .
A12 3 2 1
A13 5 . .
A14 7 7 0
A15 . 5 .
A16 . 3 .
Or even something like this would be fine:
accsnum lab_accsum rslt_accsum lab_lbcount rslt_lbcount difference
A11 . 5 . .
A12 A12 A12 3 2 1
A13 . 5 . .
A14 A14 A14 7 7 0
. A15 . 5 .
. A16 . 3 .
Any help is greatly appreciated.
Edit: I'm doing SQL code inside SAS environment, in case that wasn't clear. But that should have little bearing on the code. Thanks again!

COALESCE the two accsnnums. This function takes the first nonmissing value, so if labtable.accsnnum is available it takes that; otherwise it takes rslttable.accsnnum.
proc sql;
create table both as
select coalesce(labtable.accsnnum,rslttable.accsnnum) as accsnnum label="Access Number", Labtable.Lab_lbcount label="Lab_lbtestcd_count",
Rslttable.Rslt_lbcount label="Rslt_lbtestcd_count",
Lab_lbcount-Rslt_lbcount as difference
from work.Labtable
full outer join work.Rslttable
on Labtable.accsnnum = Rslttable.accsnnum;
quit;

Access SQL agregate function query

I would like to make an SQL query to extract aggregate statistics from the following table:
Company | Product X1 | ProdX2 | ... | ProdX10 | ProdY1 | ProdY2 | ... | ProdY10
ABC 5 3 ... 6 5 8 ... 12
EDF 2 NULL ... 5 Null 1 ... 6
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
XYZ NULL 3 ... 14 7 2 ... 8
The result of the query should look something like this (other design suggestions appreciated)
Product | Average | Min | Covariance with corresponding X or Y Product
ProdX1 Avg(ProdX1) Min(ProdX1) Covar(ProdX1,ProdY1)
ProdX2 Avg(ProdX2) Min(ProdX2) Covar(ProdX2,ProdY2)
.
.
.
ProdY10 Avg(ProdY1) Min(ProdY10) Covar(ProdY10,ProdX10)
I am OK with the different aggregate functions, of course Covar (X1,Y1) = Covar(Y1,X1)
However, I am not sure how to create a query that returns the desired result.
Any suggestions are much appreciated.
Thank you very much.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using Imputer with conditions on the cell values of columns - pandas

Related

Identifying specific strings and checking subsequent rows for another string

Search a value in column and give the value from another

SAS hierarchical structure sum

how to do a full outer join that will populate with their joing criteria regardless if they don't match up

Access SQL agregate function query

Categories

Resources