Create two indexes at same time - indexing

Is It possible to create more than one index at same time on SAS data step?
I have the follow
DATA DADOS;
INFILE DATALINES DELIMITER=',';
INPUT SETOR $ DIRETORIA $ 13.;
DATALINES;
SETOR1,DIRETORIA1
SETOR1,DIRETORIA2
SETOR2,DIRETORIA1
SETOR2,DIRETORIA2
SETOR2,DIRETORIA4
SETOR3,DIRETORIA4
SETOR4,DIRETORIA2
SETOR5,DIRETORIA2
SETOR5,DIRETORIA3
;RUN;
Then I need to add two indexes, one for column SETOR and anoother for DIRETORIA LIKE this
SETOR
DIRETORIA
SETOR_ID
DIRETORIA_ID
SETOR1
DIRETORIA1
1
1
SETOR1
DIRETORIA2
1
2
SETOR2
DIRETORIA1
2
1
SETOR2
DIRETORIA2
2
2
SETOR2
DIRETORIA3
2
3
SETOR3
DIRETORIA4
3
4
SETOR4
DIRETORIA2
4
2
SETOR5
DIRETORIA2
5
2
SETOR5
DIRETORIA3
5
3
I've tried this, but don't work.
DATA DETALHE_1;
SET DADOS;
BY SETOR DIRETORIA;
RETAIN SETOR_ID DIRETORIA_ID;
IF FIRST.SETOR THEN
SETOR_ID + 1;
IF FIRST.DIRETORIA THEN
DIRETORIA_ID + 1;
RUN;
And what i got
SETOR
DIRETORIA
SETOR_ID
DIRETORIA_ID
SETOR1
DIRETORIA1
1
1
SETOR1
DIRETORIA2
1
2
SETOR2
DIRETORIA1
2
3
SETOR2
DIRETORIA2
2
4
SETOR2
DIRETORIA3
2
5
SETOR3
DIRETORIA4
3
6
SETOR4
DIRETORIA2
4
7
SETOR5
DIRETORIA2
5
8
SETOR5
DIRETORIA3
5
9
The SETOR_ID is correct, but DIRETORIA_ID don't.
How to solve it?

Looks to me like you just want to count, not index in any sense of the word.
You are counting setor across the whole dataset and diretoria within the current value of setor. Your sample data does not have any repeating values of diretoria, but let's code so that if you did then each distinct value within the given value of sector would be assigned the same count id.
data want;
set DADOS;
by setor diretoria;
setor_id + first.setor;
diretoria_id + first.diretoria;
if first.setor then diretoria_id=1;
run;
Results
diretoria_
Obs SETOR DIRETORIA setor_id id
1 SETOR1 DIRETORIA1 1 1
2 SETOR1 DIRETORIA2 1 2
3 SETOR2 DIRETORIA1 2 1
4 SETOR2 DIRETORIA2 2 2
5 SETOR2 DIRETORIA4 2 3
6 SETOR3 DIRETORIA4 3 1
7 SETOR4 DIRETORIA2 4 1
8 SETOR5 DIRETORIA2 5 1
9 SETOR5 DIRETORIA3 5 2
10 SETOR5 DIRETORIA3 5 2
If you need to assign each value of diretoria the same number across the different values of setor then you would need a different method. One that could keep track of which values of diretoria have been already assigned a number. Such as by building a HASH object. In which case the data need not be sorted by either variable, but the number of distinct setor and diretoria values would need to be able to fit into memory.
data want ;
set dados;
if _n_=1 then do;
declare hash s();
declare hash d();
_rc=s.definekey('setor');
_rc=d.definekey('diretoria');
_rc=s.definedata('setor_id');
_rc=d.definedata('diretoria_id');
_rc=s.definedone();
_rc=d.definedone();
end;
if s.find() then do; _setor+1; setor_id=_setor; _rc=s.add(); end;
if d.find() then do; _diretoria+1; diretoria_id=_diretoria; _rc=d.add(); end;
drop _:;
run;
Results:
diretoria_
Obs SETOR DIRETORIA setor_id id
1 SETOR1 DIRETORIA1 1 1
2 SETOR1 DIRETORIA2 1 2
3 SETOR2 DIRETORIA1 2 1
4 SETOR2 DIRETORIA2 2 2
5 SETOR2 DIRETORIA4 2 3
6 SETOR3 DIRETORIA4 3 3
7 SETOR4 DIRETORIA2 4 2
8 SETOR5 DIRETORIA2 5 2
9 SETOR5 DIRETORIA3 5 4
10 SETOR5 DIRETORIA3 5 4

Related

Create Temporary Function in Hive

i have a query like this
CREATE TEMPORARY MACRO AAA (input VARCHAR(16))
input||
SUBSTR(
LPAD(
(CASE WHEN SUBSTR(LPAD(input,16,0),15,1) = 1 THEN 2
WHEN SUBSTR(LPAD(input,16,0),15,1) = 2 THEN 4
WHEN SUBSTR(LPAD(input,16,0),15,1) = 3 THEN 6
WHEN SUBSTR(LPAD(input,16,0),15,1) = 4 THEN 8
WHEN SUBSTR(LPAD(input,16,0),15,1) = 5 THEN 10
WHEN SUBSTR(LPAD(input,16,0),15,1) = 6 THEN 1
WHEN SUBSTR(LPAD(input,16,0),15,1) = 7 THEN 3
WHEN SUBSTR(LPAD(input,16,0),15,1) = 8 THEN 5
WHEN SUBSTR(LPAD(input,16,0),15,1) = 9 THEN 7
ELSE 0
END) +
(CASE WHEN SUBSTR(LPAD(input,16,0),16,1) = 1 THEN 1
WHEN SUBSTR(LPAD(input,16,0),16,1) = 2 THEN 2
WHEN SUBSTR(LPAD(input,16,0),16,1) = 3 THEN 3
WHEN SUBSTR(LPAD(input,16,0),16,1) = 4 THEN 4
WHEN SUBSTR(LPAD(input,16,0),16,1) = 5 THEN 5
WHEN SUBSTR(LPAD(input,16,0),16,1) = 6 THEN 6
WHEN SUBSTR(LPAD(input,16,0),16,1) = 7 THEN 7
WHEN SUBSTR(LPAD(input,16,0),16,1) = 8 THEN 8
WHEN SUBSTR(LPAD(input,16,0),16,1) = 9 THEN 9
ELSE 0
END)
,3,0), 3, 1);
but I want to create a temporary function for hive.
What needs to be changed and what format is it in to create a temporary function?
You will have to write your own UDF.
Ref : https://cwiki.apache.org/confluence/display/Hive/HivePlugins
Once you have jar file with custom logic you can create function like below:
Tested in Hive version apache-hive-4.0.0-alpha-1
CREATE TEMPORARY FUNCTION aa_fun as 'AAA' using jar 'hdfs://localhost:9000/user/hive/aaa_func.jar';
select aa_fun('fwfrgwre12a3') fun_res, AAA('fwfrgwre12a3') macro_res;
+----------------+----------------+
| fun_res | macro_res |
+----------------+----------------+
| fwfrgwre12a33 | fwfrgwre12a33 |
+----------------+----------------+

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.
You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

python3:How to sum column value of each three of a column

There is a dataframe like following:
id t_id y1 y2
1 1. 2. 1
1 2. 2. 1
1 3. 2. 1
1 4. 2. 1
1 .......
1 15. 2. 1
2 1. 2. 8
2 2. 5. 6
2 3. 5. 7
2 4. 5. 5
2 .......
2 15. 5. 10
..............
There is a number of id (1,2...), and t_id (1,...15) for each id, there are y1, y2 for each t_id. I want to sum the y1, y2 for each id in every three t_id (123,456,789,10112,131415) like the following dataframe(I just give example)
id. t_id y1. y2
1 1,2,3 6. 3
1. 4,5,6 6. 3
1. 7,8,9. 6. 3
1 10,11,12. 6. 3
1 13,14,15. 6. 3
......
Thanks in advance!
You can map t_id and groupby:
mapped_t_id = (df['t_id']-1)//3
(df.groupby(['id', mapped_t_id])
.agg({'t_id':set, 'y1':'sum', 'y2':'sum'})
)
You could process the entire dataframe by grouping it by 3 rows at a time. Here's how to do it.
import pandas as pd
df = pd.DataFrame({'id' : [1]*15 + [2]*15,
't_id':[i for i in range(1,16)]*2,
'y1': [2]*15 + [5]*15,
'y2':[1]*15 + [i for i in range (15,0,-1)]})
print (df)
df1 = pd.DataFrame(data=None, columns=df.columns)
for index, g in df.groupby(df.index // 3):
df_id = g.iloc[0,0]
df_tid = ','.join(str(g.iloc[i,1]) for i in range (3))
df_y1 = g.sum(axis=0)['y1']
df_y2 = g.sum(axis=0)['y2']
df1.loc[index] = [df_id,df_tid,df_y1,df_y2]
print (df1)
The output of this will be:
id t_id y1 y2
0 1 1,2,3 6 3
1 1 4,5,6 6 3
2 1 7,8,9 6 3
3 1 10,11,12 6 3
4 1 13,14,15 6 3
5 2 1,2,3 15 42
6 2 4,5,6 15 33
7 2 7,8,9 15 24
8 2 10,11,12 15 15
9 2 13,14,15 15 6
While this is a good response, I do like how #Quang Hoang solved it. It is concise.

Insert new rows based on conditions - Oracle SQL

I have a table:
table1
unique_id col_id col_nm col_val sequ
1 1 testq 1 100 1
1 2 testc 1 abc 1
1 1 testq 1 101 2
1 2 testc 1 xyz 2
1 5 test 5 10 1
1 8 test 6 100 1
2 1 testq 1 100 1
2 2 testc 1 pqr 1
2 1 testq 1 101 2
2 2 testc 1 xxy 2
2 5 test 5 qqw 1
2 8 test 6 100 1
I need to insert new rows in the table based on the following condition:
Find unique_id and sequ of col_id = 1 and col_nm = 'testq 1' and col_val = 100
Find col_val of col_id = 2 and col_nm = 'testc 1' and sequ = {sequ of step 1} and unique_id = {unique_id of step 1}.
Insert a new row for the corresponding unique_id, with col_id = 100, col_nm = 'test q100c', col_val = {col_val found in step 2}, sequ = {sequ found in step 2}
The output would be:
unique_id col_id col_nm col_val sequ
1 1 testq 1 100 1
1 2 testc 1 abc 1
1 1 testq 1 101 2
1 2 testc 1 xyz 2
1 5 test 5 10 1
1 8 test 6 100 1
1 100 test q100c abc 1
2 1 testq 1 100 2
2 2 testc 1 pqr 2
2 1 testq 1 101 2
2 2 testc 1 xxy 2
2 5 test 5 qqw 1
2 8 test 6 100 1
2 100 test q100c pqr 2
Is there anyway in SQL to achieve this?
We can use WITH clause in an INSERT … SELECT construct. So something like this?
insert into table1
with s1 as (
select t.unique_id
, t.sequ
from table1 t
where t.col_id = 1
and t.col_nm = 'testq 1'
and t.col_val = 100 )
, s2 as (
select s1.*
, t.col_val
from s1
join table1 t
on t.sequ = s1.sequ
and t.unique_id = s1.unique_id
where t.col_id = 2
and t.col_nm = 'testc 1'
)
select s2.unique_id
,100 as col_id
,'test q100c' as col_nm
,s2.col_val
,s2.sequ
from s2
/
I'm not sure I have entirely understood your rules - I used the col_val from step #2 (which is what your expected output shows) rather than the value from step #1 as your rule 3 states - but I hope this gives you a start. Also, this may not be a very efficient approach. I offer no guarantees regarding performance over a large volume of data.

reorder sort_order in table with sqlite

I have this table:
id sort_ord
0 6
1 7
2 2
3 3
4 4
5 5
6 8
Why does this query:
UPDATE table
SET sort_ord=(
SELECT count(*)
FROM table AS pq
WHERE sort_ord<table.sort_ord
ORDER BY sort_ord
)
WHERE(sort_ord>=0)
Produce:
id sort_ord
0 4
1 5
2 0
3 1
4 2
5 4
6 6
I was expecting all sort_ord fields to subtract by 2.
Here is defined: https://www.sqlite.org/isolation.html
About this link i can interpret, you has several instances for one query (update table and select count table) and independent of each other.
When you are in update sort_data(5) id 5, you have new data for read on every "SET sot_ord" (understanding what say about isolation), and now the result is 4.
Every select is a new instance and a new data reading
id sort_ord
0 4
1 5
2 0
3 1
4 2
5 5**
6 8**