Remove duplicates from left join - sql

Table A:
MSISDN A B C
990 1 2 3
992 1 2 3
995 1 2 3
993 1 2 3
991 1 2 3
994 1 2 3
Table B:
990 2 2 3
992 2 2 4
993 1 2 3
994 1 2 3
995 1 6 3
990 1 2 3
991 2 2 3
992 2 2 3
995 1 2 3
select msis1.msisdn,msis1.a,msis2.c from msis1 left join msis2 on msis1.msisdn=msis2.msisdn;
MSISDN A C
990 1 3
992 1 4
993 1 3
994 1 3
995 1 3
990 1 3
991 1 3
992 1 3
995 1 3
I want to modify the above query not to get duplicate records

Try SELECT DISTINCT, this will only select rows that are unique.

Related

How can I add rows iteratively to a select result set in pl sql?

In the work_order table there is wo_no. When I query the work_order table I want 2 additional columns (Task_no, Task_step_no) in the results set as follows
this should be iterate for all the wo_no s in the work_order table. task_no should go up to 5 and task_step_no should go upto 2000. (please have a look on the attached image to see the results set if not clear)
Any idea how to get such a results set in plsql?
One option is to use 2 row generators cross joined to your current table.
SQL> with
2 work_order (wo_no) as
3 (select 1 from dual union all
4 select 2 from dual
5 ),
6 task (task_no) as
7 (select level from dual connect by level <= 5),
8 step (task_step_no) as
9 (select level from dual connect by level <= 20) --> you'd have 2000 here
10 select y.wo_no, t.task_no, s.task_step_no
11 from work_order y cross join task t cross join step s
12 order by 1, 2, 3;
Result:
WO_NO TASK_NO TASK_STEP_NO
---------- ---------- ------------
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
1 1 6
1 1 7
1 1 8
1 1 9
1 1 10
1 1 11
1 1 12
1 1 13
1 1 14
1 1 15
1 1 16
1 1 17
1 1 18
1 1 19
1 1 20
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
1 2 6
1 2 7
1 2 8
1 2 9
1 2 10
1 2 11
1 2 12
1 2 13
1 2 14
1 2 15
1 2 16
1 2 17
1 2 18
1 2 19
1 2 20
1 3 1
1 3 2
1 3 3
1 3 4
1 3 5
1 3 6
1 3 7
1 3 8
1 3 9
1 3 10
1 3 11
1 3 12
1 3 13
1 3 14
1 3 15
1 3 16
1 3 17
1 3 18
1 3 19
1 3 20
1 4 1
1 4 2
1 4 3
1 4 4
1 4 5
1 4 6
1 4 7
1 4 8
1 4 9
1 4 10
1 4 11
1 4 12
1 4 13
1 4 14
1 4 15
1 4 16
1 4 17
1 4 18
1 4 19
1 4 20
1 5 1
1 5 2
1 5 3
1 5 4
1 5 5
1 5 6
1 5 7
1 5 8
1 5 9
1 5 10
1 5 11
1 5 12
1 5 13
1 5 14
1 5 15
1 5 16
1 5 17
1 5 18
1 5 19
1 5 20
2 1 1
2 1 2
2 1 3
2 1 4
2 1 5
2 1 6
2 1 7
2 1 8
2 1 9
2 1 10
2 1 11
2 1 12
2 1 13
2 1 14
2 1 15
2 1 16
2 1 17
2 1 18
2 1 19
2 1 20
2 2 1
2 2 2
2 2 3
2 2 4
2 2 5
2 2 6
2 2 7
2 2 8
2 2 9
2 2 10
2 2 11
2 2 12
2 2 13
2 2 14
2 2 15
2 2 16
2 2 17
2 2 18
2 2 19
2 2 20
2 3 1
2 3 2
2 3 3
2 3 4
2 3 5
2 3 6
2 3 7
2 3 8
2 3 9
2 3 10
2 3 11
2 3 12
2 3 13
2 3 14
2 3 15
2 3 16
2 3 17
2 3 18
2 3 19
2 3 20
2 4 1
2 4 2
2 4 3
2 4 4
2 4 5
2 4 6
2 4 7
2 4 8
2 4 9
2 4 10
2 4 11
2 4 12
2 4 13
2 4 14
2 4 15
2 4 16
2 4 17
2 4 18
2 4 19
2 4 20
2 5 1
2 5 2
2 5 3
2 5 4
2 5 5
2 5 6
2 5 7
2 5 8
2 5 9
2 5 10
2 5 11
2 5 12
2 5 13
2 5 14
2 5 15
2 5 16
2 5 17
2 5 18
2 5 19
2 5 20
200 rows selected.
SQL>
As you already have the work_order table, you'd just use it in FROM clause (not as a CTE):
with
task (task_no) as
(select level from dual connect by level <= 5),
step (task_step_no) as
(select level from dual connect by level <= 20)
select y.wo_no, t.task_no, s.task_step_no
from work_order y cross join task t cross join step s
order by 1, 2, 3;

pandas join with tables that has the same columns based on a single key

I have two dataframe:
df1 = K C1 C2 C3 ... Cn. D1. D2. D3
k1 1 2 4. 7 1 2 3
k2 3 5 6. 1 2 3 4
df2 = K C1 C2 C3 ... Cn B1 P1
k1 1 2 4 7 0 0
k1 1 2 4 7 0 1
k1 1 2 4 7 1 0
k1 1 2 4 7 1 1
k2 3 5 6 1 0 0
k2 3 5 6 1 0 1
k2 3 5 6 1 1 0
k2 3 5 6 1 1 1
I want to join in order to get:
df_merged =
K C1 C2 C3 ... Cn B1 P1 D1 D2 D3
k1 1 2 4 7 0 0 1 2 3
k1 1 2 4 7 0 1 1 2 3
k1 1 2 4 7 1 0 1 2 3
k1 1 2 4 7 1 1 1 2 3
k2 3 5 6 1 0 0 2 3 4
k2 3 5 6 1 0 1 2 3 4
k2 3 5 6 1 1 0 2 3 4
k2 3 5 6 1 1 1 2 3 4
I dont want to do left join on the columns [K.... Cn] because it will be very heavy.
What is the best way to do so?
Just a thought, if we can find sub-arrays in the matrix, it could shrink the quantities of combination.
df = pd.DataFrame(np.array(np.meshgrid(['k1 1 2 4 7','k2 3 5 6 1'],[0,1],[0,1],['1 2 3', '2 3 4'])).T.reshape(-1,4), columns=['KC','B','P','D'])
df
###
front = pd.DataFrame(df['KC'].str.split(' ').values.tolist()).add_prefix('C_')
front
###
C_0 C_1 C_2 C_3 C_4
0 k1 1 2 4 7
1 k1 1 2 4 7
2 k2 3 5 6 1
3 k2 3 5 6 1
4 k1 1 2 4 7
5 k1 1 2 4 7
6 k2 3 5 6 1
7 k2 3 5 6 1
8 k1 1 2 4 7
9 k1 1 2 4 7
10 k2 3 5 6 1
11 k2 3 5 6 1
12 k1 1 2 4 7
13 k1 1 2 4 7
14 k2 3 5 6 1
15 k2 3 5 6 1
rear = pd.DataFrame(df['D'].str.split(' ').values.tolist()).add_prefix('D_')
rear
###
D_0 D_1 D_2
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
6 1 2 3
7 1 2 3
8 2 3 4
9 2 3 4
10 2 3 4
11 2 3 4
12 2 3 4
13 2 3 4
14 2 3 4
15 2 3 4
# output = front.join(df[['B','P']]).join(rear)
output = pd.concat([front,df[['B','P']],rear],axis=1)
output.rename(columns={'C_0':'K'},inplace=True)
output
###
K C_1 C_2 C_3 C_4 B P D_0 D_1 D_2
0 k1 1 2 4 7 0 0 1 2 3
1 k1 1 2 4 7 1 0 1 2 3
2 k2 3 5 6 1 0 0 1 2 3
3 k2 3 5 6 1 1 0 1 2 3
4 k1 1 2 4 7 0 1 1 2 3
5 k1 1 2 4 7 1 1 1 2 3
6 k2 3 5 6 1 0 1 1 2 3
7 k2 3 5 6 1 1 1 1 2 3
8 k1 1 2 4 7 0 0 2 3 4
9 k1 1 2 4 7 1 0 2 3 4
10 k2 3 5 6 1 0 0 2 3 4
11 k2 3 5 6 1 1 0 2 3 4
12 k1 1 2 4 7 0 1 2 3 4
13 k1 1 2 4 7 1 1 2 3 4
14 k2 3 5 6 1 0 1 2 3 4
15 k2 3 5 6 1 1 1 2 3 4

How to count the duplicates of rows in a dataframe with multiple columns of integers

I have a question regarding to counting duplicates of rows in a dataframe. For example I have the following data frame.
df1 =
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
9 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Is there a way to counts the duplicates and give me the following dataframe?
df1_duplicates =
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Count
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4
1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4
2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2
I have tried using the following code,
df_duplicates = df1.groupby(df1.columns.tolist()).size().rename(columns={0:'count'})
it does give me the count, but the output dataframe become a single column dataframe as shown below.
df_I_dont_want_this =
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2
Is this what you want
df.groupby(df.columns.tolist()).size().to_frame('count').reset_index()
Out[28]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 count
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4
1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4
2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2

How to add aggregated rows based on other rows in Pandas dataframe

I have a dataframe similar to this:
index a b c d
0 1 1 1 3
1 1 1 2 1
2 1 2 1 4
3 1 2 2 1
4 2 2 1 5
5 2 2 2 9
6 2 2 1 2
7 2 3 2 6
I want to add new rows where c is 0, and d is replaced with the maximum value of d of existing rows where a and b are the same:
index a b c d
8 1 1 0 3
9 1 2 0 4
10 2 2 0 9
11 2 3 0 6
What command can I use? Thanks!
Seems you can using sort_values chain with drop_duplicates, then append
df.append(df.sort_values('d').drop_duplicates(['a','b'],keep='last').assign(c=0))
Out[77]:
a b c d
index
0 1 1 1 3
1 1 1 2 1
2 1 2 1 4
3 1 2 2 1
4 2 2 1 5
5 2 2 2 9
6 2 2 1 2
7 2 3 2 6
0 1 1 0 3
2 1 2 0 4
7 2 3 0 6
5 2 2 0 9
I come up with solution using groupby and pd.concat as follows:
pd.concat([df, df.groupby(['a', 'b'])['d'].max().reset_index().assign(c=0)], ignore_index=True)
Out[1668]:
a b c d
0 1 1 1 3
1 1 1 2 1
2 1 2 1 4
3 1 2 2 1
4 2 2 1 5
5 2 2 2 9
6 2 2 1 2
7 2 3 2 6
8 1 1 0 3
9 1 2 0 4
10 2 2 0 9
11 2 3 0 6

Paste two files into one using Linux

I was trying to paste two files into one in Linux:
file1
1 101 0 0 0 -9
1 102 0 0 0 -9
1 103 0 0 0 -9
1 104 0 0 0 -9
1 105 0 0 0 -9
1 106 0 0 0 -9
1 107 0 0 0 -9
1 108 0 0 0 -9
1 109 0 0 0 -9
1 110 0 0 0 -9
file2:
2 2 1 3 1 3 3 3 1 3 1 1 1 3 1 2 1 2 1 3 1 3 1 2 1
1 2 3 3 1 1 3 3 1 1 1 1 3 3 2 2 1 1 1 1 3 3 1 1 1
2 2 1 3 3 3 1 3 1 3 1 3 1 3 1 2 1 2 1 3 1 3 1 2 1
1 2 3 3 3 3 1 1 1 1 3 3 3 3 2 2 1 1 1 1 3 3 1 1 1
1 2 1 3 3 3 1 3 1 3 1 3 1 3 1 2 1 2 1 3 1 3 1 1 3
1 2 3 3 1 3 1 3 1 1 1 3 3 3 2 2 1 1 1 1 3 3 1 1 1
2 2 3 3 1 3 1 3 1 1 1 3 3 3 2 2 1 1 1 1 3 3 1 1 1
1 2 3 3 1 3 1 3 1 1 1 3 3 3 2 2 1 1 1 1 3 3 1 1 1
2 2 3 3 1 3 1 3 1 1 1 3 3 3 2 2 1 1 1 1 3 3 1 1 1
1 1 3 3 3 3 1 1 1 1 3 3 3 3 2 2 1 1 1 1 3 3 1 1 3
I have tried paste -d " " file1 file2 > output and paste file file2 | sed 's/\t/ /' > file3 but for some reason did not do it. I am getting only the content of file2 in output.
The desired out put is:
1 101 0 0 0 -9 2 2 1 3 1 3 3 3 1 3 1 1 1 3 1 2 1 2 1 3 1 3 1 2 1
1 102 0 0 0 -9 1 2 3 3 1 1 3 3 1 1 1 1 3 3 2 2 1 1 1 1 3 3 1 1 1
1 103 0 0 0 -9 2 2 1 3 3 3 1 3 1 3 1 3 1 3 1 2 1 2 1 3 1 3 1 2 1
1 104 0 0 0 -9 1 2 3 3 3 3 1 1 1 1 3 3 3 3 2 2 1 1 1 1 3 3 1 1 1
1 105 0 0 0 -9 1 2 1 3 3 3 1 3 1 3 1 3 1 3 1 2 1 2 1 3 1 3 1 1 3
1 106 0 0 0 -9 1 2 3 3 1 3 1 3 1 1 1 3 3 3 2 2 1 1 1 1 3 3 1 1 1
1 107 0 0 0 -9 2 2 3 3 1 3 1 3 1 1 1 3 3 3 2 2 1 1 1 1 3 3 1 1 1
1 108 0 0 0 -9 1 2 3 3 1 3 1 3 1 1 1 3 3 3 2 2 1 1 1 1 3 3 1 1 1
1 109 0 0 0 -9 2 2 3 3 1 3 1 3 1 1 1 3 3 3 2 2 1 1 1 1 3 3 1 1 1
1 110 0 0 0 -9 1 1 3 3 3 3 1 1 1 1 3 3 3 3 2 2 1 1 1 1 3 3 1 1 3
Please note, with less output I am getting the two files joined, but with ^M between them.
Any help is highly appreciated.
file1 has CRLF line endings. Use dos2unix to remove the CRs and get the expected output.