Join information from table A with the information from another table B multiple times - sql

So this maybe a simple question but I would like to learn if this can be done in one query.
Table A: contains gene information
gene start end
1 a 5 0
2 b 6 1
3 c 7 2
4 d 8 3
5 e 9 4
6 f 10 5
7 g 11 6
8 h 12 7
9 i 13 8
10 j 14 9
Table B: contains calculated gene information.
gene1 gene2 cor
1 d j -0.7600805
2 c i 0.4274278
3 e g -0.9249361
4 a f 0.8567928
5 b h -0.3018518
6 d j -0.3723553
7 c i 0.1617981
8 e g 0.8575933
9 a f 0.8409788
10 b h 0.1506035
The result table I'm trying to get is:
gene1 gene2 cor start1 end1 start2 end2
1 d j -0.7600805 8 3 14 9
2 c i 0.4274278 7 2 13 8
3 e g -0.9249361
4 a f 0.8567928
5 b h -0.3018518
6 d j -0.3723553 etc.
7 c i 0.1617981
8 e g 0.8575933
9 a f 0.8409788
10 b h 0.1506035
The method I can think of is to join table A onto table B twice, firstly by gene1 and then by gene2, which would require for an intermediate table. Is there any simpler ways to achieve this in one step?

Yes, two joins will do it
You simply need to do this:
SELECT b.Gene1
,b.Gene2
,b.cor
,a1.Start AS Start1
,a1.End AS End1
,a2.Start AS Start2
,a2.End AS End2
FROM TableB b
INNER JOIN TableA a1
ON a1.Gene = b.Gene1
INNER JOIN TableA a2
ON a2.Gene = b.Gene2
Depending on your dbms you may need to tweek the syntax a bit

Related

How do I add specific columns from different tables onto an existing table in postgresql?

I have an original table (TABLE 1):
A
B
C
D
1
3
5
7
2
4
6
8
I want to add column F from the table below (Table 2) onto table 1:
A
F
G
H
1
29
5
7
2
30
6
8
As well as adding Column J,L and O from the table below (Table 3) onto column 1:
A
I
J
K
L
M
N
O
1
9
11
13
15
17
19
21
2
10
12
14
16
18
20
22
How do I go about adding only the specific columns onto table 1?
Expected Result:
A
B
C
D
F
J
L
O
1
3
5
7
29
11
15
21
2
4
6
8
30
12
16
22
Use following query
SELECT T1.A,
B,
C,
D,
F,
J,
L,
O
FROM table1 T1
JOIN table2 T2
ON T1.A = T2.A
JOIN table3 t3
ON T1.A = T3.A

pandas groupby apply optimizing a loop

For the following data:
index bond stock investor_bond inverstor_stock
0 1 2 A B
1 1 2 A E
2 1 2 A F
3 1 2 B B
4 1 2 B E
5 1 2 B F
6 1 3 A A
7 1 3 A E
8 1 3 A G
9 1 3 B A
10 1 3 B E
11 1 3 B G
12 2 4 C F
13 2 4 C A
14 2 4 C C
15 2 5 B E
16 2 5 B B
17 2 5 B H
bond1 has two investors, A,B. stock2 has three investors, B,E,F. For each investor pair (investor_bond, investor_stock), we want to filter it out if they had ever invested in the same bond/stock.
For example, for a pair of (B,F) of index=5, we want to filter it out because both of them invested in stock 2.
Sample output should be like:
index bond stock investor_bond investor_stock
11 1 3 B G
So far I have tried using two loops.
A1 = A1.groupby('bond').apply(lambda x: x[~x.investor_stock.isin(x.bond)]).reset_index(drop=True)
stock_list=A1.groupby(['bond','stock']).apply(lambda x: x.investor_stock.unique()).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
stock_list=stock_list.groupby('bond').apply(lambda x: list(x.s)).reset_index()
stock_list=stock_list.rename(columns={0:'s'})
A1=pd.merge(A1,stock_list,on='bond',how='left')
A1['in_out']=False
for j in range(0,len(A1)):
for i in range (0,len(A1.s[j])):
A1['in_out'] = A1.in_out | (
A1.investor_bond.isin(A1.s[j][i]) & A1.investor_stock.isin(A1.s[j][i]))
print(j)
The loop is running forever due to the data size, and I am seeking a faster way.

How to remove duplicates the row wherefore two columns matches?

for my graduation project, I would like to remove duplicate rows and keep only a row where column b and c are equal for the value in column a. I tried a lot of things, groupby, Merge combinations and duplicates, but nothing worked out till now. Can you please help me? Many thanks!
input:
a b c
0 1 A B
1 1 A A
2 1 A C
3 2 B A
4 2 B B
result:
a b c
1 1 A A
4 2 B B
I believe you need:
print (df)
a b c
0 1 A B
1 1 A A
2 1 A C
3 2 B A
4 2 B B
5 3 C C
6 4 C NaN
7 4 C E
7 5 NaN E
Replace NaNs by forward and back filling:
df1 = df[['b','c']].bfill(axis=1).ffill(axis=1)
print (df1)
b c
0 A B
1 A A
2 A C
3 B A
4 B B
5 C C
6 C C
7 C E
7 E E
Check condition in df1 and because same index is possible filter df:
df = df[df1['b'] == df1['c']]
print (df)
a b c
1 1 A A
4 2 B B
5 3 C C
6 4 C NaN
7 5 NaN E

Extracting Sequential Pattern

Can anyone help me on how to write a script to extract sequential lines?
I was able to find and get a script working to create all the permutations of the given inputs, but that's not what I need.
awk 'function perm(p,s, i) {
for(i=1;i<=n;i++)
if(p==1)
printf "%s%s\n",s,A[i]
else
perm(p-1,s A[i]", ")
}
{
A[++n]=$1
}
END{
perm(n)
}' infile
Unfortunately, I don't understand the script well enough to made a modification (not due to lack of trying).
I need to extract 2 to 5 sequential lines/word patterns.
An illustration of what I need is as follows:
Eg.
inputfile.txt:
A
B
C
D
E
F
G
outputfile.txt:
A B
B C
C D
D E
E F
F G
A B C
B C D
C D E
D E F
E F G
A B C D
B C D E
C D E F
D E F G
A B C D E
B C D E F
C D E F G
Here's a Python answer.
General algorithm:
Load all letters into a list
For n = 2..5, where n is size of the "window". You "slide" that window over the list and print those n characters.
Python is nice for this because of list slicing.
with open('input.txt') as f_in, open('output.txt', 'w') as f_out:
chars = f_in.read().splitlines()
for n in range(2, 6):
for start_window in range(len(chars) - n + 1):
f_out.write(' '.join(chars[start_window:start_window + n]))
f_out.write('\n')
awk to the rescue!
$ awk 'BEGIN{n=1}
FNR==1{n++}
{a[c++]=$0; c=c%n}
FNR>n-1{for(i=c;i<c+n-1;i++) printf "%s ",a[i%n];
print}' file{,,,}
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
7 8 9
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
multiple scans of the input file (number of commas). Used seq 9 as the input file.
Another in awk:
{ a[NR]=$0 }
END {
o[0]=ORS
for(i=2;i<=5;i++)
for(j=1;j<=length(a);j++) {
printf "%s%s", a[j], (++k==i?o[k=0]:OFS)
if(!k&&j!=length(a)) j-=(i-1)
}
}

Dedup using HiveQL

I have a hive table with field 'a'(int), 'b'(string), 'c'(bigint), 'd'(bigint) and 'e'(string).
I have data like:
a b c d e
---------------
1 a 10 18 i
2 b 11 19 j
3 c 12 20 k
4 d 13 21 l
1 e 14 22 m
4 f 15 23 n
2 g 16 24 o
3 h 17 25 p
Table is sorted on key 'b'.
Now we want output like below:
a b c d e
---------------
1 e 14 22 m
4 f 15 23 n
2 g 16 24 o
3 h 17 25 p
which will be deduped on key 'a' but will keep last(latest) 'b'.
Is it possible using Hive query(HiveQL)?
If column b is unique, Try follow hql:
select
*
from
(
select max(b) as max_b
from
table
group by a
) table1
join table on table1.max_b = table.b