Merge or join dataframes on numerical condition - pandas

I have 2 dataframes which I'd like to join based on equivalence in one column, and based on a numeric difference in the second column.
For example:
d = {'col1': [A, B], 'col2': [30, 40]}
df = pd.DataFrame(data=d)
d1 = {'col1': [A, B], 'col2': [35, 400]}
df1 = pd.DataFrame(data=d1)
col1 col2
0 A 30
1 B 40
col1 col2
0 A 35
1 B 400
Is there a way to merge on equivalence in col1, and a condition such as "absolute difference in col2 < 10"?
The only solutions I have seen discussed involve a general merge, on col1, and creating a filter based on the difference in col2.

Since the expected shape of output wasn't defined in the question so i concatenated both dataframes vertically based on the given condition.
d1 = {'col1': ["A", "B"], 'col2': [30, 40]}
df1 = pd.DataFrame(data=d1)
d2 = {'col1': ["A", "B"], 'col2': [35, 400]}
df2 = pd.DataFrame(data=d2)
out=pd.concat([df1,df2],axis=1,ignore_index = True)
out = out.rename(columns = {0: "df1_col1",1: "df1_col2",2: "df2_col1",3: "df2_col2"})
out = out[abs(out["df1_col2"] - out["df2_col2"]) <10]
print(out)
df1_col1 df1_col2 df2_col1 df2_col2
0 A 30 A 35

Related

How to compute cosine similarity between two texts in presto?

Hello everyone: I wanted to use COSINE_SIMILARITY in Presto SQL to compute the similarity between two texts. Unfortunately, COSINE_SIMILARITY does not take the texts as the inputs; it takes maps instead. I am not sure how to convert the texts into those maps in presto. I want the following, if we have a table like this:
id
text1
text2
1
a b b
b c
Then we can compute the cosine similarity as:
COSINE_SIMILARITY(
MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 0]),
MAP(ARRAY['a', 'b', 'c'], ARRAY[0, 1, 1])
)
i.e., two texts combined has three words: 'a', 'b', and 'c'; text1 has 1 count of 'a', 2 counts of 'b', and 0 count of 'c', which goes as the first MAP; similarly, text2 has 0 count of 'a', 1 count of 'b', and 1 count of 'c', which goes as the second MAP.
The final table should look like this:
id
text1
text2
all_unique_words
map1
map2
similarity
1
a b b
b c
[a b c]
[1, 2, 0]
[0, 1, 1]
0.63
How can we convert two texts into two such maps in presto? Thanks in advance!
Use split to transform string into array and then depended on Presto version either use unnest+histogram trick or array_frequency:
-- sample data
with dataset(id, text1, text2) as (values (1, 'a b b', 'b c'))
-- query
select id, COSINE_SIMILARITY(histogram(t1), histogram(t2))
from dataset,
unnest (split(text1, ' '), split(text2, ' ')) as t(t1, t2)
group by id;
Output:
id
_col1
1
0.6324555320336759

How to get proportion of two pandas df

I have a following problem. I have two datasets:
TableA = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'views': [10, 10, 20, 25, 25] })
TableB = pd.DataFrame({'c': ['A', 'A', 'B', 'B']})
I would like to know how many % of views from TableA are presented in TableB. In this case the result will be 30/55, because A and B are presented in TableA (views 10+20) and total sum of vies per category is 55 (10+20+25).
Is there any elegant way how to do this in pandas? I don`t want to "drop duplicates" in both tables and than to use some "antijoin".
You can do drop_duplicates
s = TableA.drop_duplicates('c')
s.loc[s.c.isin(TableB.c),'views'].sum()/s.views.sum()
Out[51]: 0.5454545454545454

How to get all the elements in an array but not in another array in HIVE?

For example, I have two columns of arrays now:
id col1 col2
A [1, 3] [1, 2, 3]
B [2] [1, 2, 3]
what I want is all the elements in col2 but not in col1:
id output
A [2]
B [1, 3]
How can I achieve this?
Explode col2 array, use array_contains to check each element is in another array, collect array again for elements not in col1 array
select t.id,
collect_set(case when array_contains(t.col1, e.elem) then NULL else e.elem end) as result
from my_table t
lateral view explode(t.col2) e as elem
group by t.id

how to groupby statement in pandas other than crosstab

let say there is a dataframe
df = pd.DataFrame( {'col1': ['a', 'a', 'a', 'b', 'b'], 'col2': ['x', 'x', 'y', 'y', 'y']} )
I want to show this in a table whose index is a and column x is 2 and y is 1 just like in the 1st attached picture (https://imgur.com/MiWmdIz).
I used crosstab but I am getting two separate rows like in the 2nd attached picture.(https://imgur.com/WiJWT15)
You can use:
df = df.set_index('col1')
Here is the result:
col2
col1
a x
a x
a y
b y
b y
Same as your first attached picture.

SQL - How to Match Values?

I found this online for practice, but don't know where to begin.
Given a table with three columns (id, category, value) and each id has 3 or fewer possible values (price, size, color).
Now, how can I find those id's for which the value of two or more values matches to one another?
For example:
ID1 (price 10, size M, color Red),
ID2 (price 10, Size L, Color Red),
ID3 (price 15, size L, color Red)
Then the output should be two rows:
ID1 ID2
and ID2 ID3
Using the data frame DF shown reproducibly in the Note at the end:
library(sqldf)
sqldf("select a.ID first, b.ID second
from DF a
join DF b on (a.price = b.price) +
(a.size = b.size) +
(a.color = b.color) > 1 and
a.ID < b.ID")
giving:
first second
1 1 2
2 2 3
Note
DF <- data.frame(ID = 1:3,
price = c(10, 10, 15),
size = c("M", "L", "L"),
color = "Red",
stringsAsFactors = FALSE)