how to groupby statement in pandas other than crosstab - pandas

let say there is a dataframe
df = pd.DataFrame( {'col1': ['a', 'a', 'a', 'b', 'b'], 'col2': ['x', 'x', 'y', 'y', 'y']} )
I want to show this in a table whose index is a and column x is 2 and y is 1 just like in the 1st attached picture (https://imgur.com/MiWmdIz).
I used crosstab but I am getting two separate rows like in the 2nd attached picture.(https://imgur.com/WiJWT15)

You can use:
df = df.set_index('col1')
Here is the result:
col2
col1
a x
a x
a y
b y
b y
Same as your first attached picture.

Related

How to get all overlapping (ordered) 3-tuples from an array in BigQuery

Given a table like the following
elems
['a', 'b', 'c', 'd', 'e']
['v', 'w', 'x', 'y']
I'd like to transform it into something like this:
tuple
['a', 'b', 'c']
['b', 'c', 'd']
['c', 'd', 'e']
['v', 'w', 'x']
['w', 'x', 'y']
I.e., I'd like to get all overlapping 3-tuples.
My current attempt looks as follows:
WITH foo AS (
SELECT ['a', 'b', 'c', 'd', 'e'] AS elems UNION ALL
SELECT ['v', 'w', 'x', 'y']),
single AS (
SELECT * FROM
foo,
UNNEST(elems) elem
),
tuples AS (
SELECT ARRAY_AGG(elem) OVER (ROWS BETWEEN 2 PRECEDING AND 0 FOLLOWING) AS tuple
FROM single
)
SELECT * FROM tuples
WHERE ARRAY_LENGTH(tuple) >= 3
But the problem is, it returns some unwanted rows too, i.e., the ones that are "between" the original rows from the foo table.
tuple
['a', 'b', 'c']
['b', 'c', 'd']
['c', 'd', 'e']
['d', 'e', 'v'] <--- unwanted
['e', 'v', 'w'] <--- unwanted
['v', 'w', 'x']
['w', 'x', 'y']
Also, is it guaranteed, that the order of rows in single is correct, or does it only work in my minimal example by chance, because of the low cardinality? (I guess there may be a simple solution without this step in between.)
Consider below approach
select [elems[offset(index - 1)], elems[offset(index)], elems[offset(index + 1)]] as tuple
from your_table, unnest([array_length(elems)]) len,
unnest(generate_array(1, len - 2)) index
if applied to sample data in your question - output is
You might consider below query.
Also, is it guaranteed, that the order of rows in single is correct, or does it only work in my minimal example by chance, because of the low cardinality?
afaik, it's not quaranteeded without explicit using WITH OFFSET in the query.
WITH foo AS (
SELECT ['a', 'b', 'c', 'd', 'e'] AS elems UNION ALL
SELECT ['v', 'w', 'x', 'y']),
single AS (
SELECT * FROM
foo,
UNNEST(elems) elem WITH OFFSET
),
tuples AS (
SELECT ARRAY_AGG(elem) OVER (PARTITION BY FORMAT('%t', elems) ORDER BY offset ROWS BETWEEN 2 PRECEDING AND 0 FOLLOWING) AS tuple
FROM single
)
SELECT * FROM tuples
WHERE ARRAY_LENGTH(tuple) >= 3;
Just to give you another idea
create temp function slice(arr ARRAY<string>, pos float64, len float64)
returns array<string> language js as
r"return arr.slice(pos, pos + len);";
select slice(elems, index, 3) as tuple
from foo, unnest([array_length(elems)]) len,
unnest(generate_array(0, len - 3)) index
leaving it up to you to refactor above query to the point when it will look something like
select tuple
from foo, unnest(slices(elems, 3)) as tuple

How to compute cosine similarity between two texts in presto?

Hello everyone: I wanted to use COSINE_SIMILARITY in Presto SQL to compute the similarity between two texts. Unfortunately, COSINE_SIMILARITY does not take the texts as the inputs; it takes maps instead. I am not sure how to convert the texts into those maps in presto. I want the following, if we have a table like this:
id
text1
text2
1
a b b
b c
Then we can compute the cosine similarity as:
COSINE_SIMILARITY(
MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 0]),
MAP(ARRAY['a', 'b', 'c'], ARRAY[0, 1, 1])
)
i.e., two texts combined has three words: 'a', 'b', and 'c'; text1 has 1 count of 'a', 2 counts of 'b', and 0 count of 'c', which goes as the first MAP; similarly, text2 has 0 count of 'a', 1 count of 'b', and 1 count of 'c', which goes as the second MAP.
The final table should look like this:
id
text1
text2
all_unique_words
map1
map2
similarity
1
a b b
b c
[a b c]
[1, 2, 0]
[0, 1, 1]
0.63
How can we convert two texts into two such maps in presto? Thanks in advance!
Use split to transform string into array and then depended on Presto version either use unnest+histogram trick or array_frequency:
-- sample data
with dataset(id, text1, text2) as (values (1, 'a b b', 'b c'))
-- query
select id, COSINE_SIMILARITY(histogram(t1), histogram(t2))
from dataset,
unnest (split(text1, ' '), split(text2, ' ')) as t(t1, t2)
group by id;
Output:
id
_col1
1
0.6324555320336759

How to get proportion of two pandas df

I have a following problem. I have two datasets:
TableA = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'views': [10, 10, 20, 25, 25] })
TableB = pd.DataFrame({'c': ['A', 'A', 'B', 'B']})
I would like to know how many % of views from TableA are presented in TableB. In this case the result will be 30/55, because A and B are presented in TableA (views 10+20) and total sum of vies per category is 55 (10+20+25).
Is there any elegant way how to do this in pandas? I don`t want to "drop duplicates" in both tables and than to use some "antijoin".
You can do drop_duplicates
s = TableA.drop_duplicates('c')
s.loc[s.c.isin(TableB.c),'views'].sum()/s.views.sum()
Out[51]: 0.5454545454545454

Merge or join dataframes on numerical condition

I have 2 dataframes which I'd like to join based on equivalence in one column, and based on a numeric difference in the second column.
For example:
d = {'col1': [A, B], 'col2': [30, 40]}
df = pd.DataFrame(data=d)
d1 = {'col1': [A, B], 'col2': [35, 400]}
df1 = pd.DataFrame(data=d1)
col1 col2
0 A 30
1 B 40
col1 col2
0 A 35
1 B 400
Is there a way to merge on equivalence in col1, and a condition such as "absolute difference in col2 < 10"?
The only solutions I have seen discussed involve a general merge, on col1, and creating a filter based on the difference in col2.
Since the expected shape of output wasn't defined in the question so i concatenated both dataframes vertically based on the given condition.
d1 = {'col1': ["A", "B"], 'col2': [30, 40]}
df1 = pd.DataFrame(data=d1)
d2 = {'col1': ["A", "B"], 'col2': [35, 400]}
df2 = pd.DataFrame(data=d2)
out=pd.concat([df1,df2],axis=1,ignore_index = True)
out = out.rename(columns = {0: "df1_col1",1: "df1_col2",2: "df2_col1",3: "df2_col2"})
out = out[abs(out["df1_col2"] - out["df2_col2"]) <10]
print(out)
df1_col1 df1_col2 df2_col1 df2_col2
0 A 30 A 35

Extract last N elements of an array in SQL (hive)

I have a column with arrays and I want to extract the X last elements in an array.
Example trying to extract the last two elements:
Column A
['a', 'b', 'c']
['d', 'e']
['f', 'g', 'h', 'i']
Expected output:
Column A
['b', 'c']
['d', 'e']
['h', 'i']
Best case scenario would be to do it without using a UDF
One method using reverse, explode, filtering and re-assembling array again:
with your_table as (
select stack (4,
0, array(), --empty array to check it works if no elements or less than n
1, array('a', 'b', 'c'),
2, array('d', 'e'),
3, array('f', 'g', 'h', 'i')
) as (id, col_A)
)
select s.id, collect_list(s.value) as col_A
from
(select s.id, a.value, a.pos
from your_table s
lateral view outer posexplode(split(reverse(concat_ws(',',s.col_A)),',')) a as pos, value
where a.pos between 0 and 1 --last two (use n-1 instead of 1 if you want last n)
distribute by s.id sort by a.pos desc --keep original order
)s
group by s.id
Result:
s.id col_a
0 []
1 ["b","c"]
2 ["d","e"]
3 ["h","i"]
More elegant way using brickhouse numeric_range UDF in this answer