I am attempting to convert a dataframe into a sparse co-occurrence dataframe using SQL. The original dataframe contains a list of IDs and products associated with each ID. The minimum number of products associated with an ID is 2.
To generate the sparse co-occurrence dataframe, I am going through the original dataframe and calculating the number of occurrences of each unique product pair. The resulting dataframe is a sparse dataframe, with the "row" representing a unique product, and "column" representing the product co-occurring with the 'row'. The values in the co-occurrence dataframe represent the number of occurrences of the corresponding product pair.
Here's the original dataframe:
ID
Product
1
A
1
B
1
C
1
D
2
A
2
D
3
B
3
D
4
D
4
B
4
C
5
B
5
D
6
A
6
B
7
A
7
C
7
D
The sparse co-occurrence dataframe result
Row
Column
Value
A
B
2
A
C
2
A
D
3
B
A
2
B
C
2
B
D
4
C
A
2
C
B
2
C
D
3
D
A
3
D
B
4
D
C
3
The possible pair for ID 1 (original dataframe):
ID 1 list product : a, b, c, d
Possible Combination:
a -> b, b -> a
a -> c, c -> a
a -> d, d -> a
b -> c, c -> b
b -> d, d -> b
c -> d, d -> c
So essentially, starting with the original dataframe, we are constructing a sparse co-occurrence dataframe that represents the product associations within each transaction list. By examining the unique product pairs within each transaction, we can create the rows and columns in the dataframe, with the value in each cell representing the number of occurrences of that particular product pair. For instance, in the case of the combination "a -> b", the corresponding cell in the sparse co-occurrence dataframe located at the row of product "a" and the column of product "b" would have a value of +1, indicating that this product pair was present once in the transaction list of ID 1.
I've found a solution using regular Python, but the performance is slow due to the large amount of data. I am exploring the possibility of using SQL to address this issue. Can anyone provide guidance on how to accomplish this task using SQL?
Related
I have a pandas dataframe which contains duplicates values according to two columns (A and B):
A B C
1 2 1
1 2 4
2 7 1
3 4 0
3 4 8
I want to remove duplicates keeping the values in column C inside a list of len N values in C (example 2 values in this example). This would lead to:
A B C
1 2 [1,4]
2 7 1
3 4 [0,8]
I cannot figure out how to do that. Maybe use groupby and drop_duplicates?
I have a dataframe of the form :
ID | COL
1 A
1 B
1 C
1 D
2 A
2 C
2 D
3 A
3 B
3 C
I also have a list of list containing sequences,for example seq = [[A,B,C],[A,C,D]].
I am trying to count the number of IDs in the dataframe where in COL matchs exactly an entry in seq. I am currently doing it the following way :-
df.groupby('ID')['COL'].apply(lambda x: x.reset_index(drop = True).equals(pd.Series(vs))).reset_index()['COL'].count()
iterating over vs,where vs is a list from seq.
Expected Output :-
ID | is_in_seq
1 0
2 1
3 1
Since the sequence in COL for ID 1 is ABCD, which is not a sequence in seq, the value against it is 0.
Questions:-
1.) Is there a vectorized way of doing this operation? The approach I've outlined above takes a lot of time even for a single entry from seq, seeing that there can be upto 30 - 40 values in col per ID and maintaining the order in COL is critical.
IIUC:
You will only ever produce a zero or a one. Because you'll be checking if the group as a whole (and there is only one whole) is in seq. If seq is unique (I'm assuming it is) then you'll only ever have the group in seq or not.
First step is to make seq a set of tuples
seq = set(map(tuple, seq))
Second step is to produces an aggregated pandas object that contains tuples
tups = df.groupby('ID')['COL'].agg(tuple)
tups
ID
1 (A, B, C, D)
2 (A, C, D)
3 (A, B, C)
Name: COL, dtype: object
Step three, we can use isin
tups.isin(seq).astype(int).reset_index(name='is_in_seq')
ID is_in_seq
0 1 0
1 2 1
2 3 1
IIUC, use groupby.sum
to get a string with the complete sequence. Then use map and ''.join with DataFrame.isin to check matches
new_df = (df.groupby('ID')['COL']
.sum()
.isin(map(''.join, seq))
#.isin(list(map(''.join, seq))) #if neccesary list
.astype(int)
.reset_index(name = 'is_in_seq')
)
print(new_df)
ID is_in_seq
0 1 0
1 2 1
2 3 1
Detail
df.groupby('ID')['COL'].sum()
ID
1 ABCD
2 ACD
3 ABC
Name: COL, dtype: object
Consider the following.
d=pd.DataFrame([[1,'a'],[1,'b'],[2,'c'],[2,'a'],[3,'c'],[4,'a'],[4,'c']],columns=['A','B'])
I want to find values in column A that correspond to 'a' and 'c' in column B ({2,4}). So I wrote the following query.
d[d.A.isin(set(d[d.B=='c'].A)) & d.B=='a'].A
My logic is that since
set(d[d.B=='c'].A
returns all values in A that have 'c' associated with them it should return {2,3,4} and it does return that. I then consider all the rows starting with {2,3,4}, and of these rows choose the ones that have 'a' associated with them in B so that I get all the values in A that have 'c' and 'a associated with them. But my query returns an empty set. It should return {2,4}. Can someone help debug? Thank you.
We can use filter
d.groupby('A').filter(lambda x : pd.Series(['a','c']).isin(x['B']).all()).A.unique()
Out[213]: array([2, 4], dtype=int64)
Use DataFrame.groupby
to check if the unique values in A have associated the value 'a' and the value 'c' in B:
new_df=d[d.groupby('A')['B'].transform(lambda x: x.eq('a').any()&x.eq('c').any())]
print(new_df)
A B
2 2 c
3 2 a
5 4 a
6 4 c
unique_values=new_df['A'].unique()
print(unique_values)
#[2 4]
Details:
You want to find what values in A have associates in B both 'a' and 'c'. Then you can use groupby('A') to perform operations on dataframe based on the unique values in row A:
d.groupby('A')
It allows to operate in groups based on the values in A:
A B
0 1 a
1 1 b
2 2 c
3 2 a
4 3 c
5 4 a
6 4 c
Now for each group using groupby.transform we check if 'c' and also 'a' is in column B:
d.groupby('A')['B'].transform(lambda x: x.eq('a').any()&x.eq('c').any())
0 False
1 False
2 True
3 True
4 False
5 True
6 True
Name: B, dtype: bool
Series.any is used to check if any value in Series B for
each group is 'c' or if any value in Series B is 'a'
Series.eq here is similar than use '=='
This series is used to perform a Boolean indexing.
new_df=d[d.groupby('A')['B'].transform(lambda x: x.eq('a').any()&x.eq('c').any())]
print(new_df)
A B
2 2 c
3 2 a
5 4 a
6 4 c
Finally using Series.unique
we access the unique values of A of the dataframe new_df:
unique_values=new_df['A'].unique()
print(unique_values)
#[2 4]
I have a panda dataframe and would like to add data columns using one common column as index. In case the new data does not have the index value it should enter a 0. The new column will have a different length. Is there a better way than using a loop? Example below
main Dataframe:
index_column date value
1 1 A
2 2 B
3 3 C
4 4 D
add new column:
date value
2 G
3 J
Result:
index_column date value new value
1 1 A 0
2 2 B G
3 3 C J
4 4 D 0
Many thanks!
Rolf
I'm trying to order a table such that every 'page' with N number of records (i.e page size) includes a maximum of X records of each field value.
Example table values:
Value
-----
a
a
c
a
b
b
a
c
b
a
c
c
c
d (and so on..)
(Value can be any text, in random order. For simplicity, I've used alphabets.)
If page size is 5 and max of each field value is 2, the results can be as listed below (or in a different random order), as long as every 5 consecutive records have a max of 2 records of each character:
Value
-----
a ┐
a |
c |── first page of size 5 with max 2 records of a value
b |
b ┘
a ┐
a |
c |── second page of size 5 with max 2 records of a value
b |
c ┘
a ┐
c |── last page of size 5 (or less) with max 2 records of a value
c |
d ┘
Any help is appreciated.