Select all possible string combinations from within a single field - sql

Searching for this question I found references to cross join. BUt that only goes 2 wide. Suppose I have a single field fruits with values apple, pear, banana I could have (single string with spaces):
apple
apple pear
apple pear banana
apple banana
pear
banana
pear banana
I don't want doubles or triples e.g. apple, apple
How can I select all potential combinations from field fruits in this way?

Related

Retrieve match from list of strings and add as column in dataframe

I have a dataframe which contains columns text and user.
user
text
Tom
I love bananas
Dick
I love apples
Harry
I love apples and bananas
I want to find rows of text which contain a list of fruits. For
each matched string, a new row is added to new columns fruits and fruits_with_colors. The expected output is below:
user
text
fruits
Tom
I love bananas
bananas
Dick
I love apples
apples
Harry
I love apples and bananas
apples
Harry
I love apples and bananas
bananas
I'm having some trouble thinking about how to do this. I was doing the following using pandas:
fruits = ['apples', 'bananas']
df_with_matches = df[df['text'].str.contains('|'.join(fruits))]
but am returning the error sequence item 0: expected str instance, list found
You can use str.findall to extract fruits into a list and then explode it:
df.assign(fruits = df.text.str.findall('|'.join(fruits))).explode('fruits')
user text fruits
0 Tom I love bananas bananas
1 Dick I love apples apples
2 Harry I love apples and bananas apples
2 Harry I love apples and bananas bananas

How to find the maximum for each type

I have a table like this
fruit
cost
apple
5
apple
6
banana
8
pear
3
banana
9
pear
7
and i want the maximum cost of each type of fruit like
fruit
cost
apple
6
banana
8
pear
7
group by is your friend to find the maximum in each group of records.
select fruit, max(cost) as cost
from _table
group by fruit;

Iteration through two Pandas Dataframes + create new column

I am new to using Pandas and I am trying to iterate through two columns from different Dataframes and if both columns have the same word, to append "yes" to another column. If not, append the word "no".
This is what I have:
for row in df1.iterrows():
for word in df2.iterrows():
if df1['word1'] == df2['word2']:
df1.column1.append('Yes') #I just want to have two columns in binary form, if one is yes the other must be no
df2.column2.append('No')
else:
df1.column1.append('No')
df2.column2.append('Yes')
I Have now:
column1 column2 column3
apple None None
orange None None
banana None None
tomato None None
sugar None None
grapes None None
fig None None
I want:
column1 column2 column3
apple Yes No
orange No No
banana No No
tomato No No
sugar No Yes
grapes No Yes
figs No Yes
Sample of words from df1: apple, orange, pear
Sample of words from df2: yellow, orange, green
I get this error:
Can only compare identically-labeled Series objects
Note: The words in df2 are 2500 than the words in df1 are 500.
Any help is appreciated!
Actually, you want to fill:
df1.column1 with:
Yes - if word1 from this row occurs in df2.word1 (in any row),
No - otherwise,
df2.column2 with:
Yes - if word2 from this row occurs in df1.word2 (in any row),
No - otherwise.
To do it, you can run:
df1['column1'] = np.where(df1.word1.isin(df2.word2), 'Yes', 'No')
df2['column2'] = np.where(df2.word2.isin(df1.word1), 'Yes', 'No')
To test my code I used the following DataFrames:
df1: df2:
word1 word2
0 apple 0 yellow
1 orange 1 orange
2 pear 2 green
3 strawberry 3 strawberry
4 plum
The result of my code is:
df1: df2:
word1 column1 word2 column2
0 apple No 0 yellow No
1 orange Yes 1 orange Yes
2 pear No 2 green No
3 strawberry Yes 3 strawberry Yes
4 plum No
I think it might be a better idea to get set of words from both columns and then do lookup. It would be way faster as well. Something like this:
words_df1 = set(df1['word1'].tolist())
words_df2 = set(df2['word2'].tolist())
Then do
df1['has_word2'] = df1['word1'].isin(words_df2)
df2['has_word1'] = df2['word2'].isin(words_df1)

Convert multiple rows into a single comma separated string without XML

I want to combine data from multiple rows into a single comma separated string without using the XML function as it is not supported.
Here is an example:
ID Food
1 Apple
1 orange
2 wine
2 whiskey
2 beer
3 rice
3 wheat
3 maize
3 quinoa
Note that number rows per id is not fixed
Output:
ID Foods
1 Apple, orange
2 wine, whiskey, beer
3 rice, wheat, maize, quinoa
#standardSQL
select id, string_agg(food) as foods
from `project.dataset.table`
group by id

pandas search a value in a dataframe column

I have following dataframe and i want to search apple in column fruits and display all the rows if apple is found.
Before :
number fruits purchase
0 apple yes
mango
banana
1 apple no
cheery
2 mango yes
banana
3 apple yes
orange
4 grapes no
pear
After:
number fruits purchase
0 apple yes
mango
banana
1 apple no
cheery
3 apple yes
orange
Use groupby and filter to filter groups that contain 'apple':
df['number'] = df['number'].ffill()
df.groupby('number').filter(lambda x: (x['fruits'] == 'apple').any())
df_out.assign(number = df_out['number'].mask(df.number.duplicated()))\
.replace(np.nan,'')
Output:
number fruits purchase
0 0 apple yes
1 mango
2 banana
3 1 apple no
4 cheery
7 3 apple yes
8 orange
It looks like you're using 'number' as the index, so I'm going to assume that.
Get the numbers where 'apple' is present, and slice into those:
idx = df.index[df.fruits == 'apple']
df.loc[idx]