Pandas Dataframe rename duplicate values in Row - pandas

I have a dataframe
COL1 COL2 COL3
Red Blue Green
Red Yellow Blue
Blue Red Blue
I want to rename value in the dataframe if they appear 2x (or more) in a row
So the expected output is
COL1 COL2 COL3
Red Blue Green
Red Yellow Blue
Blue Red 2Blue

We can use a custom function here which will check if values are duplicated in a row and add an incremental counter to each of them after using series.mask:
def myf(x):
counter = x.groupby(x).cumcount().add(1).astype(str)
return x.mask(x.duplicated(),x.radd(counter))
print(df.apply(myf,axis=1))
#or df.T.apply(myf).T
COL1 COL2 COL3
0 Red Blue Green
1 Red Yellow Blue
2 Blue Red 2Blue

Related

how to count the occurences of a value

How to count the number of occurences for a histogram using dataframes
d = {'color': ["blue", "green", "yellow", "red, blue", "green, yellow", "yellow, red, blue"],}
df = pd.DataFrame(data=d)
How do you go from
color
blue
green
yellow
red, blue
green, yellow
yellow, red, blue
to
color
occurance
blue
3
green
2
yellow
3
Let's try split by regex ,s\* for comma with zero or more whitespaces, then explode into rows and value_counts to get the count of values:
s = (
df['color'].str.split(r',\s*')
.explode()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
Or can split and expand then stack:
s = (
df['color'].str.split(r',\s*', expand=True)
.stack()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
s:
color occurance
0 blue 3
1 yellow 3
2 green 2
3 red 2
Here is another way using .str.get_dummies()
df['color'].str.get_dummies(sep=', ').sum()

Iteration through two Pandas Dataframes + create new column

I am new to using Pandas and I am trying to iterate through two columns from different Dataframes and if both columns have the same word, to append "yes" to another column. If not, append the word "no".
This is what I have:
for row in df1.iterrows():
for word in df2.iterrows():
if df1['word1'] == df2['word2']:
df1.column1.append('Yes') #I just want to have two columns in binary form, if one is yes the other must be no
df2.column2.append('No')
else:
df1.column1.append('No')
df2.column2.append('Yes')
I Have now:
column1 column2 column3
apple None None
orange None None
banana None None
tomato None None
sugar None None
grapes None None
fig None None
I want:
column1 column2 column3
apple Yes No
orange No No
banana No No
tomato No No
sugar No Yes
grapes No Yes
figs No Yes
Sample of words from df1: apple, orange, pear
Sample of words from df2: yellow, orange, green
I get this error:
Can only compare identically-labeled Series objects
Note: The words in df2 are 2500 than the words in df1 are 500.
Any help is appreciated!
Actually, you want to fill:
df1.column1 with:
Yes - if word1 from this row occurs in df2.word1 (in any row),
No - otherwise,
df2.column2 with:
Yes - if word2 from this row occurs in df1.word2 (in any row),
No - otherwise.
To do it, you can run:
df1['column1'] = np.where(df1.word1.isin(df2.word2), 'Yes', 'No')
df2['column2'] = np.where(df2.word2.isin(df1.word1), 'Yes', 'No')
To test my code I used the following DataFrames:
df1: df2:
word1 word2
0 apple 0 yellow
1 orange 1 orange
2 pear 2 green
3 strawberry 3 strawberry
4 plum
The result of my code is:
df1: df2:
word1 column1 word2 column2
0 apple No 0 yellow No
1 orange Yes 1 orange Yes
2 pear No 2 green No
3 strawberry Yes 3 strawberry Yes
4 plum No
I think it might be a better idea to get set of words from both columns and then do lookup. It would be way faster as well. Something like this:
words_df1 = set(df1['word1'].tolist())
words_df2 = set(df2['word2'].tolist())
Then do
df1['has_word2'] = df1['word1'].isin(words_df2)
df2['has_word1'] = df2['word2'].isin(words_df1)

How to get the first group under certain level of a groupby of multiple columns?

I am interested in the first group in level 2 and want to get all the rows related to it.
Take a look at the example below:
col1 col2 col3 col4
1 34 green 10
yellow 20
orange 30
89 green 40
yellow 50
orange 60
2 89 green 15
yellow 25
orange 35
90 green 45
yellow 55
orange 65
Please note that the length of row for each level 2 group is not definitely 3.
Now I want to get all the first group under col2, then result is supposed to be:
col1 col2 col3 col4
1 34 green 10
yellow 20
orange 30
2 89 green 15
yellow 25
orange 35
The example and problem are modified from the question: How to get the first group in a groupby of multiple columns?
I have tried the get_group method but it seems not able to address this specific question.
I am wondering if there is any one-line code could solve this kind of question? Thx!
There's a quick stack/unstack solution:
df.unstack('col3').groupby(level=0).head(1).stack('col3')
Output:
col4
col1 col2 col3
1 34 g 10
o 30
y 20
2 89 g 15
o 35
y 25
We can do
df.groupby(level=[0,2]).head(1)
Out[342]:
col4
col1 col2 col3
1 34 green 10
yellow 20
orange 30
2 89 green 15
yellow 25
orange 35

SQL query to select records based on existence of required or lack of excluded values

I'm hoping for some assistance in building a simple query that will return a list of names from a given table where an entry containing a required color exists and no entry containing an excluded color exists.
id name color
--- -------- --------
1 james red
2 james blue
3 james green
4 jim red
5 jim purple
6 bob white
7 bob red
8 bob pink
9 charlie white
10 charlie green
11 charlie black
12 kate violet
13 kate pink
14 kate red
I want to select all names where:
there must be a 'red' entry, i.e. excluding charlie
there must not be a 'pink' entry, i.e. excluding kate and bob
i.e.
james - included, has red, does not have pink
jim - included, has red, does not have pink
bob - excluded, has red but also has pink, which is excluded
charlie - excluded, does not have red
kate - excluded, has red, but also has pink, which is excluded
Ideally the output would include the list of distinct names (i.e. james, jim) and the query would allow me to use lists of colors for the required or excluded colors.
Thanks for your help!
You can use aggregation:
select name
from t
where color in ('pink', 'red')
group by name
having min(color) = 'red' and min(color) = max(color);
This version just limits the colors to 'pink' and 'red'. The having clause checks that only one color is present for a name, and that that color is 'red'.
Yes, you can use the IN and NOT IN operator in the WHERE clause. Example:
SELECT *
FROM table
WHERE column_1 IN ('red')
AND column_1 NOT IN ('pink')
If the list of inclusions and exclusions are static then you can use the query above.
If the list is dynamic, such as a table that stores the inclusion and exclusion lists, then you can replace the static values with a SELECT statement.

Increment and Change counter based on change in a column value

My query returns result like this. sqlfiddle
COLOR_NAME
RED
RED
RED
GREEN
GREEN
BLUE
WHITE
WHITE
WHITE
WHITE
WHITE
WHITE
I need to show number with above result. So the desired result is like this.
COLOR_NAME SORT_NO
RED 10
RED 11
RED 12
GREEN 10
GREEN 11
BLUE 10
WHITE 10
WHITE 11
WHITE 12
WHITE 13
WHITE 14
WHITE 15
How could I achieve this in ms sql?
You can use ROW_NUMBER() function
select COLOR_NAME
, 9 + ROW_NUMBER() OVER (PARTITION BY COLOR_NAME ORDER BY ID) AS Sort_No
from TB_COLOR
SQLFiddle