python pandas - Add a new column by merging 2 columns based on a condition - pandas

Here is the data frame I am working with:
df = igan[["SUBJID", "LBSPCCND", "LBSPCCND_OTHER"]]
df.head(12)
I need to merge LBSPCCND and LBSPCCND_OTHER into a new column called LBSPCCND_ALL. I want to keep all values in LBSPCCND except where it is = "Other". I want to take all values from LBSPCCND_OTHER where it is not blank and merge those values into the new column. (all blanks should mean that LBSPCCND has a valid value.) I can't have "Other" in my data set. SUBJID is what I am using as a unique identifier to merge this data back into my main data frame that you don't see here.
I put together these conditions, but I'm unsure how to get the new column based on these conditions.
condition1 = df["LBSPCCND"] != "Other"
condition2 = df["LBSPCCND_OTHER"] != ""
df["LBSPCCND_ALL"] = df[df[condition1 & condition2]]
#This is not working I get: Expected a 1D array, got an array with shape (13, 3)

I would do it this way :
df["LBSPCCND_ALL"] = df["LBSPCCND_OTHER"].replace("", None).fillna(df["LBSPCCND"])
Another variant,
df["LBSPCCND_ALL"] = df["LBSPCCND"].replace("Other", None).fillna(df["LBSPCCND_OTHER"])
Output :
print(df["LBSPCCND_ALL"])
0 Adequate specimen
1 Adequate specimen
2 Adequate specimen
3 Limited Sample
4 Paraffin block; paraffin-embedded specimen
5 Adequate specimen
6 Adequate specimen
7 Adequate specimen
8 Pathology Report
9 Adequate specimen
10 Adequate specimen
11 Unacceptable
Name: LBSPCCND_ALL, dtype: object

Related

apply function causing SettingWithCopyWarning error -? [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 7 days ago.
dataframe called condition produces the below output:
SUBJID LBSPCCND LBSPCCND_OTHER
0 0292-104 Adequate specimen
1 1749-101 Other Limited Sample
2 1733-104 Paraffin block; paraffin-embedded specimen
3 0587-102 Other Pathology Report
4 0130-101 Adequate specimen
5 0587-101 Adequate specimen
6 0609-102 Other Unacceptable
When I run the below code, I'm getting a settingwithcopywarning:
condition["LBSPCCND"] = condition["LBSPCCND"].apply(convert_condition)
condition
SUBJID LBSPCCND LBSPCCND_OTHER
0 0292-104 ADEQUATE
1 1749-101 Other Limited Sample
2 1733-104 PARAFFIN-EMBEDDED
3 0587-102 Other Pathology Report
4 0130-101 ADEQUATE
5 0587-101 ADEQUATE
6 0609-102 Other Unacceptable
This generates this error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Copy() of my dataframe got rid of the error:
columns = ["SUBJID", "LBSPCCND", "LBSPCCND_OTHER"]
condition = igan[columns].copy()

How to find frequency of element list in data frame using pandas?

I have a list and a data frame. I want to find the number of each word in the list (some words in the list are pair) for each "emotions" in the data frame.
Here is my list:
[(frozenset({'know'}), 16528),
(frozenset({'im'}), 39047),
(frozenset({'feeling'}), 99455),
(frozenset({'like'}), 49332),
(frozenset({'feel', 'im'}), 16602),
(frozenset({'feeling', 'im'}), 23488),
(frozenset({'feel'}), 202985),
(frozenset({'feel', 'like'}), 42162),
(frozenset({'time'}), 17203),
(frozenset({'really'}), 17247)]
and this is my data frame:
Unnamed: 0 id text emotions
0 0 27383 [feel, awful, job, get, position, succeed, hap... sadness
1 1 110083 [im, alone, feel, awful] sadness
2 2 140764 [ive, probably, mentioned, really, feel, proud... joy
3 3 100071 [feeling, little, low, day, back] sadness
4 4 2837 [beleive, much, sensitive, people, feeling, te... love
Here is the expected output:
6 columns for six existed emotions and the last column is for totall count.

Creating a new column with a iterating sentence count whenever two simultaneous row values are null (indicating a new sentence is found)

I have a dataframe with words and entities and would like to create a third column which keeps a sentence count for every new sentence found as shown in the link example of desired output.
The condition based on which I would recognize the start of a new sentence is when both the word and entity columns have null values like at index 4.
0 word entity
1 It O
2 was O
3 fun O
4 NaN NaN
5 from O
6 vodka B-product
So far I have managed to fill the null values with a new_sent string and have figured out how to make a new column where I can enter a value whenever a new sentence is found using.
df.fillna("new_sentence", inplace=True)
df['Sentence #'] = np.where(df['word']=='new_sentence', 'S', False)
In the above code instead of S I would like to fill Sentence: {count} as in the example. What would be easiest/quickest way to do this? Also, is there a better way to keep a count of sentences in a separate column like in the example instead of the method I am trying?
So far I am able to get an output like this
0 word entity Sentence #
1 It O False
2 was O False
3 fun O False
4 new_sentence new_sentence S
5 from O False
6 vodka B-product False

pandas/python Merge/concatenate related data of duplicate rows and add a new column to existing data frame

I am new to Pandas, and wanted your help with data slicing.
I have a dump of 10 million rows with duplicates. Please refer to this image for a sample of the rows with the steps I am looking to perform.
As you see in the image, the column for criteria "ABC" from Source 'UK' has 2 duplicate entries in the Trg column. I need help with:
Adding a concatenated new column "All Targets" as shown in image
Removing duplicates from above table so that only unique values without duplicates appear, as shown in step 2 in the image
Any help with this regard will be highly appreciated.
I would do like this:
PART 1:
First define a function that does what you want, than use apply method:
def my_func(grouped):
all_target = grouped["Trg"].unique()
grouped["target"] = ", ".join(all_target)
return grouped
df1 = df.groupby("Criteria").apply(my_func)
#output:example with first 4 rows
Criteria Trg target
0 ABC DE DE, FR
1 ABC FR DE, FR
2 DEF UK UK, FR
3 DEF FR UK, FR
PART 2:
df2 = df1.drop_duplicates(subset=["Criteria"])
I tried it only on first 4 rows so let me know if it works.

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation