how to write list comprehension for selecting cells base on a substring [duplicate] - pandas

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
I am trying to rewrite the following in one line using list comprehension. I want to select cells that contains substring '[edit]' only. ut is my dataframe and the column that I want to select from is 'col1'. Thanks!
for u in ut['col1']:
if '[edit]' in u:
print(u)
I expect the following output:
Alabama[edit]
Alaska[edit]
Arizona[edit]
...

If the output of a Pandas Series is acceptable, then you can just use .str.contains, without a loop
s = ut[ut["col1"].str.contains("edit")]
If you need to print each element of the Series separately, then loop over the Series using
for i in s:
print(i)

Related

How to replace character into multiIndex pandas [duplicate]

This question already has an answer here:
Pandas dataframe replace string in multiple columns by finding substring
(1 answer)
Closed 11 months ago.
I have a dataset with severals columns containing numbers and I need to remove the ',' thousand separator.
Here is an example: 123,456.15 -> 123456.15.
I tried to get it done with multi-indexes the following way:
toProcess = ['col1','col2','col3']
df[toProcess] = df[toProcess].str.replace(',','')
Unfortunately, the error is: 'Dataframe' object has no attributes 'str'. Dataframe don't have str attributes but Series does.
How can I achieve this task efficiently ?
Here is a working way iterating over the columns:
toProcess = ['col1','col2','col3']
for i, col in enumerate(toProcess):
df[col] = df[col].str.replace(',','')
Use:
df[toProcess] = df[toProcess].replace(',','', regex=True)

Pandas - list of unique strings in a column [duplicate]

This question already has answers here:
Find the unique values in a column and then sort them
(8 answers)
Closed 1 year ago.
i have a dataframe column which contains these values:
A
A
A
F
R
R
B
B
A
...
I would like to make a list summarizing the different strings, as [A,B,F,...].
I've used groupby with nunique(), but I don't need counting.
How can I make the list ?
Thanks
unique() is enough
df['col'].unique().tolist()
pandas.Series.nunique() is to return the number of unique items.

Using Anti Join in R [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
How to sort data by column in descending order in R
(2 answers)
Closed 1 year ago.
I am a noob in R, and I been trying to compare two data frames which is derived using Text mining and it has two columns, one with words and other with count.
Assume they are dataframe1 and dataframe2.
I am trying to find out how to write the code which will select those words are present in dataframe2 but not present in dataframe1.
If we had to use it in excel, we would just use word as reference in dataframe2 and VLOOKUP the same list of words from dataframe1 and select the #N/A which are there and then sort the #N/A based on the highest count.
Below is the picture to explain in detail:
dataframe1
dataframe2:
As you can see the word C & F are in dataframe1 and also in dataframe2. So we have to exclude this and it should look like this.
Expected Output:
Can someone help me? I been trying for hours now. Thanks in advance.
There's a dplyr function to do this called anti_join:
library(dplyr)
anti_join(df1, df2, by = c('Check'))
To sort it in descending order of Count (thanks to Ben Bolker for pointing out that part of the question) you can use arrange.
library(dplyr)
df1 %>%
anti_join(df2, by = c('Check')) %>%
arrange(desc(Count))

filter multiple separate rows in a DataFrame that meet the condition in another DataFrame with pandas? [duplicate]

This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 2 years ago.
This is my DataFrame
df = pd.DataFrame({'uid': [109200005, 108200056, 109200060, 108200085, 108200022],
'grades': [69.233627, 70.130900, 83.357011, 88.206387, 74.342212]})
This is my condition list which comes from another DataFrame
condition_list = [109200005, 108200085]
I use this code to filter records that meet the condition
idx_list = []
for i in condition_list:
idx_list.append(df[df['uid']==i].index.values[0])
and get what I need
>>> df.iloc[idx_list]
uid grades
0 109200005 69.233627
3 108200085 88.206387
Job is done. I'd just like to know is there a simpler way to do the job?
Yes, use isin:
df[df['uid'].isin(condition_list)]

Extracting information from Pandas dataframe column headers [duplicate]

This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 2 years ago.
I have a pandas dataframe with column headers, which contain information. I want to loop through the column headers and use logical operations on each header to extract the columns with the relevant information that I have.
my df.columns command gives something like this:
['(param1:x)-(param2:y)-(param3:z1)',
'(param1:x)-(param2:y)-(param3:z2)',
'(param1:x)-(param2:y)-(param3:z3)']
I want to select only the columns, which contain (param3:z1) and (param3:z3).
Is this possible?
You can use filter:
df = df.filter(regex='z1|z3')