Using Anti Join in R [duplicate] - text-mining

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
How to sort data by column in descending order in R
(2 answers)
Closed 1 year ago.
I am a noob in R, and I been trying to compare two data frames which is derived using Text mining and it has two columns, one with words and other with count.
Assume they are dataframe1 and dataframe2.
I am trying to find out how to write the code which will select those words are present in dataframe2 but not present in dataframe1.
If we had to use it in excel, we would just use word as reference in dataframe2 and VLOOKUP the same list of words from dataframe1 and select the #N/A which are there and then sort the #N/A based on the highest count.
Below is the picture to explain in detail:
dataframe1
dataframe2:
As you can see the word C & F are in dataframe1 and also in dataframe2. So we have to exclude this and it should look like this.
Expected Output:
Can someone help me? I been trying for hours now. Thanks in advance.

There's a dplyr function to do this called anti_join:
library(dplyr)
anti_join(df1, df2, by = c('Check'))
To sort it in descending order of Count (thanks to Ben Bolker for pointing out that part of the question) you can use arrange.
library(dplyr)
df1 %>%
anti_join(df2, by = c('Check')) %>%
arrange(desc(Count))

Related

When to use pandas ‘loc’ for dataframe slicing [duplicate]

This question already has answers here:
Python: Pandas Series - Why use loc?
(3 answers)
Closed 1 year ago.
In pandas, if i have a dataframe , i can subset it like:
df[df.col == some_condition]
Also, i can do:
df.loc[df.col == some_condition]
What is the difference between the two? The ‘loc’ approach seems more verbose?
In simple words:
There are three primary indexers for pandas. We have the indexing operator itself (the brackets []), .loc, and .iloc. Let's summarize them:
[] - Primarily selects subsets of columns, but can select rows as well. Can't simultaneously select rows and columns.
.loc - selects subsets of rows and columns by label only
.iloc - selects subsets of rows and columns by integer location only
For more detailed explanation you can check this question

Pandas - list of unique strings in a column [duplicate]

This question already has answers here:
Find the unique values in a column and then sort them
(8 answers)
Closed 1 year ago.
i have a dataframe column which contains these values:
A
A
A
F
R
R
B
B
A
...
I would like to make a list summarizing the different strings, as [A,B,F,...].
I've used groupby with nunique(), but I don't need counting.
How can I make the list ?
Thanks
unique() is enough
df['col'].unique().tolist()
pandas.Series.nunique() is to return the number of unique items.

Extracting information from Pandas dataframe column headers [duplicate]

This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 2 years ago.
I have a pandas dataframe with column headers, which contain information. I want to loop through the column headers and use logical operations on each header to extract the columns with the relevant information that I have.
my df.columns command gives something like this:
['(param1:x)-(param2:y)-(param3:z1)',
'(param1:x)-(param2:y)-(param3:z2)',
'(param1:x)-(param2:y)-(param3:z3)']
I want to select only the columns, which contain (param3:z1) and (param3:z3).
Is this possible?
You can use filter:
df = df.filter(regex='z1|z3')

All column names not listed by df.columns [duplicate]

This question already has answers here:
pandas groupby without turning grouped by column into index
(3 answers)
Closed 2 years ago.
I wanted to perform groupby and agg fucntion on my dataframe, so i performed the below code
basic_df = df.groupby(['S2PName','S2PName-Category'], sort=False)['S2PGTotal'].agg([('totSale','sum'), ('count','size')])
basic_df.head(2)
My O/P:
totSale count
S2PName S2PName-Category
IDLY Food 598771.47 19749
DOSA Food 567431.03 14611
Now I try to print the columns using basic_df.columns
My O/P:
Index(['totSale', 'count'], dtype='object')
Why are the other two columns "S2pname and S2PName-category" not being displayed. What do I need to do to display them as well?
Thanks !
Adding as_index=False, or reset_index() at the end
basic_df = df.groupby(['S2PName','S2PName-Category'], sort=False,as_index=False)['S2PGTotal'].agg([('totSale','sum'), ('count','size')])
#basic_df = df.groupby(['S2PName','S2PName-Category'], sort=False)['S2PGTotal'].agg([('totSale','sum'), ('count','size')]).reset_index()

how to write list comprehension for selecting cells base on a substring [duplicate]

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
I am trying to rewrite the following in one line using list comprehension. I want to select cells that contains substring '[edit]' only. ut is my dataframe and the column that I want to select from is 'col1'. Thanks!
for u in ut['col1']:
if '[edit]' in u:
print(u)
I expect the following output:
Alabama[edit]
Alaska[edit]
Arizona[edit]
...
If the output of a Pandas Series is acceptable, then you can just use .str.contains, without a loop
s = ut[ut["col1"].str.contains("edit")]
If you need to print each element of the Series separately, then loop over the Series using
for i in s:
print(i)