How to add pandas text column one below other - pandas

I want the 3 input columns in pandas dataframe columns like pros, cons and sentiment which are text columns
pros| cons |sentiment
_________________________
text| text | positive
text| text | negative
i want to the columns to be like
review|sentiment
________________
pros|positive
pros|positive
pros|positive
pros|positive
cons|negative
cons|negative
cons|negative
I would like to have a new column review which has the pro columns text with its sentiment value, below that the text of cons with their sentiment value
How to arrange the initial dataframe to the review and sentiment column in pandas?

IIUC, use melt with sort_values.
Assuming df is your dataframe :
out = (
df.melt(id_vars="sentiment",
value_vars=["pros", "cons"],
value_name="review")
.sort_values(by="sentiment", ascending=False)
[['review', 'sentiment']]
)
​
Output :
print(out)
review sentiment
0 text positive
2 text positive
1 text negative
3 text negative

Related

Extracting portions of the entries of Pandas dataframe

I have a Pandas dataframe with several columns wherein the entries of each column are a combination of​ numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|"​. Each entry of the column is of the form:
​'x=ABCDefgh_5|123|' ​
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters​. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry ​'x=ABCDefgh_5|123|' ​ without stripping out anything. Is there an error in my code?
Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]

How to compute the similarity between two text columns in dataframes with pyspark?

I have 2 data frames with different # of rows. Both them has a column as text. My aim to compare them and find similarities and find a ratio of similarity and add this score in final data set. Comparison is between title from df1 and headline from df2. Position of these text rows are different.
df1
duration
title
publish_start_date
129.33
Smuggler's Run fr...
2021-10-29T10:21:...
49.342
anchises. Founded...
2021-10-29T06:00:...
69.939
by Diego Angel in...
2021-10-29T00:33:...
102.60
Orange County sch...
2021-10-28T10:24:...
df2
DataSource
Post Id
headline
Linkedin
L1904055
in English versi...
Linkedin
F6955268
in other language...
Facebook
F1948698
Its combined edit...
Twitter
T7954991
Emma Raducanu: 10...
Basically, I am trying to find a similarities between 2 data sets row by row (on text). Is there any way to do this?
number of final data set = number of first data set x number of second data set
What you are looking for is a Cross Join. This way each row in DF1 will get joined with all rows in DF2 after which you can apply a function to compare similatiries between them.

Iterate two dataframes, compare and change a value in pandas or pyspark

I am trying to do an exercise in pandas.
I have two dataframes. I need to compare few columns between both dataframes and change the value of one column in the first dataframe if the comparison is successful.
Dataframe 1:
Article Country Colour Buy
Pants Germany Red 0
Pull Poland Blue 0
Initially all my articles have the flag 'Buy' set to zero.
I have dataframe 2 that looks as:
Article Origin Colour
Pull Poland Blue
Dress Italy Red
I want to check if the article, country/origin and colour columns match (so check whether I can find the each article from dataframe 1 in dataframe two) and, if so, I want to put the flag 'Buy' to 1.
I trying to iterate through both dataframe with pyspark but pyspark daatframes are not iterable.
I thought about doing it in pandas but apaprently is a bad practise to change values during iteration.
Which code in pyspark or pandas would work to do what I need to do?
Thanks!
merge with an indicator then map the values. Make sure to drop_duplicates on the merge keys in the right frame so the merge result is always the same length as the original, and rename so we don't repeat the same information after the merge. No need to have a pre-defined column of 0s.
df1 = df1.drop(columns='Buy')
df1 = df1.merge(df2.drop_duplicates().rename(columns={'Origin': 'Country'}),
indicator='Buy', how='left')
df1['Buy'] = df1['Buy'].map({'left_only': 0, 'both': 1}).astype(int)
Article Country Colour Buy
0 Pants Germany Red 0
1 Pull Poland Blue 1

Pandas df to lists and sublists

I have a pandas data frame with one column and 100 rows(each cell is a paragraph). I would like to create a list of sublists to perform LDA and get the topics.
Ex:
S.No Text
0 abc
1 def
2 ghi
3 jkl
4 mno
I want the result to be a list of sublists
"[[abc]
[def]
[ghi]
[jkl]
[mno]]"
So that I can tokenize the sentences into words and perform LDA
Any ideas?
I think you don't need list of sublist to convert your sentences into tokens. You can do this way (below). Further you can modify from here, whatever way you want the output:
from nltk.tokenize import word_tokenize
# example
df = pd.DataFrame({'text': ['how are you','paris is good','fish is in water','we play tomorrow']})
# tokenize sentences
df['token_text'] = df.text.apply(word_tokenize)
print(df)
text token_text
0 how are you [how, are, you]
1 paris is good [paris, is, good]
2 fish is in water [fish, is, in, water]
3 we play tomorrow [we, play, tomorrow]
YOLO's answer is very good an my be what you're looking for. alternatively if you are trying to use LDA and want the "list of sublists" it may be better to use arrays which will work on any numpy function. To do so you can just us:
df.values
of if you only want specific columns you could do
df.loc[:, [col1, col2]].values
If you must have them as a list of lists then you can do
[list(x) for x in df.values]

change pandas crossstab dataframe into plain table format:

I've got aggregate dataframe through following pandas crosstab. However, I'd like to columns' format like this:
id ymdh A11 A12 A15 A16
--------------------------------------------------------------
How do I change the original dataframe into my desired format?
* Original output dataframe:
df = pd.crosstab(df.ymdh, df_data.id, margins=False,
values=df.duration, dropna=False,
normalize='columns',
aggfunc=[np.sum]).reset_index().fillna(0)
ymdh sum
id A11 A12 A15 A16
----------------------------------------------------------
0 2016040100 0.000000 0.002222 0.049398 0.018077
1 2016040101 0.003354 0.004141 0.078531 0.015131
2 2016040102 0.001397 0.002424 0.000633 0.001473
I think you need crosstab with sum:
df = pd.crosstab(df.ymdh, df_data.id, margins=False,
values=df.duration, dropna=False,
normalize='columns',
aggfunc='sum').reset_index().fillna(0)
To me this question topic (as on 2019-08-20: change pandas crossstab dataframe into plain table format) sounds rather misleading. Also, the fact that 600+ people have viewed this question, perhaps they were also looking for something else.
In case you are looking for converting a crosstab into a stacked dataframe please check out this discussion: Converting a pandas crosstab into a stacked dataframe
An example of crosstab to stacked dataframe could be a regular table with two columns:
col-1: consists of row labels,
col-2: consists of column labels of the crosstab.