Split column values to several in pandas dataframe - pandas

I am trying to do sentiment analysis on tweets using sentimentIntensityAnalyzer() from nltk.sentiment.vader
sid = SentimentIntensityAnalyzer()
listy = []
for index, row in data.iterrows():
ss = sid.polarity_scores(row["Tweets"])
listy.append(ss)
se = pd.Series(listy)
data['polarity'] = se.values
display(data.head(100))
This is the resulting dataFramee :
Tweets polarity
0 RT #spectatorindex: Facebook controls:\n\n- Wh... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
1 RT #YAATeamWest: Today we're at #BradfordUniSU... {'neg': 0.0, 'neu': 0.902, 'pos': 0.098, 'comp...
2 #SachinTendulkar launches India’s first Multip... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
3 How To Create a 360 R​ender (And How to Improv... {'neg': 0.0, 'neu': 0.722, 'pos': 0.278, 'comp...
4 The Most Disturbing Virtual Reality You Will E... {'neg': 0.174, 'neu': 0.826, 'pos': 0.0, 'comp...
5 VR Training for Troops 🎮\n\n... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
6 RT #DefenceHQ: The #BritishArmy has awarded a ... {'neg': 0.0, 'neu': 0.847, 'pos': 0.153, 'comp...
7 RT #UofGHumanities: #UofGCSPE Humanities Lectu... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
8 RT #OyezServices: Ever wanted a tour of Machu ... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
9 RT #ProjectDastaan: We are an Oxford Universit... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
10 RT #Paula_Piccard: Virtual reality will change... {'neg': 0.0, 'neu': 0.878, 'pos': 0.122, 'comp...
In order to do statistical analysis on the 'neg','pos','neu' and 'compound' entities in the polarity column I wanted to split the data into four different columns. To achieve this I used :
list_pos= []
list_neg = []
list_comp = []
list_neu = []
for index, row in data.iterrows():
list_pos.append(row['polarity']['pos'])
list_neg.append(row['polarity']['neg'])
list_comp.append(row['polarity']['compound'])
list_neu.append(row['polarity']['neu'])
se_pos = pd.Series(list_pos)
se_neg = pd.Series(list_neg)
se_comp = pd.Series(list_comp)
se_neu = pd.Series(list_neu)
data['positive'] = se_pos.values
data['negative'] = se_neg.values
data['compound'] = se_comp.values
data['neutral'] = se_neu.values
The resulting dataFrame:
Tweets polarity positive negative compound neutral
0 RT #spectatorindex: Facebook controls:\n\n- Wh... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound... 0.000 0.000 0.0000 1.000
1 RT #YAATeamWest: Today we're at #BradfordUniSU... {'neg': 0.0, 'neu': 0.902, 'pos': 0.098, 'comp... 0.098 0.000 0.3612 0.902
2 #SachinTendulkar launches India’s first Multip... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound... 0.000 0.000 0.0000 1.000
Is there a more concise way of achieving a similar dataFrame? Using the lambda function perhaps? Thanks for the help!

Related

How to Stratify pandas DataFrame based on two columns?

I have the following pandas DataFrame:
account_num = [
1726905620833, 1727875510892, 1727925550921, 1727925575731, 1727345507414,
1713565531401, 1725735509119, 1727925546516, 1727925523656, 1727875509665,
1727875504742, 1727345504314, 1725475539855, 1791725523833, 1727925583805,
1727925544791, 1727925518810, 1727925606986, 1727925618602, 1727605517337,
1727605517354, 1727925583101, 1727925583201, 1727925583335, 1727025517810,
1727935718602]
total_due = [
1662.87, 3233.73, 3992.05, 10469.28, 799.01, 2292.98, 297.07, 5699.06, 1309.82,
1109.67, 4830.57, 3170.12, 45329.73, 46.71, 11981.58, 3246.31, 3214.25, 2056.82,
1611.73, 5386.16, 2622.02, 5011.02, 6222.10, 16340.90, 1239.23, 1198.98]
net_returned = [
0.0, 0.0, 0.0, 2762.64, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12008.27,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2762.69, 0.0, 0.0, 0.0, 9254.66, 0.0, 0.0]
total_fees = [
0.0, 0.0, 0.0, 607.78, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2161.49, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 536.51, 0.0, 0.0, 0.0, 1712.11, 0.0, 0.0]
year = [2021, 2022, 2022, 2021, 2021, 2020, 2020, 2022, 2019, 2019, 2020, 2022, 2019,
2018, 2018, 2022, 2021, 2022, 2022, 2020, 2019, 2019, 2022, 2019, 2021, 2022]
flipped = [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0]
proba = [
0.960085, 0.022535, 0.013746, 0.025833, 0.076159, 0.788912, 0.052489, 0.035279,
0.019701, 0.552127, 0.063949, 0.061279, 0.024398, 0.902681, 0.009441, 0.015342,
0.006832, 0.032988, 0.031879, 0.026412, 0.025159, 0.023195, 0.022104, 0.021285,
0.026480, 0.025837]
d = {
"account_num" : account_num,
"total_due" : total_due,
"net_returned" : net_returned,
"total_fees" : total_fees,
"year" : year,
"flipped" : flipped,
"proba" : proba
}
df = pd.DataFrame(data=d)
I want to sample the DataFrame by the "year" column according to a specific ratio for each year, which I have successfully done with the following code:
df_fractions = pd.DataFrame({"2018": [0.5], "2019": [0.5], "2020": [1.0], "2021": [0.8],
"2022": [0.7]})
df.year = df.year.astype(str)
grouped = df.groupby("year")
df_training = grouped.apply(lambda x: x.sample(frac=df_fractions[x.name]))
df_training = df_training.reset_index(drop=True)
However, when I invoke sample(), I also want to ensure the samples from each year are stratified according to the number of flipped accounts in that year. So, I want to stratify the per-year samples based on the flipped column. With this small, toy DataFrame, after sampling per year, the ratio of flipped per year are pretty good with respect to the original proportions. But this is not true for a really large DataFrame with close to 300K accounts.
So, that's really my question to all you Python experts: is there a better way to solve this problem than the solution I came up with?

Calculate median of column with multiple values per cell (ranges)

I have this code
df = pd.DataFrame({'R': {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', 6: '7'}, 'a': {0: 1.0, 1: 1.0, 2: 2.0, 3: 3.0, 4: 3.0, 5: 2.0, 6: 3.0}, 'nv1': {0: [-1.0], 1: [-1.0], 2: [], 3: [], 4: [-2.0], 5: [-2.0, -1.0, -3.0, -1.0], 6: [-2.0, -1.0, -2.0, -1.0]}})
yielding the following dataframe:
R a nv1
0 1 1.0 [-1.0]
1 2 1.0 [-1.0]
2 3 2.0 []
3 4 3.0 []
4 5 3.0 [-2.0]
5 6 2.0 [-2.0, -1.0, -3.0, -1.0]
6 7 3.0 [-2.0, -1.0, -2.0, -1.0]
I need to calculate median of df['nv1']
df['med'] = median of df['nv1']
Desired output as follows
R a nv1 med
1 1.0 [-1.0] -1
2 1.0 [-1.0] -1
3 2.0 []
4 3.0 []
5 3.0 [-2.0] -2
6 2.0 [-2.0, -1.0, -3.0, -1.0] -1.5
7 3.0 [-2.0, -1.0, -2.0, -1.0] -1.5
I tried both line of codes below independently, but I ran into errors:
df['nv1'] = pd.to_numeric(df['nv1'],errors = 'coerce')
df['med'] = df['nv1'].median()
Use np.median:
df['med'] = df['nv1'].apply(np.median)
Output:
>>> df
R a nv1 med
0 1 1.0 [-1.0] -1.0
1 2 1.0 [-1.0] -1.0
2 3 2.0 [] NaN
3 4 3.0 [] NaN
4 5 3.0 [-2.0] -2.0
5 6 2.0 [-2.0, -1.0, -3.0, -1.0] -1.5
6 7 3.0 [-2.0, -1.0, -2.0, -1.0] -1.5
Or:
df['med'] = df['nv1'].explode().dropna().groupby(level=0).median()

How to create a pandas dataframe from csv where one column contains nested dictionary?

I have a CSV file and in one column there is a nested dictionary with the values of classification report, in a format like this one:
{'A': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 60},
'B': {'precision': 0.42, 'recall': 0.09, 'f1-score': 0.14, 'support': 150},
'micro avg': {'precision': 0.31, 'recall': 0.31, 'f1-score': 0.31, 'support': 1710},
'macro avg': {'precision': 0.13, 'recall': 0.08, 'f1-score': 0.071, 'support': 1710},
'weighted avg': {'precision': 0.29, 'recall': 0.31, 'f1-score': 0.26, 'support': 1710}}
I would like to get key_value1_level as a column in a data frame. So, is it possible to get the following result?
A_precision A_recall ...weighted_avg_precision weighted_avg_recall weighted_avg_f1-score weighted avg_support
0.0 0.0 0.29 0.31 0.26 1710
Thank you
You can use pd.json_normalize on that dictionary:
dct = {
"A": {"precision": 0.0, "recall": 0.0, "f1-score": 0.0, "support": 60},
"B": {"precision": 0.42, "recall": 0.09, "f1-score": 0.14, "support": 150},
"micro avg": {
"precision": 0.31,
"recall": 0.31,
"f1-score": 0.31,
"support": 1710,
},
"macro avg": {
"precision": 0.13,
"recall": 0.08,
"f1-score": 0.071,
"support": 1710,
},
"weighted avg": {
"precision": 0.29,
"recall": 0.31,
"f1-score": 0.26,
"support": 1710,
},
}
df = pd.json_normalize(dct, sep="_")
print(df)
Prints:
A_precision A_recall A_f1-score A_support B_precision B_recall B_f1-score B_support micro avg_precision micro avg_recall micro avg_f1-score micro avg_support macro avg_precision macro avg_recall macro avg_f1-score macro avg_support weighted avg_precision weighted avg_recall weighted avg_f1-score weighted avg_support
0 0.0 0.0 0.0 60 0.42 0.09 0.14 150 0.31 0.31 0.31 1710 0.13 0.08 0.071 1710 0.29 0.31 0.26 1710

Assign values to multicolumn dataframe using another dataframe

I am trying to assign values to a multicolumn dataframe that are stored in another normal dataframe. The 2 dataframes share the same index, however when attempting to assign the values for all columns of the normal dataframe to a slice of the multicolumn dataframe Nan values appear.
MWE
import pandas as pd
df = pd.DataFrame.from_dict(
{
("old", "mean"): {"high": 0.0, "med": 0.0, "low": 0.0},
("old", "std"): {"high": 0.0, "med": 0.0, "low": 0.0},
("new", "mean"): {"high": 0.0, "med": 0.0, "low": 0.0},
("new", "std"): {"high": 0.0, "med": 0.0, "low": 0.0},
}
)
temp = pd.DataFrame.from_dict(
{
"old": {
"high": 2.6798302797288174,
"med": 10.546654056177656,
"low": 16.46382603916123,
},
"new": {
"high": 15.91881231611413,
"med": 16.671967271277495,
"low": 26.17872356316402,
},
}
)
df.loc[:, (slice(None), "mean")] = temp
print(df)
Output:
old new
mean std mean std
high NaN 0.0 NaN 0.0
med NaN 0.0 NaN 0.0
low NaN 0.0 NaN 0.0
Is this expected behaviour or am I doing something horrible that I am not supposed?
Create MultiIndex in temp for align data and then you can set new values by DataFrame.update:
temp.columns = pd.MultiIndex.from_product([temp.columns, ['mean']])
print (temp)
old new
mean mean
high 2.679830 15.918812
med 10.546654 16.671967
low 16.463826 26.178724
df.update(temp)
print(df)
old new
mean std mean std
high 2.679830 0.0 15.918812 0.0
med 10.546654 0.0 16.671967 0.0
low 16.463826 0.0 26.178724 0.0

Extract histogram values from tensorboard and plot it with matplotlib

I want plot histograms from tensorboard on my own, to publish it. I wrote this extraction function to get the histogram values:
def _load_hist_from_tfboard(path):
event_acc = event_accumulator.EventAccumulator(path)
event_acc.Reload()
vec_dict = {}
for tag in sorted(event_acc.Tags()["distributions"]):
hist_dict = {}
for hist_event in event_acc.Histograms(tag):
hist_dict.update({hist_event.step: (hist_event.histogram_value.bucket_limit,
hist_event.histogram_value.bucket)})
vec_dict[tag] = hist_dict
return vec_dict
The function collects all histograms of a event file. The output of one bucket_limit and bucket is as follows:
[0.0, 1e-12, 0.0005418219168477906, 0.0005960041085325697, 0.0020575678275470133, 0.0022633246103017147, 0.004009617609950718, 0.00441057937094579, 0.005336801038844407, 0.005870481142728848, 0.007813610400972098, 0.008594971441069308, 0.022293142370048362, 0.0245224566070532, 0.026974702267758523, 0.035903328718386605, 0.03949366159022527, 0.043443027749247805, 0.04778733052417259, 0.052566063576589855, 0.057822669934248845, 0.06360493692767373, 0.06996543062044111, 0.07696197368248522, 0.24153964213356663, 0.2656936063469233, 0.29226296698161564, 0.3214892636797772, 0.35363819004775493, 0.38900200905253046, 0.42790220995778355, 0.47069243095356195, 0.5177616740489182, 0.56953784145381, 0.6264916255991911, 0.6891407881591103, 0.7580548669750213, 0.8338603536725235, 0.917246389039776, 1.0089710279437536]
[0.0, 3999936.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 4.0, 0.0, 8.0, 8.0, 0.0, 4.0, 4.0, 0.0, 8.0, 4.0, 0.0, 9.0, 45.0, 50.0, 48.0, 85.0, 100.0, 109.0, 114.0, 15908.0, 74.0, 15856.0, 11908.0, 3973.0, 42.0, 7951679.0]
Can someone help me to interpret these numbers to a histogram.