Standardizing company names - pandas

I have a list of company names, but these have misspelling and variations. How best can I fix this so every company has the consistent naming convention (for later groupby, sort_value, etc.)?
pd.DataFrame({'Company': ['Disney','Dinsey', 'Walt Disney','General Motors','General Motor','GM','GE','General Electric','J.P. Morgan','JP Morgan']})

One good hint: FuzzyWuzzy library. "Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package."
Example:
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
str2Match = "apple inc"
strOptions = ["Apple Inc.","apple park","apple incorporated"]
Ratios = process.extract(str2Match,strOptions)
print(Ratios)
# You can also select the string with the highest matching percentage
highest = process.extractOne(str2Match,strOptions)
print(highest)
output:
[('Apple Inc.', 100), ('apple incorporated', 90), ('apple park', 67)]
('Apple Inc.', 100)
Now you just have to create a list with the "right names" and run all the variations against it so you can pick the best ratio and replace it on your dataset.

Related

How can I add two columns of data to the 'hue' section on Geoplot-Geopandas for a cartogram map?

Having trouble selecting 'male' and 'female' for hue when creating a cartogram with geoplot (using geopandas).
I have only managed to select the total population but I would like to compare the state populations of male to female.
I understand that the 'hue' assignment is probably what I need to modify but so far I can only display the total population (Tot_P_P) rather than both "Tot_P_M" and "Tot_P_F".
This is the code so far that works for total population. I've tried for the past 2 weeks looking through tutorials and websites but there isn't much info on cartograms.
working code for total population:
ax = gplt.polyplot(gda2020,
projection=gplt.crs.AlbersEqualArea(),
figsize = (25, 15)
)
gplt.cartogram(gda2020,
scale="Tot_P_P", limits=(0.2, 1), scale_func=None,
hue="Tot_P_P",
cmap='inferno',
norm=None,
scheme=None,
legend=True,
legend_values=None,
legend_labels=None,
legend_kwargs=None,
legend_var='hue',
extent=None,
ax=ax)

T-test on the means pandas

I'm woking with the Movielens dataset and I would like to do the t-test on the mean ratings value of the male and female users.
import pandas as pd
from scipy.stats import ttest_ind
users_table_names= ['user_id','age','gender','occupation','zip_code']
users= pd.read_csv('ml-100k/u.user', sep='|', names= users_table_names)
ratings_table_names= ['user_id', 'item_id','rating','timestamp']
ratings= pd.read_csv('ml-100k/u.data', sep='\t', names=ratings_table_names)
rating_df= pd.merge(users, ratings)
males = rating_df[rating_df['gender']=='M']
females = rating_df[rating_df['gender']=='F']
ttest_ind(males.rating, females.rating)
And I get the following result:
Ttest_indResult(statistic=-0.27246234775012407, pvalue=0.7852671011802962)
Is this the correct way to do this operation? The results seem a bit odd.
Thank you in advance!
With your code you are considering a two-sided ttest with the assumption that the populations have identical variances, once you haven't specified the parameter equal_var and by default it is True on the scypi ttest_ind().
So you can represent your statitical test as:
Null hypothesis (H0): there is no difference between the values recorded for male and females, or in other words, means are similar. (µMale == µFemale).
Alternative hypothesis (H1): there is a difference between the values recorded for male and females, or in other words, means are not similar (both the situations where µMale > µFemale and µMale < µFemale, or simply µMale != µFemale)
The significance level is an arbitrary definition on your test, such as 0.05. If you had obtained a small p-value, smaller than your significance level, you could disprove the null hypothesis (H0) and consequently prove the alternative hypothesis (H1).
In your results, the p-value is ~0.78, or you can't disprove the H0. So, you can assume that the means are equal.
Considering the standard deviations of sampes as below, you could eventually define your test as equal_var = False:
>> males.rating.std()
1.1095557786889139
>> females.rating.std()
1.1709514829100405
>> ttest_ind(males.rating, females.rating, equal_var = False)
Ttest_indResult(statistic=-0.2654398046364026, pvalue=0.7906719538136853)
Which also confirms that the null hypothesis (H0).
If you use the stats model ttest_ind(), you also get the degrees of freedon used in the t-test:
>> import statsmodels.api as sm
>> sm.stats.ttest_ind(males.rating, females.rating, alternative='two-sided', usevar='unequal')
(-0.2654398046364028, 0.790671953813685, 42815.86745494558)
What exactly you've found odd on your results?

How to calculate tfidf score from a column of dataframe and extract words with a minimum score threshold

I have taken a column of dataset which has description in text form for each row. I am trying to find words with tf-idf greater than some value n. but the code gives a matrix of scores how do I sort and filter the scores and see the corresponding word.
tempdataFrame = wineData.loc[wineData.variety == 'Shiraz',
'description'].reset_index()
tempdataFrame['description'] = tempdataFrame['description'].apply(lambda
x: str.lower(x))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(tempDataFrame['description'])
Sample Data:
description
This tremendous 100% varietal wine hails from Oakville and was aged over
three years in oak. Juicy red-cherry fruit and a compelling hint of caramel
greet the palate, framed by elegant, fine tannins and a subtle minty tone in
the background. Balanced and rewarding from start to finish, it has years
ahead of it to develop further nuance. Enjoy 2022–2030.
In the absence of a full data frame column of wine descriptions, the sample data you have provided is split in three sentences in order to create a data frame with one column named 'Description' and three rows. Then the column is passed to the tf-idf for analysis and a new data frame containing the features and their scores is created. The results are subsequently filtered using pandas.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
doc = ['This tremendous 100% varietal wine hails from Oakville and was aged over \
three years in oak.', 'Juicy red-cherry fruit and a compelling hint of caramel \
greet the palate, framed by elegant, fine tannins and a subtle minty tone in \
the background.', 'Balanced and rewarding from start to finish, it has years \
ahead of it to develop further nuance. Enjoy 2022–2030.']
df_1 = pd.DataFrame({'Description': doc})
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(df_1['Description'])
# New data frame containing the tfidf features and their scores
df = pd.DataFrame(score.toarray(), columns=tfidf.get_feature_names())
# Filter the tokens with tfidf score greater than 0.3
tokens_above_threshold = df.max()[df.max() > 0.3].sort_values(ascending=False)
tokens_above_threshold
Out[29]:
wine 0.341426
oak 0.341426
aged 0.341426
varietal 0.341426
hails 0.341426
100 0.341426
oakville 0.341426
tremendous 0.341426
nuance 0.307461
rewarding 0.307461
start 0.307461
enjoy 0.307461
develop 0.307461
balanced 0.307461
ahead 0.307461
2030 0.307461
2022â 0.307461
finish 0.307461

comapring compressed distribution per cohort

How can I easily compare the distributions of multiple cohorts?
Usually, https://seaborn.pydata.org/generated/seaborn.distplot.html would be a great tool to visually compare distributions. However, due to the size of my dataset, I needed to compress it and only keep the counts.
It was created as:
SELECT age, gender, compress_distributionUDF(collect_list(struct(target_y_n, count, distribution_value))) GROUP BY age, gender
where compress_distributionUDF simply takes a list of tuples and returns the counts per group.
This leaves me with a list of
Row(distribution_value=60.0, count=314251, target_y_n=0)
nested inside a pandas.Series, but one per each chohort.
Basically, it is similar to:
pd.DataFrame({'foo':[1,2], 'bar':['first', 'second'], 'baz':[{'target_y_n': 0, 'value': 0.5, 'count':1000},{'target_y_n': 1, 'value': 1, 'count':10000}]})
and I wonder how to compare distributions:
within a cohort 0 vs. 1 of target_y_n
over multiple cohorts
in a way which is visually still understandable and not only a mess.
edit
For a single cohort Plotting pre aggregated data in python could be the answer, but how can multiple cohorts be compared (not just in a loop) as this leads to too many plots to compare?
I am still quite confused but we can start from this and see where it goes. From your example, I am focusing on baz as it is not clear to me what foo and bar are (I assume cohorts).
So let focus on baz and plot the different distributions according to target_y_n.
sns.catplot('value','count',data=df, kind='bar',hue='target_y_n',dodge=False,ci=None)
sns.catplot('value','count',data=df, kind='box',hue='target_y_n',dodge=False)
plt.bar(df[df['target_y_n']==0]['value'],df[df['target_y_n']==0]['count'],width=1)
plt.bar(df[df['target_y_n']==1]['value'],df[df['target_y_n']==1]['count'],width=1)
plt.legend(['Target=0','Target=1'])
sns.barplot('value','count',data=df, hue = 'target_y_n',dodge=False,ci=None)
Finally try to have a look at the FacetGrid class to extend your comparison (see here).
g=sns.FacetGrid(df,col='target_y_n',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
In your case you would have something like:
g=sns.FacetGrid(df,col='target_y_n',row='cohort',hue = 'target_y_n')
g=g.map(sns.barplot,'value','count',ci=None)
And a qqplot option:
from scipy import stats
def qqplot(x, y, **kwargs):
_, xr = stats.probplot(x, fit=False)
_, yr = stats.probplot(y, fit=False)
plt.scatter(xr, yr, **kwargs)
g=sns.FacetGrid(df,col='cohort',hue = 'target_y_n')
g=g.map(qqplot,'value','count')

Data Selection - Finding relations between dataframe attributes

let's say i have a dataframe of 80 columns and 1 target column,
for example a bank account table with 80 attributes for each record (account) and 1 target column which decides if the client stays or leaves.
what steps and algorithms should i follow to select the most effective columns with the higher impact on the target column ?
There are a number of steps you can take, I'll give some examples to get you started:
A correlation coefficient, such as Pearson's Rho (for parametric data) or Spearman's R (for ordinate data).
Feature importances. I like XGBoost for this, as it includes the handy xgb.ggplot.importance / xgb.plot_importance methods.
One of the many feature selection options, such as python's sklearn.feature_selection methods.
This one way to do it using the Pearson correlation coefficient in Rstudio, I used it once when exploring the red_wine dataset my targeted variable or column was the quality and I wanted to know the effect of the rest of the columns on it.
see below figure shows the output of the code as you can see the blue color represents positive relation and red represents negative relations and the closer the value to 1 or -1 the darker the color
c <- cor(
red_wine %>%
# first we remove unwanted columns
dplyr::select(-X) %>%
dplyr::select(-rating) %>%
mutate(
# now we translate quality to a number
quality = as.numeric(quality)
)
)
corrplot(c, method = "color", type = "lower", addCoef.col = "gray", title = "Red Wine Variables Correlations", mar=c(0,0,1,0), tl.cex = 0.7, tl.col = "black", number.cex = 0.9)