Why there is a train inside train in line 2 & 3? - pandas

I am a beginner in python. This is an excerpt of a code from https://github.com/minsuk-heo/kaggle-titanic/blob/master/titanic-solution.ipynb (line no. 12). I was trying to understand a bar chart with it:
def bar_chart(feature):
survived = train[train['Survived']==1][feature].value_counts()
dead = train[train['Survived']==0][feature].value_counts()
df = pd.DataFrame([survived,dead])
df.index = ['Survived','Dead']
df.plot(kind='bar',stacked=True, figsize=(10,5))

#Pranjal try to learn a python module first (here, pandas) before jumping to any challenge (say, kaggle's titanic).
To answer your question, consider the lines you asked for -
Line 2: survived = train[train['Survived']==1][feature].value_counts()
Line 3: dead = train[train['Survived']==0][feature].value_counts()
The train['Survived']==1 code results into a boolean (True/False) pandas series. It results into True where the column Survived value is equal to 1 else False. Once the series is generated it is fed to outer train and only the rows which are mapped to True will be kept and others will be dropped. Next you select only the feature column from the resulted dataframe and returns object containing counts of unique values. Similarly, proceed with line 3.
Tip: There happened no permanent change to the train dataframe.

Related

Obtaining the index of a word between two columns in pandas

I am checking on which words the SpaCy Spanish lemmatizer works on using the .has_vector method. In the two columns of the datafame I have the output of the function that indicates which words can be lemmatized and in the other one the corresponding phrase.
I would like to know how I can extract all the words that have False output to correct them so that I can lemmatize.
So I created the function:
def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])
And applied it to the column sentences in the DataFrame
df["Vectors"] = df.reviews.apply(lemmatizer)
And put in another data frame as:
df2= pd.DataFrame(df[['Vectors', 'reviews']])
The output is
index Vectors reviews
1 True True True False 'La pelicula es aburridora'
Two ways to do this:
import pandas
import spacy
nlp = spacy.load('en_core_web_lg')
df = pandas.DataFrame({'reviews': ["aaabbbcccc some example words xxxxyyyz"]})
If you want to use has_vector:
def get_oov1(text):
return [word.text for word in nlp(text) if not word.has_vector]
Alternatively you can use the is_oov attribute:
def get_oov2(text):
return [word.text for word in nlp(text) if word.is_oov]
Then as you already did:
df["oov_words1"] = df.reviews.apply(get_oov1)
df["oov_words2"] = df.reviews.apply(get_oov2)
Which will return:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [aaabbbcccc, xxxxyyyz] [aaabbbcccc, xxxxyyyz]
Note:
When working with both of these ways it is important to know that this is model dependent, and usually has no backbone in smaller models and will always return a default value!
That means when you run the exact same code but e.g. with en_core_web_sm you get this:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [] [aaabbbcccc, some, example, words, xxxxyyyz]
Which is because has_vector has a default value of False and is then not set by the model. is_oov has a default value of True and then is not by the model either. So with the has_vector model it wrongly shows all words as unknown and with is_oov it wrongly shows all as known.

DataFrame to DataFrameRow conversion (Julia)

I'm using Pingouin.jl to test normality.
In their docs, we have
dataset = Pingouin.read_dataset("mediation")
Pingouin.normality(dataset, method="jarque_bera")
Which should return a DataFrame with normality true or false for each name in the dataset.
Currently, this broadcasting is deprecated, and I'm unable to concatenate the result in one DataFrame for each unique-column-output (which is working and outputs a DataFrame).
So, what I have so far.
function var_norm(df)
norm = DataFrame([])
for i in 1:1:length(names(df))
push!(norm, Pingouin.normality(df[!,names(df)[i]], method="jarque_bera"))
end
return norm
end
The error I get:
julia> push!(norm, Pingouin.normality(df[!,names(df)[1]], method="jarque_bera"))
ERROR: ArgumentError: `push!` does not allow passing collections of type DataFrame to be pushed into a DataFrame. Only `Tuple`, `AbstractArray`, `AbstractDict`, `DataFrameRow` and `NamedTuple` are allowed.
Stacktrace:
[1] push!(df::DataFrame, row::DataFrame; promote::Bool)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1603
[2] push!(df::DataFrame, row::DataFrame)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1601
[3] top-level scope
# REPL[163]:1
EDIT: push! function was not properly written at my first version of the post. But, the error persists after the change. How can I reformat the output of type DataFrame from Pingouin into DataFrameRow?
As Pengouin.normality returns a DataFrame, you will have to iterate over its results and push one-by-one:
df = Pengouin.normality(…)
for row in eachrow(df)
push!(norms, row)
end
If you are sure Pengouin.normality returns a DataFrame with exactly one row, you can simply write
push!(norms, only(Pengouin.normality(…)))

How to handle unknown number of values for a categorical feature?

I have a pandas dataframe that looks like this
Text | Label
Some text | 0
hellow bye what | 1
...
Each row is a data point. Label is 0/1 binary. The only feature is Text which contains a set of words. I want to use the presence or absence of each word as features. For example, the features could be contains_some contains_what contains_hello contains_bye, etc. This is typical one hot encoding.
However I don't want to manually create so many features, one for every single word in the vocabulary (the vocabulary is not huge, so I am not worried about the feature set exploding). But I just want to supply a list of words as a single column to tensorflow and I want it to create a binary feature for each word in the vocabulary.
Does tensorflow/keras have an API to do this?
You can use sklearn for this , try this:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(old_df['Text'])
new_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
new_df['Label'] = old_df['label']
and this should give you :
bye hellow some text what target
0 0 1 1 0 0
1 1 0 0 1 1
CountVectorizer convert a collection of text documents to a matrix of token counts.
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix and if binary = True then all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
What you're looking for is a (binary) bag of words which you can get from scikit-learn using their CountVectorizer here.
You can do something like:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range=(1, 1), binary=True)
X_train = bow.fit_transform(df_train['text'].values)
This will create an array of binary values indicating the presence of a word in each text. Use binary=True to output a 1 or 0 if the word is present. Without this field you will get counts of occurrences per word, either method works fine.
In order to inspect the counts you could use the below:
# Create sample dataframe of BoW outputs
count_vect_df = pd.DataFrame(X_train[:1].todense(),columns=bow.get_feature_names())
# Re-order counts in descending order. Keep top 10 counts for demo purposes
count_vect_df= count_vect_df[count_vect_df.iloc[-1,:].sort_values(ascending=False).index[:10]]
# print combination of original train dataframe with BoW counts
pd.concat([df_train['text'][:1].reset_index(drop=True), count_vect_df], axis=1)
Update
If your features include categorical data you could try using to_categorical from tf.keras. See the docs for more information.

pandas apply for performance

I have a pandas apply function that runs inference over a 10k csv of strings
account messages
0 th_account Forgot to tell you Evan went to sleep a little...
1 th_account Hey I heard your buying a house I m getting ri...
2 th_account They re releasing a 16 MacBook
3 th_account 5 cups of coffee today I may break the record
4 th_account Apple Store Items in order W544414717 were del...
The function takes about 17 seconds to run.
I'm working on a text classifier and was wondering if there is a quicker way to write it
def _predict(messages):
results = []
for message in messages:
message = vectorizer.transform([message])
message = message.toarray()
results.append(model.predict(message))
return results
df["pred"] = _predict(df.messages.values)
the vectorizer is a TfidfVectorizer and model is a GaussianNB model from sklearn.
I need to loop through every messsage in the csv and perform a prediction to be shown in a new column
You can try built-in function apply in pandas. Its underlying uses C language passby GIL. But still slow.
def _predict(message):
"""message is each row in dataframe
Each row of dataframe return a result
"""
message = vectorizer.transform([message])
message = message.toarray()
return model.predict(message)
df["pred"] = df.apply(_predict, axis=1)
You can run the following code to evaluate the time.
df.head().apply(_predict, axis=1)

Using Dask Delayed on Small/Partitioned Dataframes

I am working with time series data that is formatted as each row is a single instance of a ID/time/data. This means that the rows don't correspond 1 to 1 for each ID. Each ID has many rows across time.
I am trying to use dask delayed to have a function run on an entire ID sequence (it makes sense that the operation should be able to run on each individual ID at the same time since they don't affect each other). To do this I am first looping through each of the ID tags, pulling/locating all the data from that ID (with .loc in pandas, so it is a separate "mini" df), then delaying the function call on the mini df, adding a column with the delayed values and adding it to a list of all mini dfs. At the end of the for loop I want to call dask.compute() on all the mini-dfs at once but for some reason the mini df's values are still delayed. Below I will post some pseudocode about what I just tried to explain.
I have a feeling that this may not be the best way to go about this but it's what made sense at the time and I can't understand whats wrong so any help would be very much appreciated.
Here is what I am trying to do:
list_of_mini_dfs = []
for id in big_df:
curr_df = big_df.loc[big_df['id'] == id]
curr_df['new value 1'] = dask.delayed(myfunc)(args1)
curr_df['new value 2'] = dask.delayed(myfunc)(args2) #same func as previous line
list_of_mini_dfs.append(curr_df)
list_of_mini_dfs = dask.delayed(list_of_mini_dfs).compute()
Concat all mini dfs into new big df.
As you can see by the code I have to reach into my big/overall dataframe to pull out each ID's sequence of data since it is interspersed throughout the rows. I want to be able to call a delayed function on that single ID's data and then return the values from the function call into the big/overall dataframe.
Currently this method is not working, when I concat all the mini dataframes back together the two values I have delayed are still delayed, which leads me to think that it is due to the way I am delaying a function within a df and trying to compute the list of dataframes. I just can't see how to fix it.
Hopefully this was relatively clear and thank you for the help.
IIUC you are trying to do a sort of transform using dask.
import pandas as pd
import dask.dataframe as dd
import numpy as np
# generate big_df
dates = pd.date_range(start='2019-01-01',
end='2020-01-01')
l = len(dates)
out = []
for i in range(1000):
df = pd.DataFrame({"ID":[i]*l,
"date": dates,
"data0": np.random.randn(l),
"data1": np.random.randn(l)})
out.append(df)
big_df = pd.concat(out, ignore_index=True)\
.sample(frac=1)\
.reset_index(drop=True)
Now you want to apply your function fun on columns data0 and data1
Pandas
out = big_df.groupby("ID")[["data0","data1"]]\
.apply(fun)\
.reset_index()
df_pd = pd.merge(big_df, out, how="left", on="ID" )
Dask
df = dd.from_pandas(big_df, npartitions=4)
out = df.groupby("ID")[["data0","data1"]]\
.apply(fun, meta={'data0':'f8',
'data1':'f8'})\
.rename(columns={'data0': 'new_values0',
'data1': 'new_values1'})\
.compute() # Here you need to compute otherwise you'll get NaNs
df_dask = dd.merge(df, out,
how="left",
left_on=["ID"],
right_index=True)
The dask version is not necessarily faster than the pandas one. In particular if your df fits in RAM.