How to construct the data frame such that the "species" column has the actual species names (setosa etc) - pandas

import sklearn.datasets as data
iris = data.load_iris()
import pandas as pd
iris_df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
iris_df['species'] = iris['filename']
iris_df
in species name there's iris.csv showing but i need actual species names (setosa etc).

You can solve this with list comprehensions:
Replace the last 2-lines with:
iris_df['species'] = iris['target']
#species column is filled with 0,1,2 (class indexes)
iris_df['species'] = iris_df['species'].replace({i:k for i,k in enumerate(iris['target_names'])})
#here we replace the indexes with strings
# 0 replaced with 'setosa', 1 with 'versicolor'
output would be:
sepal length (cm) sepal width (cm) ... petal width (cm) species
0 5.1 3.5 ... 0.2 setosa
1 4.9 3.0 ... 0.2 setosa
2 4.7 3.2 ... 0.2 setosa
3 4.6 3.1 ... 0.2 setosa
4 5.0 3.6 ... 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 ... 2.3 virginica
146 6.3 2.5 ... 1.9 virginica
147 6.5 3.0 ... 2.0 virginica
148 6.2 3.4 ... 2.3 virginica
149 5.9 3.0 ... 1.8 virginica

Related

Creating an "ID" variable "On the Spot"

I have a file on a server that looks something like this:
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "iris", iris)
I want to create a column that contains "ID's" - a unique "ID" for each row (something like id = 1:nrow(iris)). Since the file is on a server, I would choose some (suitable) column (that I hope has many unique values) and use the row_number() function to create this ID:
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by `Petal.Length`) as rnum from iris)a limit 5;")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species rnum
1 4.6 3.6 1.0 0.2 setosa 1
2 4.3 3.0 1.1 0.1 setosa 2
3 5.8 4.0 1.2 0.2 setosa 3
4 5.0 3.2 1.2 0.2 setosa 4
5 4.7 3.2 1.3 0.2 setosa 5
But what happens if I don't have such a "suitable" column? Is it possible to make this ID column "on the spot" without specifying it over a specific choice of column?
Thank you!
Why not just use over()?
DBI::dbGetQuery(con, "select *, row_number() over() rnum from iris limit 5;")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species rnum
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 3
4 4.6 3.1 1.5 0.2 setosa 4
5 5.0 3.6 1.4 0.2 setosa 5
You can order over select 1 (What does it mean by select 1 from table?) -
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over
(order by (select 1)) as rnum from iris)a limit 5;")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species rnum
#1 5.1 3.5 1.4 0.2 setosa 1
#2 4.9 3.0 1.4 0.2 setosa 2
#3 4.7 3.2 1.3 0.2 setosa 3
#4 4.6 3.1 1.5 0.2 setosa 4
#5 5.0 3.6 1.4 0.2 setosa 5

Ifelse leaving other observations as is

In R and tidy verse, there is a way to use ifelse() such that I can change several of the observations in a variable but then I can leave other observations that I don't want changed as they are but just setting else to that column (so in the example below, "Virginica and "Versicolor" would remain the same. Can't figure out how to do that in pandas.
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
Minimal reproducible example:
iris\
.assign(new_species = iris['species'].apply(lambda x: "set" if x=="setosa" else species))
This comes up with an error and if I put species in quotes, "species" becomes the actual observation.
Thanks much!
James
Use replace:
iris['new_spicies'] = iris['species'].replace('setosa', 'set')
Output:
sepal_length sepal_width petal_length petal_width species new_spicies
0 5.1 3.5 1.4 0.2 setosa set
1 4.9 3.0 1.4 0.2 setosa set
2 4.7 3.2 1.3 0.2 setosa set
3 4.6 3.1 1.5 0.2 setosa set
4 5.0 3.6 1.4 0.2 setosa set
.. ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica virginica
146 6.3 2.5 5.0 1.9 virginica virginica
147 6.5 3.0 5.2 2.0 virginica virginica
148 6.2 3.4 5.4 2.3 virginica virginica
149 5.9 3.0 5.1 1.8 virginica virginica
[150 rows x 6 columns]

Make Pandas groupby apply behave like transform?

Is there a way to apply a function that return iterables of the same lengths as the groups, and "transform" the results as a new column?
I have some time series data of serveral groups that are interweaved.
For example:
import pandas as pd
import numpy as np
import seaborn as sns
iris_data = sns.load_dataset('iris')
df = iris_data.head(5).copy()
df = pd.concat([df, iris_data.iloc[50:55]])
df = pd.concat([df, iris_data.iloc[100:105]])
df = df.sample(frac=1, random_state=0).reset_index(drop=True)
df = df.drop(columns=["sepal_width", "petal_length", "petal_width"])
df.index = pd.date_range("2020/12/16", periods=15)
df
sepal_length species
2020-12-16 4.9 setosa
2020-12-17 6.4 versicolor
2020-12-18 5.5 versicolor
2020-12-19 6.5 versicolor
2020-12-20 6.5 virginica
2020-12-21 5.0 setosa
2020-12-22 4.7 setosa
2020-12-23 6.3 virginica
2020-12-24 6.3 virginica
2020-12-25 6.9 versicolor
2020-12-26 5.8 virginica
2020-12-27 4.6 setosa
2020-12-28 5.1 setosa
2020-12-29 7.0 versicolor
2020-12-30 7.1 virginica
I want to apply some time-dependent logic to each group (in this case, each species), and append the results as a new column.
For example, if I want to calculate an "is new-high" indicator in each series: (again THIS IS NOT THE POINT OF THE QUESTION, just an example of time-dependent calculations)
def is_new_high(group):
"""
This function simulates some order-dependent calculations as an illustration.
It labels True if the current sepal_length is longer than the previous one in the same species.
This probably can be achieved without groupby and apply, but that is not the point here.
"""
group_res = [False] * len(group)
running_max = 0
for i, (row_idx, row) in enumerate(group.iterrows()):
if row.sepal_length > running_max:
running_max = row.sepal_length
group_res[i] = True
return(group_res)
which when applied to the df
grouped_results = df.groupby(df.species).apply(is_new_high)
grouped_results
yields a Series that indicates whether an observation is a new-high in that species:
species
setosa [True, True, False, False, True]
versicolor [True, False, True, True, True]
virginica [True, False, False, False, True]
dtype: object
With a little bit of (inefficient) Pandas trickery (see below), this can be merged back into the original df:
sepal_length species new_high
2020-12-16 4.9 setosa True
2020-12-17 6.4 versicolor True
2020-12-18 5.5 versicolor False
2020-12-19 6.5 versicolor True
2020-12-20 6.5 virginica True
2020-12-21 5.0 setosa True
2020-12-22 4.7 setosa False
2020-12-23 6.3 virginica False
2020-12-24 6.3 virginica False
2020-12-25 6.9 versicolor True
2020-12-26 5.8 virginica False
2020-12-27 4.6 setosa False
2020-12-28 5.1 setosa True
2020-12-29 7.0 versicolor True
2020-12-30 7.1 virginica True
Since the function applied is time- or order-dependent, I cannot think of a way to turn it into a Pandas "transform" function that acts on columns and not on a group (i.e., DataFrame). So it seems I'm stuck with using apply that returns a vector for each group.
My question is the following: Is there a way to apply a function that return iterables of the same lengths as the groups, and "transform" the results as a new column, to achieve the same result as in my example?
Here's the inefficient and ad-hoc way I did it:
def is_new_high(group):
group_res = [(idx, False) for idx in group.index]
temp = 0
for i, (row_idx, row) in enumerate(group.iterrows()):
if row.sepal_length > temp:
temp = row.sepal_length
group_res[i] = (row_idx, True)
return(group_res)
grouped_results_exploded = grouped_results.explode()
res = pd.DataFrame(grouped_results_exploded.tolist(), index=grouped_results_exploded.index,)
res.columns=['timestamp', 'new_high']
res = res.reset_index().set_index("timestamp")
df.join(res, rsuffix="_groupby")

Using pandas loc in hy

I want to do the following in hy:
from StringIO import StringIO
import pandas as pd
s = """sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa"""
df = pd.read_table(StringIO(s), sep="\s+")
df.loc[df.sepal_length > 4.5]
How do I do the last sentence?
I have tried (.loc df (> df.sepal_length 4.5))
but it just returns a locindexer.
There are two ways to do this:
Using the . macro:
(. df loc [(> df.sepal-length 4.5)])
Using get:
(get df.loc (> df.sepal-length 4.5))
Protip: always try running hy2py on your Hy files. It shows you what the resulting Python looks like. The output isn't always valid syntax, but it shows you what gets compiled into what. Both of these get compiled down to df.loc[(df.sepal_length > 4.5)].
One more thing: notice I used sepal-length. Hy converts dashes in identifiers to underscores, so that's the same thing as sepal_length, but it's considered better style.

strange error with RCassandra trying to store the "iris" dataset

I am new to Apache Cassandra. I installed the R extension RCassandra and created the following table:
CREATE TABLE iris (
id varchar,
"Sepal.Length" double,
"Sepal.Width" double,
"Petal.Length" double,
"Petal.Width" double,
Species text,
PRIMARY KEY(id)
);
After successful connection and keyspace selection I try the following:
> RC.write.table(c,'iris',iris)
The iris dataset has the following form:
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
...
But I get the following error:
Error in RC.write.table(c, "iris", iris) :
Cassandra exception: Not enough bytes to read value of component 0
Maybe the id field is troublesome, so I created the following dataset by adding an id column to the dataframe:
iris2 = data.frame(id=row.names(iris),iris)
iris2
id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
...
but I still get the same error:
> RC.write.table(c,'iris',iris2)
Error in RC.write.table(c, "iris", iris2) :
Cassandra exception: Not enough bytes to read value of component 0
Any ideas on how to solve this?
Are you creating the column family ("table") in Cassandra using CQLSH? Was myself having the same problem. Then switched to using cassandra-cli and created the column family the this way, and it worked.
Most likely RCassandra uses cli underneath, and because CFs created in CQLSH are not visible in cassandra-cli, it can't see CQL tables.