import sklearn.datasets as data
iris = data.load_iris()
import pandas as pd
iris_df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
iris_df['species'] = iris['filename']
iris_df
in species name there's iris.csv showing but i need actual species names (setosa etc).
You can solve this with list comprehensions:
Replace the last 2-lines with:
iris_df['species'] = iris['target']
#species column is filled with 0,1,2 (class indexes)
iris_df['species'] = iris_df['species'].replace({i:k for i,k in enumerate(iris['target_names'])})
#here we replace the indexes with strings
# 0 replaced with 'setosa', 1 with 'versicolor'
output would be:
sepal length (cm) sepal width (cm) ... petal width (cm) species
0 5.1 3.5 ... 0.2 setosa
1 4.9 3.0 ... 0.2 setosa
2 4.7 3.2 ... 0.2 setosa
3 4.6 3.1 ... 0.2 setosa
4 5.0 3.6 ... 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 ... 2.3 virginica
146 6.3 2.5 ... 1.9 virginica
147 6.5 3.0 ... 2.0 virginica
148 6.2 3.4 ... 2.3 virginica
149 5.9 3.0 ... 1.8 virginica
In R and tidy verse, there is a way to use ifelse() such that I can change several of the observations in a variable but then I can leave other observations that I don't want changed as they are but just setting else to that column (so in the example below, "Virginica and "Versicolor" would remain the same. Can't figure out how to do that in pandas.
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
Minimal reproducible example:
iris\
.assign(new_species = iris['species'].apply(lambda x: "set" if x=="setosa" else species))
This comes up with an error and if I put species in quotes, "species" becomes the actual observation.
Thanks much!
James
Use replace:
iris['new_spicies'] = iris['species'].replace('setosa', 'set')
Output:
sepal_length sepal_width petal_length petal_width species new_spicies
0 5.1 3.5 1.4 0.2 setosa set
1 4.9 3.0 1.4 0.2 setosa set
2 4.7 3.2 1.3 0.2 setosa set
3 4.6 3.1 1.5 0.2 setosa set
4 5.0 3.6 1.4 0.2 setosa set
.. ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica virginica
146 6.3 2.5 5.0 1.9 virginica virginica
147 6.5 3.0 5.2 2.0 virginica virginica
148 6.2 3.4 5.4 2.3 virginica virginica
149 5.9 3.0 5.1 1.8 virginica virginica
[150 rows x 6 columns]
I'm using Python 2.7. The title provides context. I phrased the title in this specific way so people can query for this stack exchange question in the future. There is a plethora of documentation for this stuff using MATLAB, but this process is severely lacking for Scipy, NumPy, Pandas, matplotlib, etc.
Essentially, I have the following dataframe:
time amplitude
0 1.0 0.1
1 2.0 -0.3
2 3.0 1.4
3 4.0 4.2
4 5.0 -5.7
5 6.0 2.3
6 7.0 -0.2
7 8.0 -0.3
8 9.0 1.0
9 10.0 0.1
Now what I want to do is the following:
in 5 second intervals, look for the max and min value
record max and min value with the corresponding time value (i.e. for the above case, in the first 5 seconds, the max is 4.2 at 4 seconds and -5.7 at 5 seconds)
append values in appropriate place into the data frame i.e.
time amplitude upper lower
0 1.0 0.1
1 2.0 -0.3
2 3.0 1.4
3 4.0 4.2 4.2
4 5.0 -5.7 -5.7
5 6.0 2.3 2.3
6 7.0 -0.8 -0.8
7 8.0 -0.3
8 9.0 1.0
9 10.0 0.1
interpolate between max values and min values to flush out dataframe
plot amplitude column, upper column and lower column
I'm familiar enough with python/pandas and imagine the code looking something like the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy as scipy
time = [0,1,2,3,4,5,6,7,8,9]
amplitude = [0.1,-0.3,1.4,4.2,-5.7,2.3,-0.2,-0.3,1.0,0.1]
df = pd.DataFrame({'time': time, 'amplitude': amplitude}]
plt.plot(df['time'],df['amplitude])
for seconds in time:
if <interval == 5>:
max = []
time_max = []
min = []
time_min = []
max.append(df.max['amplitude'])
min.append(df.min['amplitude'])
time_max.append(<time value in interval>)
time_min.append(<time value in interval>)
<build another dataframe>
<concat to existing dataframe df>
<interpolate between values in column 'upper'>
<interpolate between values in column 'lower'>
any help is appreciated.
thank you.
~devin
Pandas resample() and interpolate() will help here. To get seconds as a DatetimeIndex, start with an arbitrary Datetime - you can always chop off the year when you're done:
df.set_index(pd.to_datetime("2017") + df.time * pd.offsets.Second(), inplace=True)
print(df)
time amplitude
time
2017-01-01 00:00:01 1.0 0.1
2017-01-01 00:00:02 2.0 -0.3
2017-01-01 00:00:03 3.0 1.4
2017-01-01 00:00:04 4.0 4.2
2017-01-01 00:00:05 5.0 -5.7
2017-01-01 00:00:06 6.0 2.3
2017-01-01 00:00:07 7.0 -0.2
2017-01-01 00:00:08 8.0 -0.3
2017-01-01 00:00:09 9.0 1.0
2017-01-01 00:00:10 10.0 0.1
Resample every 5 seconds, and get summary statistics min and max:
summary = (df.resample('5S', label='right', closed='right')
.agg({"amplitude":{"lower":"min","upper":"max"}}))
summary.columns = summary.columns.droplevel(0)
print(summary)
upper lower
time
2017-01-01 00:00:05 4.2 -5.7
2017-01-01 00:00:10 2.3 -0.3
Merge with original df and interpolate missing values. (Note that interpolation is only possible between two values, so the first few entries will be NaN.)
df2 = df.merge(summary, how='left', left_index=True, right_index=True)
df2.lower.interpolate(inplace=True)
df2.upper.interpolate(inplace=True)
print(df2)
time amplitude upper lower
time
2017-01-01 00:00:01 1.0 0.1 NaN NaN
2017-01-01 00:00:02 2.0 -0.3 NaN NaN
2017-01-01 00:00:03 3.0 1.4 NaN NaN
2017-01-01 00:00:04 4.0 4.2 NaN NaN
2017-01-01 00:00:05 5.0 -5.7 4.20 -5.70
2017-01-01 00:00:06 6.0 2.3 3.82 -4.62
2017-01-01 00:00:07 7.0 -0.2 3.44 -3.54
2017-01-01 00:00:08 8.0 -0.3 3.06 -2.46
2017-01-01 00:00:09 9.0 1.0 2.68 -1.38
2017-01-01 00:00:10 10.0 0.1 2.30 -0.30
Finally, plot the output:
plot_cols = ['amplitude','lower','upper']
df2[plot_cols].plot()
Note: If you want the index to only display seconds, just use:
df2.index = df2.index.second
pretty much copied this: python - How to get high and low envelope of a signal? (but in a more pandas/dataframe oriented way)
used this as well: Subsetting Data Frame into Multiple Data Frames in Pandas
and I ran into this problem for the very first time:
I hope this helps people create arbitrary envelopes for noisy signals / time series data like it helped me!!!!
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy as scipy
time_array = [0,1,2,3,4,5,6,7,8,9]
value_array = [0.1,-0.3,1.4,4.2,-5.7,2.3,-0.2,-0.3,1.0,0.1]
upper_time = []
upper_value = []
lower_time = []
lower_value = []
df = pd.DataFrame({'time': time_array, 'value': value_array})
for element,df_k in df.groupby(lambda x: x/2):
df_temp = df_k.reset_index(drop=True)
upper_time.append(df_temp['time'].loc[df_temp['value'].idxmax()])
upper_value_raw = df_temp['value'].loc[df_temp['value'].idxmax()]
upper_value.append(round(upper_value_raw,1))
lower_time.append(df_temp['time'].loc[df_temp['value'].idxmin()])
lower_value_raw = df_temp['value'].loc[df_temp['value'].idxmin()]
lower_value.append(round(lower_value_raw,1))
plt.plot(df['time'],df['value'])
plt.plot(upper_time,upper_value)
plt.plot(lower_time,lower_value)
plt.show()
I have two pandas DataFrames, with one of them having index and columns that are subsets of the other. For example:
DF1 =
date a b c
20170101 1.0 2.2 3
20170102 2.1 5.2 -3.0
20170103 4.2 1.8 10.0
...
20170331 9.8 5.1 4.5
DF2 =
date a c
20170101 NaN 2.1
20170103 4 NaN
What I want is element-wise multiplication by matching both the index and column. i.e. only DF1[20170103]['c'] will be multiplied with DF2[20170103]['c'], etc.
The resulting DF should have the same dimension as the bigger one (DF1), with missing values in DF2 set as the original DF1 value:
result DF =
date a b c
20170101 1.0 2.2 6.3
20170102 2.1 5.2 -3.0
20170103 16.8 1.8 10.0
...
20170331 9.8 5.1 4.5
What's the best/fastest way to do this? The real-life matrices are huge, and DF2 is relatively sparse.
I think you need vectorized function mul:
df = DF1.mul(DF2, fill_value=1)
print (df)
a b c
date
20170101 1.0 2.2 6.3
20170102 2.1 5.2 -3.0
20170103 16.8 1.8 10.0
20170331 9.8 5.1 4.5
I want to do the following in hy:
from StringIO import StringIO
import pandas as pd
s = """sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa"""
df = pd.read_table(StringIO(s), sep="\s+")
df.loc[df.sepal_length > 4.5]
How do I do the last sentence?
I have tried (.loc df (> df.sepal_length 4.5))
but it just returns a locindexer.
There are two ways to do this:
Using the . macro:
(. df loc [(> df.sepal-length 4.5)])
Using get:
(get df.loc (> df.sepal-length 4.5))
Protip: always try running hy2py on your Hy files. It shows you what the resulting Python looks like. The output isn't always valid syntax, but it shows you what gets compiled into what. Both of these get compiled down to df.loc[(df.sepal_length > 4.5)].
One more thing: notice I used sepal-length. Hy converts dashes in identifiers to underscores, so that's the same thing as sepal_length, but it's considered better style.