Condition in Pandas

Condition in Pandas - pandas

I have a very peculiar problem in Pandas: one condition works but the other does not. You may download the linked file to test my code. Thanks!
I have a file (stars.txt) that I read in with Pandas. I would like to create two groups: (1) with Log_g < 4.0 and (2) Log_g > 4.0. In my code (see below) I can successfully get rows for group (1):
Kepler_ID RA Dec Teff Log_G g H
3 2305372 19 27 57.679 +37 40 21.90 5664 3.974 14.341 12.201
14 2708156 19 21 08.906 +37 56 11.44 11061 3.717 10.672 10.525
19 2997455 19 32 31.296 +38 07 40.04 4795 3.167 14.694 11.500
34 3352751 19 36 17.249 +38 25 36.91 7909 3.791 13.541 12.304
36 3440230 19 21 53.100 +38 31 42.82 7869 3.657 13.706 12.486
But for some reason I cannot get (2). The code returns the following for error:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 108
Data columns (total 7 columns):
Kepler_ID 90 non-null values
RA 90 non-null values
Dec 90 non-null values
Teff 90 non-null values
Log_G 90 non-null values
g 90 non-null values
H 90 non-null values
dtypes: float64(4), int64(1), object(2)
Here's my code:
#------------------------------------------
# IMPORT STATEMENTS
#------------------------------------------
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#------------------------------------------
# READ FILE AND ASSOCIATE COMPONENTS
#------------------------------------------
star_file = 'stars.txt'
header_row = ['Kepler_ID', 'RA','Dec','Teff', 'Log_G', 'g', 'H']
df = pd.read_csv(star_file, names=header_row, skiprows=2)
#------------------------------------------
# ASSOCIATE VARIABLES
#------------------------------------------
Kepler_ID = df['Kepler_ID']
#RA = df['RA']
#Dec = df['Dec']
Teff = df['Teff']
Log_G = df['Log_G']
g = df['g']
H = df['H']
#------------------------------------------
# SUBSTITUTE MISSING DATA WITH NAN
#------------------------------------------
df = df.replace('', np.nan)
#------------------------------------------
# CHANGE DATA TYPE OF THE REST OF DATA TO FLOAT
#------------------------------------------
df[['Teff', 'Log_G', 'g', 'H']] = df[['Teff', 'Log_G', 'g', 'H']].astype(float)
#------------------------------------------
# SORTING SPECTRA TYPES FOR GIANTS
#------------------------------------------
# FIND GIANTS IN THE SAMPLE
giants = df[(df['Log_G'] < 4.)]
#print giants
# FIND GIANTS IN THE SAMPLE
dwarfs = df[(df['Log_G'] > 4.)]
print dwarfs

This is not an error. You are seeing a summarized view of the DataFrame:
In [11]: df = pd.DataFrame([[2, 1], [3, 4]])
In [12]: df
Out[12]:
0 1
0 2 1
1 3 4
In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
0 2 non-null values
1 2 non-null values
dtypes: int64(2)
Which is displayed is decided by several display package options, for example, max_rows:
In [14]: pd.options.display.max_rows
Out[14]: 60
In [15]: pd.options.display.max_rows = 120
In 0.13, this behaviour changed, so you will see the first max_rows followed by ....

Related

Convert and replace a string value in a pandas df with its float type

I have a value in pandas df which is accidentally put as a string as follows:
df.iloc[5329]['values']
'72,5'
I want to convert this value to float and replace it in the df. I have tried the following ways:
df.iloc[5329]['values'] = float(72.5)
also,
df.iloc[5329]['values'] = 72.5
and,
df.iloc[5329]['values'] = df.iloc[5329]['values'].replace(',', '.')
It runs successfully with a warning but when I check the df, its still stored as '72,5'.
The entire df at that index is as follows:
df.iloc[5329]
value 36.25
values 72,5
values1 72.5
currency MYR
Receipt Kuching, Malaysia
Delivery Male, Maldives
How can I solve that?

iloc needs specific row, col positioning.
import pandas as pd
df = pd.DataFrame(
{
'A': np.random.choice(100, 3),
'B': [15.2,'72,5',3.7]
})
print(df)
df.info()
Output:
A B
0 84 15.2
1 92 72,5
2 56 3.7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null object
Update to value:
df.iloc[1,1] = 72.5
print(df)
Output:
A B
0 84 15.2
1 92 72.5
2 56 3.7

Make sure you don't have recurring indexing (i.e. [][]) when doing assignment, since df.iloc[5329] will make a copy of data and further assignment will happen to the copy not original df. Instead just do:
df.iloc[5329, 'values'] = 72.5

How to make a graph plotting monthly data over many years in pandas

I have 11 years worth of hourly ozone concentration data.
There are 11 csv files containing ozone concentrations at every hour of every day.
I was able to read all of the files in and convert the index from date to datetime.
For my graph:
I calculated the maximum daily 8-hour average and then averaged those values over each month.
My new dataframe (df3) has:
a datetime index, which consists of the last day of the month for each month of the year over the 12 years.
It also has a column including the average MDA8 values.
I want make 3 separate scatter plots for the months of April, May, and June. (x axis = year, y axis = average MDA8 for the month)
However, I am getting stuck on how to call these individual months and plot the yearly data.
Minimal sample
site,date,start_hour,value,variable,units,quality,prelim,name
3135,2010-01-01,0,13.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,1,5.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,2,11.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,3,17.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,5,16.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
Here's a link to find similar CSV data https://www.arb.ca.gov/aqmis2/aqdselect.php?tab=hourly
I've attached some code below:
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
path = "C:/Users/blah"
for f in glob.glob(os.path.join(path, "*.csv")):
df = pd.read_csv(f, header = 0, index_col='date')
df2 = df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'], inplace = True)
df = df.iloc[0:]
df.index = pd.to_datetime(df.index) #converting date to datetime
df['start_hour'] = pd.to_timedelta(df['start_hour'], unit = 'h')
df['datetime'] = df.index + df['start_hour']
df.set_index('datetime', inplace = True)
df2 = df.value.rolling('8H', min_periods = 6).mean()
df2.index -= pd.DateOffset(hours=3)
df2 = df4.resample('D').max()
df2.index.name = 'timestamp'
The problem occurs below:
df3 = df2.groupby(pd.Grouper(freq = 'M')).mean()
df4 = df3[df3.index.month.isin([4,5,6])]
if df4 == True:
plt.plot(df3.index, df3.values)
print(df4)
whenever I do this, I get a message saying "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
When I try this code with df4.any() == True:, it plots all of the months except April-June and it plots all values in the same plot. I want different plots for each month.
I've also tried adding the the following and removing the previous if statement:
df5 = df4.index.year.isin([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019])
if df5.all() == True:
plt.plot(df4.index, df4.values)
However, this gives me an image like:
Again, I want to make a separate scatterplot for each month, although this is closer to what I want. Any help would be appreciated, thanks.
EDIT
In addition, I have 2020 data, which only extends to the month of July. I don't think this is going to affect my graph, but I just wanted to mention it.
Ideally, I want it to look something like this, but a different point for each year and for the individual month of April

df.index -= pd.DateOffset(hours=3) has been removed for being potentially problematic
The first hours of each month would be in the previous month
The first hours of each day would be in the previous day
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import date
from pandas.tseries.offsets import MonthEnd
# set the path to the files
p = Path('/PythonProjects/stack_overflow/data/ozone/')
# list of files
files = list(p.glob('OZONE*.csv'))
# create a dataframe from the files - all years all data
df = pd.concat([pd.read_csv(file) for file in files])
# format the dataframe
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df = df[df.month.isin([4, 5, 6])].copy() # filter the dataframe - only April, May, June
df.set_index('datetime', inplace = True)
# calculate the 8-hour rolling mean
df['r_mean'] = df.value.rolling('8H', min_periods=6).mean()
# determine max value per day
r_mean_daily_max = df.groupby(['year', 'month', 'day'], as_index=False)['r_mean'].max()
# calculate the mean from the daily max
mda8 = r_mean_daily_max.groupby(['year', 'month'], as_index=False)['r_mean'].mean()
# add a new datetime column with the date as the end of the month
mda8['datetime'] = pd.to_datetime(mda8.year.astype(str) + mda8.month.astype(str), format='%Y%m') + MonthEnd(1)
df.info() & .head() before any processing
<class 'pandas.core.frame.DataFrame'>
Int64Index: 78204 entries, 0 to 4663
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 78204 non-null int64
1 date 78204 non-null object
2 start_hour 78204 non-null int64
3 value 78204 non-null float64
4 variable 78204 non-null object
5 units 78204 non-null object
6 quality 4664 non-null float64
7 prelim 4664 non-null object
8 name 78204 non-null object
dtypes: float64(2), int64(2), object(5)
memory usage: 6.0+ MB
site date start_hour value variable units quality prelim name
0 3135 2011-01-01 0 14.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
1 3135 2011-01-01 1 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
2 3135 2011-01-01 2 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
3 3135 2011-01-01 3 25.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
4 3135 2011-01-01 5 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
df.info & .head() after processing
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20708 entries, 2011-04-01 00:00:00 to 2020-06-30 23:00:00
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 20708 non-null int64
1 value 20708 non-null float64
2 variable 20708 non-null object
3 units 20708 non-null object
4 quality 2086 non-null float64
5 prelim 2086 non-null object
6 name 20708 non-null object
7 month 20708 non-null int64
8 day 20708 non-null int64
9 year 20708 non-null int64
10 r_mean 20475 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 1.9+ MB
site value variable units quality prelim name month day year r_mean
datetime
2011-04-01 00:00:00 3135 13.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 01:00:00 3135 29.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 02:00:00 3135 31.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 03:00:00 3135 28.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 05:00:00 3135 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
r_mean_daily_max.info() and .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 910 entries, 0 to 909
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 910 non-null int64
1 month 910 non-null int64
2 day 910 non-null int64
3 r_mean 910 non-null float64
dtypes: float64(1), int64(3)
memory usage: 35.5 KB
year month day r_mean
0 2011 4 1 44.125
1 2011 4 2 43.500
2 2011 4 3 42.000
3 2011 4 4 49.625
4 2011 4 5 45.500
mda8.info() & .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 30 non-null int64
1 month 30 non-null int64
2 r_mean 30 non-null float64
3 datetime 30 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 1.2 KB
year month r_mean datetime
0 2011 4 49.808135 2011-04-30
1 2011 5 55.225806 2011-05-31
2 2011 6 58.162302 2011-06-30
3 2012 4 45.865278 2012-04-30
4 2012 5 61.061828 2012-05-31
mda8
plot 1
sns.lineplot(mda8.datetime, mda8.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plot 2
# create color mapping based on all unique values of year
years = mda8.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for g, d in mda8.groupby('year'):
sns.lineplot(d.datetime, d.r_mean, marker='o', hue=g, palette=cmap)
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plot 3
sns.barplot(x='month', y='r_mean', data=mda8, hue='year')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.title('MDA8: April - June')
plt.ylabel('mda8 (ppb)')
plt.show()
plot 4
for month in mda8.month.unique():
data = mda8[mda8.month == month] # filter and plot the data for a specific month
plt.figure() # create a new figure for each month
sns.lineplot(data.datetime, data.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.title(f'Month: {month}')
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
There will be one plot per month
plot 5
for month in mda8.month.unique():
data = mda8[mda8.month == month]
sns.lineplot(data.datetime, data.r_mean, marker='o', label=month)
plt.legend(title='Month')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
Addressing I want make 3 separate scatter plots for the months of April, May, and June.
The main issue is, the data can't be plotted with a datetime axis.
The objective is to plot each day on the axis, with each figure as a different month.
Lineplot
It's kind of busy
A custom color map has been used because there aren't enough colors in the standard palette to give each year a unique color
# create color mapping based on all unique values of year
years = df.index.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(16, 10))
for year in v.index.year.unique(): # withing the month plot each year
data = v[v.index.year == year]
sns.lineplot(data.index.day, data.r_mean, err_style=None, hue=year, palette=cmap)
plt.xlim(0, 33)
plt.xticks(range(1, 32))
plt.title(f'Month: {k}')
plt.xlabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()
Here's April, the other two figures look similar to this
Barplot
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(10, 20))
sns.barplot(x=v.r_mean, y=v.day, ci=None, orient='h', hue=v.index.year)
plt.title(f'Month: {k}')
plt.ylabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()

Selecting specific rows by date in a multi-dimensional df Python

image of df
I would like to select a specific date e.g 2020-07-07 and get the Adj Cls and ExMA for each of the symbols. I'm new in Python and I tried using df.loc['xy'], (xy being a specific date on the datetime) and keep getting a KeyError. Any insight is greatly appreciated.
Info on the df MultiIndex: 30 entries, (SNAP, 2020-07-06 00:00:00) to (YUM, 2020-07-10 00:00:00)
Data columns (total 2 columns):
dtypes: float64(2)

You can use pandas.DataFrame.xs for this.
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(8).reshape(4, 2), index=[[0, 0, 1, 1], [2, 3, 2, 3]], columns=list("ab")
)
print(df)
# a b
# 0 2 0 1
# 3 2 3
# 1 2 4 5
# 3 6 7
print(df.xs(3, level=1).filter(["a"]))
# a
# 0 2
# 1 6

Finding greatest fall and rise in a dynamic rolling window based on index

Have a df of readings as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1000, size=100), index=range(100), columns = ['reading'])
Want to find the greatest rise and the greatest fall for each row based on its index, which theoretically may be achieved using the formula...
How can this be coded?
Tried:
df.assign(gr8Rise=df.rolling(df.index).apply(lambda x: x[-1]-x[0], raw=True).max())
...and failed with ValueError: window must be an integer
UPDATE: Based on #jezrael dataset the output for gr8Rise is expected as follows:

Use:
np.random.seed(2019)
df = pd.DataFrame(np.random.randint(100, size=10), index=range(10), columns = ['reading'])
df['gr8Rise'] = [df['reading'].rolling(x).apply(lambda x: x[0]-x[-1], raw=True).max()
for x in range(1, len(df)+1)]
df.loc[0, 'gr8Rise']= np.nan
print (df)
reading gr8Rise
0 72 NaN
1 31 41.0
2 37 64.0
3 88 59.0
4 62 73.0
5 24 76.0
6 29 72.0
7 15 57.0
8 12 60.0
9 16 56.0

convert csv import via pandas to separate columns

I have a csv file that came into pandas like this:
csv file:
Date,Numbers,Extra, NaN
05/17/2002,15 18 25 33 47,30,
Pandas input:
df = pd.read_csv('/Users/owner/Downloads/file.csv’)e
#s = Series('05/17/2002', '15 18 25 33 47')
#s.str.partition(' ‘)
Output
Date Numbers. Extra
<bound method NDFrame.head of Draw Date Winning Numbers Extra NaN.
05/17/2002 15 18 25 33 47 30 NaN.
<class 'pandas.core.frame.DataFrame’>
RangeIndex: 1718 entries, 0 to 1717
Data columns (total 4 columns):
Date 1718 non-null object
Numbers 1718 non-null object.
Extra 1718 non-null int64
NaN 815 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 53.8+ KB
How do I convert the non-null objects into two columns:
1 is a date
1 is a list
It doesn’t seem to recognize split or to.str. or headings
Thanks

I think you want this. It specifies column 0 as a date column, and a converter for column 1:
>>> df = pd.read_csv('file.csv',parse_dates=[0],converters={1:str.split})
>>> df
Date Numbers Extra NaN
0 2002-05-17 [15, 18, 25, 33, 47] 30

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Condition in Pandas - pandas

Related

Convert and replace a string value in a pandas df with its float type

How to make a graph plotting monthly data over many years in pandas

Selecting specific rows by date in a multi-dimensional df Python

Finding greatest fall and rise in a dynamic rolling window based on index

convert csv import via pandas to separate columns

Categories

Resources