convert csv import via pandas to separate columns - pandas

I have a csv file that came into pandas like this:
csv file:
Date,Numbers,Extra, NaN
05/17/2002,15 18 25 33 47,30,
Pandas input:
df = pd.read_csv('/Users/owner/Downloads/file.csv’)e
#s = Series('05/17/2002', '15 18 25 33 47')
#s.str.partition(' ‘)
Output
Date Numbers. Extra
<bound method NDFrame.head of Draw Date Winning Numbers Extra NaN.
05/17/2002 15 18 25 33 47 30 NaN.
<class 'pandas.core.frame.DataFrame’>
RangeIndex: 1718 entries, 0 to 1717
Data columns (total 4 columns):
Date 1718 non-null object
Numbers 1718 non-null object.
Extra 1718 non-null int64
NaN 815 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 53.8+ KB
How do I convert the non-null objects into two columns:
1 is a date
1 is a list
It doesn’t seem to recognize split or to.str. or headings
Thanks

I think you want this. It specifies column 0 as a date column, and a converter for column 1:
>>> df = pd.read_csv('file.csv',parse_dates=[0],converters={1:str.split})
>>> df
Date Numbers Extra NaN
0 2002-05-17 [15, 18, 25, 33, 47] 30

Related

Key Error Raise when trying to delete an existing column

RangeIndex: 381732 entries, 0 to 381731
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 381732 non-null int64
1 tweet_id 378731 non-null float64
2 time 378731 non-null object
3 tweet 378731 non-null object
4 retweet_count 336647 non-null float64
5 Unnamed: 0.1 336647 non-null float64
6 User 3001 non-null object
7 Date_Created 3001 non-null object
8 Source of Tweet 3001 non-null object
9 Tweet 3001 non-null object
dtypes: float64(3), int64(1), object(6)
memory usage: 29.1+ MB
df = df.drop(['Unnamed: 0','Unnamed: 0.1','User','Date_Created','Source of Tweet'],axis =1)
df.head()
i wrote this code to drop unwanted columns from my dataframe but i am encountering keyError not found in axis
KeyError: "['Unnamed: 0', 'Unnamed: 0.1', 'User', 'Date_Created', 'Source of Tweet'] not found in axis"
For debugging purpose, try:
cols_to_drop = ['Unnamed: 0','Unnamed: 0.1','User','Date_Created','Source of Tweet']
df = df[[col for col in df.columns if not col in cols_to_drop]]
and check the remain columns using df.info()

Convert and replace a string value in a pandas df with its float type

I have a value in pandas df which is accidentally put as a string as follows:
df.iloc[5329]['values']
'72,5'
I want to convert this value to float and replace it in the df. I have tried the following ways:
df.iloc[5329]['values'] = float(72.5)
also,
df.iloc[5329]['values'] = 72.5
and,
df.iloc[5329]['values'] = df.iloc[5329]['values'].replace(',', '.')
It runs successfully with a warning but when I check the df, its still stored as '72,5'.
The entire df at that index is as follows:
df.iloc[5329]
value 36.25
values 72,5
values1 72.5
currency MYR
Receipt Kuching, Malaysia
Delivery Male, Maldives
How can I solve that?
iloc needs specific row, col positioning.
import pandas as pd
df = pd.DataFrame(
{
'A': np.random.choice(100, 3),
'B': [15.2,'72,5',3.7]
})
print(df)
df.info()
Output:
A B
0 84 15.2
1 92 72,5
2 56 3.7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null object
Update to value:
df.iloc[1,1] = 72.5
print(df)
Output:
A B
0 84 15.2
1 92 72.5
2 56 3.7
Make sure you don't have recurring indexing (i.e. [][]) when doing assignment, since df.iloc[5329] will make a copy of data and further assignment will happen to the copy not original df. Instead just do:
df.iloc[5329, 'values'] = 72.5

How to make a graph plotting monthly data over many years in pandas

I have 11 years worth of hourly ozone concentration data.
There are 11 csv files containing ozone concentrations at every hour of every day.
I was able to read all of the files in and convert the index from date to datetime.
For my graph:
I calculated the maximum daily 8-hour average and then averaged those values over each month.
My new dataframe (df3) has:
a datetime index, which consists of the last day of the month for each month of the year over the 12 years.
It also has a column including the average MDA8 values.
I want make 3 separate scatter plots for the months of April, May, and June. (x axis = year, y axis = average MDA8 for the month)
However, I am getting stuck on how to call these individual months and plot the yearly data.
Minimal sample
site,date,start_hour,value,variable,units,quality,prelim,name
3135,2010-01-01,0,13.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,1,5.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,2,11.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,3,17.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
3135,2010-01-01,5,16.0,OZONE,Parts Per Billion ( ppb ),,,Calexico-Ethel Street
Here's a link to find similar CSV data https://www.arb.ca.gov/aqmis2/aqdselect.php?tab=hourly
I've attached some code below:
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
path = "C:/Users/blah"
for f in glob.glob(os.path.join(path, "*.csv")):
df = pd.read_csv(f, header = 0, index_col='date')
df2 = df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'], inplace = True)
df = df.iloc[0:]
df.index = pd.to_datetime(df.index) #converting date to datetime
df['start_hour'] = pd.to_timedelta(df['start_hour'], unit = 'h')
df['datetime'] = df.index + df['start_hour']
df.set_index('datetime', inplace = True)
df2 = df.value.rolling('8H', min_periods = 6).mean()
df2.index -= pd.DateOffset(hours=3)
df2 = df4.resample('D').max()
df2.index.name = 'timestamp'
The problem occurs below:
df3 = df2.groupby(pd.Grouper(freq = 'M')).mean()
df4 = df3[df3.index.month.isin([4,5,6])]
if df4 == True:
plt.plot(df3.index, df3.values)
print(df4)
whenever I do this, I get a message saying "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
When I try this code with df4.any() == True:, it plots all of the months except April-June and it plots all values in the same plot. I want different plots for each month.
I've also tried adding the the following and removing the previous if statement:
df5 = df4.index.year.isin([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019])
if df5.all() == True:
plt.plot(df4.index, df4.values)
However, this gives me an image like:
Again, I want to make a separate scatterplot for each month, although this is closer to what I want. Any help would be appreciated, thanks.
EDIT
In addition, I have 2020 data, which only extends to the month of July. I don't think this is going to affect my graph, but I just wanted to mention it.
Ideally, I want it to look something like this, but a different point for each year and for the individual month of April
df.index -= pd.DateOffset(hours=3) has been removed for being potentially problematic
The first hours of each month would be in the previous month
The first hours of each day would be in the previous day
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import date
from pandas.tseries.offsets import MonthEnd
# set the path to the files
p = Path('/PythonProjects/stack_overflow/data/ozone/')
# list of files
files = list(p.glob('OZONE*.csv'))
# create a dataframe from the files - all years all data
df = pd.concat([pd.read_csv(file) for file in files])
# format the dataframe
df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df = df[df.month.isin([4, 5, 6])].copy() # filter the dataframe - only April, May, June
df.set_index('datetime', inplace = True)
# calculate the 8-hour rolling mean
df['r_mean'] = df.value.rolling('8H', min_periods=6).mean()
# determine max value per day
r_mean_daily_max = df.groupby(['year', 'month', 'day'], as_index=False)['r_mean'].max()
# calculate the mean from the daily max
mda8 = r_mean_daily_max.groupby(['year', 'month'], as_index=False)['r_mean'].mean()
# add a new datetime column with the date as the end of the month
mda8['datetime'] = pd.to_datetime(mda8.year.astype(str) + mda8.month.astype(str), format='%Y%m') + MonthEnd(1)
df.info() & .head() before any processing
<class 'pandas.core.frame.DataFrame'>
Int64Index: 78204 entries, 0 to 4663
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 78204 non-null int64
1 date 78204 non-null object
2 start_hour 78204 non-null int64
3 value 78204 non-null float64
4 variable 78204 non-null object
5 units 78204 non-null object
6 quality 4664 non-null float64
7 prelim 4664 non-null object
8 name 78204 non-null object
dtypes: float64(2), int64(2), object(5)
memory usage: 6.0+ MB
site date start_hour value variable units quality prelim name
0 3135 2011-01-01 0 14.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
1 3135 2011-01-01 1 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
2 3135 2011-01-01 2 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
3 3135 2011-01-01 3 25.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
4 3135 2011-01-01 5 22.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street
df.info & .head() after processing
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 20708 entries, 2011-04-01 00:00:00 to 2020-06-30 23:00:00
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 site 20708 non-null int64
1 value 20708 non-null float64
2 variable 20708 non-null object
3 units 20708 non-null object
4 quality 2086 non-null float64
5 prelim 2086 non-null object
6 name 20708 non-null object
7 month 20708 non-null int64
8 day 20708 non-null int64
9 year 20708 non-null int64
10 r_mean 20475 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 1.9+ MB
site value variable units quality prelim name month day year r_mean
datetime
2011-04-01 00:00:00 3135 13.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 01:00:00 3135 29.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 02:00:00 3135 31.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 03:00:00 3135 28.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
2011-04-01 05:00:00 3135 11.0 OZONE Parts Per Billion ( ppb ) NaN NaN Calexico-Ethel Street 4 1 2011 NaN
r_mean_daily_max.info() and .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 910 entries, 0 to 909
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 910 non-null int64
1 month 910 non-null int64
2 day 910 non-null int64
3 r_mean 910 non-null float64
dtypes: float64(1), int64(3)
memory usage: 35.5 KB
year month day r_mean
0 2011 4 1 44.125
1 2011 4 2 43.500
2 2011 4 3 42.000
3 2011 4 4 49.625
4 2011 4 5 45.500
mda8.info() & .head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 30 non-null int64
1 month 30 non-null int64
2 r_mean 30 non-null float64
3 datetime 30 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 1.2 KB
year month r_mean datetime
0 2011 4 49.808135 2011-04-30
1 2011 5 55.225806 2011-05-31
2 2011 6 58.162302 2011-06-30
3 2012 4 45.865278 2012-04-30
4 2012 5 61.061828 2012-05-31
mda8
plot 1
sns.lineplot(mda8.datetime, mda8.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plot 2
# create color mapping based on all unique values of year
years = mda8.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for g, d in mda8.groupby('year'):
sns.lineplot(d.datetime, d.r_mean, marker='o', hue=g, palette=cmap)
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plot 3
sns.barplot(x='month', y='r_mean', data=mda8, hue='year')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.title('MDA8: April - June')
plt.ylabel('mda8 (ppb)')
plt.show()
plot 4
for month in mda8.month.unique():
data = mda8[mda8.month == month] # filter and plot the data for a specific month
plt.figure() # create a new figure for each month
sns.lineplot(data.datetime, data.r_mean, marker='o')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.title(f'Month: {month}')
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
There will be one plot per month
plot 5
for month in mda8.month.unique():
data = mda8[mda8.month == month]
sns.lineplot(data.datetime, data.r_mean, marker='o', label=month)
plt.legend(title='Month')
plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
plt.ylabel('MDA8: PPB')
plt.xlabel('Year')
Addressing I want make 3 separate scatter plots for the months of April, May, and June.
The main issue is, the data can't be plotted with a datetime axis.
The objective is to plot each day on the axis, with each figure as a different month.
Lineplot
It's kind of busy
A custom color map has been used because there aren't enough colors in the standard palette to give each year a unique color
# create color mapping based on all unique values of year
years = df.index.year.unique()
colors = sns.color_palette('husl', n_colors=len(years)) # get a number of colors
cmap = dict(zip(years, colors)) # zip values to colors
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(16, 10))
for year in v.index.year.unique(): # withing the month plot each year
data = v[v.index.year == year]
sns.lineplot(data.index.day, data.r_mean, err_style=None, hue=year, palette=cmap)
plt.xlim(0, 33)
plt.xticks(range(1, 32))
plt.title(f'Month: {k}')
plt.xlabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()
Here's April, the other two figures look similar to this
Barplot
for k, v in df.groupby('month'): # group the dateframe by month
plt.figure(figsize=(10, 20))
sns.barplot(x=v.r_mean, y=v.day, ci=None, orient='h', hue=v.index.year)
plt.title(f'Month: {k}')
plt.ylabel('Day of Month')
plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()

error using astype when NaN exists in a dataframe

df
A B
0 a=10 b=20.10
1 a=20 NaN
2 NaN b=30.10
3 a=40 b=40.10
I tried :
df['A'] = df['A'].str.extract('(\d+)').astype(int)
df['B'] = df['B'].str.extract('(\d+)').astype(float)
But I get the following error:
ValueError: cannot convert float NaN to integer
And:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
How do I fix this ?
If some values in column are missing (NaN) and then converted to numeric, always dtype is float. You cannot convert values to int. Only to float, because type of NaN is float.
print (type(np.nan))
<class 'float'>
See docs how convert values if at least one NaN:
integer > cast to float64
If need int values you need replace NaN to some int, e.g. 0 by fillna and then it works perfectly:
df['A'] = df['A'].str.extract('(\d+)', expand=False)
df['B'] = df['B'].str.extract('(\d+)', expand=False)
print (df)
A B
0 10 20
1 20 NaN
2 NaN 30
3 40 40
df1 = df.fillna(0).astype(int)
print (df1)
A B
0 10 20
1 20 0
2 0 30
3 40 40
print (df1.dtypes)
A int32
B int32
dtype: object
From pandas >= 0.24 there is now a built-in pandas integer.
This does allow integer nan's, so you don't need to fill na's.
Notice the capital in 'Int64' in the code below.
This is the pandas integer, instead of the numpy integer.
You need to use: .astype('Int64')
So, do this:
df['A'] = df['A'].str.extract('(\d+)', expand=False).astype('float').astype('Int64')
df['B'] = df['B'].str.extract('(\d+)', expand=False).astype('float').astype('Int64')
More info on pandas integer na values:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions

Condition in Pandas

I have a very peculiar problem in Pandas: one condition works but the other does not. You may download the linked file to test my code. Thanks!
I have a file (stars.txt) that I read in with Pandas. I would like to create two groups: (1) with Log_g < 4.0 and (2) Log_g > 4.0. In my code (see below) I can successfully get rows for group (1):
Kepler_ID RA Dec Teff Log_G g H
3 2305372 19 27 57.679 +37 40 21.90 5664 3.974 14.341 12.201
14 2708156 19 21 08.906 +37 56 11.44 11061 3.717 10.672 10.525
19 2997455 19 32 31.296 +38 07 40.04 4795 3.167 14.694 11.500
34 3352751 19 36 17.249 +38 25 36.91 7909 3.791 13.541 12.304
36 3440230 19 21 53.100 +38 31 42.82 7869 3.657 13.706 12.486
But for some reason I cannot get (2). The code returns the following for error:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 108
Data columns (total 7 columns):
Kepler_ID 90 non-null values
RA 90 non-null values
Dec 90 non-null values
Teff 90 non-null values
Log_G 90 non-null values
g 90 non-null values
H 90 non-null values
dtypes: float64(4), int64(1), object(2)
Here's my code:
#------------------------------------------
# IMPORT STATEMENTS
#------------------------------------------
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#------------------------------------------
# READ FILE AND ASSOCIATE COMPONENTS
#------------------------------------------
star_file = 'stars.txt'
header_row = ['Kepler_ID', 'RA','Dec','Teff', 'Log_G', 'g', 'H']
df = pd.read_csv(star_file, names=header_row, skiprows=2)
#------------------------------------------
# ASSOCIATE VARIABLES
#------------------------------------------
Kepler_ID = df['Kepler_ID']
#RA = df['RA']
#Dec = df['Dec']
Teff = df['Teff']
Log_G = df['Log_G']
g = df['g']
H = df['H']
#------------------------------------------
# SUBSTITUTE MISSING DATA WITH NAN
#------------------------------------------
df = df.replace('', np.nan)
#------------------------------------------
# CHANGE DATA TYPE OF THE REST OF DATA TO FLOAT
#------------------------------------------
df[['Teff', 'Log_G', 'g', 'H']] = df[['Teff', 'Log_G', 'g', 'H']].astype(float)
#------------------------------------------
# SORTING SPECTRA TYPES FOR GIANTS
#------------------------------------------
# FIND GIANTS IN THE SAMPLE
giants = df[(df['Log_G'] < 4.)]
#print giants
# FIND GIANTS IN THE SAMPLE
dwarfs = df[(df['Log_G'] > 4.)]
print dwarfs
This is not an error. You are seeing a summarized view of the DataFrame:
In [11]: df = pd.DataFrame([[2, 1], [3, 4]])
In [12]: df
Out[12]:
0 1
0 2 1
1 3 4
In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
0 2 non-null values
1 2 non-null values
dtypes: int64(2)
Which is displayed is decided by several display package options, for example, max_rows:
In [14]: pd.options.display.max_rows
Out[14]: 60
In [15]: pd.options.display.max_rows = 120
In 0.13, this behaviour changed, so you will see the first max_rows followed by ....