What would be the best way to parse the above Excel file in a Pandas dataframe? The idea would be to be able to update data easily, adding columns, dropping lines. For example, for every origin, I would like to keep only output3. Then for every column (2000, ....,2013) divide it by 2 given a condition (say value > 6000) .
Below is what I tried: first to parse and drop the unnecessary lines, but it's not satisfactory, as I had to rename columns manually. So this doesn't look very optimal as a solution. Any better idea?
df = pd.read_excel("myExcel.xlsx", skiprows=2, sheet_name='1')
cols1 = list(df.columns)
cols1 = [str(x)[:4] for x in cols1]
cols2 = list(df.iloc[0,:])
cols2 = [str(x) for x in cols2]
cols = [x + "_" + y for x,y in zip(cols1,cols2)]
df.columns = cols
df = df.drop(["Unna_nan"], axis =1).rename(columns ={'Time_Origine':'Country','Unna_Output' : 'Series','Unna_Ccy' : 'Unit','2000_nan' : '2000','2001_nan': '2001','2002_nan':'2002','2003_nan' : '2003','2004_nan': '2004','2005_nan' : '2005','2006_nan' : '2006','2007_nan' : '2007','2008_nan' : '2008','2009_nan' : '2009','2010_nan' : '2010','2011_nan': '2011','2012_nan' : '2012','2013_nan':'2013','2014_nan':'2014','2015_nan':'2015','2016_nan':'2016','2017_nan':'2017'})
df.drop(0,inplace=True)
df.drop(df.tail(1).index, inplace=True)
idx = ['Country', 'Series', 'Unit']
df = df.set_index(idx)
df = df.query('Series == "Output3"')
Without having an excel like this I think something like this might work.
In order to only get the rows from output3, you can use the following:
df = pd.read_excel("myExcel.xlsx", skiprows=2, sheet_name='1')
df = df.loc[df['Output'] == 'output3']
Now dividing every cell by 2 if the cell value is greater than 6000 by using pandas apply:
def foo(bar):
if bar > 6000:
return bar / 2
return bar
for col in df.columns:
try:
int(col) # to check if this column is a year
df[col] = df[col].apply(foo)
except ValueError:
pass
#read first 2 rows to MultiIndex nad remove last one
df = pd.read_excel("Excel1.xlsx", skiprows=2, header=[0,1], skipfooter=1)
print (df)
#create helper DataFrame
cols = df.columns.to_frame().reset_index(drop=True)
cols.columns=['a','b']
cols['a'] = pd.to_numeric(cols['a'], errors='ignore')
cols['b'] = cols['b'].replace('Unit.1','tmp', regex=False)
#create new column by condition
cols['c'] = np.where(cols['b'].str.startswith('Unnamed'), cols['a'], cols['b'])
print (cols)
a b c
0 Time Country Country
1 Time Series Series
2 Time Unit Unit
3 Time tmp tmp
4 2000 Unnamed: 4_level_1 2000
5 2001 Unnamed: 5_level_1 2001
6 2002 Unnamed: 6_level_1 2002
7 2003 Unnamed: 7_level_1 2003
8 2004 Unnamed: 8_level_1 2004
9 2005 Unnamed: 9_level_1 2005
10 2006 Unnamed: 10_level_1 2006
11 2007 Unnamed: 11_level_1 2007
12 2008 Unnamed: 12_level_1 2008
13 2009 Unnamed: 13_level_1 2009
14 2010 Unnamed: 14_level_1 2010
15 2011 Unnamed: 15_level_1 2011
16 2012 Unnamed: 16_level_1 2012
17 2013 Unnamed: 17_level_1 2013
18 2014 Unnamed: 18_level_1 2014
19 2015 Unnamed: 19_level_1 2015
20 2016 Unnamed: 20_level_1 2016
21 2017 Unnamed: 21_level_1 2017
#overwrite columns by column c
df.columns = cols['c'].tolist()
#forward filling missing values
df['Country'] = df['Country'].ffill()
df = df.drop('tmp', axis=1).set_index(['Country','Series','Unit'])
print (df)
Related
Say I have a DataFrame:
foo bar
0 1998 abc
1 1999 xyz
2 2000 123
What is the best way to combine the first two (or any n) characters of foo with the last two of bar to make a third column 19bc, 19yz, 2023 ? Normally to combine columns I'd simply do
df['foobar'] = df['foo'] + df['bar']
but I believe I can't do slicing on these objects.
If you convert your columns as string with astype, you can use str accessor and slice values:
df['foobar'] = df['foo'].astype(str).str[:2] + df['bar'].str[-2:]
print(df)
# Ouput
foo bar foobar
0 1998 abc 19bc
1 1999 xyz 19yz
2 2000 123 2023
Using astype(str) is useless if your column has already object dtype like bar column (?).
You can read: Working with text data
Example:
import pandas as pd
df = pd.DataFrame({"foo": [1998, 1999, 2000], "bar": ["abc", "xyz", "123"]})
df["foobar"] = df.apply(lambda x: str(x["foo"])[0:2] + x["bar"][1:], axis=1)
print(df)
# foo bar foobar
# 0 1998 abc 19bc
# 1 1999 xyz 19yz
# 2 2000 123 2023
I am trying to subset a large dataframe (5000+ rows and 15 columns) based on unique values from two columns (both are dtype = object). I want to exclude rows of data that meet the following criteria:
A column called 'Record' equals "MO" AND a column called 'Year' equals "2017" or "2018".
Here is an example of the dataframe:
df = pd.DataFrame({'A': [1001,2002,3003,4004,5005,6006,7007,8008,9009], 'Record' : ['MO','MO','I','I','MO','I','MO','I','I'], 'Year':[2017,2019,2018,2020,2018,2018,2020,2019,2017]})
print(df)
A Record Year
0 1001 MO 2017
1 2002 MO 2019
2 3003 I 2018
3 4004 I 2020
4 5005 MO 2018
5 6006 I 2018
6 7007 MO 2020
7 8008 I 2019
8 9009 I 2017
I would like any row with both "MO" and "2017", as well as both "MO" and "2018" taken out of the dataframe.
Example where the right rows (0 and 4 in dataframe above) are deleted:
df = pd.DataFrame({'A': [2002,3003,4004,6006,7007,8008,9009], 'Record' : ['MO','I','I','I','MO','I','I'], 'Year':[2019,2018,2020,2018,2020,2019,2017]})
print(df)
A Record Year
0 2002 MO 2019
1 3003 I 2018
2 4004 I 2020
3 6006 I 2018
4 7007 MO 2020
5 8008 I 2019
6 9009 I 2017
I have tried the following code, but it does not work (I tried at first for just one year):
df = df[(df['Record'] != "MO" & df['Year'] != "2017")]
I believe you're just missing some parenthesis.
df = df[(df['Record'] != "MO") & (df['Year'] != "2017")]
Edit:
After some clarification:
df = df[~((df['Record']=='MO')&
(df['Year']=='2017')|
(df['Year']=='2018'))]
I have the follwoign Dataframe:
df = pd.DataFrame({'Idenitiy': ['Haus1', 'Haus2', 'Haus1','Haus2'],
'kind': ['Gas', 'Gas', 'Strom','Strom'],
'2005':[2,3,5,6],
'2006':[2,3.5,5.5,7]})
No I would like to have the following dataframe as an output as the Product of the entitites:
Year Product(Gas) Product(Strom)
2005 6 30
2006 6 38,5
2007 7 38,5
Thank you.
Here's a way to do:
# multiply column values
from functools import reduce
def mult(f):
v = [reduce(lambda a,b : a*b, f['2005']), reduce(lambda a,b : a*b, f['2006'])]
return pd.Series(v, index=['2005','2006'])
# groupby and multiply column values
df1 = df.groupby('kind')[['2005','2006']].apply(mult).unstack().reset_index()
df1.columns = ['Year','Kind','vals']
print(df1)
Year Kind vals
0 2005 Gas 6.0
1 2005 Strom 30.0
2 2006 Gas 7.0
3 2006 Strom 38.5
# reshape the table
df1 = (df1
.pivot_table(index='Year', columns=['Kind'], values='vals'))
# fix column names
df1 = df1.add_prefix('Product_')
df1.columns.name = None
df1 = df1.reset_index()
Year Product_Gas Product_Strom
0 2005 6.0 30.0
1 2006 7.0 38.5
I extracted the following data from a dataframe .
https://i.imgur.com/rCLfV83.jpg
The question is, how do I plot a graph, probably a histogram type, where the horizontal axis are the hours as bins [16:00 17:00 18:00 ...24:00] and the bars are the average rainfall during each of those hours.
I just don't know enough pandas yet to get this off the ground so I need some help. Sample data below as requested.
Date Hours `Precip`
1996-07-30 21 1
1996-08-17 16 1
18 1
1996-08-30 16 1
17 1
19 5
22 1
1996-09-30 19 5
20 5
1996-10-06 20 1
21 1
1996-10-19 18 4
1996-10-30 19 1
1996-11-05 20 3
1996-11-16 16 1
19 1
1996-11-17 16 1
1996-11-29 16 1
1996-12-04 16 9
17 27
19 1
1996-12-12 19 1
1996-12-30 19 10
22 1
1997-01-18 20 1
It seems df is a multi-index DataFrame after a groupby.
Transform the index to a DatetimeIndex
date_hour_idx = df.reset_index()[['Date', 'Hours']] \
.apply(lambda x: '{} {}:00'.format(x['Date'], x['Hours']), axis=1)
precip_series = df.reset_index()['Precip']
precip_series.index = pd.to_datetime(date_hour_idx)
Resample to hours using 'H'
# This will show NaN for hours without an entry
resampled_nan = precip_series.resample('H').asfreq()
# This will fill NaN with 0s
resampled_fillna = precip_series.resample('H').asfreq().fillna(0)
If you want this to be the mean per hour, change your groupby(...).sum() to groupby(...).mean()
You can resample to other intervals too -> pandas resample documentation
More about resampling the DatetimeIndex -> https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html
It seems to be easy when you have data.
I generate artificial data by Pandas for this example:
import pandas as pd
import radar
import random
'''>>> date'''
r2 =()
for a in range(1,51):
t= (str(radar.random_datetime(start='1985-05-01', stop='1985-05-04')),)
r2 = r2 + t
r3 =list(r2)
r3.sort()
#print(r3)
'''>>> variable'''
x = [random.randint(0,16) for x in range(50)]
df= pd.DataFrame({'date': r3, 'measurement': x})
print(df)
'''order'''
col1 = df.join(df['date'].str.partition(' ')[[0,2]]).rename({0: 'daty', 2: 'godziny'}, axis=1)
col2 = df['measurement'].rename('pomiary')
p3 = pd.concat([col1, col2], axis=1, sort=False)
p3 = p3.drop(['measurement'], axis=1)
p3 = p3.drop(['date'], axis=1)
Time for sum and plot:
dx = p3.groupby(['daty']).mean()
print(dx)
import matplotlib.pyplot as plt
dx.plot.bar()
plt.show()
Plot of the mean measurements
With Python 2.7 and notebook, I am trying to display a simple Series that looks like:
year
2014 1
2015 3
2016 2
Name: mySeries, dtype: int64
I would like to:
Name the second column, I cant seem to succeed with s.columns = ['a','b']. How do we do this?
Plot the result where the years are written as such. When I run s.plot() I get the years as x, which is good but I get weird values:
Thanks for the help!
If it helps, this series comes from the following code:
df = pd.read_csv(file, usecols=['dates'], parse_dates=True)
df['dates'] = pd.to_datetime(df['date'])
df['year'] = pd.DatetimeIndex(df['dartes']).year
df
which gives me:
dates year
0 2015-05-05 14:21:00 2015
1 2014-06-06 14:22:00 2014
2 2015-05-05 14:14:00 2015
On which I do:
s = pd.Series(df.groupby('year').size())
For me works cast index to string by astype:
print s
y
2014 1
2015 3
2016 2
Name: mySeries, dtype: int64
s.index = s.index.astype(str)
s.plot()
just cast your index before :
df.set_index(df.index.astype(str),inplace=True)
then you will have what you expect.