Pandas doesn't split EIA API Data into two different columsn for easy access - pandas

I am importing EIA data which contains weekly storage data. The first column in the reported week and second is storage.
When I import the data it shows two columns. First column has no title and second one as following title "Weekly Lower 48 States Natural Gas Working Underground Storage, Weekly (Billion Cubic Feet)".
I would like to plot the data using matplotlib but I need to separate the columns first. I used df.iloc[100:,:0] and this gives the first column which is the week but I somehow cannot separate the second column.
import eia
import pandas as pd
import os
api_key = "mykey"
api = eia.API(api_key)
series_search = api.data_by_series(series='NG.NW2_EPG0_SWO_R48_BCF.W')
df = pd.DataFrame(series_search)
df1 = df.iloc[100:,:0]
Code Output
This output is sample of all 486 rows. When I use df.shape command it shows as (486, 1) when it should show (486, 2 )
2010 0101 01 3117
2010 0108 08 2850
2010 0115 15 2607
2010 0122 22 2521
2019 0322 22 1107
2019 0329 29 1130
2019 0405 05 1155
2019 0412 12 1247
2019 0419 19 1339

You can first cut the last 3 characters of the string and then convert it to datetime:
df['Date'] = pd.to_datetime(df['Date'].str[:-3], format='%Y %m%d')
print(df)
Date Value
0 2010-01-01 3117
1 2010-01-08 2850
2 2010-01-15 2607
3 2010-01-22 2521
4 2019-03-22 1107
5 2019-03-29 1130
6 2019-04-05 1155
7 2019-04-12 1247
8 2019-04-19 1339

Related

how to plot pie charts separately according to their rows using pandas dataframe

I would like to create pie charts according to their respective rows such that each pie chart contain the 3 different columns in their respective years
I manage to create the pie charts but they are all squeezed together in one graph, how can I separate them?
this is my dataset:
sector year Total in Practice (OT) Total in Practice (SLP) Total in Practice (SLP)
0 2014 123 400 123
1 2015 234 456 123
2 2016 345 484 345
3 2017 345 539 566
4 2018 453 565 123
5 2019 454 598 234
6 2020 453 626 243
7 2021 755 682 243
this is my code:
df_all.T.plot.pie(df_all,subplots=True, figsize=(10, 3))
and this is how my plot end up as

How to calculate slope of a dataframe, upto a specific row number?

I have this data frame that looks like this:
PE CE time
0 362.30 304.70 09:42
1 365.30 303.60 09:43
2 367.20 302.30 09:44
3 360.30 309.80 09:45
4 356.70 310.25 09:46
5 355.30 311.70 09:47
6 354.40 312.98 09:48
7 350.80 316.70 09:49
8 349.10 318.95 09:50
9 350.05 317.45 09:51
10 352.05 315.95 09:52
11 350.25 316.65 09:53
12 348.63 318.35 09:54
13 349.05 315.95 09:55
14 345.65 320.15 09:56
15 346.85 319.95 09:57
16 348.55 317.20 09:58
17 349.55 316.26 09:59
18 348.25 317.10 10:00
19 347.30 318.50 10:01
In this data frame, I would like to calculate the slope of both the first and second columns separately to the time period starting from (say in this case is 09:42 which is not fixed and can vary) up to the time 12:00.
please help me to write it..
Computing the slope can be accomplished by use of the equation:
Slope = Rise/Run
Given you want to define compute the slope between two time entries, all you need to do is find:
the *Run = timedelta between start and end times
the Rise** = the difference between cell entries at the start and end.
The tricky part of these calculations is making sure you properly handle the time functions:
import pandas as pd
from datetime import datetime
Thus you can define a function:
def computeSelectedSlope(df:pd.DataFrame, start:str, end:str, timecol:str, datacol:str) -> float:
assert timecol in df.columns # prove timecol exists
assert datacol in df.columns # prove datacol exists
rise = (df[datacol][df[timecol] == datetime.strptime(end, '%H:%M:%S').time()].values[0] -
df[datacol][df[timecol] == datetime.strptime(start, '%H:%M:%S').time()].values[0])
run = (int(df.index[df['T'] == datetime.strptime(end, '%H:%M:%S').time()].values) -
int(df.index[df['T'] == datetime.strptime(start, '%H:%M:%S').time()].values))
return rise/run
Now given a dataframe df of the form:
A B T
0 2.632 231.229 00:00:00
1 2.732 239.026 00:01:00
2 2.748 251.310 00:02:00
3 3.018 285.330 00:03:00
4 3.090 308.925 00:04:00
5 3.366 312.702 00:05:00
6 3.369 326.912 00:06:00
7 3.562 330.703 00:07:00
8 3.590 379.575 00:08:00
9 3.867 422.262 00:09:00
10 4.030 428.148 00:10:00
11 4.210 442.521 00:11:00
12 4.266 443.631 00:12:00
13 4.335 444.991 00:13:00
14 4.380 453.531 00:14:00
15 4.402 462.531 00:15:00
16 4.499 464.170 00:16:00
17 4.553 471.770 00:17:00
18 4.572 495.285 00:18:00
19 4.665 513.009 00:19:00
You can find the slope for any time difference by:
computeSelectedSlope(df, '00:01:00', '00:15:00', 'T', 'B')
Which yields 15.964642857142858

Pandas Merge DataFrame based on Two Columns

I have two DataFrames that I am trying to merge to create a choropleth plot. A small subsection of each data frame is shown below:
DataFrame 1:
COUNTYFP TRACTCE
7 023 960100
8 023 960200
9 023 960300
52 024 960300
5 024 960402
4 031 960403
3 031 960404
6 031 960405
DataFrame 2:
county tract percent
1640 23 960100 16.3562
1643 23 960200 15.6140
1646 23 960300 25.7558
1649 24 960300 40.3279
1652 24 960402 37.9966
1655 31 960403 34.1127
1658 31 960404 26.5466
1661 31 960405 29.2962
What I am trying to do here is merge both of these DataFrames so that the percent column from DF2 is added to the end of DF1 for its according values.
Two things here to note however:
I need to merge the df by two columns. There is a duplicate value for Tract (960300) therefore the df needs to be merged by the correct county and the correct tract.
the county is in a different numerical format across both data frames (one is in 023 and the other is in 23).
The desired output:
COUNTYFP TRACTCE percent
7 023 960100 16.3562
8 023 960200 15.6140
9 023 960300 ...
52 024 960300 ...
5 024 960402 ...
4 031 960403 ...
3 031 960404 ...
6 031 960405 ...
I can not just merge it by tract because 960300 appears twice. Similarly, I can not just merge it by county as 23 appears multiple times. Therefore, I need to combine each by using two different columns. I am a bit unsure how to do this.
My thoughts are along the lines of:
merged_df = df1.set_index(['COUNTYFP', 'TRACTCE']).join(df2.set_index(['county', 'tract']))
I am not sure if this will work though. Is this the correct approach? Also, how do I deal with the different numerical representation of the county value 023 vs 23 across both dfs?
Any thoughts, code, or links to examples/docs that you find helpful would be greatly appreciated.
Thanks!
convert df1.COUNTYFP to an integer to make the representations the same. 023 suggests that the column has a string type.
df1.COUNTYFP = df1.COUNTYFP.astype('int')
use df1.merge(df2, ...) specifying a list of columns in the left_on & right_on arguments.
df1.merge(df2, left_on=['COUNTYFP', 'TRACTCE'], right_on=['county', 'tract'], how='left')
# outputs:
county tract percent
1640 23 960100 16.3562
1643 23 960200 15.6140
1646 23 960300 25.7558
1649 24 960300 40.3279
1652 24 960402 37.9966
1655 31 960403 34.1127
1658 31 960404 26.5466
1661 31 960405 29.2962

Pandas adding row to categorical index

I have a scenario where I would like to group my datasets by personally defined week indexes that are then averaged and aggregate the averages into a "Total" row. I am able to achieve the first half of my scenario, but when I try to append/insert a new "Total" row that sums these rows I am receiving error messages.
I attempted to create this row via two different methods:
Method 1:
week_index_avg_unit.loc['Total'] = week_index_avg_unit.sum()
TypeError: cannot append a non-category item to a CategoricalIndex
Method 2:
week_index_avg_unit.index.insert(['Total'], week_index_avg_unit.sum())
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I have used the first approach in this scenario multiple times, but this is the first time where I'm cutting the data into multiple categories and clearly see where the CategoricalIndex type is the problem.
Here is the format of my data:
date organic ppc oa other content_partnership total \
0 2018-01-01 379 251 197 51 0 878
1 2018-01-02 880 527 405 217 0 2029
2 2018-01-03 859 589 403 323 0 2174
3 2018-01-04 835 533 409 335 0 2112
4 2018-01-05 760 449 355 272 0 1836
year_month day weekday weekday_name week_index
0 2018-01 1 0 Monday Week 1
1 2018-01 2 1 Tuesday Week 1
2 2018-01 3 2 Wednesday Week 1
3 2018-01 4 3 Thursday Week 1
4 2018-01 5 4 Friday Week 1
Here is the code:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
historicals = pd.read_csv("2018-2019_plants.csv")
# Capture dates for additional date columns
date_col = pd.to_datetime(historicals['date'])
historicals['year_month'] = date_col.dt.strftime("%Y-%m")
historicals['day'] = date_col.dt.day
historicals['weekday'] = date_col.dt.dayofweek
historicals['weekday_name'] = date_col.dt.day_name()
# create week ranges segment (7 day range)
historicals['week_index'] = pd.cut(historicals['day'],[0,7,14,21,28,32], labels=['Week 1','Week 2','Week 3','Week 4','Week 5'])
# Week Index Average (Units)
week_index_avg_unit = historicals[df_monthly_average].groupby(['week_index']).mean().astype(int)
type(week_index_avg_unit.index)
pandas.core.indexes.category.CategoricalIndex
Here is the week_index_avg_unit table:
organic ppc oa other content_partnership total day weekday
week_index
Week 1 755 361 505 405 22 2027 4 3
Week 2 787 360 473 337 19 1959 11 3
Week 3 781 382 490 352 18 2006 18 3
...
pd.CategoricalIndex is a special animal. It is immutable, so to do the trick you may need to use something like pd.CategoricalIndex.set_categories to add a new category.
See pandas docs: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.CategoricalIndex.html

groupby pandas dataframe, take difference between value of latest and earliest date

I have a Cumulative column and I want to groupby index and take the values corresponding to the latest date minus the values corresponding to the earliest date.
Very similar to this: group by pandas dataframe and select latest in each group
But take the difference between latest and earliest in each group.
I'm a python rookie, and here is my solution:
import pandas as pd
from io import StringIO
csv = StringIO("""index id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01""")
df = pd.read_table(csv, sep='\s+',index_col='index')
df['date']=pd.to_datetime(df['date'],errors='coerce')
df_sort=df.sort_values('date')
df_sort.drop(['product'], axis=1,inplace=True)
df_sort.groupby('id').tail(1).set_index('id')-df_sort.groupby('id').head(1).set_index('id')