Update a Pandas MultiIndex DataFrame - pandas

The dataframe "data" has a MultiIndex.
data.head()
Close High Low Open Volume
Symbol Date
A 1999-11-18 28.6358 33.5207 27.3725
30.6572 59753154
1999-11-19 27.2040 28.9727 26.8253 28.9323
16172993
1999-11-22 29.3517 29.3517 26.9935 27.8357
5435127
1999-11-23 27.1198 28.8885 27.1198 28.6358
5035889
1999-11-24 27.6676 28.2571 26.9513 27.0389
5141708
The dictionary f has a key 'AAPL' which is a regular DataFrame.
f['AAPL'].head()
Open High Low Close Volume
Date
2018-06-11 191.350 191.970 190.21 191.23 18308460
2018-06-12 191.385 192.611 191.15 192.28 16911141
2018-06-13 192.420 192.880 190.44 190.70 21638393
2018-06-14 191.550 191.570 190.22 190.80 21610074
2018-06-15 190.030 190.160 188.26 188.84 61719160
I'd like to append to data['AAPL'] so that it has the data from f['AAPL']. This works, but is not inplace:
data.loc['AAPL'].append(f['AAPL'], verify_integrity=True).tail()
Close High Low Open Volume
Date
2018-07-30 189.91 192.20 189.0700 191.90 21029535
2018-07-31 190.29 192.14 189.3400 190.30 39373038
2018-08-01 201.50 201.76 197.3100 199.13 67935716
2018-08-02 207.39 208.38 200.3500 200.58 62404012
2018-08-03 207.99 208.74 205.4803 207.03 33447396
When I try to update data, I get all NaNs.
data.loc['AAPL'] = data.loc['AAPL'].append(f['AAPL'], verify_integrity=True).tail()
Close High Low Open Volume
Date
2018-06-04 NaN NaN NaN NaN NaN
2018-06-05 NaN NaN NaN NaN NaN
2018-06-06 NaN NaN NaN NaN NaN
2018-06-07 NaN NaN NaN NaN NaN
2018-06-08 NaN NaN NaN NaN NaN
Edit:
The "data" DataFrame was created with pandas data_reader:
import pandas_datareader.data as web
data = web.DataReader(['A','AAPL','F'], 'morningstar', start, end)
"f" was created the same way, but using 'iex' as the source instead of 'morningstar' (at the moment the morningstar source is returning 404s, so I switched to iex).

I still don't know why assigning to data.loc['AAPL'] doesn't work, but the following does:
# Converts dict with keys as tickers, DataFrame as values, to a DataFrame with a MultiIndex
new_data = pd.concat(f)
# Just append, and sort index to align the dates
data = data.append(new_data).sort_index()

Personal preference: I would first create a temp df with the data to append as a multi index dataframe.
toappend = pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol'])
And then create a new dataframe by appending the data and new data.
newdata = data.append(toappend, verify_integrity=True)
or if you prefer to do it in one line:
newdata = data.append(pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol']), verify_integrity=True)
My full test code is:
import pandas as pd
import numpy as np
symbols = ['AAA', 'BBB', 'CCC']
dates = ['2018-06-11', '2018-06-12', '2018-06-13']
cols = ['Close', 'High', 'Low']
midx = pd.MultiIndex.from_product([symbols, dates], names=['Symbol', 'Date'])
data= pd.DataFrame(10, midx, cols)
aapldf = pd.DataFrame(15, dates, cols)
aapldf.index.name = 'Date'
f = {'AAPL': aapldf}
toappend = pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol'])
newdata = data.append(toappend, verify_integrity=True)
print(newdata)

Related

ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'

I am receiving this error when I try to upload a csv file as a geodataframe.According to other questions resolutions on this site, this method should do the trick.
Here is the code that I am using to: upload the file as a gdf, then produce a subset dataframe with only some of the columns present.
cp_union = gpd.read_file(r'C:\Users\User\Desktop\CPAWS\terrestrial_outputs\cp_union.csv')
cp_union.crs = 'epsg:3005'
cp_trimmed = cp_union[['COSEWIC_status','reason_for_designation','cnm_eng','iucn_cat','mgmt_e','status_e','classification','sq_m']]
As stated in the title, the error that i am receiving is this: ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'. Is there some part of the process of saving a gdf as a csv and then reloading it as a gdf that would cause the creation of an additional geometry column?
EDIT
In another script, I loaded the same csv file as a pd dataframe. Here is the first row of data within that pd dataframe.
Unnamed: 0 0
fid_critic 0
scntfc_nm Castilleja victoriae
cnm_eng Victoria's Owl-clover
cnm_fren Castilléjie de Victoria
cswc_pop NaN
ch_stat Final
cb_site_nm Cattle Point
ch_detail Detailed Polygon
ch_variant NaN
ch_method NaN
hectares 0.8559
utm_zone 10
utm_east 478383
utm_north 5365043
latitude 48.438164
longitude -123.29226
shape_1 0.0
objectid 10251681.0
area_sqm 8478.6733
feat_len 326.5008
fid_protec -1
name_e NaN
name_f NaN
aichi_t11 NaN
iucn_cat NaN
oecm NaN
o_area 0.0
loc_e NaN
loc_f NaN
type_e NaN
mgmt_e NaN
gov_type NaN
legisl_e NaN
status_e NaN
protdate 0
delisdate 0
owner_e NaN
owner_f NaN
subs_right NaN
comments NaN
url NaN
shape_leng 0.0
protected 0
shape_le_1 320.859687
shape_area 6499.790343
geometry POLYGON ((1200735.4438 384059.0133999996, 1200...
COSEWIC_status Endangered
reason_for_designation This small annual herb is confined to a very s...
sq_m 6499.790343
classification c
Name: 0, dtype: object
So my only theory here is that, when you save a gdf as a csv, that the csv contains a column called geometry. Then when you load that csv as a gdf, that geopandas tries to create a new geometry column ontop of the one that was already in the csv. I could be completely wrong about this. Even if this is the case, Im not sure how to go about resolving the issue.
Thanks for the help!
using your sample data to create a CSV. Had to replace geometry as sample is not a valid WKT string
re-produced your error
solved by loading using pandas then converting to geopandas
solution
df = pd.read_csv(f)
cp_union = gpd.GeoDataFrame(
df.loc[:, [c for c in df.columns if c != "geometry"]],
geometry=gpd.GeoSeries.from_wkt(df["geometry"]),
crs="epsg:3005",
)
full code
import pandas as pd
import geopandas as gpd
import io
from pathlib import Path
# fmt: off
df_q = pd.read_csv(io.StringIO("""Unnamed: 0 0
fid_critic 0
scntfc_nm Castilleja victoriae
cnm_eng Victoria's Owl-clover
cnm_fren Castilléjie de Victoria
cswc_pop NaN
ch_stat Final
cb_site_nm Cattle Point
ch_detail Detailed Polygon
ch_variant NaN
ch_method NaN
hectares 0.8559
utm_zone 10
utm_east 478383
utm_north 5365043
latitude 48.438164
longitude -123.29226
shape_1 0.0
objectid 10251681.0
area_sqm 8478.6733
feat_len 326.5008
fid_protec -1
name_e NaN
name_f NaN
aichi_t11 NaN
iucn_cat NaN
oecm NaN
o_area 0.0
loc_e NaN
loc_f NaN
type_e NaN
mgmt_e NaN
gov_type NaN
legisl_e NaN
status_e NaN
protdate 0
delisdate 0
owner_e NaN
owner_f NaN
subs_right NaN
comments NaN
url NaN
shape_leng 0.0
protected 0
shape_le_1 320.859687
shape_area 6499.790343
geometry POLYGON ((5769135.557632876 7083849.386658552, 5843426.213336911 7098018.122146672, 5852821.812968816 7081377.7312996285, 5914814.478616157 7091734.620966213, 5883751.009067913 7017032.330573363, 5902031.719573214 6983898.953064103, 5864452.659165712 6922039.030140929, 5829585.402576889 6878872.269967912, 5835906.522449658 6846685.714836724, 5800391.382286092 6827305.509709548, 5765261.646424723 6876008.057438379, 5765261.402301509 6876010.894933639, 5765264.431247815 6876008.786040769, 5760553.056402712 6927522.42488809, 5720896.599172597 6983360.181762057, 5755349.303491102 7039380.015177476, 5769135.557632876 7083849.386658552))
COSEWIC_status Endangered
reason_for_designation This small annual herb is confined to a very s...
sq_m 6499.790343
classification c"""), sep="\s\s+", engine="python", header=None).set_index(0).T
# fmt: on
# generate a CSV file from sample data
f = Path.cwd().joinpath("SO_q.csv")
df_q.to_csv(f, index=False)
# replicate issue...
try:
gpd.read_file(f)
except ValueError as e:
print(e)
# now the actual solution
df = pd.read_csv(f)
cp_union = gpd.GeoDataFrame(
df.loc[:, [c for c in df.columns if c != "geometry"]],
geometry=gpd.GeoSeries.from_wkt(df["geometry"]),
crs="epsg:3005",
)
GeoPandas automatically adds a geometry field when you read in your CSV.
As you already have a field called "geometry", GeoPandas raises an exception. GeoPandas doesn't read the WKT strings in the "geometry" field as geometry.
Some workarounds:
the GDAL/OGR CSV driver (which geopandas uses) supports the open option GEOM_POSSIBLE_NAMES which lets you specify a different field name to look for geometry in.
gdf = gpd.read_file('/path/to/gdf.csv',
GEOM_POSSIBLE_NAMES="geometry",
KEEP_GEOM_COLUMNS="NO")
The KEEP_GEOM_COLUMNS option is also required otherwise GDAL/OGR will return a "geometry" column as well as the geometry and you still get the original ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'.
Alternatively, if you have control over the original writing of the CSV, the GDAL/OGR CSV driver documentation notes that if a field is named "WKT" it will be read as geometry:
When reading a field named “WKT” is assumed to contain WKT geometry, but also is treated as a regular field.
So one option is to write out the CSV using "WKT" as the geometry column name and then it can be read straight back in:
gdf = gdf.rename_geometry('WKT')
gdf.to_csv('/path/to/gdf.csv', index=False)
# Then in later scripts you can just read it straight back in
gdf = gpd.read_file('/path/to/gdf.csv')
A final option is to read the CSV in as a Pandas DataFrame (per #RobRaymond's answer) then manually create the geometry from the WKT and convert to a GeoDataFrame. Here is an alternative way of doing so:
import geopandas as gpd
df = gpd.read_file('/path/to/gdf.csv', ignore_geometry=True)
# Create geometry objects from WKT strings
df['geometry'] = gpd.GeoSeries.from_wkt(df['geometry'])
# Convert to GDF
gdf = gpd.GeoDataFrame(df)

Pandas: Merging multiple dataframes efficiently

I have a situation where I need to merge multiple dataframes that I can do easily using the below code:
# Merge all the datasets together
df_prep1 = df_prep.merge(df1,on='e_id',how='left')
df_prep2 = df_prep1.merge(df2,on='e_id',how='left')
df_prep3 = df_prep2.merge(df3,on='e_id',how='left')
df_prep4 = df_prep3.merge(df_4,on='e_id',how='left')
df_prep5 = df_prep4.merge(df_5,on='e_id',how='left')
df_prep6 = df_prep5.merge(df_6,on='e_id',how='left')
But what I want to understand is that if there is any other efficient way to perform this merge, maybe using a helper function? If yes, then how could I achieve that?
You can use reduce from functools module to merge multiple dataframes:
from functools import reduce
dfs = [df_1, df_2, df_3, df_4, df_5, df_6]
out = reduce(lambda dfl, dfr: pd.merge(dfl, dfr, on='e_id', how='left'), dfs)
You can put all your dfs into a list, or pass them from a function, a loop, etc. and then have 1 main df that you merge everything onto.
You can start with an empty df and iterate through. In your case, since you are doing left merge, it looks like your df_prep should already have all of the e_id values that you want. You'll need to figure out what you want to do with any additional columns, e.g., you can have pandas add _x and _y after conflicting column names that you don't merge, or rename them, etc. See this toy example:
main_df = pd.DataFrame({'e_id': [0, 1, 2, 3, 4]})
for x in range(3):
dfx = pd.DataFrame({'e_id': [x], 'another_col' + str(x): [x * 10]})
main_df = main_df.merge(dfx, on='e_id', how='left')
to get:
e_id another_col0 another_col1 another_col2
0 0 0.0 NaN NaN
1 1 NaN 10.0 NaN
2 2 NaN NaN 20.0
3 3 NaN NaN NaN
4 4 NaN NaN NaN

pandas: to_csv append mode with preserved columns order

I am using:
df.to_csv('file.csv', header=False, mode='a')
to write multiple pandas dataframe one by one to a CSV file.
I make sure that these dataframe have the same sets of column names.
However, it seems that the column orders will be written in a random order, so I have a chaos CSV file.
How to make sure that the new dataframe will be written with the column order of previous data?
Many thanks
I think you can sorting each DataFrame by columns if same columns names in each one:
df.sort_index(axis=1).to_csv('file.csv', header=None, mode='a')
If possible different columns names is possible create helper variable c and add new columns with removing duplicates:
df1 = pd.DataFrame({'C':list('as'),
'B':[4,5],
'A':[7,8]})
df2 = pd.DataFrame({'D':list('as'),
'A':[4,5],
'C':[7,8]})
df3 = pd.DataFrame({'C':list('as'),
'B':[4,5],
'E':[7,8]})
c = df1.columns
#first df should be written to file same way as another df
df1.to_csv('file.csv', header=None, index=False)
c = c.append(df2.columns).drop_duplicates()
df2.reindex(columns=c).to_csv('file.csv', header=None, mode='a', index=False)
c = c.append(df3.columns).drop_duplicates()
df3.reindex(columns=c).to_csv('file.csv', header=None, mode='a', index=False)
df = pd.read_csv('file.csv', header=None, names=c)
print (df)
C B A D E
0 a 4.0 7.0 NaN NaN
1 s 5.0 8.0 NaN NaN
2 7 NaN 4.0 a NaN
3 8 NaN 5.0 s NaN
4 a 4.0 NaN NaN 7.0
5 s 5.0 NaN NaN 8.0

What is the functionality of the filling method when reindexing?

When reindexing, say, 1 minute data to daily data (e.g. and index for daily prices at 16:00), if there is a situation that there is no 1 minute data for the 16:00 timestamp on a day, we would want to forward fill from the last non-null 1min data. In the following case, there is no 1min data before 16:00 on the 13th, and the last 1min data comes from 10th.
When using reindex with method='ffill', wouldn't one expect the following code to fill in the value on the 13th at 16:00? Inspecting daily1 shows that it is missing however.
import pandas as pd
import numpy as np
hf_index = pd.date_range(start='2013-05-09 9:00', end='2013-05-13 23:59', freq='1min')
hf_prices = np.random.rand(len(hf_index))
hf = pd.DataFrame(hf_prices, index=hf_index)
hf.ix['2013-05-10 18:00':'2013-05-13 18:00',:]=np.nan
hf.plot()
ind_daily = pd.date_range(start='2013-05-09 16:00', end='2013-05-13 16:00', freq='B')
print(ind_daily.values)
daily1 = hf.reindex(index=ind_daily, method='ffill')
To fill as one (or rather I) would expect, I need to do this:
daily2 = daily1.fillna(method='ffill')
If this is the case, what is the fill method in reindex actually doing. It is not clear to me just from the pandas documentation. It seems to me I should not have to do the above line.
I write my comment on the github here as well:
The current behavior in my opinion makes more sense. 'nan' values can be valid "actual" values in some scenarios. the concept of an actual 'nan' value should be different from 'nan' value because of changing index. If I have a dataframe like this:
A B C
1 1.242 NaN 0.110
3 NaN -0.185 -0.209
5 -0.581 1.483 NaN
and i want to keep all nan as nan, it makes much more sense to have:
df.reindex( [2, 4, 6], method='ffill' )
A B C
2 1.242 NaN 0.110
4 NaN -0.185 -0.209
6 -0.581 1.483 NaN
just take whatever value there is ( nan or not nan ) and fill forward until the next available index. Reindexing should not enforce a mandatory fillna on the data.
This is completely different from
df.reindex( [2, 4, 6], method=None )
which produces
A B C
2 NaN NaN NaN
4 NaN NaN NaN
6 NaN NaN NaN
Here is an example:
np.nan can just mean not applicable; say i have hourly data, and on weekends some calculations are just not applicable. I will fill nan for those columns during the weekends. now if I reindex to finer index, say every minute, the reindex will pick the last value from Friday, and fill it out for the whole weekend. This is wrong.
in reindexing a dataframe, forward flll means just take whatever value there is ( nan or not nan ) and fill forward until the next available index. A 'nan' value can be just an actual valid observation which you want to keep as is.
Reindexing should not enforce a mandatory fillna on the data.

issue with pandas and semilog for boxplot

I have a pandas dataframe that has columns:
'video' and 'link' of click values
with an index of datetime. For some reason, when I use semilogy and boxplot with the video series, I get the error
ValueError: Data has no positive values, and therefore can not be log-scaled.
but when I do it on the 'link' series I can draw the boxplot correctly.
I have verified that both the 'video' and 'link' series has NaN values and positive values.
Any thoughts on why this is occurring? Below is what I've done to verify that this is the case
Below is sample code:
#get all the not null values of video to show that there are positive
temp=a.types_pivot[a.types_pivot['video'].notnull()]
print temp
#get a count of all the NaN values to show both 'video' and 'link' has NaN
count = 0
for item in a.types_pivot['video']:
if(item.is_integer() == False):
count += 1
#try to draw the plots
print "there is %s nan values in video" % (count)
fig=plt.figure(figsize=(6,6),dpi=50)
ax=fig.add_subplot(111)
ax.semilogy()
plt.boxplot(a.types_pivot['video'].values)
Here is relevant output from the code for video series
type link video
created_time
2011-02-10 15:00:51+00:00 NaN 5
2011-02-17 17:50:38+00:00 NaN 5
2011-03-22 14:04:56+00:00 NaN 5
there is 5463 nan values in video
I run the same exact code except I do
a.types_pivot['link']
and I am able to draw the boxplot.
Below is the relevant output from the link series
Index: 5269 entries, 2011-01-24 20:03:58+00:00 to 2012-06-22 16:56:30+00:00
Data columns:
link 5269 non-null values
photo 0 non-null values
question 0 non-null values
status 0 non-null values
swf 0 non-null values
video 0 non-null values
dtypes: float64(6)
there is 216 nan values in link
Using the describe function
a.types_pivot['video'].describe()
<pre>
count 22.000000
mean 16.227273
std 15.275040
min 1.000000
25% 5.250000
50% 9.500000
75% 23.000000
max 58.000000
</pre>
Note: I'm unable to upload images due to some issue with imgur. I'll try again later.
Take advantage of pandas matplotlib helper / wrappers by calling pd.DataFrame.boxplot(). I believe this will take care of the NaN values for you. It will also put both Series in the same plot so you can easily compare data.
Example
Create a dataframe with some NaN values and negative values
In [7]: df = pd.DataFrame(np.random.rand(10, 5))
In [8]: df.ix[2:4,3] = np.nan
In [9]: df.ix[2:3,4] = -0.45
In [10]: df
Out[10]:
0 1 2 3 4
0 0.391882 0.776331 0.875009 0.350585 0.154517
1 0.772635 0.657556 0.745614 0.725191 0.483967
2 0.057269 0.417439 0.861274 NaN -0.450000
3 0.997749 0.736229 0.084077 NaN -0.450000
4 0.886303 0.596473 0.943397 NaN 0.816650
5 0.018724 0.459743 0.472822 0.598056 0.273341
6 0.894243 0.097513 0.691781 0.802758 0.785258
7 0.222901 0.292646 0.558909 0.220400 0.622068
8 0.458428 0.039280 0.670378 0.457238 0.912308
9 0.516554 0.445004 0.356060 0.861035 0.433503
Note that I can count the number of NaN values like so:
In [14]: df[3].isnull().sum() # Count NaNs in the 4th column
Out[14]: 3
A box plot is simply:
In [16]: df.boxplot()
You could create a semi-log boxplot, for example, by:
In [23]: np.log(df).boxplot()
Or, more generally, modify / transform to you heart's content, and then boxplot.
In [24]: df_mod = np.log(df).dropna()
In [25]: df_mod.boxplot()