Using pandas and scipy regression line slope to identify growth - pandas

My goal is to be able to identify price growth in a table of records.
I know this is probably far off from what is possible with data tools, so I appreciate any help or suggestions for improvement.
The immediate trouble I'm having is that scipy.stats.linregress does not return if some data in the pandas rows is absent. I think some kind of masking or filling will be necessary to return the slope measure for rows where there are nulls. There is an exception thrown but it still works.
Also, am I using the best solution to find the growth?
I've observed that if I filter for the records that have a positive slope, higher rvalue (correlation) and lower stderr (standard error) the trendline for these rows is upward and consistent.
The reason I tried quantifying the price growth with the slope and other numeric values is because if I plot the lines from all the data in an excel chart, it's overwhelming to select the lines that show consistent upward movement because there is so much noise. Can it be done in a better way?
Here is the working sample:
# credit jezrael
import pandas as pd
import numpy as np
import scipy
from scipy import stats
def calc_slope(row):
a = scipy.stats.linregress(row, y=axisvalues)
return pd.Series(a._asdict())
table=pd.DataFrame({'Category':['A','A','A','B','C','C','C','B','B','A','A','A','B','B','D','A','B','B'],
'Quarter':['2016-Q1','2017-Q2','2017-Q3','2017-Q4','2017-Q2','2016-Q2','2017-Q2','2016-Q3','2016-Q4','2016-Q2','2016-Q3','2017-Q4','2016-Q1','2016-Q2','2016-Q4','2016-Q4','2017-Q2','2017-Q3'],
'Value':[100,200,500,800,700,900,300,400,600,200,300,400,200,300,100,300,500,600]})
db=(table.groupby(['Category','Quarter']).filter(lambda group: len(group) >= 1)).groupby(['Category','Quarter'])["Value"].mean()
db=db.unstack()
axisvalues=list(range(1,len(db.columns)+1)) #used in calc_slope function
db = db.join(db.apply(calc_slope,axis=1))

You can use:
#np.arange instead range
axisvalues= np.arange(1,len(db.columns)+1)
def calc_slope(row):
#mask NaNs out
mask = row.notnull()
a = scipy.stats.linregress(row[mask.values], y=axisvalues[mask])
return pd.Series(a._asdict())
db = db.join(db.apply(calc_slope,axis=1))
print (db)
print (db)
2016-Q1 2016-Q2 2016-Q3 2016-Q4 2017-Q2 2017-Q3 2017-Q4 \
Category
A 100.0 200.0 300.0 300.0 200.0 500.0 400.0
B 200.0 300.0 400.0 600.0 500.0 600.0 800.0
C NaN 900.0 NaN NaN 500.0 NaN NaN
D NaN NaN NaN 100.0 NaN NaN NaN
slope intercept rvalue pvalue stderr
Category
A 0.012895 0.315789 0.802955 0.029677 0.004281
B 0.010057 -0.885057 0.947623 0.001172 0.001516
C -0.007500 8.750000 -1.000000 0.000000 0.000000
D NaN NaN 0.000000 NaN NaN
But for last row get RuntimeWarnings, because only one value in 2016-Q4.
And for remove warnings is possible use filterwarnings, thank Kdog:
import warnings
warnings.filterwarnings("ignore")

Related

Nan columns not dropping

i have the data-set that contains some NAN values. i tried this to drop it but it is still showing
df['string_tweet'].dropna(inplace=True)
df['string_tweet']
this is the output
113 apc started let ’ finish started
235 upon vote katsina , apc government left state ...
1796 two people contesting office , one person win ...
1798 deji said peter obi jumping church church.na d...
1850 amnesia set , lem say deleting incriminating p...
...
378726 nan
378727 nan
378728 nan
378729 nan
378730 nan
Name: string_tweet, Length: 63664, dtype: object
please check the length and the row, they are not corresponding
If you have proper NaN values, use the subset argument to work on the whole dataframe:
df.dropna(subset=['string_tweet'], inplace=True)
If your dataframe includes "nan" strings as suggested by #99_m4n, you may filter them out using:
df = df[df['string_tweet']!='nan']
I guess, that the pictured nan are of type numpy.ndarray try to convert your column before droping the NaN.
df['string_tweet']=df['string_tweet'].astype(float)

ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'

I am receiving this error when I try to upload a csv file as a geodataframe.According to other questions resolutions on this site, this method should do the trick.
Here is the code that I am using to: upload the file as a gdf, then produce a subset dataframe with only some of the columns present.
cp_union = gpd.read_file(r'C:\Users\User\Desktop\CPAWS\terrestrial_outputs\cp_union.csv')
cp_union.crs = 'epsg:3005'
cp_trimmed = cp_union[['COSEWIC_status','reason_for_designation','cnm_eng','iucn_cat','mgmt_e','status_e','classification','sq_m']]
As stated in the title, the error that i am receiving is this: ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'. Is there some part of the process of saving a gdf as a csv and then reloading it as a gdf that would cause the creation of an additional geometry column?
EDIT
In another script, I loaded the same csv file as a pd dataframe. Here is the first row of data within that pd dataframe.
Unnamed: 0 0
fid_critic 0
scntfc_nm Castilleja victoriae
cnm_eng Victoria's Owl-clover
cnm_fren Castilléjie de Victoria
cswc_pop NaN
ch_stat Final
cb_site_nm Cattle Point
ch_detail Detailed Polygon
ch_variant NaN
ch_method NaN
hectares 0.8559
utm_zone 10
utm_east 478383
utm_north 5365043
latitude 48.438164
longitude -123.29226
shape_1 0.0
objectid 10251681.0
area_sqm 8478.6733
feat_len 326.5008
fid_protec -1
name_e NaN
name_f NaN
aichi_t11 NaN
iucn_cat NaN
oecm NaN
o_area 0.0
loc_e NaN
loc_f NaN
type_e NaN
mgmt_e NaN
gov_type NaN
legisl_e NaN
status_e NaN
protdate 0
delisdate 0
owner_e NaN
owner_f NaN
subs_right NaN
comments NaN
url NaN
shape_leng 0.0
protected 0
shape_le_1 320.859687
shape_area 6499.790343
geometry POLYGON ((1200735.4438 384059.0133999996, 1200...
COSEWIC_status Endangered
reason_for_designation This small annual herb is confined to a very s...
sq_m 6499.790343
classification c
Name: 0, dtype: object
So my only theory here is that, when you save a gdf as a csv, that the csv contains a column called geometry. Then when you load that csv as a gdf, that geopandas tries to create a new geometry column ontop of the one that was already in the csv. I could be completely wrong about this. Even if this is the case, Im not sure how to go about resolving the issue.
Thanks for the help!
using your sample data to create a CSV. Had to replace geometry as sample is not a valid WKT string
re-produced your error
solved by loading using pandas then converting to geopandas
solution
df = pd.read_csv(f)
cp_union = gpd.GeoDataFrame(
df.loc[:, [c for c in df.columns if c != "geometry"]],
geometry=gpd.GeoSeries.from_wkt(df["geometry"]),
crs="epsg:3005",
)
full code
import pandas as pd
import geopandas as gpd
import io
from pathlib import Path
# fmt: off
df_q = pd.read_csv(io.StringIO("""Unnamed: 0 0
fid_critic 0
scntfc_nm Castilleja victoriae
cnm_eng Victoria's Owl-clover
cnm_fren Castilléjie de Victoria
cswc_pop NaN
ch_stat Final
cb_site_nm Cattle Point
ch_detail Detailed Polygon
ch_variant NaN
ch_method NaN
hectares 0.8559
utm_zone 10
utm_east 478383
utm_north 5365043
latitude 48.438164
longitude -123.29226
shape_1 0.0
objectid 10251681.0
area_sqm 8478.6733
feat_len 326.5008
fid_protec -1
name_e NaN
name_f NaN
aichi_t11 NaN
iucn_cat NaN
oecm NaN
o_area 0.0
loc_e NaN
loc_f NaN
type_e NaN
mgmt_e NaN
gov_type NaN
legisl_e NaN
status_e NaN
protdate 0
delisdate 0
owner_e NaN
owner_f NaN
subs_right NaN
comments NaN
url NaN
shape_leng 0.0
protected 0
shape_le_1 320.859687
shape_area 6499.790343
geometry POLYGON ((5769135.557632876 7083849.386658552, 5843426.213336911 7098018.122146672, 5852821.812968816 7081377.7312996285, 5914814.478616157 7091734.620966213, 5883751.009067913 7017032.330573363, 5902031.719573214 6983898.953064103, 5864452.659165712 6922039.030140929, 5829585.402576889 6878872.269967912, 5835906.522449658 6846685.714836724, 5800391.382286092 6827305.509709548, 5765261.646424723 6876008.057438379, 5765261.402301509 6876010.894933639, 5765264.431247815 6876008.786040769, 5760553.056402712 6927522.42488809, 5720896.599172597 6983360.181762057, 5755349.303491102 7039380.015177476, 5769135.557632876 7083849.386658552))
COSEWIC_status Endangered
reason_for_designation This small annual herb is confined to a very s...
sq_m 6499.790343
classification c"""), sep="\s\s+", engine="python", header=None).set_index(0).T
# fmt: on
# generate a CSV file from sample data
f = Path.cwd().joinpath("SO_q.csv")
df_q.to_csv(f, index=False)
# replicate issue...
try:
gpd.read_file(f)
except ValueError as e:
print(e)
# now the actual solution
df = pd.read_csv(f)
cp_union = gpd.GeoDataFrame(
df.loc[:, [c for c in df.columns if c != "geometry"]],
geometry=gpd.GeoSeries.from_wkt(df["geometry"]),
crs="epsg:3005",
)
GeoPandas automatically adds a geometry field when you read in your CSV.
As you already have a field called "geometry", GeoPandas raises an exception. GeoPandas doesn't read the WKT strings in the "geometry" field as geometry.
Some workarounds:
the GDAL/OGR CSV driver (which geopandas uses) supports the open option GEOM_POSSIBLE_NAMES which lets you specify a different field name to look for geometry in.
gdf = gpd.read_file('/path/to/gdf.csv',
GEOM_POSSIBLE_NAMES="geometry",
KEEP_GEOM_COLUMNS="NO")
The KEEP_GEOM_COLUMNS option is also required otherwise GDAL/OGR will return a "geometry" column as well as the geometry and you still get the original ValueError: GeoDataFrame does not support multiple columns using the geometry column name 'geometry'.
Alternatively, if you have control over the original writing of the CSV, the GDAL/OGR CSV driver documentation notes that if a field is named "WKT" it will be read as geometry:
When reading a field named “WKT” is assumed to contain WKT geometry, but also is treated as a regular field.
So one option is to write out the CSV using "WKT" as the geometry column name and then it can be read straight back in:
gdf = gdf.rename_geometry('WKT')
gdf.to_csv('/path/to/gdf.csv', index=False)
# Then in later scripts you can just read it straight back in
gdf = gpd.read_file('/path/to/gdf.csv')
A final option is to read the CSV in as a Pandas DataFrame (per #RobRaymond's answer) then manually create the geometry from the WKT and convert to a GeoDataFrame. Here is an alternative way of doing so:
import geopandas as gpd
df = gpd.read_file('/path/to/gdf.csv', ignore_geometry=True)
# Create geometry objects from WKT strings
df['geometry'] = gpd.GeoSeries.from_wkt(df['geometry'])
# Convert to GDF
gdf = gpd.GeoDataFrame(df)

Matplotlib plots of all dataframe columns with FOR operator

I want plot the graphs one by one from the dataframe with FOR operator.
names_list = df.columns.tolist()
for name in names_list:
df[name].plot(figsize=(25, 5))
This code is no good. The graphs are depicted in one figure, but should be in different ones.
How can I get multiple charts instead of one?
Try the following:
names_list = df.columns.tolist()
for name in names_list:
fig, ax = plt.subplots(figsize=(25, 5))
df[name].plot(ax=ax)
If you're able to use seaborn here's an example using a FacetGrid:
import seaborn as sns, matplotlib.pyplot as plt
In [102]: df.head(3)
Out[102]:
Date Consumption Wind Solar Wind+Solar name
0 2006-01-01 1069.184 NaN NaN NaN mid
1 2006-01-02 1380.521 NaN NaN NaN mid
2 2006-01-03 1442.533 NaN NaN NaN high
g = sns.FacetGrid(data=df,col='name',col_wrap=1,hue='name')
g.fig.set_size_inches(6,3) # compressed just to show example
g.map(sns.lineplot,'Date','Consumption')
plt.show()
Result:

Update a Pandas MultiIndex DataFrame

The dataframe "data" has a MultiIndex.
data.head()
Close High Low Open Volume
Symbol Date
A 1999-11-18 28.6358 33.5207 27.3725
30.6572 59753154
1999-11-19 27.2040 28.9727 26.8253 28.9323
16172993
1999-11-22 29.3517 29.3517 26.9935 27.8357
5435127
1999-11-23 27.1198 28.8885 27.1198 28.6358
5035889
1999-11-24 27.6676 28.2571 26.9513 27.0389
5141708
The dictionary f has a key 'AAPL' which is a regular DataFrame.
f['AAPL'].head()
Open High Low Close Volume
Date
2018-06-11 191.350 191.970 190.21 191.23 18308460
2018-06-12 191.385 192.611 191.15 192.28 16911141
2018-06-13 192.420 192.880 190.44 190.70 21638393
2018-06-14 191.550 191.570 190.22 190.80 21610074
2018-06-15 190.030 190.160 188.26 188.84 61719160
I'd like to append to data['AAPL'] so that it has the data from f['AAPL']. This works, but is not inplace:
data.loc['AAPL'].append(f['AAPL'], verify_integrity=True).tail()
Close High Low Open Volume
Date
2018-07-30 189.91 192.20 189.0700 191.90 21029535
2018-07-31 190.29 192.14 189.3400 190.30 39373038
2018-08-01 201.50 201.76 197.3100 199.13 67935716
2018-08-02 207.39 208.38 200.3500 200.58 62404012
2018-08-03 207.99 208.74 205.4803 207.03 33447396
When I try to update data, I get all NaNs.
data.loc['AAPL'] = data.loc['AAPL'].append(f['AAPL'], verify_integrity=True).tail()
Close High Low Open Volume
Date
2018-06-04 NaN NaN NaN NaN NaN
2018-06-05 NaN NaN NaN NaN NaN
2018-06-06 NaN NaN NaN NaN NaN
2018-06-07 NaN NaN NaN NaN NaN
2018-06-08 NaN NaN NaN NaN NaN
Edit:
The "data" DataFrame was created with pandas data_reader:
import pandas_datareader.data as web
data = web.DataReader(['A','AAPL','F'], 'morningstar', start, end)
"f" was created the same way, but using 'iex' as the source instead of 'morningstar' (at the moment the morningstar source is returning 404s, so I switched to iex).
I still don't know why assigning to data.loc['AAPL'] doesn't work, but the following does:
# Converts dict with keys as tickers, DataFrame as values, to a DataFrame with a MultiIndex
new_data = pd.concat(f)
# Just append, and sort index to align the dates
data = data.append(new_data).sort_index()
Personal preference: I would first create a temp df with the data to append as a multi index dataframe.
toappend = pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol'])
And then create a new dataframe by appending the data and new data.
newdata = data.append(toappend, verify_integrity=True)
or if you prefer to do it in one line:
newdata = data.append(pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol']), verify_integrity=True)
My full test code is:
import pandas as pd
import numpy as np
symbols = ['AAA', 'BBB', 'CCC']
dates = ['2018-06-11', '2018-06-12', '2018-06-13']
cols = ['Close', 'High', 'Low']
midx = pd.MultiIndex.from_product([symbols, dates], names=['Symbol', 'Date'])
data= pd.DataFrame(10, midx, cols)
aapldf = pd.DataFrame(15, dates, cols)
aapldf.index.name = 'Date'
f = {'AAPL': aapldf}
toappend = pd.concat([f['AAPL']], keys=['AAPL'], names=['Symbol'])
newdata = data.append(toappend, verify_integrity=True)
print(newdata)

sum vs np.nansum weirdness while summing columns with same name on a pandas dataframe - python

taking inspiration from this discussion here on SO (Merge Columns within a DataFrame that have the Same Name), I tried the method suggested and, while it works while using the function sum() it doesn't when I am using np.nansum :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100))
print(df.head(3))
sum() case:
print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3))
a b
2011-01-01 1.328933 1.678469
2011-01-02 1.878389 1.343327
2011-01-03 0.964278 1.302857
np.nansum() case:
print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3))
a [1.32893299939, 1.87838886222, 0.964278430632,...
b [1.67846885234, 1.34332662587, 1.30285727348, ...
dtype: object
any idea why?
The issue is that np.nansum converts its input to a numpy array, so it effectively loses the column information (sum doesn't do this). As a result, the groupby doesn't get back any column information when constructing the output, so the output is just a Series of numpy arrays.
Specifically, the source code for np.nansum calls the _replace_nan function. In turn, the source code for _replace_nan checks if the input is an array, and converts it to one if it's not.
All hope isn't lost though. You can easily replicate np.nansum with Pandas functions. Specifically use sum followed by fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
The sum should ignore NaN's and just sum the non-null values. The only case you'll get back a NaN is if all the values attempting to be summed are NaN, which is why fillna is required. Note that you could also do the fillna before the groupby, i.e. df.fillna(0).groupby....
If you really want to use np.nansum, you can recast as pd.Series. This will likely impact performance, as constructing a Series can be a relatively expensive, and you'll be doing it multiple times:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
Example Computations
For some example computations, I'll be using the following simple DataFrame, which includes NaN values (your example data doesn't):
df = pd.DataFrame([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb'))
a a a b b
0 1.0 2.0 2.0 NaN 4.0
1 NaN NaN NaN 3.0 3.0
2 NaN NaN -1.0 2.0 NaN
Using sum without fillna:
df.groupby(df.columns, axis=1).sum()
a b
0 5.0 4.0
1 NaN 6.0
2 -1.0 2.0
Using sum and fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0
Comparing to the fixed np.nansum method:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0