How do I create a shapefile from a GeoPandas DataFrame? - pandas

I'm having issues writing a GeoPandas DataFrame to a shapefile using the GeoDataFrame.to_file() function. When I run the code below, sometimes I get an empty shapefile, and sometimes it runs but returns nothing at all.
import arcpy
import pandas as pd
import glob
import geopandas as gpd
from shapely.geometry import Point
arcpy.env.overwriteOutput = True
arcpy.env.workspace = r'F:\\GY539_Programming\\project_data'
ws = arcpy.env.workspace
files = glob.glob(ws + '/*.csv')
for filename in files:
df = pd.read_csv(filename, sep=',')
geometry = [Point(xy) for xy in zip(df['Longitude'], df['Latitude'])]
gdf = gpd.GeoDataFrame(df, crs='EPSG:4326', geometry=geometry)
gdf.to_file('file.shp', driver='ESRI Shapefile')
Any advice? My data comes from a csv file that contains one column with longitude coordinates and another with latitude coordinates. Here is a snippet of it:
Api Permit ... Latitude Longitude
0 5.000000e+13 163019 ... 61.14 -149.98
1 5.000000e+13 100001 ... 61.21 -149.77
2 5.000000e+13 163015 ... 61.33 -149.91
3 5.000000e+13 165037 ... 61.30 -149.99
4 5.000000e+13 100002 ... 61.42 -149.81
Thanks so much!

Thanks for adding the sample file.
The thing is that if you provide a relative filepath like here
gdf.to_file('file.shp', driver='ESRI Shapefile')
then the file is saved in your current working directory which is maybe not the one where you want the file to appear and therefore do not search there whether the file exists. If you want to save the shapefile to another working directory then just supply an absolute filepath like in this example:
filepath = "C:/users/your/file/path"
gdf.to_file(f"{filepath}/file.shp", driver='ESRI Shapefile')
This worked for me with your sample data. I could also verify this by loading the file again as a GeoDataFrame like this:
gpd_df = gpd.read_file(f"{filepath}/file.shp")
Hope that solves it for you as well.

Related

GeoViews saving inline HTML file is very large

I have created geo-dataframe using a combination of geopandas and geoviews. Libraries I'm using are below:
import pandas as pd
import numpy as np
import geopandas as gpd
import holoviews as hv
import geoviews as gv
import matplotlib.pyplot as plt
import matplotlib
import panel as pn
from cartopy import crs
gv.extension('bokeh')
I have concatenated 3 shapefiles to build a polygon picture of UK healthcare boundaries (links to files provided if needed). Unfortunately, from what i have found the UK doesn't produce one file that combines all of those, so have had to merge the shape files from the 3 individual countries i'm interested in. The 3 shape files have a size of:
shape file 1 = 5mb (https://www.opendatani.gov.uk/dataset/department-of-health-trust-boundaries)
shape file 2 = 204kb (https://geoportal.statistics.gov.uk/datasets/5252644ec26e4bffadf9d3661eef4826_4)
shape file 3 = 22kb (https://data.gov.uk/dataset/31ab16a2-22da-40d5-b5f0-625bafd76389/local-health-boards-december-2016-ultra-generalised-clipped-boundaries-in-wales)
I have merged them all successfully to build the picture i am looking for using:
Test = gv.Polygons(Merged_Shapes, vdims=[('Data'), ('CCG_Name')], crs=crs.OSGB()).options(tools=['hover'], width=550, height=700)
Test_2 = gv.Polygons(Merged_Shapes, vdims=[('Data'), ('CCG_Name')], crs=crs.OSGB()).options(tools=['hover'], width=550, height=700)
However, I would like to include these charts in a shareable html file. The issue I'm running into, is that when I save the HTML using:
from bokeh.resources import INLINE
layout = hv.Layout(Test + Test_2)
Final_report = pn.Tabs(('Test',layout)).save('Map_test.html', resources=INLINE)
I generate a html file that displays the charts, but the size is 80mb, which is far to large, especially if I want include more polygon charts and other charts in the same html.
Does anyone know of a more efficient way, from a memory perspective, I can store my polygon charts within a HTML file for sharing?
You can make the file smaller by rasterizng or by decimating the shapes. For rasterizng you can call hv.operation.datashader.rasterize(obj), and I think there is something in Shapely or GeoPandas for simplifying the shapes.

How to convert the outcome from np.mean to csv?

so I wrote a script to get the average grey value of each image in a folder. when I execute print(np.mean(img) I get all the values on the terminal. But i don't know how to get the values to a csv data.
import glob
import cv2
import numpy as np
import csv
import pandas as pd
files = glob.glob("/media/rene/Windows8_OS/PROMON/Recorded Sequences/6gParticles/650rpm/*.png")
for file in files:
img = cv2.imread(file)
finalArray = np.mean(img)
print(finalArray)
so far it works but I need to have the values in a csv data. I tried csvwriter and pandas but did not mangage to get a file containing the grey scale values.
Is this what you're looking for?
files = glob.glob("/media/rene/Windows8_OS/PROMON/Recorded Sequences/6gParticles/650rpm/*.png")
mean_lst = []
for file in files:
img = cv2.imread(file)
mean_lst.append(np.mean(img))
pd.DataFrame({"mean": mean_lst}).to_csv("path/to/file.csv", index=False)

How Can I import a Data set in Jupiter notebook (AD_Data.xlsx) data got xlsx extention

Tried all the possible options
like
import pandas as pd
df = pd.read_csv('AD_Data')
data = pd.ExcelFile("AD_Data")
xl_file = pd.ExcelFile(AD_Data)
dfs = {sheet_name: xl_file.parse(AD_Data) for sheet_name in xl_file.AD_Data}
dfs = pd.read_excel(AD_Data, sheetname=None)
None of them are helping
The error I am getting that
FileNotFoundError: File b'adData' does not exist
notebook and Data is in the same Folder.
I tried keeping different folder too, did not help.
I can use / import any other file like text and convert to DataFrame and work on it in same note book and from same data folder.
pd.read_excel (Python 3.6.4) works fine with xlsx on Windows.
Add the fileending .xlsx or make sure the file is in the same folder as the script.
dfs = pd.read_excel(r'C:\users\ilja\Desktop\Mappe1.xlsx', sheet_name=None)
print(dfs)
# OrderedDict([('Tabelle1', 1 5
# 0 2 6
# 1 3 7)])

Accessing carray of pointcloud using pytables

I am having a hard time understanding how to access the data in a carray.
http://carray.pytables.org/docs/manual/index.html
I have a carray that I can view in a group structure using vitables - but how to open it and retrieve the data it beyond me.
The data are a point cloud that is 3 levels down that I want to make a scatter plot of and extract as a .obj file..
I then have to loop through (many) clouds and do the same thing..
Is there anyone that can give me a simple example of how to do this?
This was my attempt:
import carray as ca
fileName = 'hdf5_example_db.h5'
a = ca.open(rootdir=fileName)
print a
I managed to solve my issue.. I wasn't treating the carray differently to the rest of the hierarchy. I needed to first load the entire db, then refer to the data I needed. I ended up not having to use carray, and just stuck to h5py:
from __future__ import print_function
import h5py
import numpy as np
# read the hdf5 format file
fileName = 'hdf5_example_db.h5'
f = h5py.File(fileName, 'r')
# full path of carry type data (which is in ply format)
dataspace = '/objects/object_000/object_model'
# view the data
print(f[dataspace])
# print to ply file
with open('object_000.ply', 'w') as fo:
for line in f[dataspace]:
fo.write(line+'\n')

Stuck importing NetCDF file into Pandas DataFrame

I've been working on this as a beginner for a while. Overall, I want to read in a NetCDF file and import multiple (~50) columns (and 17520 cases) into a Pandas DataFrame. At the moment I have set it up for a list of 4 variables but I want to be able to expand that somehow. I made a start, but any help on how to loop through to make this happen with 50 variables would be great. It does work using the code below for 4 variables. I know its not pretty - still learning!
Another question I have it that when I try to read the numpy arrays directly into Pandas DataFrame it doesn't work and instead creates a DataFrame that is 17520 columns large. It should be the other way (transposed). If I create a series, it works fine. So I have had to use the following lines to get around this. Not even sure why it works. Any suggestions of a better way (especially when it comes to 50 variables)?
d={vnames[0] :vartemp[0], vnames[1] :vartemp[1], vnames[2] :vartemp[2], vnames[3] :vartemp[3]}
hs = pd.DataFrame(d,index=times)
The whole code is pasted below:
import pandas as pd
import datetime as dt
import xlrd
import numpy as np
import netCDF4
def excel_to_pydate(exceldate):
datemode=0 # datemode: 0 for 1900-based, 1 for 1904-based
pyear, pmonth, pday, phour, pminute, psecond = xlrd.xldate_as_tuple(exceldate, datemode)
py_date = dt.datetime(pyear, pmonth, pday, phour, pminute, psecond)
return(py_date)
def main():
filename='HowardSprings_2010_L4.nc'
#Define a list of variables names we want from the netcdf file
vnames = ['xlDateTime', 'Fa', 'Fh' ,'Fg']
# Open the NetCDF file
nc = netCDF4.Dataset(filename)
#Create some lists of size equal to length of vnames list.
temp=list(xrange(len(vnames)))
vartemp=list(xrange(len(vnames)))
#Enumerate the list and assign each NetCDF variable to an element in the lists.
# First get the netcdf variable object assign to temp
# Then strip the data from that and add to temporary variable (vartemp)
for index, variable in enumerate(vnames):
temp[index]= nc.variables[variable]
vartemp[index] = temp[index][:]
# Now call the function to convert to datetime from excel. Assume datemode: 0
times = [excel_to_pydate(elem) for elem in vartemp[0]]
#Dont know why I cant just pass a list of variables i.e. [vartemp[0], vartemp[1], vartemp[2]]
#But this is only thing that worked
#Create Pandas dataframe using times as index
d={vnames[0] :vartemp[0], vnames[1] :vartemp[1], vnames[2] :vartemp[2], vnames[3] :vartemp[3]}
theDataFrame = pd.DataFrame(d,index=times)
#Define missing data value and apply to DataFrame
missing=-9999
theDataFrame1=theDataFrame.replace({vnames[0] :missing, vnames[1] :missing, vnames[2] :missing, vnames[3] :missing},'NaN')
main()
You could replace:
d = {vnames[0] :vartemp[0], ..., vnames[3]: vartemp[3]}
hs = pd.DataFrame(d, index=times)
with
hs = pd.DataFrame(vartemp[0:4], columns=vnames[0:4], index=times)
.
Saying that, pandas can read HDF5 directly, so perhaps the same is true for netCDF (which is based on HDF5)...