Generate graphs for all the columns in a excel file with Pandas in Google Colab - pandas

I'm trying to generate graphs for all the columns from a excel file but I'm having one error. My goal is getting this graphs in png files and then download it. Let me give you some context: I'm reading a csv, for each column, I'm trying to use a for to use .value_counts() and then create a graph once the graph is generated saving this one in a png file with the number of the index my code is this one:
import pandas as pd
from google.colab import files
from matplotlib import pyplot as plt
df = pd.read_excel('columns.xlsx')
for i in df.columns:
print(i)
#i.value_counts().plot(kind="bar", figsize=(15,7), color="#61d199")
df[i].value_counts().plot(kind="bar", figsize=(15,7), color="#61d199")
#plt.savefig('viz_movies.png')
Error in this code line:
df[i].value_counts().plot(kind="bar", figsize=(15,7), color="#61d199")
Error:
index 0 is out of bounds for axis 0 with size 0
Also I want to add in the for the names of the files something like this:
for index, number in enumerate(numbers):
plt.savefig('index.png') #use the index as a name

Related

Loading a csv file with no header on my Colab by Pandas read_csv and Numpy loadtxt gave me a different results

This is the image of the error on my Colab when I used pd.dtye to pd_data
This is the image of the error on my Colab when I used np.dtye to pd_data
I have loaded one csv file to my Colab note by two diffrent way. By pd.read_csv() and np.loadtxt(). And I have assigned these two in nd_data and pd_data ,repectively. After that I printed the shape of each data. At this point I've got two diffrent shape even though I loaded the same csv file.
My question is why I've got two diffrent shape by loading the same data.
this is the link to ThoraricSurgery.csv file which I've used.
'''
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
pd_data = pd.read_csv('/content/drive/MyDrive/딥러닝과실습1/ThoraricSurgery.csv')
print(pd_data.shape)
print(type(pd_data))
import numpy as np
nd_data = np.loadtxt('/content/drive/MyDrive/딥러닝과실습1/ThoraricSurgery.csv', delimiter=",")
print(nd_data.shape)
print(type(nd_data))this is the mentioned result
'''

can't create a graph with matplotlib from a csv file / data type issue

I'm hoping to get some help here. I'm trying to create some simple bar/line graphs from a csv file, however, it gives me an empty graph until I open this csv file manually in excel and change the data type to numeric. I've tried changing the data type with pd.to_numeric but it still gives an empty graph.
The csv that I'm trying to visualise is web data that I scraped using Beautiful Soup, I used .text method do get rid of all of the HTML tags so maybe it's causing the issue?
Would really appreciate some help. thanks!
Data file: https://dropmefiles.com/AYTUT
import numpy
import matplotlib
from matplotlib import pyplot as plt
import pandas as pd
import csv
my_data = pd.read_csv('my_data.csv')
my_data_n = my_data.apply(pd.to_numeric)
plt.bar(x=my_data_n['Company'], height=my_data_n['Market_Cap'])
plt.show()
Your csv file is corrupt. There are commas at the end of each line. Remove them and your code should work. pd.to_numeric is not required for this sample dataset.
Test code:
from matplotlib import pyplot as plt
import pandas as pd
my_data = pd.read_csv('/mnt/ramdisk/my_data.csv')
fig = plt.bar(x=my_data['Company'], height=my_data['Market_Cap'])
plt.tick_params(axis='x', rotation=90)
plt.title("Title")
plt.tight_layout()
plt.show()

How to convert PDF to excel using tabula-py into dataframe of several tables?

I have a PDF file where are several tables, For example:
Table from PDF File
By the way, I learned that I have to use tabula-py from Java (Note: I'm working on Jupyter Notebook
So, I code this:
import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
pdf_path = "..\PDFs\pobreza2.pdf" #File direction
df=tabula.read_pdf(pdf_path, pages="all", stream=True, guess=False, multiple_tables=True) #PDF have many pages with several tables
And I get this:
Output of the code
It's like a list and not a dataframe
So, how could I get this table into a Dataframe? The tables have string and float object

GeoViews saving inline HTML file is very large

I have created geo-dataframe using a combination of geopandas and geoviews. Libraries I'm using are below:
import pandas as pd
import numpy as np
import geopandas as gpd
import holoviews as hv
import geoviews as gv
import matplotlib.pyplot as plt
import matplotlib
import panel as pn
from cartopy import crs
gv.extension('bokeh')
I have concatenated 3 shapefiles to build a polygon picture of UK healthcare boundaries (links to files provided if needed). Unfortunately, from what i have found the UK doesn't produce one file that combines all of those, so have had to merge the shape files from the 3 individual countries i'm interested in. The 3 shape files have a size of:
shape file 1 = 5mb (https://www.opendatani.gov.uk/dataset/department-of-health-trust-boundaries)
shape file 2 = 204kb (https://geoportal.statistics.gov.uk/datasets/5252644ec26e4bffadf9d3661eef4826_4)
shape file 3 = 22kb (https://data.gov.uk/dataset/31ab16a2-22da-40d5-b5f0-625bafd76389/local-health-boards-december-2016-ultra-generalised-clipped-boundaries-in-wales)
I have merged them all successfully to build the picture i am looking for using:
Test = gv.Polygons(Merged_Shapes, vdims=[('Data'), ('CCG_Name')], crs=crs.OSGB()).options(tools=['hover'], width=550, height=700)
Test_2 = gv.Polygons(Merged_Shapes, vdims=[('Data'), ('CCG_Name')], crs=crs.OSGB()).options(tools=['hover'], width=550, height=700)
However, I would like to include these charts in a shareable html file. The issue I'm running into, is that when I save the HTML using:
from bokeh.resources import INLINE
layout = hv.Layout(Test + Test_2)
Final_report = pn.Tabs(('Test',layout)).save('Map_test.html', resources=INLINE)
I generate a html file that displays the charts, but the size is 80mb, which is far to large, especially if I want include more polygon charts and other charts in the same html.
Does anyone know of a more efficient way, from a memory perspective, I can store my polygon charts within a HTML file for sharing?
You can make the file smaller by rasterizng or by decimating the shapes. For rasterizng you can call hv.operation.datashader.rasterize(obj), and I think there is something in Shapely or GeoPandas for simplifying the shapes.

reading arrays from netCDF, why I get a size of (1,1,n)

I am trying to read and later on to plot data from a netcdf file. Some of the arrays contained at the .nc file that I am trying to store as variables, are created as a (1,1,n) size variable. When printing them i see [[[ numbers, numbers,....]]]. Why are these three [[[ are created? How can I read these variables as a simple (n,1) array?
Here is my code
import pandas as pd
import netCDF4 as nc
import matplotlib.pyplot as plt
from tkinter import filedialog
import numpy as np
file_path=filedialog.askopenfilename(title = "Select files", filetypes = (("all files","*.*"),("txt files","*.txt")))
file=nc.Dataset(file_path)
print(file.variables.keys()) # get all variable names
read_alt=file.variables['altitude'][:]
alt=np.array(read_alt)
read_b355=file.variables['backscatter'][:]
read_error_b355=file.variables['error_backscatter'][:]
b355=np.array(read_b355)
error_b355=np.array(read_error_b355)
the variable alt is fine, for the other two I have the aforementioned problem.
Is it possible that your variables - altitude, backscatter and error_backscatter - have more than one dimensions? Whenever you load that kind of data, the number of dimensions is kept by the netCDF library.
Nevertheless, what I usually do, is that I remove the dimensions that I do not need from the arrays by squeezing them:
read_alt = np.squeeze(file.variables['altitude'][:])
read_b355 = np.squeeze(file.variables['backscatter'][:]);
read_error_b355 = np.squeeze(file.variables['error_backscatter'][:]);