How to convert PDF to excel using tabula-py into dataframe of several tables? - pandas

I have a PDF file where are several tables, For example:
Table from PDF File
By the way, I learned that I have to use tabula-py from Java (Note: I'm working on Jupyter Notebook
So, I code this:
import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
pdf_path = "..\PDFs\pobreza2.pdf" #File direction
df=tabula.read_pdf(pdf_path, pages="all", stream=True, guess=False, multiple_tables=True) #PDF have many pages with several tables
And I get this:
Output of the code
It's like a list and not a dataframe
So, how could I get this table into a Dataframe? The tables have string and float object

Related

How to access a dataframe from a Python dataframe list through a cell from a date column in the dataframe

I have created a list (df) which contains some dataframes after importing csv files. Instead of accessing this dataframes using df[0], df[1] etc, I would like to access them in a much easier way with something like df[20/04/22] or df[date=='20/04/22] or something similar. I am really new to Python and programming, thank you very much in advance. I attach the simplified code (contains only 2 items in the list) for simplyfying reasons.
I came up with two ways of achieving that but each time I have some trouble realising them.
Through my directory path names. Each csv (dataframe) file name includes the date in each original name file, something like : "5f05d5d83a442d4f78db0a19_2022-04-01.csv"
Each csv (dataframe), includes a date column (object type) which I have changed to datetime64 type so I can work with plots. So, I thought that maybe through this column what I ask would be possible.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime
from datetime import date
from datetime import time
from pandas.tseries.offsets import DateOffset
import glob
import os
path = "C:/Users/dsdadsdsaa/"
all_files = glob.glob(path + '*.csv')
df = []
for filename in all_files:
dataframe = pd.read_csv(filename, index_col=None, header=0)
df.append(dataframe)
for i in range(0,2):
df[i]['date'] = pd.to_datetime(df[i]['date'])
df[i]['time'] = pd.to_datetime(df[i]['time'])
df[0]

Generate graphs for all the columns in a excel file with Pandas in Google Colab

I'm trying to generate graphs for all the columns from a excel file but I'm having one error. My goal is getting this graphs in png files and then download it. Let me give you some context: I'm reading a csv, for each column, I'm trying to use a for to use .value_counts() and then create a graph once the graph is generated saving this one in a png file with the number of the index my code is this one:
import pandas as pd
from google.colab import files
from matplotlib import pyplot as plt
df = pd.read_excel('columns.xlsx')
for i in df.columns:
print(i)
#i.value_counts().plot(kind="bar", figsize=(15,7), color="#61d199")
df[i].value_counts().plot(kind="bar", figsize=(15,7), color="#61d199")
#plt.savefig('viz_movies.png')
Error in this code line:
df[i].value_counts().plot(kind="bar", figsize=(15,7), color="#61d199")
Error:
index 0 is out of bounds for axis 0 with size 0
Also I want to add in the for the names of the files something like this:
for index, number in enumerate(numbers):
plt.savefig('index.png') #use the index as a name

can't create a graph with matplotlib from a csv file / data type issue

I'm hoping to get some help here. I'm trying to create some simple bar/line graphs from a csv file, however, it gives me an empty graph until I open this csv file manually in excel and change the data type to numeric. I've tried changing the data type with pd.to_numeric but it still gives an empty graph.
The csv that I'm trying to visualise is web data that I scraped using Beautiful Soup, I used .text method do get rid of all of the HTML tags so maybe it's causing the issue?
Would really appreciate some help. thanks!
Data file: https://dropmefiles.com/AYTUT
import numpy
import matplotlib
from matplotlib import pyplot as plt
import pandas as pd
import csv
my_data = pd.read_csv('my_data.csv')
my_data_n = my_data.apply(pd.to_numeric)
plt.bar(x=my_data_n['Company'], height=my_data_n['Market_Cap'])
plt.show()
Your csv file is corrupt. There are commas at the end of each line. Remove them and your code should work. pd.to_numeric is not required for this sample dataset.
Test code:
from matplotlib import pyplot as plt
import pandas as pd
my_data = pd.read_csv('/mnt/ramdisk/my_data.csv')
fig = plt.bar(x=my_data['Company'], height=my_data['Market_Cap'])
plt.tick_params(axis='x', rotation=90)
plt.title("Title")
plt.tight_layout()
plt.show()

How to convert the outcome from np.mean to csv?

so I wrote a script to get the average grey value of each image in a folder. when I execute print(np.mean(img) I get all the values on the terminal. But i don't know how to get the values to a csv data.
import glob
import cv2
import numpy as np
import csv
import pandas as pd
files = glob.glob("/media/rene/Windows8_OS/PROMON/Recorded Sequences/6gParticles/650rpm/*.png")
for file in files:
img = cv2.imread(file)
finalArray = np.mean(img)
print(finalArray)
so far it works but I need to have the values in a csv data. I tried csvwriter and pandas but did not mangage to get a file containing the grey scale values.
Is this what you're looking for?
files = glob.glob("/media/rene/Windows8_OS/PROMON/Recorded Sequences/6gParticles/650rpm/*.png")
mean_lst = []
for file in files:
img = cv2.imread(file)
mean_lst.append(np.mean(img))
pd.DataFrame({"mean": mean_lst}).to_csv("path/to/file.csv", index=False)

Generating a NetCDF from a text file

Using Python can I open a text file, read it into an array, then save the file as a NetCDF?
The following script I wrote was not successful.
import os
import pandas as pd
import numpy as np
import PIL.Image as im
path = 'C:\path\to\data'
grb = [[]]
for fn in os.listdir(path):
file = os.path.join(path,fn)
if os.path.isfile(file):
df = pd.read_table(file,skiprows=6)
grb.append(df)
df2 = pd.np.array(grb)
#imarray = im.fromarray(df2) ##cannot handle this data type
#imarray.save('Save_Array_as_TIFF.tif')
i once used xray or xarray (they renamed them selfs) to get a NetCDF file into an ascii dataframe... i just googled and appearantly they have a to_netcdf function
import xarray and it allows you to treat dataframes just like pandas.
so give this a try:
df.to_netcdf(file_path)
xarray slow to save netCDF