can't create a graph with matplotlib from a csv file / data type issue - pandas

I'm hoping to get some help here. I'm trying to create some simple bar/line graphs from a csv file, however, it gives me an empty graph until I open this csv file manually in excel and change the data type to numeric. I've tried changing the data type with pd.to_numeric but it still gives an empty graph.
The csv that I'm trying to visualise is web data that I scraped using Beautiful Soup, I used .text method do get rid of all of the HTML tags so maybe it's causing the issue?
Would really appreciate some help. thanks!
Data file: https://dropmefiles.com/AYTUT
import numpy
import matplotlib
from matplotlib import pyplot as plt
import pandas as pd
import csv
my_data = pd.read_csv('my_data.csv')
my_data_n = my_data.apply(pd.to_numeric)
plt.bar(x=my_data_n['Company'], height=my_data_n['Market_Cap'])
plt.show()

Your csv file is corrupt. There are commas at the end of each line. Remove them and your code should work. pd.to_numeric is not required for this sample dataset.
Test code:
from matplotlib import pyplot as plt
import pandas as pd
my_data = pd.read_csv('/mnt/ramdisk/my_data.csv')
fig = plt.bar(x=my_data['Company'], height=my_data['Market_Cap'])
plt.tick_params(axis='x', rotation=90)
plt.title("Title")
plt.tight_layout()
plt.show()

Related

How to prevent pandas from reading multiple sheets?

I'm trying to make and save graphs based on each sheet in an excel file. Not sure what is wrong with my code but instead of reading the sheets one by one it seems to read them in a cumulative manner. Any help would be much appreciated.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="darkgrid", color_codes=True)
for i in range(3):
ihdata = pd.read_excel(r"C:\Users\wwwgo\PycharmProjects\plots\ihdata.xlsx", sheet_name=i)
sns_ih = sns.lineplot(x="Temperature", y="Derivative", hue="Well Position", data=ihdata)
plt.savefig(r'C:\Users\wwwgo\PycharmProjects\plots\sns_ih_{0}.jpg'.format(i), dpi = 1600)

pd.read_csv, when changing separator data type changes?

My dataframe is originally a text file, where the columns are separated by a tab.
I first changed these tabs to spaces by hand (sep=" "), loaded and plotted the data, my plot looked the way it should.
Since I have multiple files to plot, its not really handy to change the separator of each file. That's why I changed the seper<tor to sep="\s+".
Suddenly the x-axis of my new plot takes every single position value and overlaps them.
Anyone knows why this is happening and how to prevent it?
My first code looked like:
import pandas as pd
import numpy as np
from functools import reduce
data1500 = pd.read_csv('V026-15.000-0.1.txt', sep = " ", index_col='Position')
plt.plot(data_merged1.ts1500, label="ts 15.00")
and the second:
import pandas as pd
import numpy as np
from functools import reduce
from matplotlib import pyplot as plt
data1500 = pd.read_csv('V025-15.000-0.5.txt', sep = "\s+", index_col='Position')
plt.plot(data_merged2.ts1500, label="ts 15.00")
you could do this to import a tab-delimited file:
import re
with open('V026-15.000-0.1.txt.txt') as f:
data = [re.split('\t',x) for x in f.read().split('\n')]
or do this:
import csv
with open('data.txt', newline = '') as mytext:
data = csv.reader(mytext, delimiter='\t')
then to plot your data you should do as follow:
Read each line in the file using for loop.
Append required columns into a list.
After reading the whole file, plot the required data
something like this:
for row in data:
x.append(row[0])
y.append(row[1])
plt.plot(x, y)

Plot from multiple files imported with glob

I need to process hundreds of data files and I want to plot the results in a single graph. I'm using glob with a for loop to read and store the data, but I have no idea how to plot them with plotly.
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import glob
pio.renderers.default = 'browser'
files = glob.glob('GIRS12_L_8V_0.95bar.*')
traces = []
for file in files:
dat = pd.read_csv(file, sep=' ')
dat.columns = ['time','v(t)']
fig = go.Figure()
traces.append(go.Scatter(x = dat['time'], y = dat['v(t)']))
px.scatter(data_frame = traces)
Is it right to call px.scatter(...)? I was using fig.show() at the end but I don't know why it does not show anything in the graph.
have generated 100s of CSVs to demonstrate
pathlib is more pythonic / OO approach to interacting with file system and hence glob()
simplest approach with plotly is to use Plotly Express to generate all of the traces. Have taken approach of preparing all data into a single pandas data frame to make this super simple
per comments, a figure with so many traces and hence such a long legend may not be best visualisation for what you are trying to achieve. Consider what you need to visualise and tune solution to achieve a better visualisation
from pathlib import Path
import pandas as pd
import numpy as np
import plotly.express as px
# location where files exist
p = Path.cwd().joinpath("SO_csv")
if not p.is_dir():
p.mkdir()
# generate 100s of files
for i in range(400):
pd.DataFrame(
{
"time": pd.date_range("00:00", freq="30min", periods=47),
"v(t)": pd.Series(np.random.uniform(1, 5, 47)).sort_values(),
}
).to_csv(p.joinpath(f"GIRS12_L_8V_0.95bar.{i}"), index=False)
# read and concat all the CSVs into one dataframe, creating additional column that is the filename
# scatter this dataframe, a scatter / color per CSV
px.scatter(
pd.concat(
[pd.read_csv(f).assign(name=f.name) for f in p.glob("GIRS12_L_8V_0.95bar.*")]
),
x="time",
y="v(t)",
color="name",
)

Saving Matplotlib Output to DBFS on Databricks

I'm writing Python code on Databricks to process some data and output graphs. I want to be able to save these graphs as a picture file (.png or something, the format doesn't really matter) to DBFS.
Code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'fruits':['apple','banana'], 'count': [1,2]})
plt.close()
df.set_index('fruits',inplace = True)
df.plot.bar()
# plt.show()
Things that I tried:
plt.savefig("/FileStore/my-file.png")
[Errno 2] No such file or directory: '/FileStore/my-file.png'
fig = plt.gcf()
dbutils.fs.put("/dbfs/FileStore/my-file.png", fig)
TypeError: has the wrong type - (,) is expected.
After some research, I think the fs.put only works if you want to save text files.
running the above code with plt.show() will get you a bar graph - I want to be able to save the bar graph as an image to DBFS. Any help is appreciated, thanks in advance!
Easier way, just with matplotlib.pyplot. Fix the dbfs path:
Example
import matplotlib.pyplot as plt
plt.scatter(x=[1,2,3], y=[2,4,3])
plt.savefig('/dbfs/FileStore/figure.png')
You can do this by saving the figure to memory and then using the Python local file APIs to write to the DataBricks filesystem (DBFS).
Example:
import matplotlib.pyplot as plt
from io import BytesIO
# Create a plt or fig, then:
buf = BytesIO()
plt.savefig(buf, format='png')
path = '/dbfs/databricks/path/to/file.png'
# Make sure to open the file in bytes mode
with open(path, 'wb') as f:
# You can also use Bytes.IO.seek(0) then BytesIO.read()
f.write(buf.getvalue())

Generating a NetCDF from a text file

Using Python can I open a text file, read it into an array, then save the file as a NetCDF?
The following script I wrote was not successful.
import os
import pandas as pd
import numpy as np
import PIL.Image as im
path = 'C:\path\to\data'
grb = [[]]
for fn in os.listdir(path):
file = os.path.join(path,fn)
if os.path.isfile(file):
df = pd.read_table(file,skiprows=6)
grb.append(df)
df2 = pd.np.array(grb)
#imarray = im.fromarray(df2) ##cannot handle this data type
#imarray.save('Save_Array_as_TIFF.tif')
i once used xray or xarray (they renamed them selfs) to get a NetCDF file into an ascii dataframe... i just googled and appearantly they have a to_netcdf function
import xarray and it allows you to treat dataframes just like pandas.
so give this a try:
df.to_netcdf(file_path)
xarray slow to save netCDF