Graph csv data that is represented horizontally rather than vertical - Python Pandas CSV - pandas

Context: I have combined numerous CSV's into one representing use case vs usage over a period of time.
The way the data is represented currently is attached.
What I am trying to do is, for each usecase, graph across row A(1, 1.1, 1.9, 4.0.11435, 4.1.11436 and so on...) - creating a linear plot to show progression over time
What I have so far:
import matplotlib.pyplot as plt
plot_df = pd.read_csv("results.csv")
milestones = plot_df.columns[1:]
row = plot_df.iloc[0]
row.plot(kind='line')
plt.show()
Any help is appreciated.
Thank you

Related

Cannot plot a histogram from a Pandas dataframe

I've used pandas.read_csv to generate a 1000-row dataframe with 32 columns. I'm looking to plot a histogram or bar chart (depending on data type) of each column. For columns of type 'int64', I've tried doing matplotlib.pyplot.hist(df['column']) and df.hist(column='column'), as well as calling matplotlib.pyplot.hist on df['column'].values and df['column'].to_numpy(). Weirdly, nthey all take areally long time (>30s) and when I've allowed them to complet, I get unit-height bars in multiple colors, as if there's some sort of implicit grouping and they're all being separated into different groups. Any ideas about what I can do to get a normal histogram? Unfortunately I closed the charts so I can't show you an example right now.
Edit - this seems to be a much bigger problem with Int columns, and casting them to float fixes the problem.
Follow these two steps:
import the Histogram class from the Matplotlib library
use the "plot" method, which will accept a dataframe as argument
import matplotlib.pyplot as plt
plt.hist(df['column'], color='blue', edgecolor='black', bins=int(45/1))
Here's the source.

How to average data for a variable over a number of timesteps

I was wondering if anyone could shed some light into how I can average this data:
I have a .nc file with data (dimensions: 2029,64,32) which relates to time, latitude and longitude. Using these commands I can plot individual timesteps:
timestep = data.variables['precip'][0]
plt.imshow(timestep)
plt.colorbar()
plt.show()
Giving a graph in this format for the 0th timestep:
I was wondering if there was any way to average this first dimension (the snapshots in time).
If you are looking to take a mean over all times, try using np.mean where you use the axis keyword to say which axis you want to average.
time_avaraged = np.mean(data.variables['precip'], axis = 0)
If you have NaN values then np.mean will give NaN for that lon/lat point. If you'd rather ignore them then use np.nanmean.
If you want to do specific times only, e.g. the first 1000 time steps, then you could do
time_avaraged = np.mean(data.variables['precip'][:1000,:,:], axis = 0)
I think if you're using pandas and numpy this may help you.Look for more details
import pandas as pd
import numpy as np
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7])
d = pd.Series(data)
print(d.rolling(4).mean())

Enforcing Incoming X-Axis Data to map with Static X-Axis - Plotly

I am trying to plot a multi-axes line graph in Plotly and my data is based on the percentage (y-axis) v/s date (x-axis).
X and Y-axis coming from the database via pandas
Now since Plotly doesn't understand the order of string date in the x-axis it adjusted it automatically.
I am looking for something where my x-axis remains static for dates and in order and graph plots on top of that mapping based on their dates matching parameter.
static_x_axis = ['02-11-2021', '03-11-2021', '04-11-2021', '05-11-2021', '06-11-2021', '07-11-2021', '08-11-2021', '09-11-2021', '10-11-2021', '11-11-2021', '12-11-2021', '13-11-2021', '14-11-2021', '15-11-2021', '16-11-2021', '17-11-2021', '18-11-2021', '19-11-2021', '20-11-2021', '21-11-2021', '22-11-2021', '23-11-2021']
and the above list determines the x-axis mapping.
I tried using range but seems that does not support static mapping or either map all graphs from the 0th point.
Overall I am looking for a way that either follows a static date range or either does not break the current order of dates like what happened in the above graph.
Thanks in advance for your help.
from your question your data:
x date as a string representation (i.e. categorical)
y a number between 0 and 1 (a precentage)
three traces
you describe that x is unordered as source. Require it to be sorted in the x-axis
below simulates a figure in this way
then applies categorical axis sorting
import pandas as pd
import numpy as np
import plotly.graph_objects as go
s = pd.Series(pd.date_range("2-nov-2021", periods=40).strftime("%d-%m-%Y"))
fig = go.Figure(
[
go.Scatter(
x=s.sample(10).sort_index().values,
y=np.linspace(n/4, n/3, 10),
mode="lines+markers+text",
)
for n in range(1,4)
]
).update_traces(texttemplate="%{y:.2f}", textposition="top center")
fig.show()
fig.update_layout(xaxis={"categoryorder": "array", "categoryarray": s.values})
fig.show()

Histogram in Bokeh charts takes a looong time

I am trying to move from matplotlib to bokeh. However, I am finding some annoying features. Last I encountered was that it took several minutes to make an histogram of about 1.5M entries - it would have taken a fraction of a second with Matplotlib. Is that normal? And if so, what's the reason?
from bokeh.charts import Histogram, output_file, show
import pandas as pd
output_notebook()
jd1 = pd.read_csv("somefile.csv")
p = Histogram(jd1['QTY'], bins=50)
show(p)
I'm not sure offhand what might be going on with Histogram in your case. Without the data file it's impossible to try and reproduce or debug. But in any case bokeh.charts does not really have a maintainer at the moment, so I would actually just recommend using bokeh.plotting to create your historgam. The bokeh.plotting API is stable (for several years now) and extensively documented. It's a few more lines of code but not many:
import numpy as np
from bokeh.plotting import figure, show, output_notebook
output_notebook()
# synthesize example data
measured = np.random.normal(0, 0.5, 1000)
hist, edges = np.histogram(measured, density=True, bins=50)
p = figure(title="Normal Distribution (μ=0, σ=0.5)")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color=None)
show(p)
As you can see that takes (on my laptop) ~half a second for a 10 million point histogram, including generating synthetic data and binning it.

IPython / pandas: Is there an canonical way to detect rapid changes in a timeseries?

Noob data analyst, analyzing some gas concentrations over a timeseries of a couple of thousand points (so small). I graphed it with Matplotlib, and there are some easy to see points where things change rapidly.
What is the canonical / easiest way to home in on those points?
import pandas as pd
from numpy import diff, concatenate
ff = pd.DataFrame( #acquire data here
columns=('Year','Recon'))
fd = diff(ff['Recon'], axis=-1)
ff['diff'] = concatenate([[0],fd],axis=0)
ff['rolling10'] = pd.rolling_mean(ff['diff'],10)
ff['rolling5'] = pd.rolling_mean(ff['diff'],5)
ff.plot('Year',['rolling5','rolling10'],subplots=False)
But note! my test data was evenly sampled. Looks like rolling_* don't apply to irregular time series yet, though there are some workarounds: Pandas: rolling mean by time interval