A bit new to python so maybe code could be improved.
I have a txt file with x and y values, separated by some NaN in between.
Data goes from -x to x and then comes back (x to -x) but with somewhat different values of y, say:
x=np.array([-0.02,-0.01,0,0.01,0.02,NaN,1,NaN,0.02,0.01,0,-0.01,-0.02])
And I would like to plot (matplotlib) up to the first NaN with certain format, x=1 with other format, and last set of data with a third different format (color, marker, linewidth...).
Of course the data I have is a bit more complex, but I guess is a simple useful approximation.
Any idea?
I'm using pandas as my data manipulation tool
You can create a group label taking the cumsum of where x is null. Then you can define a dictionary keyed by the label with values being a dictionary containing all of the plotting parameters. Use groupby to plot each group separately, unpacking all the parameters to set the arguments for that group.
Sample Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = np.array([-0.02,-0.01,0,0.01,0.02,np.NaN,1,np.NaN,0.02,0.01,0,-0.01,-0.02])
df = pd.DataFrame({'x': x})
Code
df['label'] = df.x.isnull().cumsum().where(df.x.notnull())
plot_params = {0: {'lw': 2, 'color': 'red', 'marker': 'o'},
1: {'lw': 6, 'color': 'black', 'marker': 's'},
2: {'lw': 9, 'color': 'blue', 'marker': 'x'}}
fig, ax = plt.subplots(figsize=(3,3))
for label, gp in df.groupby('label'):
gp.plot(y='x', **plot_params[label], ax=ax, legend=None)
plt.show()
This is what df looks like for reference after defining the group label
print(df)
x label
0 -0.02 0.0
1 -0.01 0.0
2 0.00 0.0
3 0.01 0.0
4 0.02 0.0
5 NaN NaN
6 1.00 1.0
7 NaN NaN
8 0.02 2.0
9 0.01 2.0
10 0.00 2.0
11 -0.01 2.0
12 -0.02 2.0
Related
I'm dealing with incomplete data and would like to assign scoring to different rows.
For example:
Bluetooth and WLAN are non integers but I would like to assign the value of 1 if data is available. 0 if there is no data (or NaN).
Samsung's score would be 1 + 1 + 4 = 6
Nokia's score would be 0 + 0 + 5 = 5
Bluetooth WLAN Rating Score
Apple Class-A USB-A NaN
Samsung Class-B USB-B 4
Nokia NaN NaN 5
I'm using Pandas at the moment but I'm not sure if Pandas alone is capable without Numpy.
Thanks a lot!
import pandas as pd
import numpy as np
data = {'Bluetooth': ['class-A', 'class-B', np.nan], 'WLAN': ['usb-A', 'usb-B', np.nan],'Rating': [np.nan, 4, 5]}
df = pd.DataFrame(data)
df = df.replace(np.nan, 0)
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(1)
df['score'] = df.sum(axis=1)
print(df.head())
Output:
Bluetooth WLAN Rating score
0 1.0 1.0 0.0 2.0
1 1.0 1.0 4.0 6.0
2 0.0 0.0 5.0 5.0
try this :
import pandas as pd
import numpy as np
df['Nan_count']=df.isnull().sum(axis=1)
df['score']=-df['Nan_count']+df['Rating'].replace(np.nan,0)+2
With this solution we do need to change the Nan in our dataframe et as computation is pretty low also
I have a dataframe that has, let's say 5 entries.
moment
stress
strain
0
0.12
13
0.11
1
0.23
14
0.12
2
0.56
15
0.56
I would like to get a 1D float list in the order of [moment, stress, strain], based on the linear interpolation of strain = 0.45
I have read a couple of threads talking about the interpolate() method from pandas. But it is used when you have Nan entry and you fill in the number.
How do I accomplish a similar task with my case?
Thank you
One method is to add new row to your dataframe with NaN values and sort it:
df = df.append(
{"moment": np.nan, "stress": np.nan, "strain": 0.45}, ignore_index=True
)
df = df.sort_values(by="strain").set_index("strain")
df = df.interpolate(method="index")
print(df)
Prints:
moment stress
strain
0.11 0.1200 13.00
0.12 0.2300 14.00
0.45 0.4775 14.75
0.56 0.5600 15.00
To get the values back:
df = df.reset_index()
print(
df.loc[df.strain == 0.45, ["moment", "stress", "strain"]]
.to_numpy()
.tolist()[0]
)
Prints:
[0.47750000000000004, 14.75, 0.45]
I have a dataframe with 20k rows and 100 columns. I am trying to normalize my data across rows. Scikit's MinMaxScaler doesn't allow me to do this by rows. It has something called minmax_scale that allows row normalization but I cannot denormalize it later. At least, I don't see how to do it. How would you guys do it?
From sklearn.preprocessing.minmax_scale:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 5],
'B': [88, 300, 200]})
# Find and store min and max vectors
min_values = df.min()
max_values = df.max()
normalized_df = (df - min_values) / (df.max() - min_values)
denormalized_df= normalized_df * (max_values - min_values) + min_values
A B
1 88
2 300
5 200
A B
0.00 0.000000
0.25 1.000000
1.00 0.528302
A B
1.0 88.0
2.0 300.0
5.0 200.0
I try to make a barchart in pandas, with two data series coming from a groupby:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().plot(kind='bar', layout=(2,2))
The x axis is not continuous, and only shows values that are in the dataset. In this example, it jumps from 11 to 13.
How can I make it continuous?
**EDIT 2: **
I tried JohnE datacentric approach, and it works. It creates a new index with no missing values:
temp = data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose()
temp.reindex(np.arange(temp.index.min(), temp.index.max())).plot(kind='bar', layout=(2,2))
However, I assume there should be a better approach with histogram instead of bar plot. The best I could do with histograms is:
data.groupby(['popup','UID']).size().groupby(level=0).plot(kind='hist', bins=30, alpha=0.5, layout=(2,2), legend=True)
But I didn't find any option in hist plot to get the same rendering than bar plot, without bar overlapping.
**EDIT: ** Here are some information to answer comments.
Data sample:
INSEE C1 popup C3 date \
0 75101.0 0.0 0 NaN 2017-05-17T13:20:16Z
0 75101.0 0.0 0 NaN 2017-05-17T14:23:51Z
1 31557.0 0.0 1 NaN 2017-05-17T14:58:27Z
UID
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
1 bafe9715-3a07-4d9b-b85c-0bbf658a9115
First groupby result (sample):
data.groupby(['popup','UID']).size().head(3)
popup UID
0 016d3e7e-1901-4f84-be0e-117988ec57a8 6
01c15455-29cc-4d1e-8743-638fd0f51602 6
03fc9eb0-c5fb-4205-91f0-4b74f78a8b96 3
dtype: int64
Second groupby result (sample):
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().head(3)
popup
0 1 46
3 23
4 22
dtype: int64
After unstack and transpose:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().head(3)
popup 0 1
1 46.0 38.0
2 21.0 35.0
3 23.0 22.0
There is a solution with histogram plot from matplotlib.axes.Axes.hist. It is better to use histograms than bar plots for this purpose, as we can choose the number of bins.
# Separate groups by 'popup' and count number of records for each 'UID'
popup_values = data['popup'].unique()
count_by_popup = [data[data['popup'] == popup_value].groupby(['UID']).size() for popup_value in popup_values]
# Create histogram
fig, ax = plt.subplots()
ax.hist(count_by_popup, 20, histtype='bar', label=[str(x) for x in popup_values])
ax.legend()
plt.show()
I have a dataframe with one column of data. I'd like to visualize the data such that all the bars above the horizontal axis are blue, and those below it are red.
How can I accomplish this?
You can use where for selecting values above and below 0 to new columns b and c:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
data = np.random.randn(10)
df = pd.DataFrame({'a':data})
df['b'] = df.a.where(df.a >= 0)
df['c'] = df.a.where(df.a < 0)
print (df)
a b c
0 1.624345 1.624345 NaN
1 -0.611756 NaN -0.611756
2 -0.528172 NaN -0.528172
3 -1.072969 NaN -1.072969
4 0.865408 0.865408 NaN
5 -2.301539 NaN -2.301539
6 1.744812 1.744812 NaN
7 -0.761207 NaN -0.761207
8 0.319039 0.319039 NaN
9 -0.249370 NaN -0.249370
#plot to same figure
ax = df.b.plot.bar(color='b')
df.c.plot.bar(ax=ax, color='r')
plt.show()
Using numpy.where you can get indices at which data is below 0: np.where(x < 0) and over 0: np.where(x >= 0), thus you will get two not overlapping arrays, which you can visualize, using different colors.
Actually, pandas frame has its own equivalent of numpy.where, please look at this question: pandas equivalent of np.where