boxplot ValueError: X must have 2 or fewer dimensions - matplotlib

I'm in the process of converting python code to study pyspark.
[My goal]
month = [[],[],[],[],[],[],[],[],[],[],[],[]]
for row in data :
if row[-1] != ''
month[int(row[0].split('-')[1])-1].append(float(row[-1]))
plt.boxplot(month)
plt.show()
[my code]
box_col_list = ['date', 'high_cel']
month = [[],[],[],[],[],[],[],[],[],[],[],[]]
for i in range(1, 13):
month[int(i)-1].append(df2.select(*box_col_list)\
.filter(F.month(F.col('date'))==i)\
.select('high_cel').dropna().toPandas()['high_cel'])
When I run my code, a value error occurs.
ValueError: X must have 2 or fewer dimensions
please help me!

Related

How to convert an object containing 3 numbers into three separate columns in pandas?

I ran a sentiment analysis model on my dataset of tweets and created a new column with the output called 'scores'. The output was a set of 3 probabilities: the first indicates the probability that the tweet is negative, the second indicates the probability that the tweet is neutral, the third indicates the probability that the tweet is positive.
For example:
[0.013780469, 0.94494355, 0.041276094]
Here is a screenshot of a few observations of the 'score' column
Using this code: df.scores.dtype I found the data type to be an object.
I want to create three separate columns, 'Negative', 'Neutral', "Positive' for each probability. Therefore, I would like to separate the 'scores'. How might I go about doing this?
I already tried this:
df[['Negative', 'Neutral', 'Positive']] = pd.DataFrame(df.scores.tolist(), index=df.index)
But I got an error saying:
ValueError: Columns must be same length as key
I also tried this:
df[['Negative', 'Neutral', 'Positive']] = pd.DataFrame([ x.split('~') for x in df['scores'].tolist() ])
But I got an error saying:
AttributeError: 'float' object has no attribute 'split'
When using str(x).split() instead of x.split(), I got this error:
ValueError: Columns must be same length as key
Here is the output when I do print(df['scores']) :
0 [0.07552529 0.7626313 0.16184345]
1 [0.0552146 0.7753107 0.16947475]
2 [0.06891786 0.6625086 0.26857358]
3 [0.10522033 0.7078265 0.18695314]
4 [0.04945428 0.78878057 0.16176508]
...
4976 [0.0196455 0.9556966 0.02465796]
4977 [0.02270025 0.94873595 0.02856365]
4978 [0.01378047 0.94494355 0.04127609]
4979 [0.05239033 0.9061995 0.04141007]
4980 [0.0651902 0.9061197 0.02869013]
Name: scores, Length: 4981, dtype: object
Here is the output when I do df.loc[0:5, "scores"].to_dict():
{0: '[0.07552529 0.7626313 0.16184345]',
1: '[0.0552146 0.7753107 0.16947475]',
2: '[0.06891786 0.6625086 0.26857358]',
3: '[0.10522033 0.7078265 0.18695314]',
4: '[0.04945428 0.78878057 0.16176508]',
5: '[0.02224329 0.87228 0.10547666]'}
You can try this method:
import pandas as pd
# Create some sample data
df = pd.DataFrame(columns=["scores"], data=["[0.013780469, 0.94494355, 0.041276094]",
"[0.013780469, 0.94494355, 0.941276094]",
"[0.513780469, 0.74494355, 0.041276094]",
"[0.813780469, 0.14494355, 0.541276094]"])
# First strip the unwanted characters and split by ", "
df[['Negative', 'Neutral', 'Positive']] = df.scores.str.replace("[", "", regex=True).replace("]", "", regex=True).str.split(", ", expand=True)
# Drop the original scores column
df.drop("scores", axis=1, inplace=True)
print(df)
Output:
Negative Neutral Positive
0 0.013780469 0.94494355 0.041276094
1 0.013780469 0.94494355 0.941276094
2 0.513780469 0.74494355 0.041276094
3 0.813780469 0.14494355 0.541276094

How to calculate rolling.agg('max') utilising a dataframe column as input to my function

I'm working with a kline dataframe. I'm adding a Swing_High and Swing_Low column to my df.
I've picked up an error where during low volatile periods my Close == Swing_Low price. This gives me a inf error in another function I have where close / Swing_Low.
To fix this I need to calculate the max/min value based on whether Close == Swing_Low or not. Default is for the rolling period to be 10 but if the above is true then increase the rolling period to 15.
Below is how I calculated the Swing_High and Swing_Low up to encountering Inf error.
import pandas as pd
df = pd.read_csv('Data/bybit_BTCUSD_15m.csv')
df["Date"] = df["Date"].astype('datetime64[ns]')
# Calculate the swing high and low for a given length
df['Swing_High'] = df['High'].rolling(10).agg('max')
df['Swing_Low'] = df['Low'].rolling(10).agg('min')
I tried the below function but it gives me a ValueError: The truth value of a Series is ambiguous
def swing_high(close, high, period1, period2):
a = high.rolling(period1).agg('max')
b = high.rolling(period2).agg('max')
if a != close:
return a
else:
return b
df['Swing_High'] = swing_high(df['Close'], df['High'], 10, 15)
How do I fix this or is there a better way to achieve my desired outcome?
A simple solution for what you're trying to achieve :
using the where function:
here’s the basic syntax using the pandas where() function:
df['col'] = (value_if_false).where(condition, value_if_true)
df['Swing_High_10']=df['High'].rolling(10).agg('max')
df['Swing_High_15']=df['High'].rolling(15).agg('max')
df['Swing_High']=(df['Swing_High_15']).where(df['Swing_High_10']!=df['Close'], df['Swing_High_15'])

ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

I'm trying to plot with kmeans, but I'm stuck because one column is with dates, and it's making a lot of problems. (you can see in the screenshot the data enter image description here)
I've alreay used to_datetime, so what now I should do?
how to pass this problem and plot it?
Thank you in advance!
from sklearn.cluster import KMeans
AAPL= pd.read_csv('AAPL.csv', header=0, squeeze=True)
#sd=store_data.head(100)
x = pd.to_datetime(AAPL.iloc[:, [0,1]],dayfirst=True)
print(x)
kmeans4 = KMeans(n_clusters=4)
y_kmeans4 = kmeans4.fit_predict(x)
print(y_kmeans4)
print(kmeans4.cluster_centers_)
plt.scatter(x[:,0],x[:,1],c=y_kmeans4,cmap='rainbow')
plt.scatter(kmeans4.cluster_centers_[:,0] ,kmeans4.cluster_centers_[:,1],color='black')
You need select first column only:
x = pd.to_datetime(AAPL.iloc[:, 0],dayfirst=True)
If use:
x = pd.to_datetime(AAPL.iloc[:, [0,1]],dayfirst=True)
it select first and second column and raise error, because pd.to_datetime working only if passed columns year, month, days like this solution.

Item Wrong Length 1 Instead of 50 Pandas

I'm dealing with a csv file consists of 2 columns and 51 rows in total.
data = pd.read_csv("data.csv", sep = ',')
data.columns=['x_column', 'y_column']
Then I perform linear regresssion
X = data.iloc[:, 0].values.reshape(-1, 1)
y = data.iloc[:, 1].values.reshape(-1, 1)
lr = LinearRegression()
Next thing I need to perform is Tukey Method.
X = data.iloc[[0], :].values
y = data.iloc[[1], :].values
Then I plotted the boxes and found out my range is between -40 to 10.
data.boxplot(return_type='dict')
plt.plot()
I need to assign my outliers to a value in order to remove them before training my dataset again. And this is where I have a problem.
y_column = X[:, 1]
data_outliers = (y_column > 0.0)
data[data_outliers]
When I run this last part I get Item wrong length 1 instead of 50. error and I don't know how to solve that. Any help is appreciated.
Try:
data_outliers = (y_column > 0.0).ravel()
The problem was that your data_outliers was a numpy column with two dimensions (shape: (1,50)) and that was impossible to mask the df like that... ravel just flattened it...

Error upon DataFrame.set_index : only integer scalar arrays can be converted to a scalar index

I display my DataFrame 'df_totals' I get the following output :
Value Month
0 -63585.86 Grand Total
Next, I'm trying to set the column Month as index. Therefore I do :
df_totals.set_index(['Month'], append=False, inplace=True)
In my code it crashed.
So I tried to reproduce outside to debug, but I don't reproduce...
However when I try to build a reduced test-case, I got no issue :
df = pandas.DataFrame({"Value": [-63585.86], "Month" : ["Grand Total"]})
df.set_index(['Month'], append=False, inplace=True)
display(df)
What is the best way to move forward on this analysis ?
Typically, I imagine there is a difference on the DataFrame definition at some point.
Also what means this error in this contexts ?
It is possible to dump the pandas into a string:
print( "df = pd.DataFrame( %s )" % (str(df_totals.to_dict())) )
Thanks to this I was able to identify that my df was build as following :
df = pandas.DataFrame( {('Value',): {0: -63585.85999999999}, ('Month',): {0: 'Grand Total'}} )
And therefore the line
df[("Month",)]
Leads to a crash.