Pandas dataframe - multiplying DF's elementwise on same dates - something wrong? - pandas

I've been banging my head over this, I just cannot seem to get it right and I don't understand what is the problem... So I tried to do the following:
#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
import quandl
btc_usd_price_kraken = quandl.get('BCHARTS/KRAKENUSD', returns="pandas")
btc_usd_price_kraken.replace(0, np.nan, inplace=True)
plt.plot(btc_usd_price_kraken.index, btc_usd_price_kraken['Weighted Price'])
plt.grid(True)
plt.title("btc_usd_price_kraken")
plt.show()
eur_usd_price = quandl.get('BUNDESBANK/BBEX3_D_USD_EUR_BB_AC_000', returns="pandas")
eur_dkk_price = quandl.get('ECB/EURDKK', returns="pandas")
usd_dkk_price = eur_dkk_price / eur_usd_price
btc_dkk = btc_usd_price_kraken['Weighted Price'] * usd_dkk_price
plt.plot(btc_dkk.index, btc_dkk) # WHY IS THIS [4785 rows x 1340 columns] ???
plt.grid(True)
plt.title("Historic value of 1 BTC converted to DKK")
plt.show()
As you can see in the comment, I don't understand why I get a result (which I'm trying to plot) that has size: [4785 rows x 1340 columns] ?
Anyway, the code results in a lot of error messages, something like e.g.
> Traceback (most recent call last): File
> "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_qt5agg.py",
> line 197, in __draw_idle_agg
> FigureCanvasAgg.draw(self) File "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py",
...
> return _from_ordinalf(x, tz) File "/usr/lib/python3.6/site-packages/matplotlib/dates.py", line 254, in
> _from_ordinalf
> dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC) ValueError: ordinal must be >= 1
I read some posts and I know that Pandas/Dataframe when using multiply is able to automatically only do an elementwise multiplication, on data-pairs, where the date is the same (so if one DF has timeseries for e.g. 1999-2017 and the other only has e.g. 2012-2015, then only common dates between 2012-2015 will be multiplied, i.e. the intersection subset of the data set) - so this problem about understanding the error message(s) (and the solution) - the whole problem is related to calculating btc_dkk variable and plotting it (which is the price for Bitcoin in the currency DKK)...

This should work:
usd_dkk_price.multiply(btc_usd_price_kraken['Weighted Price'], axis='index').dropna()
You are multiplying on columns, not index (this happens since you are multiplying a dataframe and a series, if you had selected the column in usd_dkk_price, this would not have happened). Then afterwards just drop the rows with NaN.

Related

Iterating and ploting five columns per iteration pandas

I am trying to plot five columns per iteration, but current code is ploting everithing five times. How to explain to it to plot five columns per iteration without repeting them?
n=4
for tag_1,tag_2,tag_3,tag_4,tag_5 in zip(df.columns[n:], df.columns[n+1:], df.columns[n+2:], df.columns[n+3:], df.columns[n+4:]):
fig,ax=plt.subplots(ncols=5, tight_layout=True, sharey=True, figsize=(20,3))
sns.scatterplot(df, x=tag_1, y='variable', ax=ax[0])
sns.scatterplot(df, x=tag_2, y='variable', ax=ax[1])
sns.scatterplot(df, x=tag_3, y='variable', ax=ax[2])
sns.scatterplot(df, x=tag_4, y='variable', ax=ax[3])
sns.scatterplot(df, x=tag_5, y='variable', ax=ax[4])
plt.show()
You are using list slicing in the wrong way. When you use df.columns[n:], you are getting all the column names from the one with index n to the last one. The same is valid for n+1, n+2, n+3 and n+4. This causes the repetition that you are referring to. In addition to that, the fact that the plot is shown five times is due to the behavior of the zip function: when used on iterables with different sizes, the iterable returned by zip has the size of the smaller one (in this case df.columns[n+4:]).
You can achieve what you want by adapting your code as follows:
# Imports to create sample data
import string
import random
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Create some sample data and a sample dataframe
data = { string.ascii_lowercase[i]: [random.randint(0, 100) for _ in range(100)] for i in range(15) }
df = pd.DataFrame(data)
# Iterate in groups of five indexes
for start in range(0, len(df.columns), 5):
# Get the next five columns. Pay attention to the case in which the number of columns is not a multiple of 5
cols = [df.columns[idx] for idx in range(start, min(start+5, len(df.columns)))]
# Adapt your plot and take into account that the last group can be smaller than 5
fig,ax=plt.subplots(ncols=len(cols), tight_layout=True, sharey=True, figsize=(20,3))
for idx in range(len(cols)):
#sns.scatterplot(df, x=cols[idx], y='variable', ax=ax[idx])
sns.scatterplot(df, x=cols[idx], y=df[cols[idx]], ax=ax[idx]) # In the example the values of the column are plotted
plt.show()
In this case, the code performs the following steps:
Iterate over groups of at most five indexes ([0->4], [5->10]...)
Recover the columns that are positioned in the previously recovered indexes. The last group of columns may be smaller than 5 (e.g., 18 columns, the last is composed of the ones with the following indexes: 15, 16, 17
Create the plot taking into account the previous corner case of less than 5 columns
With Seaborn's object interface, available from v0.12, we might do like this:
from numpy import random
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import seaborn.objects as so
sns.set_theme()
First, let's create a sample dataset, just like trolloldem's answer.
random.seed(0) # To produce the same random values across multiple runs
columns = list("abcdefghij")
sample_size = 20
df_orig = pd.DataFrame(
{c: random.randint(100, size=sample_size) for c in columns},
index=pd.Series(range(sample_size), name="variable")
)
Then transform the data frame into a long-form for easier processing.
df = (df_orig
.melt(value_vars=columns, var_name="tag", ignore_index=False)
.reset_index()
)
Then finally render the figures, 5 figures per row.
(
so.Plot(df, x="value", y="variable") # Or you might do x="variable", y="value" instead
.facet(col="tag", wrap=5)
.add(so.Dot())
)

'pandas' has no attribute 'to_float'

I'm testing my data using the SVM Classifier. And my dataset is in a form of text and I'm trying to transform it into float.
I have data that may look like this:
dataset
Transform as float
df.columns = df('columns').str.rstrip('%').astype('float') / 100.0
TypeError Traceback (most recent call last)
<ipython-input-66-74921537411d> in <module>
1 # Transform as float
----> 2 df.columns = df('columns').str.rstrip('%').astype('float') / 100.0
3
TypeError: 'DataFrame' object is not callable
Basically, it is impossible to convert text to float. In your dataset, it seems that all the columns have text values, and not sure if the value can be numbers by using rstrip('%') (because the values are too long, so truncated in the image).
If the values of a columns can be numbers by using rstrip('%'), then you can convert it. In addition, you are using (), not [] for the dataframe. Because you are using`df(...'), it looks like a function call. You can do what you want if the values of a columns is numbers, as follows:
df['columns'] = df['columns'].str.rstrip('%').astype('float') / 100.0
Here is a full code sample:
import pandas as pd
df = pd.DataFrame({
'column_name': ['111%', '222%'],
})
# df looks like:
# columns
#0 111%
#1 222%
df['column_name'] = df['column_name'].str.rstrip('%').astype('float') / 100.0
print(df)
# columns
#0 1.11
#1 2.22

Is there a better way of finding summary statistics in Python?

The following is my code for finding the 5 point summary statistics. I keep getting this error:
list indices must be integers or slices, not str
It seems like the way i'm using the describe function that i created is wrong.
from statistics import stdev,median,mean
def describe(key):
a=[]
for i in scripts:
a.append(i[key])
a=scripts[key]
total = sum(script[key] for script in scripts)
avg = total/len(a)
avg=mean(a)
s = stdev(a)
q25 = min(a)+(max(a)-min(a))*25
med = min(a)+(max(a)-min(a))*50
med=median(a)
q75 = min(a)+(max(a)-min(a))*75
return (total, avg, s, q25, med, q75)`enter code here`
summary = [('items', describe('items')),
('quantity', describe('quantity')),
('nic', describe('nic')),
('act_cost', describe('act_cost'))]
I keep getting this error:
TypeError Traceback (most recent call last)
<ipython-input-8-ba78d5218ead> in <module>()
----> 1 summary = [('items', describe('items')),
2 ('quantity', describe('quantity')),
3 ('nic', describe('nic')),
4 ('act_cost', describe('act_cost'))]
<ipython-input-1-bcf37f98eb7d> in describe(key)
4 for i in scripts:
5 a.append(i[key])
----> 6 a=scripts[key]
7 total = sum(script[key] for script in scripts)
8 avg = total/len(a)
TypeError: list indices must be integers or slices, not str
It is hard to understand your problem, since we don't know how scripts looks like. It is a global variable which is not defined in your script. The error states that scripts is of type list, but it looks like you assume it is a dataframe in your code. So please check the type of scripts.
Also, did you know that there is an easy way to calculate a Five-number summary with numpy like this:
import numpy as np
minimum, q25, med, q75, maximum = np.percentile(a, [0, 25, 50, 75, 100], interpolation='midpoint')
For description, see:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html
As per your question, you are accessing list of dictionaries.
Directly accessing with its key value is not yielding the result here.
So you must do,
getValues = lambda key,inputData: [subVal[key] for subVal in inputData if key in subVal]
in this case,
getValues('key', scripts) will give the corresponding list, then its easy to compute the statistics of that list.

how to add the following feature to a tfidf matrix?

Hello I have a list called list_cluster, that looks as follows:
list_cluster=["hello,this","this is a test","the car is red",...]
I am using TfidfVectorizer to produce a model as follows:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
with open('vectorizerTFIDF.pickle', 'rb') as infile:
tdf = pickle.load(infile)
tfidf2 = tdf.transform(list_cluster)
then I would like to add new features to this matrix called tfidf2, I have a list as follows:
dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]
this list has the same lenght of list_cluster, and represents the date has 12 positions and in the place where is the 1 is the corresponding month of the year,
for instance '010000000000' represents february,
in order to use it as feature first I tried:
import numpy as np
dates=np.array(listMonth)
dates=np.transpose(dates)
to get a numpy array and then to transpose it in order to concatenate it with the first matrix tfidf2
print("shape tfidf2: "+str(tfidf2.shape),"shape dates: "+str(dates.shape))
in order to concatenate my vector and matrix I tried:
tfidf2=np.hstack((tfidf2,dates[:,None]))
However this is the output:
shape tfidf2: (11159, 1927) shape dates: (11159,)
Traceback (most recent call last):
File "Main.py", line 230, in <module>
tfidf2=np.hstack((tfidf2,dates[:,None]))
File "/usr/local/lib/python3.5/dist-packages/numpy/core/shape_base.py", line 278, in hstack
return _nx.concatenate(arrs, 0)
ValueError: all the input arrays must have same number of dimensions
the shape seems good, but I am not sure what is failing, I would like to appreciate support to concatenate this feature to my tfidf2 matrix, thanks in advance for the atention,
You need to convert all strings to numerics for sklearn. One way to do this is use the LabelBinarizer class in the preprocessing module of sklearn. This creates a new binary column for each unique value in your original column.
If dates is the same number of rows as tfidf2 then I think this will work.
# create tfidf2
tfidf2 = tdf.transform(list_cluster)
#create dates
dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]
# binarize dates
lb = LabelBinarizer()
b_dates = lb.fit_transform(dates)
new_tfidf = np.concatenate((tfidf2, b_dates), axis=1)

numpy change elements matching conditions

For two numpy array a, b
a=[1,2,3] b=[4,5,6]
I want to change x<2.5 data of a to b. So I tried
a[a<2.5]=b
hoping a to be a=[4,5,3].
but this makes error
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
a[a<2.5]=b
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 2 output values where the mask is true
what is the problem?
The issue you're seeing is a result of how masks work on numpy arrays.
When you write
a[a < 2.5]
you get back the elements of a which match the mask a < 2.5. In this case, that will be the first two elements only.
Attempting to do
a[a < 2.5] = b
is an error because b has three elements, but a[a < 2.5] has only two.
An easy way to achieve the result you're after in numpy is to use np.where.
The syntax of this is np.where(condition, valuesWhereTrue, valuesWhereFalse).
In your case, you could write
newArray = np.where(a < 2.5, b, a)
Alternatively, if you don't want the overhead of a new array, you could perform the replacement in-place (as you're trying to do in the question). To achieve this, you can write:
idxs = a < 2.5
a[idxs] = b[idxs]