Error in tk_xts(.) : could not find function "tk_xts" - Error when trying to convert a dataframe of stock returns into a time series object - error-handling

I have a dataset of five stocks and their returns that I'm trying to modify the dataset and eventually create a covariance matrix. I've been trying the function below to convert the dataset into a time series object using the "xts" function:
log_ret_xts <- log_ret_tidy %>%
spread(symbol, value = ret) %>%
tk_xts()
I get the error message Error in tk_xts(.) : could not find function "tk_xts"
I should have all needed packages installed, any help?

Related

DataFrame to DataFrameRow conversion (Julia)

I'm using Pingouin.jl to test normality.
In their docs, we have
dataset = Pingouin.read_dataset("mediation")
Pingouin.normality(dataset, method="jarque_bera")
Which should return a DataFrame with normality true or false for each name in the dataset.
Currently, this broadcasting is deprecated, and I'm unable to concatenate the result in one DataFrame for each unique-column-output (which is working and outputs a DataFrame).
So, what I have so far.
function var_norm(df)
norm = DataFrame([])
for i in 1:1:length(names(df))
push!(norm, Pingouin.normality(df[!,names(df)[i]], method="jarque_bera"))
end
return norm
end
The error I get:
julia> push!(norm, Pingouin.normality(df[!,names(df)[1]], method="jarque_bera"))
ERROR: ArgumentError: `push!` does not allow passing collections of type DataFrame to be pushed into a DataFrame. Only `Tuple`, `AbstractArray`, `AbstractDict`, `DataFrameRow` and `NamedTuple` are allowed.
Stacktrace:
[1] push!(df::DataFrame, row::DataFrame; promote::Bool)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1603
[2] push!(df::DataFrame, row::DataFrame)
# DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:1601
[3] top-level scope
# REPL[163]:1
EDIT: push! function was not properly written at my first version of the post. But, the error persists after the change. How can I reformat the output of type DataFrame from Pingouin into DataFrameRow?
As Pengouin.normality returns a DataFrame, you will have to iterate over its results and push one-by-one:
df = Pengouin.normality(…)
for row in eachrow(df)
push!(norms, row)
end
If you are sure Pengouin.normality returns a DataFrame with exactly one row, you can simply write
push!(norms, only(Pengouin.normality(…)))

Pandas rolling apply multiplication

I would have thought this would be a basic application of pd.DataFrame().rolling() or pd.Series().rolling(), but it appears that the pandas rolling function cannot handle scalar multiplication being applied to a rolling window; I am hoping I am wrong and someone can spot the error.
I am trying to take a rolling window of a series (or dataframe) and multiply each row of that series/dataframe by a series/dataframe of weights (these weights have been precomputed). The code I thought should work is:
data.rolling(5).apply( lambda x: x*weights )
with
data = pd.Series( np.random.randint(1,101,2000) )
weights = pd.Series([ 0.10650, 0.1405310, 0.1854318, 0.2446788, 0.3228556 ])
I thought, data.rolling(5).apply( lambda x: x*weights ) would produce a new rolling series, but the following error is returned everytime "TypeError: cannot convert the series to <class 'float'>".
I should note that the only reason that I am trying to multiply the weights is to apply a corr/cov/mean statistic on the new rolling seres/dataframe afterward...something like
rolling_weighted_corr = data.rolling(5).apply( lambda x: x*weights ).corr()
Does anyone know how to multiply (scalar) a series with a rolling series to produce a new rolling series?
Scipy can do this with signal.convolve. For example, using the mode="same" flag will return an array of the same size as data where a centered window around each row is multiplied by weights and summed (thus, np.sum(weights) is used to normalize). Example:
from scipy import signal
data_smoothed = signal.convolve(data, weights, mode="same")/np.sum(weights)
np.var(data) # 836.0666897499997
np.var(data_smoothed) # 185.52213418878418

Cannot write dataframe into hive table after using UDF in Pyspark

I am trying to extract first element of probability column (vector data type) using UDF in Pyspark. I was able to get the new dataframe with extracted values in probability column. I also checked the data type of probability column which has changed from vector to float. But I am not able to write the dataframe into hive table. I get numpy module not found error. Deploy mode is client. Is there a workaround other than installing numpy in all the worker nodes ?
Code -
spark = (SparkSession
.builder
.appName("Model_Scoring")
.master('yarn')
.enableHiveSupport()
.getOrCreate()
)
hive = HiveWarehouseSession.session(spark).build()
hive.setDatabase("hc360_models")
final_ads = spark.read.parquet("hdfs://DATAHUB/datahube/feature_engineering/final_ads.parquet")
model = PipelineModel.load("/tmp/fitted_model_new/")
first_element=udf(lambda v:float(v[0]),FloatType())
out = model.transform(final_ads)
out = out.withColumn("probability",first_element("probability")).drop('features').drop('rawPrediction')
out.show(10)
out.write.mode("append").format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR).option("table","test_hypertension_table_final3").save()
spark.stop()
Error -
ImportError: ('No module named numpy', <function _parse_datatype_json_string at 0x7f83370922a8>, (u'{"type":"struct","fields":[{"name":"_0","type":{"type":"udt","class":"org.apache.spark.ml.linalg.VectorUDT","pyClass":"pyspark.ml.linalg.VectorUDT","sqlType":{"type":"struct","fields":[{"name":"type","type":"byte","nullable":false,"metadata":{}},{"name":"size","type":"integer","nullable":true,"metadata":{}},{"name":"indices","type":{"type":"array","elementType":"integer","containsNull":false},"nullable":true,"metadata":{}},{"name":"values","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}}]}},"nullable":true,"metadata":{}},{"name":"_1","type":"integer","nullable":true,"metadata":{}}]}',))
Sample data -
Schema -
StructType(List(StructField(patient_id,IntegerType,true),
StructField(carrier_operational_id,IntegerType,false),
StructField(gender_cde,StringType,true),
StructField(pre_fixed_mpr_qty,DecimalType(38,8),true),
StructField(idx_days_in_gap,DecimalType(11,1),true),
StructField(age,DecimalType(6,1),true),
StructField(post_fixed_mpr_adh_ind,DecimalType(2,1),true),
StructField(probability,FloatType,true),
StructField(prediction,DoubleType,false),
StructField(run_date,TimestampType,false)))

How to Box Plot Panda Timestamp series ? (Errors with Timestamp type)

I'm using:
Pandas version 0.23.0
Python version 3.6.5
Seaborn version 0.81.1
I'd like a Box Plot of a column of Timestamp data. My dataframe is not a time series, the index is just an integer but I have created a column of Timestamp data using:
# create a new column of time stamps corresponding to EVENT_DTM
data['EVENT_DTM_TS'] =pd.to_datetime(data.EVENT_DTM, errors='coerce')
I filter out all NaT values resulting from coerce.
dt_filtered_time = data[~data.EVENT_DTM_TS.isnull()]
At this point my data looks good and I can confirm the type of the EVENT_DM_TS column is Timestamp with no invalid values.
Finally to generate the single variable box plot I invoke:
ax = sns.boxplot(x=dt_filtered_time.EVENT_DTM_TS)
and get the error:
TypeError: ufunc add cannot use operands with types dtype('M8[ns]') and dtype( 'M8[ns]')
I've Googled and found:
https://github.com/pandas-dev/pandas/issues/13844
https://github.com/matplotlib/matplotlib/issues/9610
which seemingly indicate issues with data type representations.
I've also seen references to issues with pandas version 0.21.0.
Anyone have an easy fix suggestion or do I need to use a different data type to plot the box plot. I'd like to get the single picture of the distribution of the timestamp data.
This is the code I ended up with:
import time
#plt.FuncFormatter
def convert_to_date_string(x,pos):
return time.strftime('%Y-%m',time.localtime(x))
plt.figure(figsize=(15,4))
sns.set(style='whitegrid')
temp = dt_filtered_time.EVENT_DTM_TS.astype(np.int64)/1E9
ax = sns.boxplot(x=temp)
ax.xaxis.set_major_formatter(convert_to_date_string)
Here is the result:
Credit goes to ImportanceOfBeingErnest whose comment pointed me towards this solution.

Vectorizing text from data frame column using pandas

I have a Data Frame which looks like this:
I am trying to vectorize every row, but only from the text column. I wrote this code:
vectorizerCount = CountVectorizer(stop_words='english')
# tokenize and build vocab
allDataVectorized = allData.apply(vectorizerCount.fit_transform(allData.iloc[:]['headline_text']), axis=1)
The error says:
TypeError: ("'csr_matrix' object is not callable", 'occurred at index 0')
Doing some research and trying changes I found out the fit_transform function returns a scipy.sparse.csr.csr_matrix and that is not callable.
Is there another way to do this?
Thanks!
There are a number of problems with your code. You probably need something like
allDataVectorized = pd.DataFrame(vectorizerCount.fit_transform(allData[['headline_text']]))
allData[['headline_text']]) (with the double brackets) is a DataFrame, which transforms to a numpy 2d array.
fit_transform returns a csr matrix.
pd.DataFrame(...) creates a DataFrame from a csr matrix.