Preparing stock data for k-means clustering with unique value in column - pandas

I have Dhaka stock exchange data combined 359 stocks
I want to preprocess this for k-means clustering. But non-uniqueness of symbol I can't prepare data.

To make use of the data points for clustering, you can ignore the symbol as well as the Date is required.
You can specify the columns (features) by indexing using iloc[row_index, col_index]. To make the data usable for K-Means clustering, you can extract the values from the dataframe using values. This will extract the values to a numpy array, which can be used for further clustering.
# Sample data
>>> data
Open High Low Close Volume
Symbol
a 0 0 0 0 0
b 10 1 1 1 10
c 20 2 2 2 20
# Selecting features and extracting values
# '1:' ignores the first column
>>> data.iloc[:, 1:].values
array([[ 0, 0, 0, 0],
[ 1, 1, 1, 10],
[ 2, 2, 2, 20]])

You'll likely want to pivot the data to have one row per ticker.
But I doubt it makes much sense to use k-means on this data. If you are serious about results, you'd need an approach that can deal with missing values, series of different length, and that can use the trading volume as weighting instead of an attribute. If you just naively feed your data into k-means, you'll trivially group stocks by trading volume.
First decide your mathematical objective function. Make sure it's solving your problem. Then decide how to represent your data such that an algorithm can optimize this.

Related

Trouble understanding how the indices of a series are determined

Trouble understanding how the indices of a series are determined
So I have a huge data frame that i am reading a single column from, and I need to choose 100 unique values from this column. I think that what I did resulted in 100 unique values but I'm confused about the indexing of the resulting series. I looked at the indices of the data frame and they did not correspond to the value associated with the same indices of the series. I would like this to be the case, that is I want the indices of the resulting series to be the same as the indices of the data frame from which I am reading the column from. Would someone be able to explain to me how the resulting indices were determined here?
The indices of the sample do not correspond to the indices that exist in the DataFrame. This is due to the following fact:
When doing CSsq.unique() you are in fact getting back a np.ndarray (check the docs here). An array does not have any indices. But, you are passing this to the pd.Series constructor and as a result, a new Series is created, which in fact has indexing (starting from 0 up to n-1, where n is the size of the Series). This, of course, has nothing to do with the DataFrame indices, because you have firstly isolated the unique values.
See the example below for a hypothetical Series called s:
s
0 100
1 100
2 100
3 200
4 250
5 300
6 300
Let's isolate the unique occurences:
s.unique()
# [100, 200, 250, 300]
And now let's feed this to the pd.Series constructor:
pd.Series(s.unique())
0 100
1 200
2 250
3 300
As you can see this Series was generated from an array and its indices have nothing to do with the initial indices!
Now, if you take a random sample out of this Series, you'll get values with indices that correspond to this new Series object!
If you'd like to get a sample with indices that are derived from the DataFrame try something like this:
CSsq.drop_duplicates().sample(100)

What is the most efficient method for calculating per-row historical values in a large pandas dataframe?

Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:
Was_Sold (Y/N)
Color
Size_Group
Shape
Date_Made
Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).
Then for every row in df_a, I want to:
Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
The df_b toys must also be made before the given df_a toy
Then find the ratio of those sold (So count sold / count all matched)
What is the most efficient means to make those per-row calculations above?
The best I've came up with so far is something like the below.
(Note code might have an error or two as I'm rough typing from a different use case)
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
prior_count = len(prior_toys)
# Now find the ones that were sold
prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
Using .itertuples() is useful, but this is still pretty slow. Is there a more efficient method or something I'm missing?
EDIT
Added the below script which will emulated data for the above scenario:
import numpy as np
import pandas as pd
colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000
date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
df_a = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_a),
'Size_Group': np.random.choice(sizes, size_df_a),
'Shape': np.random.choice(shapes, size_df_a),
'Was_Sold': np.random.choice(sold, size_df_a),
'Date_Made': random_dates(date_start, date_end, n=size_df_a)
}
)
df_b = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_b),
'Size_Group': np.random.choice(sizes, size_df_b),
'Shape': np.random.choice(shapes, size_df_b),
'Was_Sold': np.random.choice(sold, size_df_b),
'Date_Made': random_dates(date_start, date_end, n=size_df_b)
}
)
First of all, I think your computation would be much more efficient using relational database and SQL query. Indeed, the filters can be done by indexing columns, performing a database join, some advance filtering and count the result. An optimized relational database can generate an efficient algorithm based on a simple SQL query (hash-based row grouping, binary search, fast intersection of sets, etc.). Pandas is sadly not very good to perform efficiently advanced requests like this. It is also very slow to iterate over pandas dataframe although I am not sure this can be alleviated in this case using only pandas. Hopefully you can use some Numpy and Python tricks and (partially) implement what fast relational database engines would do.
Additionally, pure-Python object types are slow, especially (unicode) strings. Thus, **converting column types to efficient ones in a first place can save a lot of time (and memory). For example, there is no need for the Was_Sold column to contains "Y"/"N" string objects: a boolean can just be used in that case. Thus let us convert that:
df_b.Was_Sold = df_b.Was_Sold == "Y"
Finally, the current algorithm has a bad complexity: O(Na * Nb) where Na is the number of rows in df_a and Nb is the number of rows in df_b. This is not easy to improve though due to the non-trivial conditions. A first solution is to group df_b by col column ahead-of-time so to avoid an expensive complete iteration of df_b (previously done with df_b[col] == relevant_val). Then, the date of the precomputed groups can be sorted so to perform a fast binary search later. Then you can use Numpy to count boolean values efficiently (using np.sum).
Note that doing prior_toys['Was_Sold'] is a bit faster than prior_toys.Was_Sold.
Here is the resulting code:
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Split df_b by col and sort each (indexed) group by date
colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
curColGroup = colGroups[relevant_val]
prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
prior_toys = curColGroup[:prior_count]
# Now find the ones that were sold
prior_sold_count = prior_toys['Was_Sold'].values.sum()
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
This is 5.5 times faster on my machine.
The iteration of the pandas dataframe is a major source of slowdown. Indeed, prior_toys['Was_Sold'] takes half the computation time because of the huge overhead of pandas internal function calls repeated Na times... Using Numba may help to reduce the cost of the slow iteration. Note that the complexity can be increased by splitting colGroups in subgroups ahead of time (O(Na log Nb)). This should help to completely remove the overhead of prior_sold_count. The resulting program should be about 10 time faster than the original one.

Creating a CategoricalDtype from an int column in Dask

dask.__version__ = 2.5.0
I have a table with columns containing many uint16 range 0,...,n & a bunch of lookup tables containing the mappings from these 'codes' to their 'categories'.
My question: Is there a way to make these integer columns 'categorical' without parsing the data or first replacing the codes with the categories.
Ideally I want Dask can keep the values as is and accept them as category codes and and accept the categories I tell Dask belong to these codes?
dfp = pd.DataFrame({'c01': np.random.choice(np.arange(3),size=10), 'v02': np.random.randn(10)})
dfd = dd.from_pandas(dfp, npartitions=2)
mdt = pd.CategoricalDtype(list('abc'), ordered=True)
dfd.c01 = dfd.c01.map_partitions(lambda s: pd.Categorical.from_codes(s, dtype=mdt), meta='category')
dfd.dtypes
The above does not work, the dtype is 'O' (it seem to have replaced the ints with strings)? I can subsequently do the following (which seems to do the trick):
dfd.c01 = dfd.c01.astype('category')
But than seems inefficient for big data sets.
Any pointers are much appreciated.
Some context: I have a big dataset (>500M rows) with many columns containing a limited number of strings. The perfect usecase for dtype categorical. The data gets extracted from a Teradata DW using Parallel Transporter, meaning it produces a delimited UTF-8 file. To make this process faster, I categorize the data on the Teradata side and I just need to create the dtype category from the codes on the dask side of the fence.
As long as you have an upper bound on largest integer, which you call n (equal to 3), then the following will work.
In [33]: dfd.c01.astype('category').cat.set_categories(np.arange(len(mdt.categories))).cat.rename_categories(list(mdt.categories))
Out[33]:
Dask Series Structure:
npartitions=2
0 category[known]
5 ...
9 ...
Name: c01, dtype: category
Dask Name: cat, 10 tasks
Which is the following when computed
Out[34]:
0 b
1 b
2 c
3 c
4 a
5 c
6 a
7 a
8 a
9 a
Name: c01, dtype: category
Categories (3, object): [a, b, c]
The basic idea is to make an intermediate Categorical whose categories are the codes (0, 1, ... n) and then move from those numerical categories to the actual one (a, b, c).
We have an open issue for making this nicer https://github.com/dask/dask/issues/2829

How to plot outliers with regard to unique ids

I have item_code column in my data and another column, sales, which represents sales quantity for the particular item.
The data can have a particular item id many times. There are other columns tell apart these entries.
I want to plot only the outlier sales for each item (because data has thousands of different item ids, plotting every entry can be difficult).
Since I'm very new to this, what is the right way and tool to do this?
you can use pandas. You should choose a method to detect outliers, but I have an example for you:
If you want to get outliers for all sales (not in groups), you can use apply with function (example - lambda function) to have outliers indexes.
import numpy as np
%matplotlib inline
df = pd.DataFrame({'item_id': [1, 1, 2, 1, 2, 1, 2],
'sales': [0, 2, 30, 3, 30, 30, 55]})
df[df.apply(lambda x: np.abs(x.sales - df.sales.mean()) / df.sales.std() > 1, 1)
].set_index('item_id').plot(style='.', color='red')
In this example we generated data sample and search indexes of points what are more then mean / std + 1 (you can try another method). And then just plot them where y is count of sales and x is item id. This method detected points 0 and 55. If you want search outliers in groups, you can group data before.
df.groupby('item_id').apply(lambda data: data.loc[
data.apply(lambda x: np.abs(x.sales - data.sales.mean()) / data.sales.std() > 1, 1)
]).set_index('item_id').plot(style='.', color='red')
In this example we have points 30 and 55, because 0 isn't outlier for group where item_id = 1, but 30 is.
Is it what you want to do? I hope it helps start with it.

Conditional join result count in a large dataframe

I have a data set of about 100m rows, 4gb, containing two lists like these:
Seed
a
r
apple
hair
brush
tree
Phrase
apple tree
hair brush
I want to get the count of unique matched 'Phrase's for each unique 'Seed'. So for example, the seed 'a' is contained in both 'apple tree' and 'hair brush', so it's 'Phrases_matched_count' should be '2'. Matches are just using partial patches (i.e. a 'string contains' match, does not need to be a regex or anything complex).
Seed Phrases_matched_count
a 2
r 2
apple 1
hair 1
brush 1
tree 1
I have been trying to find a way to do this using Apache Pig (on a small Amazon EMR cluster), and Python Pandas (the data set just about fits in memory), but just can't find a way to do this without looping through every row for each unique 'seed', which will take very long, or a cross product of the tables, which will use too much memory.
Any ideas?
This can be done by using built-in contains but I'm not sure of its scalability on an important number of data.
# Test data
seed = pd.Series(['a','r', 'apple', 'hair', 'brush', 'tree'])
phrase = pd.Series(['apple tree', 'hair brush'])
# Creating a DataFrame with seeds as index and phrases as columns
df = pd.DataFrame(index=seed, columns=phrase)
# Checking if each seed is contained in each phrase
df = df.apply(lambda x: x.index.str.contains(x.name), axis=1)
# Getting the result
df.sum(axis=1)
# The result
a 2
r 2
apple 1
hair 1
brush 1
tree 1