I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi
From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.
I need to extract dataframes from json data stored in every row of initial dataframe and concat them all together. Currently it works for me over iteration and takes ages.
Input data is dataframe, containing JSON dictionaries:
print(json_table)
json_responce timestamp request
27487 {'explore_tabs.. 2019-07-02 02:05:25 Lisboa, Portugal
27488 {'explore_tabs.. 2019-07-02 02:05:27 Ribeira, Portugal
The json_responce field is being unwraped to dataframe:
from pandas.io.json import json_normalize
from ast import literal_eval
json = literal_eval(json_table.loc[0,'json_responce'])
df_normalized = json_normalize(json['explore_tabs'][0]['sections'][0]
['listings'])
which gives a nice unwrapped dataframe for each row of the initial df
Having 27000 rows of json containing df, I iterate over initial df, which creates new df at every step and concat's to the final_df, to concat all the data together:
def unwrap_json_and_concat(json_table):
final_df = pd.DataFrame()
for i in json_table.index:
row = literal_eval(json_table.loc[i,'json_responce'])
df = json_normalize(row['explore_tabs'][0]['sections']
[0]['listings'])
final_df = pd.concat([final_df,df])
return final_df
As expected, that takes ages to iterate over, with significant slowing towards the end of calculation due to the increasing size of the final_df.
I know how to create functions for apply, but I believe it will not give much perfomance either, due to the fact, that new dataframe is being created every row anyways.
How to vectorize this calculation?
Thank you!
I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.
The following data has been parsed from a stock API. The dataframe has the headers of each column in the Dataset respectively. Is there anyway I can link the data to the dataframe effectively creating a labeled data array/table?
DataFrame
df = pd.DataFrame(columns=['Date','Close','High','Low','Open','Volume'])
DataSet
20140502,36.8700,37.1200,36.2100,36.5900,22454100
20140505,36.9100,37.0500,36.3000,36.6800,13129100
20140506,36.4900,37.1700,36.4800,36.9400,19156000
20140507,34.0700,35.9900,33.6700,35.9900,66062700
20140508,33.9200,34.5700,33.6100,33.8800,30407700
20140509,33.7600,34.1000,33.4100,34.0100,20303400
20140512,34.4500,34.6000,33.8700,33.9900,22520600
20140513,34.4000,34.6900,34.1700,34.4300,12477100
20140514,34.1700,34.6500,33.9800,34.4800,17039000
20140515,33.8000,34.1900,33.4000,34.1800,18879800
20140516,33.4100,33.6600,33.1000,33.6600,18847100
20140519,33.8900,33.9900,33.2800,33.4100,14845700
20140520,33.8700,34.4700,33.6700,33.9900,18596700
20140521,34.3600,34.3900,33.8900,34.0000,13804500
20140522,34.7000,34.8600,34.2600,34.6000,17522800
20140523,35.0200,35.0800,34.5100,34.8500,16294400
20140527,35.1200,35.1300,34.7300,35.0000,13057000
20140528,34.7800,35.1700,34.4200,35.1500,16960500
20140529,34.9000,35.1000,34.6700,34.9000,9780800
20140530,34.6500,34.9300,34.1300,34.9200,13153000
20140602,34.8700,34.9500,34.2800,34.6900,9178900
20140603,34.6500,34.9700,34.5800,34.8000,6557500
20140604,34.7300,34.8300,34.2600,34.4800,9434100
I'm assuming that you are receiving the data as a list of lists. So something like -
vals = [[20140502,36.8700,37.1200,36.2100,36.5900,22454100], [20140505,36.9100,37.0500,36.3000,36.6800,13129100], ...]
In that case, you can populate your dataframe with loc -
for index, val in enumerate(vals):
df.loc[index] = val
Which will give you -
In [6]: df
Out[6]:
Date Close High Low Open Volume
0 20140502 36.87 37.12 36.21 36.59 22454100
1 20140505 36.91 37.05 36.3 36.68 13129100
...
Here, enumerate gives us the index of the row, so we can use that to populate the dataframe index.
If somehow the data was saved as csv, then you can simply use read_csv -
df = pd.read_csv('data.csv', names=['Date','Close','High','Low','Open','Volume'])