I got stuck converting a especific dictionary to a pandas dataframe - pandas

I have a dictionary and I got stuck while trying to convert to a pandas dataframe.
It's a result of scoring an IBM ML model. The result comes in this format and I would like to transform this dictionary to a pandas dataframe in order to merge later with the original dataframe that was scored.
Dictionary:
{'predictions': [{'fields': ['prediction', 'probability'], 'values': [['Creditworthy', [0.5522992460276774, 0.4477007539723226]]]}]}
image of the code which generated this dictionary
I would like a pandas dataframe like this:
index predictions prediction probability
0 Creditworthy 0.552299 0.447701

Assume that the source dictionary is in value named dct.
Start from reading column names:
cols = dct['predictions'][0]['fields']
Then create DataFrame in a form which can be read from this dictionary:
df = pd.DataFrame(dct['predictions'][0]['values'],
columns=['predictions', 'val'])
For the time being, values are in val column, as a list:
predictions val
0 Creditworthy [0.5522992460276774, 0.4477007539723226]
Then break val column into separate columns, setting at the same time
proper column names (read before):
df[cols] = pd.DataFrame(df.val.values.tolist())
And the only thing to do is to drop val columns:
df.drop(columns=['val'], inplace=True)
The result is:
predictions prediction probability
0 Creditworthy 0.552299 0.447701
Just as it should be.

Related

Python CountVectorizer for Pandas DataFrame

I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi
From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.

vectorise pandas: extract multiple dataframes and concat together

I need to extract dataframes from json data stored in every row of initial dataframe and concat them all together. Currently it works for me over iteration and takes ages.
Input data is dataframe, containing JSON dictionaries:
print(json_table)
json_responce timestamp request
27487 {'explore_tabs.. 2019-07-02 02:05:25 Lisboa, Portugal
27488 {'explore_tabs.. 2019-07-02 02:05:27 Ribeira, Portugal
The json_responce field is being unwraped to dataframe:
from pandas.io.json import json_normalize
from ast import literal_eval
json = literal_eval(json_table.loc[0,'json_responce'])
df_normalized = json_normalize(json['explore_tabs'][0]['sections'][0]
['listings'])
which gives a nice unwrapped dataframe for each row of the initial df
Having 27000 rows of json containing df, I iterate over initial df, which creates new df at every step and concat's to the final_df, to concat all the data together:
def unwrap_json_and_concat(json_table):
final_df = pd.DataFrame()
for i in json_table.index:
row = literal_eval(json_table.loc[i,'json_responce'])
df = json_normalize(row['explore_tabs'][0]['sections']
[0]['listings'])
final_df = pd.concat([final_df,df])
return final_df
As expected, that takes ages to iterate over, with significant slowing towards the end of calculation due to the increasing size of the final_df.
I know how to create functions for apply, but I believe it will not give much perfomance either, due to the fact, that new dataframe is being created every row anyways.
How to vectorize this calculation?
Thank you!

How to create a DataFrame with index names different from `row` and write data into (`index`, `column`) pairs in Julia?

How can I create a DataFrame with Julia with index names that are different from Row and write values into a (index,column) pair?
I do the following in Python with pandas:
import pandas as pd
df = pd.DataFrame(index = ['Maria', 'John'], columns = ['consumption','age'])
df.loc['Maria']['age'] = 52
I would like to do the same in Julia. How can I do this? The documentation shows a DataFrame similar to the one I would like to construct but I cannot figure out how.

Combine Sklearn TFIDF with Additional Data

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"
vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)
(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>
But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.
So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.
Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.
This will do the work.
`df1 = pd.DataFrame(X.toarray()) //Convert sparse matrix to array
df2 = YOUR_DF of size 57k x 40
newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
I figured it out:
First: iterate over my pandas column and create a list of lists
for_np = []
for x in merged['other_data']:
row = x.split(",")
row2 = map(float, row)
for_np.append(row2)
Then create a np array:
n = np.array(for_np)
Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!
import scipy.sparse
X = scipy.sparse.hstack([X, n])
You could have a look at the answer to this question:
use Featureunion in scikit-learn to combine two pandas columns for tfidf
Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.

Python 3: Creating DataFrame with parsed data

The following data has been parsed from a stock API. The dataframe has the headers of each column in the Dataset respectively. Is there anyway I can link the data to the dataframe effectively creating a labeled data array/table?
DataFrame
df = pd.DataFrame(columns=['Date','Close','High','Low','Open','Volume'])
DataSet
20140502,36.8700,37.1200,36.2100,36.5900,22454100
20140505,36.9100,37.0500,36.3000,36.6800,13129100
20140506,36.4900,37.1700,36.4800,36.9400,19156000
20140507,34.0700,35.9900,33.6700,35.9900,66062700
20140508,33.9200,34.5700,33.6100,33.8800,30407700
20140509,33.7600,34.1000,33.4100,34.0100,20303400
20140512,34.4500,34.6000,33.8700,33.9900,22520600
20140513,34.4000,34.6900,34.1700,34.4300,12477100
20140514,34.1700,34.6500,33.9800,34.4800,17039000
20140515,33.8000,34.1900,33.4000,34.1800,18879800
20140516,33.4100,33.6600,33.1000,33.6600,18847100
20140519,33.8900,33.9900,33.2800,33.4100,14845700
20140520,33.8700,34.4700,33.6700,33.9900,18596700
20140521,34.3600,34.3900,33.8900,34.0000,13804500
20140522,34.7000,34.8600,34.2600,34.6000,17522800
20140523,35.0200,35.0800,34.5100,34.8500,16294400
20140527,35.1200,35.1300,34.7300,35.0000,13057000
20140528,34.7800,35.1700,34.4200,35.1500,16960500
20140529,34.9000,35.1000,34.6700,34.9000,9780800
20140530,34.6500,34.9300,34.1300,34.9200,13153000
20140602,34.8700,34.9500,34.2800,34.6900,9178900
20140603,34.6500,34.9700,34.5800,34.8000,6557500
20140604,34.7300,34.8300,34.2600,34.4800,9434100
I'm assuming that you are receiving the data as a list of lists. So something like -
vals = [[20140502,36.8700,37.1200,36.2100,36.5900,22454100], [20140505,36.9100,37.0500,36.3000,36.6800,13129100], ...]
In that case, you can populate your dataframe with loc -
for index, val in enumerate(vals):
df.loc[index] = val
Which will give you -
In [6]: df
Out[6]:
Date Close High Low Open Volume
0 20140502 36.87 37.12 36.21 36.59 22454100
1 20140505 36.91 37.05 36.3 36.68 13129100
...
Here, enumerate gives us the index of the row, so we can use that to populate the dataframe index.
If somehow the data was saved as csv, then you can simply use read_csv -
df = pd.read_csv('data.csv', names=['Date','Close','High','Low','Open','Volume'])