vectorise pandas: extract multiple dataframes and concat together - pandas

I need to extract dataframes from json data stored in every row of initial dataframe and concat them all together. Currently it works for me over iteration and takes ages.
Input data is dataframe, containing JSON dictionaries:
print(json_table)
json_responce timestamp request
27487 {'explore_tabs.. 2019-07-02 02:05:25 Lisboa, Portugal
27488 {'explore_tabs.. 2019-07-02 02:05:27 Ribeira, Portugal
The json_responce field is being unwraped to dataframe:
from pandas.io.json import json_normalize
from ast import literal_eval
json = literal_eval(json_table.loc[0,'json_responce'])
df_normalized = json_normalize(json['explore_tabs'][0]['sections'][0]
['listings'])
which gives a nice unwrapped dataframe for each row of the initial df
Having 27000 rows of json containing df, I iterate over initial df, which creates new df at every step and concat's to the final_df, to concat all the data together:
def unwrap_json_and_concat(json_table):
final_df = pd.DataFrame()
for i in json_table.index:
row = literal_eval(json_table.loc[i,'json_responce'])
df = json_normalize(row['explore_tabs'][0]['sections']
[0]['listings'])
final_df = pd.concat([final_df,df])
return final_df
As expected, that takes ages to iterate over, with significant slowing towards the end of calculation due to the increasing size of the final_df.
I know how to create functions for apply, but I believe it will not give much perfomance either, due to the fact, that new dataframe is being created every row anyways.
How to vectorize this calculation?
Thank you!

Related

Python CountVectorizer for Pandas DataFrame

I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi
From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.

I got stuck converting a especific dictionary to a pandas dataframe

I have a dictionary and I got stuck while trying to convert to a pandas dataframe.
It's a result of scoring an IBM ML model. The result comes in this format and I would like to transform this dictionary to a pandas dataframe in order to merge later with the original dataframe that was scored.
Dictionary:
{'predictions': [{'fields': ['prediction', 'probability'], 'values': [['Creditworthy', [0.5522992460276774, 0.4477007539723226]]]}]}
image of the code which generated this dictionary
I would like a pandas dataframe like this:
index predictions prediction probability
0 Creditworthy 0.552299 0.447701
Assume that the source dictionary is in value named dct.
Start from reading column names:
cols = dct['predictions'][0]['fields']
Then create DataFrame in a form which can be read from this dictionary:
df = pd.DataFrame(dct['predictions'][0]['values'],
columns=['predictions', 'val'])
For the time being, values are in val column, as a list:
predictions val
0 Creditworthy [0.5522992460276774, 0.4477007539723226]
Then break val column into separate columns, setting at the same time
proper column names (read before):
df[cols] = pd.DataFrame(df.val.values.tolist())
And the only thing to do is to drop val columns:
df.drop(columns=['val'], inplace=True)
The result is:
predictions prediction probability
0 Creditworthy 0.552299 0.447701
Just as it should be.

Pandas - Appending data from one Dataframe to

I have a Dataframe (called df) that has list of tickets worked for a given date. I have a script that runs each day where this df gets generated and I would like to have a new master dataframe (lets say df_master) that appends values form df to a new Dataframe. So anytime I view df_master I should be able to see all the tickets worked across multiple days. Also would like to have a new column in df_master that shows date when the row was inserted.
Given below is how df looks like:
1001
1002
1003
1004
I tried to perform concat but it threw an error
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "Series"
Update
df_ticket = tickets['ticket']
df_master = df_ticket
df_master['Date'] = pd.Timestamp('now').normalize()
L = [df_master,tickets]
master_df = pd.concat(L)
master_df.to_csv('file.csv', mode='a', header=False, index=False)
I think you need pass sequence to concat, obviously list is used:
objs : a sequence or mapping of Series, DataFrame, or Panel objects
If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised
L = [s1,s2]
df = pd.concat(L)
And it seems you pass only Series, so raised error:
df = pd.concat(s)
For insert Date column is possible set pd.Timestamp('now').normalize(), for master df I suggest create one file and append each day DataFrame:
df_ticket = tickets[['ticket']]
df_ticket['Date'] = pd.Timestamp('now').normalize()
df_ticket.to_csv('file.csv', mode='a', header=False, index=False)
df_master = pd.read_csv('file.csv', header=None)

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))

Python 3: Creating DataFrame with parsed data

The following data has been parsed from a stock API. The dataframe has the headers of each column in the Dataset respectively. Is there anyway I can link the data to the dataframe effectively creating a labeled data array/table?
DataFrame
df = pd.DataFrame(columns=['Date','Close','High','Low','Open','Volume'])
DataSet
20140502,36.8700,37.1200,36.2100,36.5900,22454100
20140505,36.9100,37.0500,36.3000,36.6800,13129100
20140506,36.4900,37.1700,36.4800,36.9400,19156000
20140507,34.0700,35.9900,33.6700,35.9900,66062700
20140508,33.9200,34.5700,33.6100,33.8800,30407700
20140509,33.7600,34.1000,33.4100,34.0100,20303400
20140512,34.4500,34.6000,33.8700,33.9900,22520600
20140513,34.4000,34.6900,34.1700,34.4300,12477100
20140514,34.1700,34.6500,33.9800,34.4800,17039000
20140515,33.8000,34.1900,33.4000,34.1800,18879800
20140516,33.4100,33.6600,33.1000,33.6600,18847100
20140519,33.8900,33.9900,33.2800,33.4100,14845700
20140520,33.8700,34.4700,33.6700,33.9900,18596700
20140521,34.3600,34.3900,33.8900,34.0000,13804500
20140522,34.7000,34.8600,34.2600,34.6000,17522800
20140523,35.0200,35.0800,34.5100,34.8500,16294400
20140527,35.1200,35.1300,34.7300,35.0000,13057000
20140528,34.7800,35.1700,34.4200,35.1500,16960500
20140529,34.9000,35.1000,34.6700,34.9000,9780800
20140530,34.6500,34.9300,34.1300,34.9200,13153000
20140602,34.8700,34.9500,34.2800,34.6900,9178900
20140603,34.6500,34.9700,34.5800,34.8000,6557500
20140604,34.7300,34.8300,34.2600,34.4800,9434100
I'm assuming that you are receiving the data as a list of lists. So something like -
vals = [[20140502,36.8700,37.1200,36.2100,36.5900,22454100], [20140505,36.9100,37.0500,36.3000,36.6800,13129100], ...]
In that case, you can populate your dataframe with loc -
for index, val in enumerate(vals):
df.loc[index] = val
Which will give you -
In [6]: df
Out[6]:
Date Close High Low Open Volume
0 20140502 36.87 37.12 36.21 36.59 22454100
1 20140505 36.91 37.05 36.3 36.68 13129100
...
Here, enumerate gives us the index of the row, so we can use that to populate the dataframe index.
If somehow the data was saved as csv, then you can simply use read_csv -
df = pd.read_csv('data.csv', names=['Date','Close','High','Low','Open','Volume'])