pandas, good to have list as element or, just flatten it? - pandas

Suppose, I want to analyze item-purchase records
My analization function expects userid, item_ids
df analyze(user_id, item_ids):
..
Is it a good idea to prepare data in
user_id item_ids
1, [3,4,5]
vs
user_id, item_ids
1, 3
1, 4
1, 5
(with the 2nd one, I could do groupby and generate the data format I need)
I just find it hard to work with data format of ([1, [3,4,5]]) than ([1,3],[1,4],[1,5]) in intermediate steps..

Related

Pandas - Break nested json into multiple rows

I have my Dataframe in the below structure. I would like to break them based on the nested values within the details column
cust_id, name, details
101, Kevin, [{"id":1001,"country":"US","state":"OH"}, {"id":1002,"country":"US","state":"GA"}]
102, Scott, [{"id":2001,"country":"US","state":"OH"}, {"id":2002,"country":"US","state":"GA"}]
Expected output
cust_id, name, id, country, state
101, Kevin, 1001, US, OH
101, Kevin, 1002, US, GA
102, Scott, 2001, US, OH
102, Scott, 2002, US, GA
df = df.explode('details').reset_index(drop=True)
df = df.merge(pd.json_normalize(df['details']), left_index=True, right_index=True).drop('details', axis=1)
df.explode("details") basically duplicates each row in the details N times, where N is the number of items in the array (if any) of details of that row
Since explode duplicates the rows, the original rows' indices (0 and 1) are copied to the new rows, so their indices are 0, 0, 1, 1, which messes up later processing. reset_index() creates a fresh new column for the index, starting at 0. drop=True is used because by default pandas will keep the old index column; this removes it.
pd.json_normalize(df['details']) converts the column (where each row contains a JSON object) to a new dataframe where each key unique of all the JSON objects is new column
df.merge() merges the new dataframe into the original one
left_index=True and right_index=True tells pandas to merge the specified dataframe starting from it's first, row into this dataframe, starting at its first row
.drop('details', axis=1) gets rid of the old details column containing the old objects

Preparing stock data for k-means clustering with unique value in column

I have Dhaka stock exchange data combined 359 stocks
I want to preprocess this for k-means clustering. But non-uniqueness of symbol I can't prepare data.
To make use of the data points for clustering, you can ignore the symbol as well as the Date is required.
You can specify the columns (features) by indexing using iloc[row_index, col_index]. To make the data usable for K-Means clustering, you can extract the values from the dataframe using values. This will extract the values to a numpy array, which can be used for further clustering.
# Sample data
>>> data
Open High Low Close Volume
Symbol
a 0 0 0 0 0
b 10 1 1 1 10
c 20 2 2 2 20
# Selecting features and extracting values
# '1:' ignores the first column
>>> data.iloc[:, 1:].values
array([[ 0, 0, 0, 0],
[ 1, 1, 1, 10],
[ 2, 2, 2, 20]])
You'll likely want to pivot the data to have one row per ticker.
But I doubt it makes much sense to use k-means on this data. If you are serious about results, you'd need an approach that can deal with missing values, series of different length, and that can use the trading volume as weighting instead of an attribute. If you just naively feed your data into k-means, you'll trivially group stocks by trading volume.
First decide your mathematical objective function. Make sure it's solving your problem. Then decide how to represent your data such that an algorithm can optimize this.

Find records having at least one element of a given array in an array column

I use PG's array type to store some integers in a Order table:
Order
id: 1
array_column: [1, 2, 3]
id: 2
array_column: [3, 4, 5]
And I'd like to have a query returning all Orders having at least one element of a given array (let's say [3]) in array_column.
So for [3], it should return both orders since they both have 3 in array_column. For [4, 5], it should only return the second order since the first one doesn't have any element in common, and for [9, 10, 49], it shouldn't return anything.
How can I achieve this with ActiveRecord ? If it's not feasible, how can I do this using a plain SQL query ?

PySpark Create new column from transformations in another dataframe

Looking for a more functional and computationally efficient approach in PySpark ->
I have master table (containing billions of rows), the columns of interests are:
id - (String),
tokens - (Array(string))- ex, ['alpha', 'beta', 'gamma']
-- (Calling it dataframe, df1)
I have another summary table which contains top 25 tokens like:
-- (Calling it dataframe, df2)
Ex:
Token
Alpha
Beta
Zi
Mu
Now to this second table (or, dataframe), I wish to append a row which contains a list of ids for that token from the first table, so that the result looks like:
Token Ids
Alpha [1, 2, 3]
Beta [3, 5, 6, 8, 9]
Zi [2, 8, 12]
Mu [1, 15, 16, 17]
Present Approach:
From the df2, figure out the distinct tokens and store it as a list (say l1).
(For every token from list, l1):
Filter df1 to extract the unique ids as a list, call it l2
Add this new list (l2) as a new column (Ids) to the dataframe (df2) to create a new dataframe (df3)
persist df3 to a table
I agree this is a terribe approach and for any given l1 with 100k records, it will run forever. Can anyone help me rewrite the code (for Pyspark)
You can alternatively attempt to join both the table on a new column which would essentially contain only the tokens exploded to the individual rows. That would be helpful from both computational efficiency, allocated resources and the required processing time.
Additionally, there are several in-the-box join privileges including 'map-side join' which would further propel your cause.
Explode the tokens array column of df1 and then join with df2 (left join) with lower case of tokens and token and then groupBy token and collect the ids as set
from pyspark.sql import functions as f
#exolode tokens column for joining with df2
df1 = df1.withColumn('tokens', f.explode('tokens'))
#left join with case insensitive and collecting ids as set for each token
df2.join(df1, f.lower(df1.tokens) == f.lower(df2.token), 'left')\
.groupBy('token')\
.agg(f.collect_set('id').alias('ids'))\
.show(truncate=False)
I hope the answer is helpful

isin pandas doesn't show all values in dataframe

I am using the Amazon database for my research where I want to select the 100 most rated items. So first I have counted the values of the itemID's (asin)
data = amazon_data_parse('data/reviews_Movies_and_TV_5.json.gz')
unique, counts = np.unique(data['asin'], return_counts=True)
test = np.asarray((unique, counts)).T
test.sort(axis=1)
which gives:
array([[5, '0005019281'],
[5, '0005119367'],
[5, '0307141985'],
...,
[1974, 'B00LG7VVPO'],
[2110, 'B00LH9ROKM'],
[2213, 'B00LT1JHLW']], dtype=object)
It is clearly to see that there must be at least 6.000 rows selected. But if I run:
a= test[49952:50054,1]
a = a.tolist()
test2 = data[data.asin.isin(a)]
It only selects 2000 rows from the dataset. I already have tried multiple thing, like only filter on one asin but it doesn't just seem to work. Can someone please help? If there is a better option to get a dataframe with the rows of the 100 most frequent values in asin column I would be glad too.
I found the solution, had to change the sorting line to:
test = test[test[:,1].argsort()]