PySpark Create new column from transformations in another dataframe - dataframe

Looking for a more functional and computationally efficient approach in PySpark ->
I have master table (containing billions of rows), the columns of interests are:
id - (String),
tokens - (Array(string))- ex, ['alpha', 'beta', 'gamma']
-- (Calling it dataframe, df1)
I have another summary table which contains top 25 tokens like:
-- (Calling it dataframe, df2)
Ex:
Token
Alpha
Beta
Zi
Mu
Now to this second table (or, dataframe), I wish to append a row which contains a list of ids for that token from the first table, so that the result looks like:
Token Ids
Alpha [1, 2, 3]
Beta [3, 5, 6, 8, 9]
Zi [2, 8, 12]
Mu [1, 15, 16, 17]
Present Approach:
From the df2, figure out the distinct tokens and store it as a list (say l1).
(For every token from list, l1):
Filter df1 to extract the unique ids as a list, call it l2
Add this new list (l2) as a new column (Ids) to the dataframe (df2) to create a new dataframe (df3)
persist df3 to a table
I agree this is a terribe approach and for any given l1 with 100k records, it will run forever. Can anyone help me rewrite the code (for Pyspark)

You can alternatively attempt to join both the table on a new column which would essentially contain only the tokens exploded to the individual rows. That would be helpful from both computational efficiency, allocated resources and the required processing time.
Additionally, there are several in-the-box join privileges including 'map-side join' which would further propel your cause.

Explode the tokens array column of df1 and then join with df2 (left join) with lower case of tokens and token and then groupBy token and collect the ids as set
from pyspark.sql import functions as f
#exolode tokens column for joining with df2
df1 = df1.withColumn('tokens', f.explode('tokens'))
#left join with case insensitive and collecting ids as set for each token
df2.join(df1, f.lower(df1.tokens) == f.lower(df2.token), 'left')\
.groupBy('token')\
.agg(f.collect_set('id').alias('ids'))\
.show(truncate=False)
I hope the answer is helpful

Related

How do i combine multiple dataframes using a repeating index system

I have multiple dataframes that I want to combine and only want to use the indexing system of the first dataframe. The problem is the indices I want to use are repeating and I want to keep it that way.
df = pd.concat([df1, df2, df3], axis=1, join='inner')
This gives me InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Just so it's clear, df1 has repeating indices (0-9 and then it repeats again multiple times), whereas df2 and df3 are single-column dataframes and have non-repeating indices. The number of rows do match though.
Well from what i understand your index repeats itself, on df1. That is what is causing the given error InvalidIndexError: Reindexing only valid with uniquely valued Index objects, since you have a loop beetween (0,9 values) pandas, will never be able to identify which row to join with what row since the indexes well are repeated so non unique. My apprach would be just to use join, but hey if you want to use concat for reasons
A few ways to do this would be just
Just using the join function
df1.join([df2,df3])
But if you insist on using concat, i would
x = df1.index
df1.reset_index(drop=True)
df = pd.concat([df1,df2,df3],axis=1,join='inner')
df.index = x

Pandas - Break nested json into multiple rows

I have my Dataframe in the below structure. I would like to break them based on the nested values within the details column
cust_id, name, details
101, Kevin, [{"id":1001,"country":"US","state":"OH"}, {"id":1002,"country":"US","state":"GA"}]
102, Scott, [{"id":2001,"country":"US","state":"OH"}, {"id":2002,"country":"US","state":"GA"}]
Expected output
cust_id, name, id, country, state
101, Kevin, 1001, US, OH
101, Kevin, 1002, US, GA
102, Scott, 2001, US, OH
102, Scott, 2002, US, GA
df = df.explode('details').reset_index(drop=True)
df = df.merge(pd.json_normalize(df['details']), left_index=True, right_index=True).drop('details', axis=1)
df.explode("details") basically duplicates each row in the details N times, where N is the number of items in the array (if any) of details of that row
Since explode duplicates the rows, the original rows' indices (0 and 1) are copied to the new rows, so their indices are 0, 0, 1, 1, which messes up later processing. reset_index() creates a fresh new column for the index, starting at 0. drop=True is used because by default pandas will keep the old index column; this removes it.
pd.json_normalize(df['details']) converts the column (where each row contains a JSON object) to a new dataframe where each key unique of all the JSON objects is new column
df.merge() merges the new dataframe into the original one
left_index=True and right_index=True tells pandas to merge the specified dataframe starting from it's first, row into this dataframe, starting at its first row
.drop('details', axis=1) gets rid of the old details column containing the old objects

Looping through a dictionary of dataframes and counting a column

I am wondering if anyone can help. I have a number of dataframes stored in a dictionary. I simply want to access each of these dataframes and count the values in a column in the column I have 10 letters. In the first dataframe there are 5bs and 5 as. For example the output from the count I would expect to be is a = 5 and b =5. However for each dataframe this count would be different hence I would like to store the output of these counts either into another dictionary or a separate variable.
The dictionary is called Dict and the column name in all the dataframes is called letters. I have tried to do this by accessing the keys in the dictionary but can not get it to work. A section of what I have tried is shown below.
import pandas as pd
for key in Dict:
Count=pd.value_counts(key['letters'])
Count here would ideally change with each new count output to store into a new variable
A simplified example (the actual dataframe sizes are max 5000,63) of the one of the 14 dataframes in the dictionary would be
`d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
df = pd.DataFrame(data=d)`
The other dataframes are names df2,df3,df4 etc
I hope that makes sense. Any help would be much appreciated.
Thanks
If you want to access both key and values when iterating over a dictionary, you should use the items function.
You could use another dictionary to store the results:
letter_counts = {}
for key, value in Dict.items():
letter_counts[key] = value["letters"].value_counts()
You could also use dictionary comprehension to do this in 1 line:
letter_counts = {key: value["letters"].value_counts() for key, value in Dict.items()}
The easiest thing is probably dictionary comprehension:
d = {'col1': [1, 2,3,4,5,6,7,8,9,10], 'letters': ['a','a','a','b','b','a','b','a','b','b']}
d2 = {'col1': [1, 2,3,4,5,6,7,8,9,10,11], 'letters': ['a','a','a','b','b','a','b','a','b','b','a']}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(d2)
df_dict = {'d': df, 'd2': df2}
new_dict = {k: v['letters'].count() for k,v in df_dict.items()}
# out
{'d': 10, 'd2': 11}

Remove rows from multiple dataframe that contain bad data

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]

Looking for built-in, invertible, list-of-list-accepting constructor/deconstructor pair for pandas dataframes

Are there built-in ways to construct/deconstruct a dataframe from/to a Python list-of-Python-lists?
As far as the constructor (let's call it make_df for now) that I'm looking for goes, I want to be able to write the initialization of a dataframe from literal values, including columns of arbitrary types, in an easily-readable form, like this:
df = make_df([[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]],
['d', 'i'])
For the deconstructor, I want to essentially recover from a dataframe df the arguments one would need to pass to such make_df to re-create df.
AFAIK,
officially at least, the pandas.DataFrame constructor accepts only a numpy ndarray, a dict, or another DataFrame (and not a simple Python list-of-lists) as its first argument;
the pandas.DataFrame.values property does not preserve the original data types.
I can roll my own functions to do this (e.g., see below), but I would prefer to stick to built-in methods, if available. (The Pandas API is pretty big, and some of its names not what I would expect, so it is quite possible that I have missed one or both of these functions.)
FWIW, below is a hand-rolled version of what I described above, minimally tested. (I doubt that it would be able to handle every possible corner-case.)
import pandas as pd
import collections as co
import pandas.util.testing as pdt
def make_df(values, columns):
return pd.DataFrame(co.OrderedDict([(columns[i],
[row[i] for row in values])
for i in range(len(columns))]))
def unmake_df(dataframe):
columns = list(dataframe.columns)
return ([[dataframe[c][i] for c in columns] for i in dataframe.index],
columns)
values = [[9.75, 1],
[6.375, 2],
[9., 3],
[0.25, 1],
[1.875, 2],
[3.75, 3],
[8.625, 1]]
columns = ['d', 'i']
df = make_df(values, columns)
Here's what the output of the call to make_df above produced:
>>> df
d i
0 9.750 1
1 6.375 2
2 9.000 3
3 0.250 1
4 1.875 2
5 3.750 3
6 8.625 1
A simple check of the round-trip1:
>>> df == make_df(*unmake_df(df))
True
>>> (values, columns) == unmake_df(make_df(*(values, columns)))
True
BTW, this is an example of the loss of the original values' types:
>>> df.values
array([[ 9.75 , 1. ],
[ 6.375, 2. ],
[ 9. , 3. ],
[ 0.25 , 1. ],
[ 1.875, 2. ],
[ 3.75 , 3. ],
[ 8.625, 1. ]])
Notice how the values in the second column are no longer integers, as they were originally.
Hence,
>>> df == make_df(df.values, columns)
False
1 In order to be able to use == to test for equality between dataframes above, I resorted to a little monkey-patching:
def pd_DataFrame___eq__(self, other):
try:
pdt.assert_frame_equal(self, other,
check_index_type=True,
check_column_type=True,
check_frame_type=True)
except:
return False
else:
return True
pd.DataFrame.__eq__ = pd_DataFrame___eq__
Without this hack, expressions of the form dataframe_0 == dataframe_1 would have evaluated to dataframe objects, not simple boolean values.
I'm not sure what documentation you are reading, because the link you give explicitly says that the default constructor accepts other list-like objects (one of which is a list of lists).
In [6]: pandas.DataFrame([['a', 1], ['b', 2]])
Out[6]:
0 1
0 a 1
1 b 2
[2 rows x 2 columns]
In [7]: t = pandas.DataFrame([['a', 1], ['b', 2]])
In [8]: t.to_dict()
Out[8]: {0: {0: 'a', 1: 'b'}, 1: {0: 1, 1: 2}}
Notice that I use to_dict at the end, rather than trying to get back the original list of lists. This is because it is an ill-posed problem to get the list arguments back (unless you make an overkill decorator or something to actually store the ordered arguments that the constructor was called with).
The reason is that a pandas DataFrame, by default, is not an ordered data structure, at least in the column dimension. You could have permuted the order of the column data at construction time, and you would get the "same" DataFrame.
Since there can be many differing notions of equality between two DataFrame (e.g. same columns even including type, or just same named columns, or some columns and in same order, or just same columns in mixed order, etc.) -- pandas defaults to trying to be the least specific about it (Python's principle of least astonishment).
So it would not be good design for the default or built-in constructors to choose an overly specific idea of equality for the purposes of returning the DataFrame back down to its arguments.
For that reason, using to_dict is better since the resulting keys will encode the column information, and you can choose to check for column types or ordering however you want to for your own application. You can even discard the keys by iterating the dict and simply pumping the contents into a list of lists if you really want to.
In other words, because order might not matter among the columns, the "inverse" of the list-of-list constructor maps backwards into a bigger set, namely all the permutations of the same column data. So the inverse you're looking for is not well-defined without assuming more structure -- and casual users of a DataFrame might not want or need to make those extra assumptions to get the invertibility.
As mentioned elsewhere, you should use DataFrame.equals to do equality checking among DataFrames. The function has many options that allow you specify the specific kind of equality testing that makes sense for your application, while leaving the default version as a reasonably generic set of options.