Multiple calculations over spark dataframe in one pass - dataframe

I need to make such several computations over data frame: min and max over column A, distinct values over column B. What is the most efficient way to do that? Is it possible to do that in one pass?
val df = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")

If by one pass you mean inside a single statement , you can do this as below -
sparkDF = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")
sparkDF.select([F.max(F.col('A')),F.min(F.col('A')),F.countDistinct(F.col('B'))
Additionally you can provide sample data and the expected output you are looking for to better gauge the question

Related

Why Spark uses ordered schema for dataframe?

I wondered why spark uses ordered schema in dataframe rather than using name based schema where 2 schemas considered to be the same if for each column name they have the same type.
My first question is that what was the advantage of ordering columns in schema that spark orders columns? Does it make some operations on dataframe faster when we have this assumption?
And my second question is whether I can tell spark that the order of columns do not matter to me and consider two schemas to be the same if the unordered set of columns and their types are the same.
Spark dataframes are not relational databases. It saves time for certain types of processing; e.g. union, which will take in fact the names from the last DF. So, it's an implementation detail.
You therefore cannot state that ordering does not matter to Spark. See the union of the below:
val df2 = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "animal", "talk")
val df = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "talk", "animal")
val df3 = df.union(df2)
Note that with JSON schema inference everything is alphabetical. That to me is very handy.

Get row and column index of value in Pandas df

Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.
Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.
If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]
I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!
To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'

Baffled by numpy's transpose

Let's take a very simple case: an array with shape (2,3,4), ignoring the values.
>>> a.shape
(2, 3, 4)
When we transpose it and print the dimensions:
>>> a.transpose([1,2,0]).shape
(3, 4, 2)
So I'm saying: take axis index 2 and make it the first, then take axis index 0 and make it the second and finally take axis index 1 and make it the third. I should get (4,2,3), right?
Well, I thought perhaps I don't understand the logic fully. So I read the documentation and its says:
Use transpose(a, argsort(axes)) to invert the transposition of tensors
when using the axes keyword argument.
So I tried
>>> c = np.transpose(a, [1,2,0])
>>> c.shape
(3, 4, 2)
>>> np.transpose(a, np.argsort([1,2,0])).shape
(4, 2, 3)
and got yet a completely different shape!
Could someone please explain this? Thanks.
In [259]: a = np.zeros((2,3,4))
In [260]: idx = [1,2,0]
In [261]: a.transpose(idx).shape
Out[261]: (3, 4, 2)
What this has done is take a.shape[1] dimension and put it first. a.shape[2] is 2nd, and a.shape[0] third:
In [262]: np.array(a.shape)[idx]
Out[262]: array([3, 4, 2])
transpose without parameter is a complete reversal of the axis order. It's an extension of the familiar 2d transpose (rows become columns, columns become rows):
In [263]: a.transpose().shape
Out[263]: (4, 3, 2)
In [264]: a.transpose(2,1,0).shape
Out[264]: (4, 3, 2)
And the do-nothing transpose:
In [265]: a.transpose(0,1,2).shape
Out[265]: (2, 3, 4)
You have an initial axes order and final one; describing swap can be hard to visualize if you don't regularly work with lists of size 3 or larger.
Some people find it easier to use swapaxes, which changes the order of just axes. rollaxis is yet another way.
I prefer to use transpose since it can do anything the others can; so I just have to develop an intuitive for one tool.
The argsort comment operates this way:
In [278]: a.transpose(idx).transpose(np.argsort(idx)).shape
Out[278]: (2, 3, 4)
That is, apply it to the result of one transpose to get back the original order.
np.argsort([1,2,0]) returns an array like [2,0,1]
So
np.transpose(a, np.argsort([1,2,0])).shape
act like
np.transpose(a, [2,0,1]).shape
not
np.transpose(a, [1,2,0]).shape

Applying a user defined function to a PySpark dataframe and return a dictionary

Suppose I have a pandas dataframe called df
id value1 value2
1 2 1
2 2 1
3 4 5
In plain Python, I wrote a function to process this dataframe and return a dictionary:
d = dict()
for row in df.itertuples()
x = do_something (row)
d[x[0]] = x[1:]
I am trying to reimplement this function using Spark.
d = dict() # define a global var
def do_something (id, value1, value2):
# business logic
d[x0] = [x1,x2,x3]
return 0
udf_do = udf (do_something)
then:
df_spark.select (udf_do ('id','value1','value2'))
My idea is, by calling df_spark.select, the function do_something will be called over the dataframe, and it will update the global variable d. I don't really care about the return value of udf_do so I return 0.
My solution does not work, indeed.
Could you suggest me some ways to iterate through (I know it is not a Spark-way) or somehow to process a Spark dataframe and update an external dictionary?
Note that the dataframe is quite large. I tried to convert it to pandas by calling toPandas() but I have OOM problem.
UDF cannot update any global state. But, you can do some some businness login inside UDF and then use toLocalIterator to get all the data to the driver in memory-efficient way (partition by partition). For example:
df = spark.createDataFrame([(10, 'b'), (20, 'b'), (30, 'c'),
(40, 'c'), (50, 'c'), (60, 'a')], ['col1', 'col2'])
df.withColumn('udf_result', ......)
df.cache()
df.count() # force cache fill
for row in df.toLocalIterator():
print(row)

numpy recarray indexing based on intersection with external array

I'm trying to subset the records in a numpy.recarray based on the common values between one of the recarrays fields and an external array. For example,
a = np.array([(10, 'Bob', 145.7), (20, 'Sue', 112.3), (10, 'Jim', 130.5)],
dtype=[('id', 'i4'), ('name', 'S10'), ('weight', 'f8')])
a = a.view(np.recarray)
b = np.array([10,30])
I want to take the intersection of a.id and b to determine what records to pull from the recarray, so that I get back:
(10, 'Bob', 145.7)
(10, 'Jim', 130.5)
Naively, I tried:
common = np.intersect1d(a.id, b)
subset = a[common]
but of course that doesn't work because there is no a[10]. I also tried to do this by creating a reverse dict between the id field and the index and subsetted from there, e.g.
id_x_index = {}
ids = a.id
indexes = np.arange(a.size)
for (id, index) in zip(ids, indexes):
id_x_index[id] = index
subset_indexes = np.sort([id_x_index[x] for x in ids if x in b])
print a[subset_indexes]
but then I'm overriding dict values in id_x_index if a.id has duplicates, as in this case I get
(10, 'Jim', 130.5)
(10, 'Jim', 130.5)
I know I'm overlooking some simple way to get the appropriate indices into the recarray. Thanks for help.
The most concise way to do this in Numpy is
subset = a[np.in1d(a.id, b)]
And for those who have an older version of numpy, you can also do it this way:
subset = a[np.array([i in b for i in a.id])]