Why Spark uses ordered schema for dataframe? - dataframe

I wondered why spark uses ordered schema in dataframe rather than using name based schema where 2 schemas considered to be the same if for each column name they have the same type.
My first question is that what was the advantage of ordering columns in schema that spark orders columns? Does it make some operations on dataframe faster when we have this assumption?
And my second question is whether I can tell spark that the order of columns do not matter to me and consider two schemas to be the same if the unordered set of columns and their types are the same.

Spark dataframes are not relational databases. It saves time for certain types of processing; e.g. union, which will take in fact the names from the last DF. So, it's an implementation detail.
You therefore cannot state that ordering does not matter to Spark. See the union of the below:
val df2 = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "animal", "talk")
val df = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "talk", "animal")
val df3 = df.union(df2)
Note that with JSON schema inference everything is alphabetical. That to me is very handy.

Related

Multiple calculations over spark dataframe in one pass

I need to make such several computations over data frame: min and max over column A, distinct values over column B. What is the most efficient way to do that? Is it possible to do that in one pass?
val df = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")
If by one pass you mean inside a single statement , you can do this as below -
sparkDF = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")
sparkDF.select([F.max(F.col('A')),F.min(F.col('A')),F.countDistinct(F.col('B'))
Additionally you can provide sample data and the expected output you are looking for to better gauge the question

Splitting Dataframe with hierarchical index [duplicate]

This question already has answers here:
Splitting dataframe into multiple dataframes
(13 answers)
Closed 3 years ago.
I have a large dataframe with hierarchical indexing (a simplistic/ format example provided in the code below). I would like to setup a loop/automated way of splitting the dataframe into subsets per unique index value, i.e. dfa, dfb, dfc etc. in the coded example below and store in a list.
I have tried the following but unfortunately to no success. Any help appreciated!
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b',
'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 1, 2, 2, 3]])
split = []
for value in data.index.unique():
split.append(data[data.index == value])
I am not exactly sure if this is what you are looking for but have you checked groupby pandas function? The crucial part is that you can apply it across MultiIndex specifying which level of indexing (or what subset of levels) to group by. e.g.
split = {}
for value, split_group in data.groupby(level=0):
split[value] = split_group
print(split)
as #jezrael points out a simpler way to do it is:
dict(tuple(df.groupby(level=0)))

PySpark Create new column from transformations in another dataframe

Looking for a more functional and computationally efficient approach in PySpark ->
I have master table (containing billions of rows), the columns of interests are:
id - (String),
tokens - (Array(string))- ex, ['alpha', 'beta', 'gamma']
-- (Calling it dataframe, df1)
I have another summary table which contains top 25 tokens like:
-- (Calling it dataframe, df2)
Ex:
Token
Alpha
Beta
Zi
Mu
Now to this second table (or, dataframe), I wish to append a row which contains a list of ids for that token from the first table, so that the result looks like:
Token Ids
Alpha [1, 2, 3]
Beta [3, 5, 6, 8, 9]
Zi [2, 8, 12]
Mu [1, 15, 16, 17]
Present Approach:
From the df2, figure out the distinct tokens and store it as a list (say l1).
(For every token from list, l1):
Filter df1 to extract the unique ids as a list, call it l2
Add this new list (l2) as a new column (Ids) to the dataframe (df2) to create a new dataframe (df3)
persist df3 to a table
I agree this is a terribe approach and for any given l1 with 100k records, it will run forever. Can anyone help me rewrite the code (for Pyspark)
You can alternatively attempt to join both the table on a new column which would essentially contain only the tokens exploded to the individual rows. That would be helpful from both computational efficiency, allocated resources and the required processing time.
Additionally, there are several in-the-box join privileges including 'map-side join' which would further propel your cause.
Explode the tokens array column of df1 and then join with df2 (left join) with lower case of tokens and token and then groupBy token and collect the ids as set
from pyspark.sql import functions as f
#exolode tokens column for joining with df2
df1 = df1.withColumn('tokens', f.explode('tokens'))
#left join with case insensitive and collecting ids as set for each token
df2.join(df1, f.lower(df1.tokens) == f.lower(df2.token), 'left')\
.groupBy('token')\
.agg(f.collect_set('id').alias('ids'))\
.show(truncate=False)
I hope the answer is helpful

Spark dataframes groupby into list

I am trying to do some analysis on sets. I have a sample data set that looks like this:
orders.json
{"items":[1,2,3,4,5]}
{"items":[1,2,5]}
{"items":[1,3,5]}
{"items":[3,4,5]}
All it is, is a single field that is a list of numbers that represent IDs.
Here is the Spark script I am trying to run:
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("Dataframe Test")
val sc = new SparkContext(sparkConf)
val sql = new SQLContext(sc)
val dataframe = sql.read.json("orders.json")
val expanded = dataframe
.explode[::[Long], Long]("items", "item1")(row => row)
.explode[::[Long], Long]("items", "item2")(row => row)
val grouped = expanded
.where(expanded("item1") !== expanded("item2"))
.groupBy("item1", "item2")
.count()
val recs = grouped
.groupBy("item1")
Creating expanded and grouped is fine, in a nutshell expanded is a list of all the possible sets of two IDs where the two IDs were in the same original set. grouped filters out IDs that were matched with themselves, then groups together all the unique pairs of IDs and produces a count for each. The schema and data sample of grouped are:
root
|-- item1: long (nullable = true)
|-- item2: long (nullable = true)
|-- count: long (nullable = false)
[1,2,2]
[1,3,2]
[1,4,1]
[1,5,3]
[2,1,2]
[2,3,1]
[2,4,1]
[2,5,2]
...
So, my question is: how do I now group on the first item in each result so that I have a list of tuples? For the example data above, I would expect something similar to this:
[1, [(2, 2), (3, 2), (4, 1), (5, 3)]]
[2, [(1, 2), (3, 1), (4, 1), (5, 2)]]
As you can see in my script with recs, I thought you would start by doing a groupBy on 'item1' which is the first item in each row. But after that you are left with this GroupedData object that has very limited actions on it. Really, you are only left with doing aggregations like sum, avg, etc. I just want to list the tuples from each result.
I could easily use RDD functions at this point, but that departs from using Dataframes. Is there a way to do this with the dataframe functions.
You can build that with org.apache.spark.sql.functions (collect_list and struct) available since 1.6
val recs =grouped.groupBy('item1).agg(collect_list(struct('item2,'count)).as("set"))
+-----+----------------------------+
|item1|set |
+-----+----------------------------+
|1 |[[5,3], [4,1], [3,2], [2,2]]|
|2 |[[4,1], [1,2], [5,2], [3,1]]|
+-----+----------------------------+
You can use collect_set also
Edit: for information, tuples don't exist in dataframes. The closest structures are struct since they are the equivalent of case classes in the untyped dataset API.
Edit 2: Also be warned that collect_set comes with the caveat that the result is actually not a set (there is no datatype with set properties in the SQL types). That means that you can end up with distinct "sets" which differ by their order (in version 2.1.0 at least). Sorting them with sort_array is then necessary.

numpy recarray indexing based on intersection with external array

I'm trying to subset the records in a numpy.recarray based on the common values between one of the recarrays fields and an external array. For example,
a = np.array([(10, 'Bob', 145.7), (20, 'Sue', 112.3), (10, 'Jim', 130.5)],
dtype=[('id', 'i4'), ('name', 'S10'), ('weight', 'f8')])
a = a.view(np.recarray)
b = np.array([10,30])
I want to take the intersection of a.id and b to determine what records to pull from the recarray, so that I get back:
(10, 'Bob', 145.7)
(10, 'Jim', 130.5)
Naively, I tried:
common = np.intersect1d(a.id, b)
subset = a[common]
but of course that doesn't work because there is no a[10]. I also tried to do this by creating a reverse dict between the id field and the index and subsetted from there, e.g.
id_x_index = {}
ids = a.id
indexes = np.arange(a.size)
for (id, index) in zip(ids, indexes):
id_x_index[id] = index
subset_indexes = np.sort([id_x_index[x] for x in ids if x in b])
print a[subset_indexes]
but then I'm overriding dict values in id_x_index if a.id has duplicates, as in this case I get
(10, 'Jim', 130.5)
(10, 'Jim', 130.5)
I know I'm overlooking some simple way to get the appropriate indices into the recarray. Thanks for help.
The most concise way to do this in Numpy is
subset = a[np.in1d(a.id, b)]
And for those who have an older version of numpy, you can also do it this way:
subset = a[np.array([i in b for i in a.id])]