Generate random number for each row in between range - dataframe

I am new to Spark; I am trying to generate a random number for each row in Spark between range 1 to 30, something like:
import scala.util.Random
df.withColumn("random_num", lit(Random.nextInt(30)))
Expected output:
col1 1
col 2
abc 1
bcd 3
xyz 6
hg 2
How can I do that?
NOTE: I have 7 billion rows to process, so I wanted to assign a random number to each row, so that I can use that columns for my repartition as further needed to me.
monotonically_increasing_id is not applicable in my case as it generated random number but not in range and very uneven manner. I wanted to assign some numbers to 7B rows which would be somehow optimal and equal, so that when I use that column for reparation, I should avoid any skewness.

You can use rand method from Spark, rather than a Java method, for example, if this was your data:
+----+
|col |
+----+
|col1|
|col2|
|abc |
|bcd |
|xyz |
|hg |
+----+
Then, this statement:
data = data.withColumn("random_num", round(rand().multiply(30)))
Generates this output:
+----+----------+
|col |random_num|
+----+----------+
|col1|24.0 |
|col2|16.0 |
|abc |20.0 |
|bcd |26.0 |
|xyz |8.0 |
|hg |22.0 |
+----+----------+
Good luck!
EDIT
If your data is skewed, assigning and partitioning by a unique ID per row will crash your application, you can consider adding salt or read more about it here.

Related

Dropping duplicates in Apache Spark DataFrame and keep row with value that has not been dropped already?

Let's say I have a DataFrame as the following:
+-------+-------+
|column1|column2|
+-------+-------+
| 1 | A |
| 1 | B |
| 2 | A |
| 2 | B |
| 3 | B |
+-------+-------+
I want to be able to find the pairs of where each unique element from column1 and column2 fit in
exactly one pair. Therefore, I would hope the outcome would be:
+-------+-------+
|column1|column2|
+-------+-------+
| 1 | A |
| 2 | B |
+-------+-------+
Notice that the pair (2, A) was removed because A was already paired up with 1. Also 3 was removed because B was already paired up with 2.
Is there a way to do this with Spark?
So far the only solution I came up with is just running a .collect() and then mapping each row and adding each value of A and B into a set. Therefore, when I meet a row and either an element from column A or B is already in the set, I remove that row.
Thanks for reading.

Let pandas use 0-based row number as index when reading Excel files

I am trying to use pandas to process a series of XLS files. The code I am currently using looks like:
with pandas.ExcelFile(data_file) as xls:
data_frame = pandas.read_excel(xls, header=[0, 1], skiprows=2, index_col=None)
And the format of the XLS file looks like
+---------------------------------------------------------------------------+
| REPORT |
+---------------------------------------------------------------------------+
| Unit: 1000000 USD |
+---------------------------------------------------------------------------+
| | | | | Balance |
+ ID + Branch + Customer ID + Customer Name +--------------------------+
| | | | | Daily | Monthly | Yearly |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 1 | Company A | 10 | 5 | 2 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 222222 | Branch2 | 2 | Company B | 20 | 25 | 20 |
+--------+---------+-------------+---------------+-------+---------+--------+
| 111111 | Branch1 | 3 | Company C | 30 | 35 | 40 |
+--------+---------+-------------+---------------+-------+---------+--------+
Even I explicitly gave index_col=None, pandas still take ID column as the index. I am wondering the right way of making row numbers to be the index.
pandas currently doesn't support parsing a MultiIndex columns without also parsing a row index. Related issue here - it probably could be supported, but this gets tricky to define in a non-ambiguous way.
It's a hack, but the easiest way to work around this right now is to add a blank column on the left side of data, then read it in like this.
pd.read_excel('file.xlsx', header=[0,1], skiprows=2).reset_index(drop=True)
Edit:
If you can't / don't want to modify the files, a couple options are:
If the data has a known / common header, use pd.read_excel(..., skiprows=4, header=None) and assign the columns yourself, suggested by #ayhan.
If you need to parse the header, use pd.read_excel(..., skiprows=2, header=0), then munge the second level of labels into a MultiIndex. This will probably mess up dtypes, so you may also need to do some typecasting (pd.to_numeric) as well.

Implicit Flattening in BigQuery

When does BigQuery flatten an intermediate result set? I was under the impression that it was only when FLATTEN was invoked, but I've encountered an example where the result is flattened without a FLATTEN.
This is the case - this base query returns one record:
select count(*) from publicdata:samples.trigrams
where ngram = 'der Griindung im'
+-----+
| f0_ |
+-----+
| 1 |
+-----+
When queried, you can see that the record has a repeated field that is repeated twice.
select * from publicdata:samples.trigrams
where ngram = 'der Griindung im'
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
| ngram | first | second | third | fourth | fifth | cell_value | cell_volume_count | cell_volume_fraction | cell_page_count | cell_match_count | cell_sample_id | cell_sample_text | cell_sample_title | cell_sample_subtitle | cell_sample_authors | cell_sample_url |
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
| der Griindung im | der | Griindung | im | NULL | NULL | 2007 | 54 | 0.008746355685131196 | 54 | 54 | NULL | NULL | NULL | NULL | NULL | NULL |
| der Griindung im | der | Griindung | im | NULL | NULL | 2008 | 47 | 0.007612568837058633 | 47 | 47 | NULL | NULL | NULL | NULL | NULL | NULL |
+------------------+-------+-----------+-------+--------+-------+------------+-------------------+----------------------+-----------------+------------------+----------------+------------------+-------------------+----------------------+---------------------+-----------------+
When I add a filter on cell.value, I get two records instead of one - but I never flattened so I'm not sure about the behavior here. My expectation is that this would return the same output as the previous COUNT above. It doesn't:
select count(*) from publicdata:samples.trigrams
where ngram = 'der Griindung im' and cell.value in ('2007', '2008')
+-----+
| f0_ |
+-----+
| 2 |
+-----+
What this means is that while I expect select * from publicdata:samples.trigrams where ngram = 'der Griindung im' and select * from publicdata:samples.trigrams where ngram = 'der Griindung im' and cell.value in ('2007', '2008') to return the same output, they don't because one is implicitly flattened and the other is not. While this may not seem like a huge issue, this could matter significantly if it was part of a nested query that expected an intermediate result to be flattened or repeated.
Under what conditions does BigQuery flatten results without an explicit FLATTEN?
Let me answer first, how to get correct count in this case:
So instead of
SELECT COUNT(*)
FROM [publicdata:samples.trigrams]
WHERE ngram = 'der Griindung im'
AND cell.value IN ('2007', '2008')
with result of
+-----+
| f0_ |
+-----+
| 2 |
+-----+
you should do
SELECT COUNT(*)
FROM [publicdata:samples.trigrams]
WHERE ngram = 'der Griindung im'
OMIT RECORD IF EVERY(cell.value NOT IN ('2007', '2008'))
with result of
+-----+
| f0_ |
+-----+
| 1 |
+-----+
as I think what you expected
Secondly - Under what conditions does BigQuery flatten results without an explicit FLATTEN?
I think (just my guess baseed on BQ behavior observation) every time you explicitelly reference record's field within clauses like SELECT or WHERE , it gets automatically flattened for you. Using FLATTEN operator helps "control" this process.
Short story: use count(0) instead of count(*). (You get 1 instead of 2.)
count(*) behaves strangely with repeated fields. It looks like the results are flattened, but if that were really the case, this should also affect count(0). I've asked about this here, but I haven't so far received a full explanation.

Primary key auto-increment manipulation

Is there any way to have a primary key with a feature that increments it but fills in gaps? Assuming I have the following table:
____________________
| ID | Value |
| 1 | A |
| 2 | B |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
Notice that the value is only an example, the order has nothing to do with the question.
Once I remove the row with the ID of 2 (the table will look like this):
____________________
| ID | Value |
| 1 | A |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
And I add another row, with regular auto-increment feature it will look like this:
____________________
| ID | Value |
| 1 | A |
| 3 | C |
| 4 | D |
^^^^^^^^^^^^^^^^^^^^^
As expected.
The output I'd want would be:
____________________
| ID | Value |
| 1 | A |
| 2 | D |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
Where the gap is filled with the new row. Also note that maybe, in memory, it would look different. But the point is that the primary key would fill the gaps.
When having the primary keys (for instance) 1, 2, 3, 6, 7, 10, 11, 4 should be first filled in, then 5, 8 and so on... When the table is empty (even if it had a million of rows before) it should start over from 1.
How do I accomplish that? Is there any built-in feature similar to that? Can I implement it?
EDIT: If it's not possible, why not?
No, you don't want to do that, as juergen-d said. It's unlikely to do what you think it is doing, and it will do it even less in a multi-user environment.
In a multiuser environment you are likely to get voids even when there are no deletes, just from aborted inserts.

LINQ OrderBy. Does it always return the same ordered list?

I was trying out a simple OrderBy statement.
The target data to order is something like below:
[
{"id":40, "description":"aaa", "rate":1},
{"id":1, "description":"bbb", "rate":1},
{"id":4, "description":"ccc", "rate":2},
{"id":19, "description":"aaa", "rate":1}
]
Then I order items by the rate property.
The odd thing is that if I 'order' them, it 'skips' some items by a given offset and then 'take' only portion of the data.
For example,
var result = items.OrderBy(i => i.rate);
var result = result.Skip(2);
var result = result.Take(2);
The result looks fine for the most of it, but the 'edge case' item is not returned at all.
For example,
if the first result came back as
[{"id":40, "description":"aaa", "rate":1}, {"id":1, "description":"bbb", "rate":1}]
the second result comes back like
[{"id":1, "description":"bbb", "rate":1}, {"id":4, "description":"ccc", "rate":2}]
Item "id: 19" has not been returned with the second query call. Instead item "id: 1" has returned twice.
My guess is that the SQL OrderBy statement doesn't produce the same ordered list every single time OrderBy orders by a given property, but the exact order within a group that shares the same property can change.
What is the exact mechanism under the hood?
Short answer: LINQ to Objects uses a stable sort algorithm, so we can say that it is deterministic, and LINQ to SQL depends on the database implementation of Order By that is usually nondeterministic.
A deterministic sort algorithm is one that have always the same behavior on different runs.
In you example, you have duplicates in your OrderBy clause. For a guaranteed and predicted sort, one of the order clauses or the combination of order clauses must be unique.
In LINQ, you can achieve it by adding another OrderBy clause to refer your unique property, like in
items.OrderBy(i => i.Rate).ThenBy(i => i.ID).
Long answer:
LINQ to Objects uses a stable sort, as documented in this link: MSDN.
In LINQ to SQL, it depends on the sort algorithm of the underlying database and it is usually an unstable sort, like in MS SQL Server (MSDN).
In a stable sort, if the keys of two elements are equal, the order of the elements is preserved. In contrast, an unstable sort does not preserve the order of elements that have the same key.
So, for LINQ to SQL, the sorting is usually nondeterministic because the RDMS (Relational Database Management System, like MS SQL Server) may directly use a unstable sort algorithm with a random pivot selection or the randomness can be related with which row the database happens to access first in the file system.
For example, imagine that the size of a page in the file system can hold up to 4 rows.
The page will be full if you insert the following data:
Page 1
| Name | Value |
|------|-------|
| A | 1 |
| B | 2 |
| C | 3 |
| D | 4 |
If you need to insert a new row, the RDMS has two options:
Create a new page to allocate the new row.
Split the current page in two pages. So the first page will hold the Names A and B and the second page will hold C and D.
Suppose that the RDMS chooses option 1 (to reduce index fragmentation). If you insert a new row with Name C and Value 9, you will get:
Page 1 Page 2
| Name | Value | | Name | Value |
|------|-------| |------|-------|
| A | 1 | | C | 9 |
| B | 2 | | | |
| C | 3 | | | |
| D | 4 | | | |
Probably, the OrderBy clause in column Name will return the following:
| Name | Value |
|------|-------|
| A | 1 |
| B | 2 |
| C | 3 |
| C | 9 | -- Value 9 appears after because it was at another page
| D | 4 |
Now, suppose that the RDMS chooses option 2 (to increase the insert performance in a storage system with many spindles). If you insert a new row with Name C and Value 9, you will get:
Page 1 Page 2
| Name | Value | | Name | Value |
|------|-------| |------|-------|
| A | 1 | | C | 3 |
| B | 2 | | D | 4 |
| C | 9 | | | |
| | | | | |
Probably, the OrderBy clause in column Name will return the following:
| Name | Value |
|------|-------|
| A | 1 |
| B | 2 |
| C | 9 | -- Value 9 appears before because it was at the first page
| C | 3 |
| D | 4 |
Regarding your example:
I believe that you have mistyped something in your question, because you have used items.OrderBy(i => i.rate).Skip(2).Take(2); and the first result do not show a row with Rate = 2. This is not possible since the Skip will ignore the first two rows and they have Rate = 1, so your output must show the row with Rate = 2.
You've tagged your question with database, so I believe that you are using LINQ to SQL. In this case, results can be nondeterministic and you could get the following:
Result 1:
[{"id":40, "description":"aaa", "rate":1},
{"id":4, "description":"ccc", "rate":2}]
Result 2:
[{"id":1, "description":"bbb", "rate":1},
{"id":4, "description":"ccc", "rate":2}]
If you had used items.OrderBy(i => i.rate).ThenBy(i => i.ID).Skip(2).Take(2); then the only possible result would be:
[{"id":40, "description":"aaa", "rate":1},
{"id":4, "description":"ccc", "rate":2}]