Dropping duplicates in Apache Spark DataFrame and keep row with value that has not been dropped already?

Dropping duplicates in Apache Spark DataFrame and keep row with value that has not been dropped already? - dataframe

Let's say I have a DataFrame as the following:
+-------+-------+
|column1|column2|
+-------+-------+
| 1 | A |
| 1 | B |
| 2 | A |
| 2 | B |
| 3 | B |
+-------+-------+
I want to be able to find the pairs of where each unique element from column1 and column2 fit in
exactly one pair. Therefore, I would hope the outcome would be:
+-------+-------+
|column1|column2|
+-------+-------+
| 1 | A |
| 2 | B |
+-------+-------+
Notice that the pair (2, A) was removed because A was already paired up with 1. Also 3 was removed because B was already paired up with 2.
Is there a way to do this with Spark?
So far the only solution I came up with is just running a .collect() and then mapping each row and adding each value of A and B into a set. Therefore, when I meet a row and either an element from column A or B is already in the set, I remove that row.
Thanks for reading.

Related

PySpark - drop rows with duplicate values with no column order

I have a PySpark DataFrame with two columns like so:
+-------------------+--------------------------+
| right | left |
+-------------------+--------------------------+
| 1 | 2 |
| 2 | 3 |
| 2 | 1 |
| 3 | 2 |
| 1 | 1 |
+-------------------+--------------------------+
I want to drop duplicates but with no respect to the order of the columns.
For example, a row that contains (1,2) and a row that contains (2,1) are duplicates.
The resultant Dataframe would look like this:
+-------------------+--------------------------+
| right | left |
+-------------------+--------------------------+
| 1 | 2 |
| 2 | 3 |
| 1 | 1 |
+-------------------+--------------------------+
The regular drop_duplicates method doesn't work in this case, anyone has any ideas how to do this cleanly and efficiently?

(df1.withColumn('x',array_sort(array(col('left'), col('right'))))#create sorted array column of columns left and right
.dropDuplicates(['x'])#Use column created to drop duplicates
.drop('x')#drop unwanted column
).show()

How to see what rows are missing between two select statements in SQLite?

I have a single table view that has a group column and a data column (among other columns). In a particular group, there should be n rows of the same set of text in the same order. However, I'm finding that in some groups, some rows are missing. I'd like to query the view so that I can see what rows are missing.
Concrete example:
+--------+-------+
| Group | Data |
+--------+-------+
| 1 | row 1 |
| 1 | row 2 |
| 1 | row 3 |
| 2 | row 1 |
| 2 | row 3 |
+--------+-------+
Group 2 has "row 2" missing, and I'd like that output. Something like:
+-------+
| Data |
+-------+
| row 2 |
+-------+
Is this possible?

You need to take the COUNT of Data column and then find count(Data) is less than Unique number of Group.
You can achieve it using below.
Select
Data,Count(*)
from tab
Group By Data
having Count(*)<(select count(Distinct Grp) from tab);
DB Fiddle: Try it here

Excel VBA to transpose set of rows if value exists in another column

I'm trying to find a way via VB script that will transpose rows from column A into a new sheet but only if there is a value in column B for rows that contain numbers. I have a sheet with ~75K rows on it that I need to do this for, and I tried creating pivot tables which allowed me to get the data into its current format but I need the data to be in columns.
The tricky part of this is that in column A, I only need to look at the rows that are all numbers and not the other rows that have text.
I created a sample sheet to view, where the sample data is in the SOURCE tab and what I want the data to look like in the TRANSPOSED tab.
https://docs.google.com/spreadsheets/d/1ujbaouZFqiPw0DbO78PCnz25OY2ugF1HtUqMg_J7KeI/edit?usp=sharing
Any help would be appreciated.

UPDATE and Answer:
I modified my approach and went back to the original source data which was not part of a pivot table and was able to use a simple match formula between the 2 data sources. So, my original data looked like this:
+----------------+---------+--------+--------------+
| Gtin | Brand | Name | TaxonomyText |
+----------------+---------+--------+--------------+
| 00030085075605 | brand 1 | name 1 | cat1 |
| 00041100015112 | brand 2 | name 2 | cat2 |
| 00041100015099 | brand 3 | name 3 | cat3 |
| 00030085075608 | brand 4 | name 4 | cat4 |
+----------------+---------+--------+--------------+
I had another sheet containing the data I needed to match to in this format:
+----------------+---------+
| Gtin | Brand |
+----------------+---------+
| 00030085075605 | brand 1 |
| 00041100015112 | brand 2 |
| 00041100015098 | brand 3 |
| 00030085075608 | brand 4 |
+----------------+---------+
I created a new column in my source sheet and used a if error match formula:
=IFERROR(IF(MATCH(A14,data_to_match!$A:$A,0),"yes",),"no")
Then copied this formula down for every row, about 75K rows which very quickly added a yes or a no.
+----------------+---------+---------+--------+--------------+
| Gtin | matched | Brand | Name | TaxonomyText |
+----------------+---------+---------+--------+--------------+
| 00030085075605 | yes | brand 1 | name 1 | cat1 |
| 00041100015112 | yes | brand 2 | name 2 | cat2 |
| 00041100015098 | no | brand 3 | name 3 | cat3 |
| 00030085075608 | yes | brand 4 | name 4 | cat4 |
+----------------+---------+---------+--------+--------------+
The final step was to just filter for Yes values and I had all the data that I needed.
My mistake was going to a pivot table first which put the data in a very funky format causing me to have to do a transpose, which wasn't really necessary. Hopefully this can help others....

Combine column x to n in OpenRefine

I have a table with an unknown number of columns, and I need to combine all columns after a certain point. Consider the following:
| A | B | C | D | E |
|----|----|---|---|---|
| 24 | 25 | 7 | | |
| 12 | 3 | 4 | | |
| 5 | 5 | 5 | 5 | |
Columns A-C are known, and the information in them correct. But column D to N (an unknown number of columns starting with D) needs to be combined as they are all parts of the same string. How can I combine an unknown number of columns in OpenRefine?
As some columns may have empty cells (the string may be of various lengths) I also need to disregard empty cells.

There is a two step approach to this that should work for you.
From the first column you want to merge (Col D in this case) choose Transpose->Transpose cells across columns into rows
You will be asked to set some options. You'll want to choose 'From Column' D and 'To Column' N. Then choose to transpose into One Column, assign a name to that column, make sure the option to 'Ignore Blank Cells' is checked (should be checked by default. Then click Transpose.
You'll get the values that were previously in cols D-N appearing in rows. e.g.
| A | B | C | D | E | F |
|----|----|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 |
Transposes to:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 4 |
| | | | 5 |
| | | | 6 |
You can then use the dropdown menu from the head of the 'new' column to choose
Edit cells->Join multi-value cells
You'll be asked what character you want to use to separate the characters in the joined cell. Probably in your use case you can delete the joining character and combine the cells without any joining characters. This will give you:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 456 |

Primary key auto-increment manipulation

Is there any way to have a primary key with a feature that increments it but fills in gaps? Assuming I have the following table:
____________________
| ID | Value |
| 1 | A |
| 2 | B |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
Notice that the value is only an example, the order has nothing to do with the question.
Once I remove the row with the ID of 2 (the table will look like this):
____________________
| ID | Value |
| 1 | A |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
And I add another row, with regular auto-increment feature it will look like this:
____________________
| ID | Value |
| 1 | A |
| 3 | C |
| 4 | D |
^^^^^^^^^^^^^^^^^^^^^
As expected.
The output I'd want would be:
____________________
| ID | Value |
| 1 | A |
| 2 | D |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
Where the gap is filled with the new row. Also note that maybe, in memory, it would look different. But the point is that the primary key would fill the gaps.
When having the primary keys (for instance) 1, 2, 3, 6, 7, 10, 11, 4 should be first filled in, then 5, 8 and so on... When the table is empty (even if it had a million of rows before) it should start over from 1.
How do I accomplish that? Is there any built-in feature similar to that? Can I implement it?
EDIT: If it's not possible, why not?

No, you don't want to do that, as juergen-d said. It's unlikely to do what you think it is doing, and it will do it even less in a multi-user environment.
In a multiuser environment you are likely to get voids even when there are no deletes, just from aborted inserts.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Dropping duplicates in Apache Spark DataFrame and keep row with value that has not been dropped already? - dataframe

Related

PySpark - drop rows with duplicate values with no column order

How to see what rows are missing between two select statements in SQLite?

Excel VBA to transpose set of rows if value exists in another column

Combine column x to n in OpenRefine

Primary key auto-increment manipulation

Categories

Resources