sparksql how to resolve Data dependency issues - apache-spark-sql

scala> sql(""" select * from demo1""").show(false)
+-----+---+
|from1|to1|
+-----+---+
|c |d |
|b |c |
|a |b |
+-----+---+
demo1 is my inputTable.
now from table we can find: a to b, b to c,c to d; so we need All elements on the link
will be to d.
so i need result like this:
+-----+---+
|from1|to1|
+-----+---+
|c |d |
|b |d |
|a |d |
+-----+---+
note: a b c It's not sorted by subtitle or size.
Their relationship transitions are random.
How do I write this SparkSQL?

Related

transform table with duplicate

I'm trying to transform a base with duplicates into a new base according to the attached model
impossible without duplicate
I don't see how I can do
in advance thank you for your help
original base
IDu| ID | Information
1 |A |1
2 |A |2
3 |A |3
4 |A |4
5 |A |5
6 |B |1
7 |B |2
8 |B |3
9 |B |4
10 |C |1
11 |D |1
12 |D |2
13 |D |3
base to reach
ID | Resultat/table2 | plus grand valeur
A |(1,2,3,4,5) |5
B |(1,2,3,4) |4
C |(1) |1
D |(1,2,3) |3
You can use GROUP_CONCAT
(https://www.w3resource.com/mysql/aggregate-functions-and-grouping/aggregate-functions-and-grouping-group_concat.php):
SELECT
ID, GROUP_CONCAT(INFORMATION), COUNT(INFORMATION)
FROM
TABLE
GROUP BY
ID
a huge thank you.
Quick and perfect response
on the other hand how I can filter to have the greatest value
this query ranges from smallest to largest, but how to keep only the largest value
D | Resultat/table2 | greatest value
A |(1,2,3,4,5) |5
B |(1,2,3,4) |4
C |(1) |1
D |(1,2,3) |3
I tried, but without success
SELECT ID,GROUP_CONCAT(ID1)
from tournee_reduite
GROUP BY ID
ORDER BY MAX(ID1) desc;
another huge thank you

Add two elements in a dataframe (based on the index)

I have a dataframe in which some rows are useless except for one variable.
I want to add that the variable in those rows to the previous row and then delete the useless rows.
In the data frame there are some rows in which the only useful information is on a variable, so I want to preserve this information.
More precisely, my dataframe looks something like
|cat1| cat2|var1|var2|
|A |x |1 |2 |
|A |x |1 |0 |
|A |x |. |5 |
|A |y |1 |2 |
|A |y |1 |2 |
|A |y |1 |3 |
|A |y |. |6 |
|B |x |1 |2 |
|B |x |1 |4 |
|B |x |1 |2 |
|B |x |1 |1 |
|B |x |. |3 |
and i want to get
|cat1| cat2|var1|var2|
|A |x |1 |2 |
|A |x |1 |5(5+0)|
|A |y |1 |2 |
|A |y |1 |2 |
|A |y |1 |9(6+3)|
|B |x |1 |2 |
|B |x |1 |4 |
|B |x |1 |2 |
|B |x |1 |4(3+1)|
iI've tried code like
test = df[df['var1'] == '.'].index
for num in test:
df['var2][num - 1] = df['var2][num - 1] + df['var2][num]
but it doesn't work.
Any help would be appreciated.
For a very readable solution combine np.where to select the rows where the shifted rows of var1 contain .. Use the -1 to select the next row. If that's the case add the next row, otherwise just fill the original row. Afterwards, just drop all the rows with a .
df['var2_new'] = np.where(df['var1'].shift(-1) == '.',
df['var2'] + df['var2'].shift(-1), df['var2'])
df[df['var1'] != '.']
# cat1 cat2 var1 var2 var2_new
#0 A x 1 2 2.0
#1 A x 1 0 5.0
#3 A y 1 2 2.0
#4 A y 1 2 2.0
#5 A y 1 3 9.0
#7 B x 1 2 2.0
#8 B x 1 4 4.0
#9 B x 1 2 2.0
#10 B x 1 1 4.0

Distinct count on multiple unrelated columns

I've a dataset where from I want distinct count of more than one column and get the result in one single select, how to go about it?
Example:
Table:
|Col_A|Col_B|
|a |c |
|a |d |
|b |c |
|b |d |
|b |c |
I want like this (with the use of a single select query) -
|Col_A|Count_of_A|Col_B|Count_of_B|
|a |2 |c |3 |
|b |3 |d |2 |
How to do this? Given that, data is unknown every-time and hence, we cannot use where or case statements for specific use-case.
Ideally this is a Spark-Streaming problem, where I want to do this operation on a Spark-streaming dataframe every time new data comes in from Kafka.

Is there a word or term for a PIVOT without data loss?

Is there a word/phrase that describes the following action?
Where data in the form:
ID |Group |Type |Data
------------------------
1 |A |a |10
2 |A |b |11
3 |A |c |12
4 |B |a |20
5 |B |d |40
6 |C |b |31
Is transformed to this form:
Type |A |B |C (etc.)
-------------------------
a |10 |20 |NULL
b |11 |NULL |31
c |12 |NULL |NULL
d |NULL |40 |NULL
This is a kind of pivot, but where there is no summarising so data could (in theory) be updated via the transformed table.
I would have thought that this is needed quite widely for allocation of resources/stock to multiple projects. In the example above 'Group' would be project, 'Type' would be the resource and 'Data' would be the quantity needed or allocated.
I really want to ask a question about how this is normally approached in database design, but I need to know the terminology before I can do that!

Select rows with different values on different columns

I'm new to SQL so this took my a long time without being able to figure it out.
My table looks like this:
+------+------+------+
|ID |2016 | 2017 |
+------+------+------+
|1 |A |A |
+------+------+------+
|2 |A |B |
+------+------+------+
|3 |B |B |
+------+------+------+
|4 |B |C |
+------+------+------+
I would like to have only the rows which have changed from 2016 to 2017:
+------+------+------+
|ID |2016 | 2017 |
+------+------+------+
|2 |A |B |
+------+------+------+
|4 |B |C |
+------+------+------+
Could you please help ?
select * from mytable where column_2016<>column_2017
assuming your column labels are column_2016 and column_2017