I need to show the sum of the columns of only those values when the 3 columns combined are unique.
Right now my expression looks like this: Num(Sum([myColumn]) , '# ##0,00')
I need to add a condition similar to (where Dictinct col1, col2, col3).
Can you tell me how to do it?
I tried to write like this:
Num(Sum(distinct {[col1] + [col2] + [col3]} [mycolumn]), '# ##0.00')
, but it didn't help
id| col1| col2| col3| mycolumn|
1| 0001| 810| yes| 10.00|
2| 0001| 810| yes| 10.00|
3| 0001| 840| no| 25.11|
4| 0001| 840| yes| |25.11|
5| 0001| 392| yes| 15.01|
6| 0001| 756| yes| 15.01|
total 90,24
col1, col2, and col3 in records with id1 and id2 are equal, so only one of them goes into the sum. In records id3 and id4 col3 are different, so two columns add up. In records id5 and id6 the col2 column is different, so they add up too.
Total: 10.00 + 25.11 + 25.11 + 15.01 + 15.01
Related
I have a data frame here where I need to some transformation. The Col_X and Col_Y here are the columns which need to be worked on. The suffix for Col_X and Col_Y are X and Y and I need this values as in the new column Col_D and the values of Col_x and col_y should be splitted into different rows. I gone through pivot table option but seems to be not working. Is there a way I can transform the data efficiently in Spark scala
ColA ColB Col_x Col_y
a 1 10 20
b 2 30 40
Table required:
ColA ColB ColC Col_D
a 1 10 X
a 1 20 Y
b 2 30 X
b 2 40 Y
You can use stack function,
val df = // input
df.selectExpr("ColA", "ColB", "stack(2, 'X', Col_x, 'Y', Col_y) as (ColD, ColC)")
.show()
+----+----+----+----+
|ColA|ColB|ColD|ColC|
+----+----+----+----+
| a| 1| X| 10|
| a| 1| Y| 20|
| b| 2| X| 30|
| b| 2| Y| 40|
+----+----+----+----+
I have a table(df) which has multiple columns: col1, col2,col3 and so on.
col1
col2
col3
....
coln
1
abc
1
qwe
1
xyz
2
3
3
abc
6
qwe
I want my final table(df) to have following columns:
attribute_name: contains the name of columns from previous table
count: contains total count of the table
distinct_count: contains distinct count of each column from previous table
null_count: contains count of null values of each column from previous table
The final table should like like this:
attribute_name
count
distinct_count
null_count
col1
4
3
0
col2
4
2
1
col3
4
3
1
coln
4
1
2
Could someone help me on how i can implement this in pyspark?
I didn't test it or checked if it is correct, but something like this should work:
attr_df_list = []
for column_name in df.columns:
attr_df_list.append(
df.selectExpr(
f"{column_name} AS attribute_name",
"COUNT(*) AS count",
f"COUNT(DISTINCT {column_name}) AS distinct_count",
f"COUNT_IF({column_name} IS NULL) AS null_count"
)
)
result_df = reduce(lambda df1, df2: df1.union(df2), attr_df_list)
Here's a solution:
df = spark.createDataFrame([("apple",1,1),("mango",2,2),("apple",None,3),("mango",None,4)], ["col1","col2","col3"])
df.show()
# Out:
# +---——+—--—+—--—+
# | col1|col2|col3|
# +—---—+—--—+—--—+
# |apple| 1| 1|
# |mango| 2| 2|
# |apple|null| 3|
# |mango|null| 4|
# +—---—+—--—+—--—+
from pyspark.sql.functions import col
data = [(c, \
df.filter(col(c).isNotNull()).count(), \
df[[c]].distinct().count(), \
df.filter(col(c).isNull()).count() \
) for c in df.columns]
cols=['attribute_name','count','distinct_count','null_count']
spark.createDataFrame(data, cols).show()
# Out:
# +——————-------—+—---—+——————-------—+——-----———+
# |attribute_name|count|distinct_count|null_count|
# +————-------———+—---—+—————-------——+———-----——+
# | col1| 4| 2| 0|
# | col2| 2| 3| 2|
# | col3| 4| 4| 0|
# +————-------———+—---—+————-------———+———-----——+
The idea is to loop through the columns of the original dataframe and for each column create a new row with the aggregated data.
Suppose I have a table1:
column1|column2|state|
-------|-------|-----|
test1 | 2| 0|
test1 | 3| 0|
test1 | 1| 1|
test2 | 2| 1|
test2 | 1| 2|
I want to select (actually delete, but I use select for testing) all columns that don't have unique column1 and don't select (actually retain) only the rows that have:
state = 0 and smallest value in column2,
if no row with state = 0 exists, then the row with just smallest value in column2.
So the result if the select should be:
column1|column2|state|
-------|-------|-----|
test1 | 3| 0|
test1 | 1| 1|
test2 | 2| 1|
and the retained rows (in case of delete) should be:
column1|column2|state|
-------|-------|-----|
test1 | 2| 0|
test2 | 1| 2|
I tried to achieve it with following (which does not work):
SELECT * FROM table1 AS result1
WHERE
result1.column1 IN
(SELECT
result2.column1
FROM
table1 AS result2
WHERE /*part that works*/)
AND
result1.column2 >
(SELECT
min(result3.column2)
FROM
table1 AS result3
WHERE (COALESCE(
result3.column1 = result1.column1
AND
result3.state = 0,
WHERE
result3.column1 = result1.column1
)))
The part that I can't figure out is behind result1.column2 >.
I want to compare the result1.column2 with the result of
smallest value from result-set where it3.state = 0,
if 1. does not exist, then with smallest value from similar result-set without it3.state = 0 condition.
That is my problem, I hope it makes sense. Maybe it can be rewritten in a more efficient/neater way completely.
Can you help me to fix that query?
Is this what you want?
SELECT
*
FROM
table1 AS result1
WHERE
result1.column1 IN (SELECT result2.column1
FROM table1 AS result2
WHERE /*part that works*/)
AND result1.column2 > COALESCE( ( SELECT min(result3.column2)
FROM table1 AS result3
WHERE result3.column1 = result1.column1
AND result3.state = 0 )
,( SELECT min(result3.column2)
FROM table1 AS result3
WHERE result3.column1 = result1.column1 )
)
;
I have a production table in hive which gets incremental(changed records/new records) data from external source on daily basis. For values in row are possibly spread across different dates, for example, this is how records in table looks on first day
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| |
| 3| | b3|
+---+----+----+
on second day, we get following -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 4| a4| |
| 2| | b2 |
| 3| a3| |
+---+----+----+
which has new record as well as changed records
The result I want to achieve is, merge of rows based on Primary key (id in this case) and produce and output which is -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2 |
| 3| a3| b3|
| 4| a4| b4|
+---+----+----+
Number of columns are pretty huge , typically in range of 100-150. Aim is to provide latest full view of all the data received so far.How can I do this within hive itself.
(ps:it doesnt have to be sorted)
This can archived using COALESCE and full outer join.
SELECT COALESCE(a.id ,b.id) as id ,
COALESCE(a.col1 ,b.col1) as col1,
COALESCE(a.col2 ,b.col2) as col2
FROM tbl1 a
FULL OUTER JOIN table2 b
on a.id =b.id
Let's say I have a table with 3 columns with the following values
| A | B | C |
| A1| B1| C |
| A | B | C1|
_____________
I'd like to make a query to get
A | B | C, C1|
A1| B1| C|
to get distinct First & Second Column. Any help would be greatly appreciated
Need to use listagg..
SELECT
cola,
colb,
LISTAGG(colc, ', ') WITHIN GROUP(ORDER BY colc) AS colc
FROM mytable
GROUP BY
cola,
colb