Distinct Sum and Group by - sql

I have a dataset [attached example] and I want to create 2 tables out of this;
+------+------------+-------+-------+-------+--------+
| corp | product | data | Group | sales | market |
+------+------------+-------+-------+-------+--------+
| A | Eli | 43831 | A | 100 | I |
| A | Eli | 43831 | B | 100 | I |
| B | Sut | 43831 | A | 80 | I |
| A | Api | 43831 | C | 50 | C or D |
| A | Api | 43831 | D | 50 | C or D |
| B | Konkurent2 | 43831 | C | 40 | C or D |
+------+------------+-------+-------+-------+--------+
1st - sum(sales) by market and exclude duplicated rows, so I want to end up with Sales for each market in specific date rage (Data column) but exclude duplicated - I have them because 1 product can be in more than 1 group
So first table, for exmaple, for MRCC I, would look like:
+--------+-------+-------+
| market | sales | data |
+--------+-------+-------+
| I | 180 | 43831 |
+--------+-------+-------+
Then second table I would like to look like above one, but add as a 'dictionary' aditionall column with uniqe product name within Market and Date, so for MRCC I it would look like:
+--------+-------+-------+----------------+
| market | sales | data | unique product |
+--------+-------+-------+----------------+
| I | 180 | 43831 | eli |
| I | 180 | 43831 | Sut |
+--------+-------+-------+----------------+
The thing is, im not that experienced in SQL, and i'm fairly new to DataProcessing, the system I am working in allows me to do some of data processing either by some "visual" recipes or by SQL code which im not that familiar with. And even moe confusing is I can choose between 3 SQL DBMS , Impala, Hive, Spark SQL - for example to create market column I used Impala and script looks like this, and im not sure if this is "pure" Impala syntax:
SELECT * from
(
-- mrc I --
SELECT *,case when
(`product`="Eli")
or
(`product`="Sut")
THEN "MRCC I"
end as market
FROM x.`y`
)a
where market is not null
Could you give me some tips on a structure of a code and if this is even possible?
Thanks,
eM

import spark.implicits._
import org.apache.spark.sql.functions._
case class Sale(
corp: String,
product: String,
data: Long,
group: String,
sales: Long,
market: String
)
val df = Seq(
Sale("A", "Eli", 43831, "A", 100, "I"),
Sale("A", "Eli", 43831, "B", 100, "I"),
Sale("A", "Sut", 43831, "A", 80, "I"),
Sale("A", "Api", 43831, "C", 50, "C or D"),
Sale("A", "Api", 43831, "D", 50, "C or D"),
Sale("B", "Konkurent2", 43831, "C", 40, "C or D")
).toDF()
val t2 = df.dropDuplicates(Seq("corp", "product", "data", "market"))
.groupBy("market", "product", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data,
'product.alias("unique product")
)
t2.show(false)
// +------+-----+-----+--------------+
// |market|sales|data |unique product|
// +------+-----+-----+--------------+
// |I |80 |43831|Sut |
// |I |100 |43831|Eli |
// |C or D|40 |43831|Konkurent2 |
// |C or D|50 |43831|Api |
// +------+-----+-----+--------------+
val t1 = t2.drop("unique product")
.groupBy("market", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data)
t1.show(false)
// +------+-----+-----+
// |market|sales|data |
// +------+-----+-----+
// |I |180 |43831|
// |C or D|90 |43831|
// +------+-----+-----+

Related

How to extract elements from Presto ARRAY(MAP(VARCHAR, VARCHAR))

I have an array of maps and data format is ARRAY(MAP(VARCHAR, VARCHAR)); I'd like to extract "id" and "description" features from this "Item_Details" column:
+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+--+--+
| Company | Country | Item_Details | | |
+===========+=============+============================================================================================================================================================+==+==+
| Apple | US | [{"created":"2019-09-15","product":"apple watch", "amount": "$7,900"},{"created":"2022-09-19","product":"iPhone", "amount": "$78,300"},{"created":"2021-01-13","product":"Macbook Pro", "amount": "$163,980"}] | | |
| Google | US | [{"created":"2020-07-15","product":"Nest", "amount": "$78,300"},{"created":"2021-07-15","product":"Google phone", "amount": "$178,900"}] | | |
+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
My expected outputs would be:
+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+--+--+
| Company | Country | Item_Details | | |
+===========+=============+============================================================================================================================================================+==+==+
| Apple | US | ["product":["apple watch", "iPhone", "Macbook Pro"], "amount":[ "$7,900", "$78,300","$163,980"] | | |
| Google | US | ["product":["Nest", "Google phone"], "amount": "$78,300", "$178,900"] | | |
+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
or
+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+--+--+
| Company | Country | Product | Amount | | |
+===========+=============+============================================================================================================================================================+==+==+
| Apple | US | apple watch | $7,900 | | |
| Apple | US | iPhone | $78,300 | | |
| Apple | US | Macbook Pro | $163,980 | | |
...
+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
I tried element_at(Item_Details, 'product') and json_extract_scalar(Item_Details, '$.product') but received error "Unexpected parameters (array(map(varchar,varchar)), varchar(23)) for function element_at. "
Any suggestions is much appreciated! Thank you in advance
For the second one you can unnest the array and access elements of map:
-- sampel data
WITH dataset(Company, Country, Item_Details) AS (
values ('Google', 'US', array[
map(array['created', 'product', 'amount'], array['2019-09-15', 'Nest', '$78,300']),
map(array['created', 'product', 'amount'], array['2019-09-16', 'Nest1', '$79,300'])
])
)
-- query
select Company,
Country,
m['product'] product,
m['amount'] amount
from dataset d,
unnest(Item_Details) as t(m);
Output:
Company
Country
product
amount
Google
US
Nest
$78,300
Google
US
Nest1
$79,300

PowerBI / SQL Query to verify records

I am working on a PowerBI report that is grabbing information from SQL and I cannot find a way to solve my problem using PowerBI or how to write the required code. My first table, Certifications, includes a list of certifications and required trainings that must be obtained in order to have an active certification.
My second table, UserCertifications, includes a list of UserIDs, certifications, and the trainings associated with a certification.
How can I write a SQL code or PowerBI measure to tell if a user has all required trainings for a certification? ie, if UserID 1 has the A certification, how can I verify that they have the TrainingIDs of 1, 10, and 150 associated with it?
Certifications:
CertificationsTable
UserCertifications:
UserCertificationsTable
This is a DAX pattern to test if contains at least some values.
| Certifications |
|----------------|------------|
| Certification | TrainingID |
|----------------|------------|
| A | 1 |
| A | 10 |
| A | 150 |
| B | 7 |
| B | 9 |
| UserCertifications |
|--------------------|---------------|----------|
| UserID | Certification | Training |
|--------------------|---------------|----------|
| 1 | A | 1 |
| 1 | A | 10 |
| 1 | A | 300 |
| 2 | A | 150 |
| 2 | B | 9 |
| 2 | B | 90 |
| 3 | A | 7 |
| 4 | A | 1 |
| 4 | A | 10 |
| 4 | A | 150 |
| 4 | A | 1000 |
In the above scenario, DAX needs to find out if the mandatory trainings (Certifications[TrainingID]) by Certifications[Certification] is completed by
UserCertifications[UserID ]&&UserCertifications[Certifications] partition.
In the above scenario, DAX should only return true for UserCertifications[UserID ]=4 as it is the only User that completed at least all the mandatory trainings.
The way to achieve this is through the following measure
areAllMandatoryTrainingCompleted =
VAR _alreadyCompleted =
CONCATENATEX (
UserCertifications,
UserCertifications[Training],
"-",
UserCertifications[Training]
) // what is completed in the fact Table; the fourth argument is very important as it decides the sort order
VAR _0 =
MAX ( UserCertifications[Certification] )
VAR _supposedToComplete =
CONCATENATEX (
FILTER ( Certifications, Certifications[Certification] = _0 ),
Certifications[TrainingID],
"-",
Certifications[TrainingID]
) // what is comeleted in the training Table; the fourth argument is very important as it decides the sort order
VAR _isMandatoryTrainingCompleted =
CONTAINSSTRING ( _alreadyCompleted, _supposedToComplete ) // CONTAINSSTRING (<Within Text>,<Search Text>); return true false
RETURN
_isMandatoryTrainingCompleted

Snowflake - using json_parse and select distinct to un-nested column and compare with another column

I have 2 columns, 1 is a nested column named custom_field and the other is sales_id I want to compare the sales_id_2 values in custom_field with sales_id column
I've tried this but it didn't work:
select distinct parse_json(custom_fields) as CUSTOM_FIELDS
from my_table where custom_fields:sales_id_2 = sales_id;
but I get the error:
SQL compilation error: error line 1 at position 111 Invalid argument
types for function 'GET': (VARCHAR(16777216), VARCHAR(2)).
+-----------------------------------------------------+
| custom_field | sales_id |
|-----------------------------------------------------|
| | |
| { | 235324115 |
| "sales_id_2": 235324115, | 1234351 |
| "g": 12, | |
| "r": 255 | |
| } | |
| { | 678322341 |
| "sales_id_2": 1234351, | 5648561 |
| "g": 13, | |
| "r": 254 | |
| } | |
I'm hoping to see empty results, because I believe sales_id_2 is the same as sales_id
:: is for casting, plus you are trying a JSON operation on a varchar column. try this
select distinct parse_json(custom_fields) as CUSTOM_FIELDS from my_table where parse_json(custom_fields):sales_id_2 = sales_id;

In Hive, what is the difference between explode() and lateral view explode()

Assume there is a table employee:
+-----------+------------------+
| col_name | data_type |
+-----------+------------------+
| id | string |
| perf | map<string,int> |
+-----------+------------------+
and the data inside this table:
+-----+------------------------------------+--+
| id | perf |
+-----+------------------------------------+--+
| 1 | {"job":80,"person":70,"team":60} |
| 2 | {"job":60,"team":80} |
| 3 | {"job":90,"person":100,"team":70} |
+-----+------------------------------------+--+
I tried the following two queries but they all return the same result:
1. select explode(perf) from employee;
2. select key,value from employee lateral view explode(perf) as key,value;
The result:
+---------+--------+--+
| key | value |
+---------+--------+--+
| job | 80 |
| team | 60 |
| person | 70 |
| job | 60 |
| team | 80 |
| job | 90 |
| team | 70 |
| person | 100 |
+---------+--------+--+
So, what is the difference between them? I did not find suitable examples. Any help is appreciated.
For your particular case both queries are OK. But you can't use multiple explode() functions without lateral view. So, the query below will fail:
select explode(array(1,2)), explode(array(3, 4))
You'll need to write something like:
select
a_exp.a,
b_exp.b
from (select array(1, 2) as a, array(3, 4) as b) t
lateral view explode(t.a) a_exp as a
lateral view explode(t.b) b_exp as b

SparkSQL: conditional sum on range of dates

I have a dataframe like this:
| id | prodId | date | value |
| 1 | a | 2015-01-01 | 100 |
| 2 | a | 2015-01-02 | 150 |
| 3 | a | 2015-01-03 | 120 |
| 4 | b | 2015-01-01 | 100 |
and I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates. In other words, I need to build a table with the following columns:
prodId
val_1: sum value if date is between date1 and date2
val_2: sum value if date is between date2 and date3
val_3: same as before
etc.
| prodId | val_1 | val_2 |
| | (01-01 to 01-02) | (01-03 to 01-04) |
| a | 250 | 120 |
| b | 100 | 0 |
Is there any predefined aggregated function in spark that allows doing conditional sums? Do you recommend develop a aggr. UDF (if so, any suggestions)?
Thanks a lot!
First lets recreate example dataset
import org.apache.spark.sql.functions.to_date
val df = sc.parallelize(Seq(
(1, "a", "2015-01-01", 100), (2, "a", "2015-01-02", 150),
(3, "a", "2015-01-03", 120), (4, "b", "2015-01-01", 100)
)).toDF("id", "prodId", "date", "value").withColumn("date", to_date($"date"))
val dates = List(("2015-01-01", "2015-01-02"), ("2015-01-03", "2015-01-04"))
All you have to do is something like this:
import org.apache.spark.sql.functions.{when, lit, sum}
val exprs = dates.map{
case (x, y) => {
// Create label for a column name
val alias = s"${x}_${y}".replace("-", "_")
// Convert strings to dates
val xd = to_date(lit(x))
val yd = to_date(lit(y))
// Generate expression equivalent to
// SUM(
// CASE
// WHEN date BETWEEN ... AND ... THEN value
// ELSE 0
// END
// ) AS ...
// for each pair of dates.
sum(when($"date".between(xd, yd), $"value").otherwise(0)).alias(alias)
}
}
df.groupBy($"prodId").agg(exprs.head, exprs.tail: _*).show
// +------+---------------------+---------------------+
// |prodId|2015_01_01_2015_01_02|2015_01_03_2015_01_04|
// +------+---------------------+---------------------+
// | a| 250| 120|
// | b| 100| 0|
// +------+---------------------+---------------------+