Executing a join while avoiding creating duplicate metrics in first table rows - sql

There are two tables to join for an in depth excel report. I am trying to avoid creating duplicate metrics. I have already separately scraped competitor data using a python script
The first table looks like this
name |occurances |hits | actions |avg $|Key
---------+------------+--------+-------------+-----+----
balls |53432 | 5001 | 5| 2$ |Hgdy24
bats |5389 | 4672 | 3| 4$ |dhfg12
The competitor data is as follows;
Key | Ad Copie |
---------+------------+
Hgdy24 |Click here! |
Hgdy24 |Free Trial! |
Hgdy24 |Sign Up now |
dhfg12 |Check it out|
dhfg12 |World known |
dhfg12 |Sign up |
I have already tried joins to the following effect, (duplicate rows metric rows created here)
name |occurances | hits | actions | avg$|Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
Balls |53432 | 5001 | 5| 2$ |Hgdy24|Click here!
Balls |53432 | 5001 | 5| 2$ |Hgdy24|Free Trial!
Balls |53432 | 5001 | 5| 2$ |Hgdy24|Sign Up now
Bats |5389 | 4672 | 3| 4$ |dhfg12|Check it out
Bats |5389 | 4672 | 3| 4$ |dhfg12|World known
Bats |5389 | 4672 | 3| 4$ |dhfg12|Sign up
Here is the desired output
name |occurances | hits | actions | avg$|Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
Balls |53432 | 5001 | 5| 2$ |Hgdy24|Click here!
Balls | | | | |Hgdy24|Free Trial!
Balls | | | | |Hgdy24|Sign Up now
Bats |5389 | 4672 | 3| 4$ |dhfg12|Check it out
Bats | | | | |dhfg12|World known
Bats | | | | |dhfg12|Sign up
Does anyone have a clue on a good course of action for this? Lag function perhaps?

Your desired output is not a proper use-case for SQL. SQL is designed to create vies of data with all the fields filled in. When you want to visualize that data, you should do so in your application code and suppress the "duplicate" values there, not in SQL.

Related

SQL table transformation. How to pivot a certain table?

How would I do the pivot below?
I have a table like this:
+------+---+----+
| round| id| kpi|
+------+---+----+
| 0 | 1 | 0.1|
| 1 | 1 | 0.2|
| 0 | 2 | 0.5|
| 1 | 2 | 0.4|
+------+---+----+
I want to convert the id column into multiple columns (same amount of different ids), with KPI value as their values and in the new table we keep the rounds like in the first table.
+------+----+----+
| round| id1| id2|
+------+----+----+
| 0 | 0.1| 0.5|
| 1 | 0.2| 0.4|
+------+----+----+
Is it possible to do this in SQL? How to do that?
You are looking for a pivot function. You can find details on how to do this here and here. The first link also provides input into how to do this if you have an unknown number of columnnames.

Hive beginner, got FAILED: SemanticException error

Suppose I have two tables, actv_user and play_video:
actv_user :
|p_date | user_id|country_name|
| -------- | -------------- |------------|
| 20210125| 1|Brazil|
| 20210124| 2|ENG|
| 20210125| 3|India|
| 20210125| 4|Indonesia|
| 20210125| 5|Indonesia|
| 20210125| 6|Brazil|
| 20210125| 7|Brazil|
| 20210125| 8|Indonesia|
User_id is unique but country_name can be null
play_video:
| user_id| video_id|
| -------- | -------------- |
| 1| 1001|
| 1| 1002|
| 2| 2001|
| 3| 1001|
| 3| 1002|
| 3| 3003|
| 4| 4004|
|5| 1001|
|5| 5005|
|6| 1001|
|6| 1002|
|7| 1001|
|7| 1002|
|8| 3003|
|8| 4004|
What I want to do is find New users(p_date = 20210125) in Brazil, Indonesia and India play videos on top on the first day.
Therefore, new users in Brazil are 1,6,7(user_id), new user in India is 3, new user in Indonesia are 4, 5,8(user_id);
The outcome is something like this:
In Brazil the top videos played by new users are 1001,1002
In India the top videos played by new users are 1001,1002,3003
In Indonesia the top videos played by new users are 4004,3003, 5005
desire outcome:
|country_name| count|video_id|
| -------- | -------------- |----- |
| Brazil| 1001|3|
| Brazil| 1002|3|
| India | 1001|1|
| India | 1002|1|
| India | 3003|1|
| Indonesia | 4004|2|
| Indonesia | 3003|1|
| Indonesia | 5005|1|
what I got error message is : Failed: semanticexception error condition: user_ ID is not null. Table play is missing in SQL_ Photo partition restrictions! If there is partition condition, please check whether there is abnormal or use, please add bracket for or condition!
any ideas?
I tried:
select actv_user.country_name ,play_video.video_id, count(play_video.video_id) count_num
from actv_user join play_photo on actv_user.user_id = play_video.user_id
where p_date = 20210125 and (country_name = 'Brazil' or country_name = 'India ' or country_name = 'Indonesia ')
group by actv_user.country_name ;
Please try this
country_name in ('Brazil' ,'India ' ,'Indonesia ')

Using pyspark to create a segment array from a flat record

I have a sparsely populated table with values for various segments for unique user ids. I need to create an array with unique_id and relevant segment headers only
Please note that this is just an indicative dataset. I have several hundreds of segments like these.
------------------------------------------------
| user_id | seg1 | seg2 | seg3 | seg4 | seg5 |
------------------------------------------------
| 100 | M | null| 25 | null| 30 |
| 200 | null| null| 43 | null| 250 |
| 300 | F | 3000| null| 74 | null|
------------------------------------------------
I am expecting the output to be
-------------------------------
| user_id| segment_array |
-------------------------------
| 100 | [seg1, seg3, seg5] |
| 200 | [seg3, seg5] |
| 300 | [seg1, seg2, seg4] |
-------------------------------
Is there any function available in pyspark of pyspark-sql to accomplish this?
Thanks for your help!
I cannot find the direct way but you can do this.
cols= df.columns[1:]
r = df.withColumn('array', array(*[when(col(c).isNotNull(), lit(c)).otherwise('notmatch') for c in cols])) \
.withColumn('array', array_remove('array', 'notmatch'))
r.show()
+-------+----+----+----+----+----+------------------+
|user_id|seg1|seg2|seg3|seg4|seg5| array|
+-------+----+----+----+----+----+------------------+
| 100| M|null| 25|null| 30|[seg1, seg3, seg5]|
| 200|null|null| 43|null| 250| [seg3, seg5]|
| 300| F|3000|null| 74|null|[seg1, seg2, seg4]|
+-------+----+----+----+----+----+------------------+
Not sure this is the best way but I'd attack it this way:
There's the collect_set function which will always give you a unique value across a list of values you aggregate over.
do a union for each segment on:
df_seg_1 = df.select(
'user_id',
fn.when(
col('seg1').isNotNull(),
lit('seg1)
).alias('segment')
)
# repeat for all segments
df = df_seg_1.union(df_seg_2).union(...)
df.groupBy('user_id').agg(collect_list('segment'))

SQL to retrieve cancelled services without new stamps

I have vehicles inside a services table, that first get a stamp and them an essay, stamp and essay are linked steps inside the services tables.
First came a stamp and second came an essay.
After all they get a certificate linked to the essay from the services table.
I need to retrieve only canceled certificates (status = 0 ) that do not have any new service with new stamps after the canceled essay .
My query is :
SELECT
cert.essay_id,
cert.status,
ser.essay_id,
ser.stamp_id,
ser.created,
v.ds_plate
FROM vehicles v
JOIN services ser ON ser.vei_id = v.vei_id
LEFT JOIN certificates cert ON cert.essay_id = ser.essay_id AND cert.status = 0
WHERE v.ds_plate = 'ABC1234' ORDER BY ser.created DESC;
And the results i have are :
-------------------------------------------------------------------------------
| ESSAY_ID | STATUS | ESSAY_ID | STAMP_ID | CREATED | DS_PLATE
1| | | | 4303333 | 10/07/2017 15:54:05 | ABC1234
2| 4254441 | 0 | 4254441 | | 20/12/2016 16:52:05 | ABC1234
3| | | | 4263152 | 20/12/2016 16:42:51 | ABC1234
In a minimalistic way these are the tables involved:
vehicles
vei_id
ds_plate
services
vei_id
essay_id
stamp_id
created
certificates
id
essay_id
status
created
I would like to find only these situations, when no new stamp was added after having a essay_id linked to an certificate and with status = 0 inside the services table :
| ESSAY_ID | STATUS | ESSAY_ID | STAMP_ID | CREATED | DS_PLATE
1| 4254441 | 0 | 4254441 | | 20/12/2016 16:52:05 | ABC1234
2| | | | 4263152 | 20/12/2016 16:42:51 | ABC1234

Translation of a SQL Query Into DAX to create a Calculated Column in PowerPivot

Hi I am building a PowerPivot Data Model Using "Person" table which has the columns "Name" and "Amount"
Table - Person
|Name | Amount|
|Red | 10|
|Blue | 10|
|Red | 16|
|Blue | 82|
|Red | 82|
|Red | 54|
|Red | 61|
|Blue | 82|
|Blue | 82|
The Output is as expected :
| Name | Amount | Count(Specific_Amount) |
| Red |10 | 2 |
| Blue | 10 | 1 |
|Red | 16 | 1|
|Blue | 82 | 3|
|Red |82 | 1|
|Red | 54 | 1|
|Red | 61 | 1|
What i Have Tried till now is :
select Name, distinct Amount, count(Amount) as CountOfAmountRepeated
from Person
group by Amount
order by Amount;
I have imported my table "Person" into PowerPivot in Excel.
I want to create a Calculated Column In PowerPivot in Excel to create a new column of count of Repeated Amount Values. i was able to do this in SQL by using the above query, But i wanted an Equivalent DAX query for creating a new column in PowerPivot.
Can someone translate this query into DAX or say a tool to translate sql into DAX so that i can create an Calculated column and Use PowerView to prepare a histogram of this data.
tried googling but no much help. Advance Thanks ..
There are a lot of facets of you question that need to be addressed but very simply (without consideration of any other requirements) the calculation is:
Count(Specific_Amount):=COUNTROWS('Person')
*All you seem to be looking to do here is count the unique instances of each combination.
If you then then created a pivot table dragging the [name] and [amount] into the rows and [Count(Specific_Amount)] into the values you would have the answer you are looking for, To get the layout you want you could change the layout to tabular form and remove the sub totals.