I combined 2 .csv files with 5 rows each and converted it into a df in Databricks Comm edition .But one column is having all null values in the df - dataframe

+----------+---------+-------------+----------+-------+--------+-------+------+------------+------+
|Invoice ID| City|Customer type|Unit price| Tax 5%| Total|Payment| cogs|gross income|Rating|
+----------+---------+-------------+----------+-------+--------+-------+------+------------+------+
| 750678428| Yangon| Member| 74.69|26.1415|548.9715| null|522.83| 26.1415| 9.1|
| 226313081|Naypyitaw| Normal| 15.28| 3.82| 80.22| null| 76.4| 3.82| 9.6|
| 631413108| Yangon| Normal| 46.33|16.2155|340.5255| null|324.31| 16.2155| 7.4|
| 123191176| Yangon| Member| 58.22| 23.288| 489.048| null|465.76| 23.288| 8.4|
| 373737910| Yangon| Normal| 86.31|30.2085|634.3785| null|604.17| 30.2085| 5.3|
| 750678428| Yangon| Member| 74.69|26.1415|548.9715| null|522.83| 26.1415| 9.1|
| 226313081|Naypyitaw| Normal| 15.28| 3.82| 80.22| null| 76.4| 3.82| 9.6|
| 631413108| Yangon| Normal| 46.33|16.2155|340.5255| null|324.31| 16.2155| 7.4|
| 123191176| Yangon| Member| 58.22| 23.288| 489.048| null|465.76| 23.288| 8.4|
| 373737910| Yangon| Normal| 86.31|30.2085|634.3785| null|604.17| 30.2085| 5.3|
+----------+---------+-------------+----------+-------+--------+-------+------+------------+------+
The payment column is completely NULL; but indeed there is data in the input CSV files. How do I rectify this ?

Related

How to get the occurence rate of the specific values with Apache Spark

I have the raw data DataFrame like that:
+-----------+--------------------+------+
|device | timestamp | value|
+-----------+--------------------+------+
| device_A|2022-01-01 18:00:01 | 100|
| device_A|2022-01-01 18:00:02 | 99|
| device_A|2022-01-01 18:00:03 | 100|
| device_A|2022-01-01 18:00:04 | 102|
| device_A|2022-01-01 18:00:05 | 100|
| device_A|2022-01-01 18:00:06 | 99|
| device_A|2022-01-01 18:00:11 | 98|
| device_A|2022-01-01 18:00:12 | 100|
| device_A|2022-01-01 18:00:13 | 100|
| device_A|2022-01-01 18:00:15 | 101|
| device_A|2022-01-01 18:00:17 | 101|
I'd like to aggregate them and to build the listed 10 s aggregation like that:
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
To plot a heat-map graph of the values later.
I have succeed with getting the values column but not clear how to calculate the corresponding counts
.withColumn("values",collect_list(col("value")).over(Window.partitionBy($"device").orderBy($"timestamp".desc)))
How can I do the weighted list aggregation in Apache Spark?
Group by time window using window function with duration of 10 seconds to get counts by value and device, then group by device + window_time and collect list of structs:
val result = (
df.groupBy(
$"device",
window($"timestamp", "10 second")("start").as("window_time"),
$"value"
)
.count()
.groupBy("device", "window_time")
.agg(collect_list(struct($"value", $"count")).as("values"))
.withColumn("count", col("values.count"))
.withColumn("values", col("values.value"))
)
result.show()
//+--------+-------------------+--------------+---------+
//| device| window_time| values| count|
//+--------+-------------------+--------------+---------+
//|device_A|2022-01-01 18:00:00|[102, 99, 100]|[1, 2, 3]|
//|device_A|2022-01-01 18:00:10|[100, 101, 98]|[2, 2, 1]|
//+--------+-------------------+--------------+---------+

SQL table transformation. How to pivot a certain table?

How would I do the pivot below?
I have a table like this:
+------+---+----+
| round| id| kpi|
+------+---+----+
| 0 | 1 | 0.1|
| 1 | 1 | 0.2|
| 0 | 2 | 0.5|
| 1 | 2 | 0.4|
+------+---+----+
I want to convert the id column into multiple columns (same amount of different ids), with KPI value as their values and in the new table we keep the rounds like in the first table.
+------+----+----+
| round| id1| id2|
+------+----+----+
| 0 | 0.1| 0.5|
| 1 | 0.2| 0.4|
+------+----+----+
Is it possible to do this in SQL? How to do that?
You are looking for a pivot function. You can find details on how to do this here and here. The first link also provides input into how to do this if you have an unknown number of columnnames.

Using pyspark to create a segment array from a flat record

I have a sparsely populated table with values for various segments for unique user ids. I need to create an array with unique_id and relevant segment headers only
Please note that this is just an indicative dataset. I have several hundreds of segments like these.
------------------------------------------------
| user_id | seg1 | seg2 | seg3 | seg4 | seg5 |
------------------------------------------------
| 100 | M | null| 25 | null| 30 |
| 200 | null| null| 43 | null| 250 |
| 300 | F | 3000| null| 74 | null|
------------------------------------------------
I am expecting the output to be
-------------------------------
| user_id| segment_array |
-------------------------------
| 100 | [seg1, seg3, seg5] |
| 200 | [seg3, seg5] |
| 300 | [seg1, seg2, seg4] |
-------------------------------
Is there any function available in pyspark of pyspark-sql to accomplish this?
Thanks for your help!
I cannot find the direct way but you can do this.
cols= df.columns[1:]
r = df.withColumn('array', array(*[when(col(c).isNotNull(), lit(c)).otherwise('notmatch') for c in cols])) \
.withColumn('array', array_remove('array', 'notmatch'))
r.show()
+-------+----+----+----+----+----+------------------+
|user_id|seg1|seg2|seg3|seg4|seg5| array|
+-------+----+----+----+----+----+------------------+
| 100| M|null| 25|null| 30|[seg1, seg3, seg5]|
| 200|null|null| 43|null| 250| [seg3, seg5]|
| 300| F|3000|null| 74|null|[seg1, seg2, seg4]|
+-------+----+----+----+----+----+------------------+
Not sure this is the best way but I'd attack it this way:
There's the collect_set function which will always give you a unique value across a list of values you aggregate over.
do a union for each segment on:
df_seg_1 = df.select(
'user_id',
fn.when(
col('seg1').isNotNull(),
lit('seg1)
).alias('segment')
)
# repeat for all segments
df = df_seg_1.union(df_seg_2).union(...)
df.groupBy('user_id').agg(collect_list('segment'))

Executing a join while avoiding creating duplicate metrics in first table rows

There are two tables to join for an in depth excel report. I am trying to avoid creating duplicate metrics. I have already separately scraped competitor data using a python script
The first table looks like this
name |occurances |hits | actions |avg $|Key
---------+------------+--------+-------------+-----+----
balls |53432 | 5001 | 5| 2$ |Hgdy24
bats |5389 | 4672 | 3| 4$ |dhfg12
The competitor data is as follows;
Key | Ad Copie |
---------+------------+
Hgdy24 |Click here! |
Hgdy24 |Free Trial! |
Hgdy24 |Sign Up now |
dhfg12 |Check it out|
dhfg12 |World known |
dhfg12 |Sign up |
I have already tried joins to the following effect, (duplicate rows metric rows created here)
name |occurances | hits | actions | avg$|Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
Balls |53432 | 5001 | 5| 2$ |Hgdy24|Click here!
Balls |53432 | 5001 | 5| 2$ |Hgdy24|Free Trial!
Balls |53432 | 5001 | 5| 2$ |Hgdy24|Sign Up now
Bats |5389 | 4672 | 3| 4$ |dhfg12|Check it out
Bats |5389 | 4672 | 3| 4$ |dhfg12|World known
Bats |5389 | 4672 | 3| 4$ |dhfg12|Sign up
Here is the desired output
name |occurances | hits | actions | avg$|Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
Balls |53432 | 5001 | 5| 2$ |Hgdy24|Click here!
Balls | | | | |Hgdy24|Free Trial!
Balls | | | | |Hgdy24|Sign Up now
Bats |5389 | 4672 | 3| 4$ |dhfg12|Check it out
Bats | | | | |dhfg12|World known
Bats | | | | |dhfg12|Sign up
Does anyone have a clue on a good course of action for this? Lag function perhaps?
Your desired output is not a proper use-case for SQL. SQL is designed to create vies of data with all the fields filled in. When you want to visualize that data, you should do so in your application code and suppress the "duplicate" values there, not in SQL.

SQL Query Microsoft Access - change horizontal field to vertical field

I need help with access SQL Query.
I create view in access using 4 table, my problem show when I want to change some field into vertical. I know if two matrix but if more than it I can't.
This is my looks like before change
|DataKioskID | KioskName | YearFiscal | MonthReport | ProductID | ProductName | Sales | Stock |
|AB0101061501| Sarana Tani | 2015 | 6 | P15 | Advanta | 56| 12|
|AB0101061501| Sarana Tani | 2015 | 6 | P16 | Advanta | 23| 15|
|AB0101061501| Sarana Tani | 2015 | 6 | P02 | Advanta | 14| 12|
|AB0102061501| TaniLestari | 2015 | 6 | P02 | Advanta | 15| 14|
|AB0102061501| TaniLestari | 2015 | 6 | P15 | Advanta | 12| 15|
|AB0102061501| TaniLestari | 2015 | 6 | P16 | Advanta | 14| 23|
code :
SELECT Data_Kiosk_Header.DataKioskID, Master_Kiosk.KioskName, Data_Kiosk_Header.YearFiscal
, Max(Data_Kiosk_Header.MonthReport) AS monthReport
, Max(IIf(Data_Kiosk_Detail.ProductID='P15',Data_Kiosk_Detail.Sales,0)) AS Advanta_Sales
, Max(IIf(Data_Kiosk_Detail.ProductID='P16',Data_Kiosk_Detail.Sales,0)) AS Agro_Sales
, Max(IIf(Data_Kiosk_Detail.ProductID='P02',Data_Kiosk_Detail.Sales,0)) AS P12_Sales
, Max(IIf(Data_Kiosk_Detail.ProductID='P15',Data_Kiosk_Detail.Stocks,0)) AS Advanta_Stocks
, Max(IIf(Data_Kiosk_Detail.ProductID='P16',Data_Kiosk_Detail.Stocks,0)) AS Agro_Stocks
, Max(IIf(Data_Kiosk_Detail.ProductID='P02',Data_Kiosk_Detail.Stocks,0)) AS P12_Stocks
FROM Master_Kiosk
INNER JOIN (Data_Kiosk_Header INNER JOIN (Data_Kiosk_Detail
INNER JOIN Master_Product ON Data_Kiosk_Detail.ProductID = Master_Product.ProductID) ON Data_Kiosk_Header.DataKioskID = Data_Kiosk_Detail.DataKioskID) ON Master_Kiosk.kioskid = Data_Kiosk_Header.KioskName
GROUP BY Data_Kiosk_Header.DataKioskID, Master_Kiosk.KioskName, Data_Kiosk_Header.YearFiscal;
after the code become like this :
DataKioskID | KioskName |YearFiscal |monthReport |Advanta_Sales |Agro_Sales |P12_Sales |Advanta_Stocks |Agro_Stocks |P12_Stocks |
AB0101061501| Sarana Tani |2015 |6 |56 |23 |14 |12 |15 |12 |
AB0102061501| Tani Lestari|2015 |6 |12 |14 |15 |15 |23 |14 |
Can anybody help me?,I wanna be like this.
|DataKioskID | KioskName | YearFiscal | MonthReport | Sales | Stock |
| | | | | Advanta | Agro | P12 | Advanta | Agro | P12 |
|AB0101061501| Sarana Tani | 2015 | 6 | 56 | 23| 14| 12 | 15| 12|
|AB0102061501| LestariTani | 2015 | 6 | 15 | 12| 14| 14 | 15| 16|
Here I give the DB to you can try what I mean:
DB Source
Exactly what you want is not possible at least on query level because you have 2 level grouping...report is the answer
Furthermore in order to get the info as a "single" query you need the following
1st a cross tab query for sales
TRANSFORM Max(Data_Kiosk_Detail.Sales) AS MaxOfSales
SELECT Data_Kiosk_Header.DataKioskID
,Master_Kiosk.KioskName
,Data_Kiosk_Header.YearFiscal
,Data_Kiosk_Header.MonthReport AS monthReport
,"Sales" AS Info
FROM Master_Kiosk
INNER JOIN (
Data_Kiosk_Header INNER JOIN (
Data_Kiosk_Detail INNER JOIN Master_Product ON Data_Kiosk_Detail.ProductID = Master_Product.ProductID
) ON Data_Kiosk_Header.DataKioskID = Data_Kiosk_Detail.DataKioskID
) ON Master_Kiosk.kioskid = Data_Kiosk_Header.KioskName
GROUP BY Data_Kiosk_Header.DataKioskID
,Master_Kiosk.KioskName
,Data_Kiosk_Header.YearFiscal
,Data_Kiosk_Header.MonthReport
,"Sales"
PIVOT Data_Kiosk_Detail.ProductID;
2nd a cross tab query for Stocks
TRANSFORM Max(Data_Kiosk_Detail.Stocks) AS MaxOfStocks
SELECT Data_Kiosk_Header.DataKioskID
,Master_Kiosk.KioskName
,Data_Kiosk_Header.YearFiscal
,Data_Kiosk_Header.MonthReport AS monthReport
,"Stocks" AS Info
FROM Master_Kiosk
INNER JOIN (
Data_Kiosk_Header INNER JOIN (
Data_Kiosk_Detail INNER JOIN Master_Product ON Data_Kiosk_Detail.ProductID = Master_Product.ProductID
) ON Data_Kiosk_Header.DataKioskID = Data_Kiosk_Detail.DataKioskID
) ON Master_Kiosk.kioskid = Data_Kiosk_Header.KioskName
GROUP BY Data_Kiosk_Header.DataKioskID
,Master_Kiosk.KioskName
,Data_Kiosk_Header.YearFiscal
,Data_Kiosk_Header.MonthReport
,"Stocks"
PIVOT Data_Kiosk_Detail.ProductID;
Then you join them together with a union query
select * from MaxOfSales
UNION select * from MaxOfStocks;
Then you could use the above query to create a report to show what you need