Create a col of Seq[String] values from column a col of Seq[String] - dataframe

I have this input dataframe:
inputDF=
+---------------------+
|days (seq[String]) |
+---------------------+
|[sat, sun] |
|[mon, wed] |
|[fri ] |
|[fri, sat] |
|[mon, sun, sat] |
+---------------------+
I would like to obtain this outputDF containing all exisiting Strings from days column
outputDF=
+---------------------+----------------------------+
|days (seq[String]) |all days (seq[String]) |
+---------------------+----------------------------+
|[sat, sun] |[sat, sun, mon, wed, fri] |
|[mon, wed] |[sat, sun, mon, wed, fri] |
|[fri] |[sat, sun, mon, wed, fri] |
|[fri, sat] |[sat, sun, mon, wed, fri] |
|[mon, sun, sat] |[sat, sun, mon, wed, fri] |
+---------------------+----------------------------+
How to do that in Scala/Spark please

Assuming this is our input, and it is called dataset:
+---------------+
|days |
+---------------+
|[sat, sun] |
|[mon, wed] |
|[fri] |
|[fri, sat] |
|[mon, sun, sat]|
+---------------+
We can get to this output:
+---------------+-------------------------+
|days |all_days |
+---------------+-------------------------+
|[sat, sun] |[fri, sat, sun, mon, wed]|
|[mon, wed] |[fri, sat, sun, mon, wed]|
|[fri] |[fri, sat, sun, mon, wed]|
|[fri, sat] |[fri, sat, sun, mon, wed]|
|[mon, sun, sat]|[fri, sat, sun, mon, wed]|
+---------------+-------------------------+
Through the following code:
// First we want to create a unique ID (if you don't have that already)
dataset = dataset.withColumn("id", lit(1))
// We want to group by the id, and collect all values into an array, then apply distinct
val collected = dataset
.groupBy("id")
.agg(array_distinct(flatten(collect_set("days"))).as("all_days"))
// We join our main table with the collected data
dataset = dataset
.join(collected, Seq("id"), "left")
.drop("id")
Good luck!

You can create another dataset that contains the unique days value then join it back to your initial dataset:
import spark.implicits._
val data = Seq(
Seq("sat", "sun"),
Seq("mon", "wed"),
Seq("fri" ),
Seq("fri", "sat"),
Seq("mon", "sun", "sat")
)
val df = spark.sparkContext.parallelize(data).toDF("days")
val allDf = df.select(explode(col("days")).as("days")).agg(collect_set("days").as("all_days"))
.withColumn("join_column", lit(1))
df.withColumn("join_column", lit(1)).join(broadcast(allDf), Seq("join_column"), "left").drop("join_column").show(false)
+---------------+-------------------------+
|days |all_days |
+---------------+-------------------------+
|[sat, sun] |[fri, sun, wed, mon, sat]|
|[mon, wed] |[fri, sun, wed, mon, sat]|
|[fri] |[fri, sun, wed, mon, sat]|
|[fri, sat] |[fri, sun, wed, mon, sat]|
|[mon, sun, sat]|[fri, sun, wed, mon, sat]|
+---------------+-------------------------+

You can collect distinct values and then add it using withColumn
import org.apache.spark.sql.functions._
val data = Seq((Seq("sun", "sun", "mon")),
(Seq("sun", "tue", "mon")),
(Seq("fri", "tue")))
val days = data.toDF("days")
val uniqueDays = days.agg(array_distinct(flatten(collect_set("days")))).head()(0)
days.withColumn("all days", lit(uniqueDays)).show(false)
Output is:
+---------------+--------------------+
|days |all days |
+---------------+--------------------+
|[sun, sun, mon]|[fri, tue, sun, mon]|
|[sun, tue, mon]|[fri, tue, sun, mon]|
|[fri, tue] |[fri, tue, sun, mon]|
+---------------+--------------------+

Related

how do you extract a variable that appears multiple times in a table only once

I'm trying to extract the name of space organisations from a table but the closest i can get is the amount of times it appears next to the name of the organisation but i just want the name of the organisation not the amount of times it is named in the table.
if you can help me please leave a comment on my google colab.
https://colab.research.google.com/drive/1m4zI4YGguQ5aWdDVyc7Bdpr-78KHdxhR?usp=sharing
What I get:
variable number
organisation
time of launch
0
SpaceX
Fri Aug 07, 2020 05:12 UTC
1
CASC
Thu Aug 06, 2020 04:01 UTC
2
SpaceX
Tue Aug 04, 2020 23:57 UTC
3
Roscosmos
Thu Jul 30, 2020 21:25 UTC
4
ULA
Thu Jul 30, 2020 11:50 UTC
...
...
...
4319
US Navy
Wed Feb 05, 1958 07:33 UTC
4320
AMBA
Sat Feb 01, 1958 03:48 UTC
4321
US Navy
Fri Dec 06, 1957 16:44 UTC
4322
RVSN USSR
Sun Nov 03, 1957 02:30 UTC
4323
RVSN USSR
Fri Oct 04, 1957 19:28 UTC
etc
etc
etc
What I want:
organisation
RVSN USSR
Arianespace
CASC
General Dynamics
NASA
VKS RF
US Air Force
ULA
Boeing
Martin Marietta
etc

How to find missing dates AND missing period in sql table within a given range?

Suppose there exist a table called:
RandomPriceSummary , which has the date ranging from Wed Oct 01 2022 00:00:00 GMT+0100 to Wed Oct 03 2022 00:00:00 GMT+0100, and period ranging from 1-3 and cost as shown below:
date
period
cost
Wed Oct 01 2022 00:00:00 GMT+0100 (British Summer Time)
1
10
Wed Oct 01 2022 00:00:00 GMT+0100 (British Summer Time)
2
20
Wed Oct 01 2022 00:00:00 GMT+0100 (British Summer Time)
3
10
Wed Oct 03 2022 00:00:00 GMT+0100 (British Summer Time)
1
20
Wed Oct 03 2022 00:00:00 GMT+0100 (British Summer Time)
2
20
In the above table, how can we check all of the missing dates and missing periods?
For example, we need a query WHERE SETTLEMENT_DATE BETWEEN TIMESTAMP '10-01-2022' AND TIMESTAMP '10-03-2022' which has a missing period ranging from 1-3.
So the expected answer should return something along the lines of :
missing_date
missing_period
Wed Oct 02 2022 00:00:00 GMT+0100 (British Summer Time)
1
Wed Oct 02 2022 00:00:00 GMT+0100 (British Summer Time)
2
Wed Oct 02 2022 00:00:00 GMT+0100 (British Summer Time)
3
Wed Oct 03 2022 00:00:00 GMT+0100 (British Summer Time)
3
We can use the following calendar table left anti-join approach:
SELECT d.dt, p.period
FROM (SELECT date_trunc('day', dd)::date AS dt
FROM generate_series(
'2022-01-01'::timestamp,
'2022-12-31'::timestamp,
'1 day'::interval) dd
) d
CROSS JOIN (SELECT 1 AS period UNION ALL SELECT 2 UNION ALL SELECT 3) p
LEFT JOIN RandomPriceSummary t
ON t.date::date = d.dt AND t.period = p.perio
WHERE d.dt BETWEEN '2022-10-01'::date AND '2022-10-03'::date AND
t.date IS NULL
ORDER BY d.dt, p.period;

Compare first column in two files, if match: update last column variable, else: append line to second file

I want to take col1 of file1 and if there is a match in col1 of file2, update the "date updated" in the last column. If there is no match, I want to append the entire line of file1 to file2 and append a "date updated" value to that line as well.
I am currently using awk 'NR==FNR{c[$1]++;next};c[$1] > 0' file2 file1 for a baseline comparison, but that wrongly prints the whole line IF there is a match and I also cannot figure out how to add another condition for updating the date column. I am also trying to do this in a shell script.
file 1
userName | cpu% | command | date created
user1, 101.6, plasma-de+, Thu Aug 8 09:30:17 MDT 2019
user2, 100.0, plasma-de+, Thu Aug 8 09:30:17 MDT 2019
user3, 102.0, plasma-de+, Thu Aug 8 09:30:17 MDT 2019
file 2
userName | cpu% | command | date created | date updated
user1, 101.6, plasma-de+, Mon Aug 5 06:35:39 MDT 2019, Mon Aug 5 06:35:39 MDT 2019
user2, 100.0, plasma-de+, Mon Aug 5 06:35:39 MDT 2019, Mon Aug 5 06:35:39 MDT 2019
file 2 after command is run
userName | cpu% | command | date created | date updated
user1, 101.6, plasma-de+, Mon Aug 5 06:35:39 MDT 2019, Thu Aug 8 09:30:17 MDT 2019
user2, 100.0, plasma-de+, Mon Aug 5 06:35:39 MDT 2019, Thu Aug 8 09:30:17 MDT 2019
user3, 102.0, plasma-de+, Thu Aug 8 09:30:17 MDT 2019, Thu Aug 8 09:30:17 MDT 2019
One non-awk way that assumes your files are sorted:
$ (join -t, -j1 -o 0,2.2,2.3,2.4,1.4 file1 file2; \
join -t, -j1 -v1 -o 0,1.2,1.3,1.4,1.4 file1 file2)
user1, 101.6, plasma-de+, Mon Aug 5 06:35:39 MDT 2019, Thu Aug 8 09:30:17 MDT 2019
user2, 100.0, plasma-de+, Mon Aug 5 06:35:39 MDT 2019, Thu Aug 8 09:30:17 MDT 2019
user3, 102.0, plasma-de+, Thu Aug 8 09:30:17 MDT 2019, Thu Aug 8 09:30:17 MDT 2019
or using awk without that restriction:
$awk 'BEGIN { FS = OFS = "," }
NR == FNR { a[$1] = $0; b[$1] = $4; next }
$1 in a { $5 = b[$1]; delete a[$1]; print }
END { for (u in a) print a[u], b[u] }' file1 file2
user1, 101.6, plasma-de+, Mon Aug 5 06:35:39 MDT 2019, Thu Aug 8 09:30:17 MDT 2019
user2, 100.0, plasma-de+, Mon Aug 5 06:35:39 MDT 2019, Thu Aug 8 09:30:17 MDT 2019
user3, 102.0, plasma-de+, Thu Aug 8 09:30:17 MDT 2019, Thu Aug 8 09:30:17 MDT 2019

Convert or cast varchar rows like (Mon Jul 18 19:28:36 EDT 2018) To DateTime

I have a column varchar type with dates like:
Fri Mar 3 12:55:17 EST 2017
Thu Jul 27 10:12:07 EDT 2017
Fri Jul 21 12:11:35 EDT 2017
Wed Jan 31 13:15:34 EST 2018
And I would like to return just the date and time something like:
03/03/2017 12:55:17
07/27/2017 10:12:07
07/21/2017 12:11:35
01/31/2018 13:15:34
I tried several ways with substring and convert statement but nothing work.
Any assistance in this regard will be greatly appreciated.
Perhaps something like this
Example
Declare #YourTable table (SomeCol varchar(50))
Insert Into #YourTable values
('Fri Mar 3 12:55:17 EST 2017'),
('Thu Jul 27 10:12:07 EDT 2017'),
('Fri Jul 21 12:11:35 EDT 2017'),
('Wed Jan 31 13:15:34 EST 2018')
Select *
,AsDateTime = try_convert(datetime,substring(SomeCol,4,len(SomeCol)-11)+right(SomeCol,4))
From #YourTable
Returns
SomeCol AsDateTime
Fri Mar 3 12:55:17 EST 2017 2017-03-03 12:55:17.000
Thu Jul 27 10:12:07 EDT 2017 2017-07-27 10:12:07.000
Fri Jul 21 12:11:35 EDT 2017 2017-07-21 12:11:35.000
Wed Jan 31 13:15:34 EST 2018 2018-01-31 13:15:34.000

Clean up excessive words in a date field

I have a text filed called "TempD" in an Access table MissingTF469Temp. This field will update the date by finding it from a long string after a specific word "Effective Date". But due to a recent change in the actual string it is taking an extra letter "I" at the end of the date in some instances. I'd like to have an sql code to remove this excessive character.
Any help please.
Tue, Mar 29, 2016
Wed, Mar 9, 2016I
Fri, Apr 22, 2016
Fri, Apr 1, 2016
Mon, Apr 4, 2016
Mon, Apr 25, 2016
Mon, Mar 21, 2016
Wed, May 11, 2016
Fri, Apr 1, 2016
Mon, Apr 4, 2016
Mon, Apr 4, 2016I
Mon, Apr 4, 2016I
Mon, Apr 4, 2016I
Fri, Mar 11, 2016
Fri, Mar 11, 2016
Fri, Mar 11, 2016
Fri, Mar 11, 2016
Fri, Mar 11, 2016
Fri, Mar 11, 2016
Fri, Mar 11, 2016
Fri, Mar 11, 2016
Fri, Mar 11, 2016
Fri, Mar 18, 2016
Fri, Mar 18, 2016
Mon, Mar 21, 2016
Mon, Mar 21, 2016
Mon, Mar 21, 2016
Mon, Mar 21, 2016
Mon, Mar 28, 2016
Fri, Apr 1, 2016
Fri, Apr 1, 2016
Fri, Mar 4, 2016I
Tue, Mar 8, 2016I
Tue, Mar 8, 2016I
How did you end up in that situation? Please use date fields for dates, not characters, and insert them in ISO 8601 format, like 'YYYY-MM-DD'.
Human-readable formats like yours should be kept for presentation only.
For your question, this should do the trick:
UPDATE table
SET field_name = LEFT(field_name, len(field_name)-1)
WHERE field_name LIKE '*I';