Convert string value to Timestamp - PySparkSQL - apache-spark-sql

I need to convert a date string e.g. 2022-04-12T14:22:34Z to timestamp in PySpark/SparkSQL before loading it to a Postgres table.
I have tried SELECT TO_TIMESTAMP(REGEXP_REPLACE('2022-04-12T14:22:34Z', '[TZ]', ''), 'YYYY-MM-DDHH:MM:SS') AS result
It works somewhat, but just wondering if there is an elegant way to accomplish the same in Spark 2.4.
Thanks.

You can try this:
df.withColumn("date", to_timestamp(col("date"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
Input:
+--------------------+
| date|
+--------------------+
|2022-04-12T14:22:34Z|
+--------------------+
Output:
+-------------------+
|date |
+-------------------+
|2022-04-12 14:22:34|
+-------------------+
Good luck!

Related

How to FILTER for string in numeric column

I am trying to filter out my Spark DF to only show text values in a numeric field - as the data is unstructured.
Not quite sure how to work the code below for the scenario above:
sparkdf = sparkdf.filter(col("colToFilter") <evaluation>)
If I were to try something similar in SQL, I would perform the following:
SELECT * FROM tbl
WHERE col NOT LIKE '%[0-9]%'
An example of my current table would look like this:
|RefId|
|0|
|1|
|1|
|1|
|RefNum2|
|1|
I would like to show only "RefNum2" as an output.
I would really appreciate any assistance.
Thank you.
the simplest request:
select * from tbl1 where col regexp('[a-z]');
You can use an rlike filter as below:
df.filter("RefId NOT RLIKE '^[0-9]+$'").show()
+-------+
| RefId|
+-------+
|RefNum2|
+-------+
Or
import pyspark.sql.functions as F
df.filter(~F.col("RefId").rlike("^[0-9]+$")).show()
+-------+
| RefId|
+-------+
|RefNum2|
+-------+

Where/filtering in pyspark

I used sql with pyspark but when I used where for filtering the result was a empty table but It's false because I have data with this filtering.
"Lesividad" is a string:
|-- LESIVIDAD: string (nullable = true)
t_acc = spark.sql("SELECT LESIVIDAD, COUNT(LESIVIDAD) AS COUNT FROM acc_table
WHERE LESIVIDAD = 'IL' GROUP BY LESIVIDAD")
t_acc.show()
+---------+-----+
|LESIVIDAD|COUNT|
+---------+-----+
+---------+-----+
My table "Lesividad" is:
t_acc = spark.sql("""SELECT LESIVIDAD FROM acc_table GROUP BY
LESIVIDAD""").show()

+--------------------+
| LESIVIDAD|
+--------------------+
| NO ASIGNADA|
|IL ...|
|MT ...|
|HG ...|
|HL ...|
+--------------------+
your code is perfect. I presume the problem is with the data which your trying to search i.e. LESIVIDAD = 'IL'.
Please note, in pyspark, header/column names of table are case-insensitive where as data inside the table is case-****sensitive. So if your table contains 'il' / 'Il' /'iL' and there is no 'IL'. It will return empty table only.
Hence please note the data which you are trying to search is case-sensitive. Hence type correctly.

SQL - Structuring code for Group by and Sum in query

I would appriciate any ideas on how to approach this problem/function in SQL code. Please see the attached image.
I need to Group the AbsenceCause, Sum the numDays according to each AbsenceCausegroup and get this displayed under each AbsenceEmployeeID.
The goal is to achieve a new table like:
|AbsenceEmployeeID|AbsenceCause|numDays|
| 081014002722|Children | 9|
| 081014002722|Travel | 2|
Thanks,
Joergen Mathiesen
Something like this ?
SELECT AbsenceEmployeeID,
AbsenceCause,
SUM(numDays)
FROM que_ABSENCE_Days
GROUP BY AbsenceEmployeeID,
AbsenceCause

Get list of MySQL databases, and server version?

My connection string for MySQL is:
"Server=localhost;User ID=root;Password=123;pooling=yes;charset=utf8;DataBase=.;"
My questions are :
What query should I write to get database names that exist?
What query should I write to get server version?
I have error because of my connection string ends with DataBase=.
What should I write instead of the dot?
SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
SELECT VARIABLE_NAME, VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_VARIABLES WHERE VARIABLE_NAME = 'VERSION'
Use INFORMATION_SCHEMA as the database.
To get the list of databases, you can use SHOW DATABASES:
SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| test |
+--------------------+
3 rows in set (0.01 sec)
To get the version number of your MySQL Server, you can use SELECT VERSION():
SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 5.1.45 |
+-----------+
1 row in set (0.01 sec)
As for the question about the connection string, you'd want to put a database name instead of the dot, such as Database=test.
show Databases;
Will return you all the registered databases.
And
show variables;
will return a bunch of name value pairs, one of which is the version number.

format integer to string

I have an integer field in a table and I want to make a query to format the integer value of this field in an char or double field with a especific format.
For example, if my value in the table is 123456 I want to format it as "###.###" what means the result should be like this: 123.456
I've done this using CONCAT function, but the result is not very elegant. I would like to use another funciont spacific for this purpose.
I would suggest doing this in your presentation layer rather than the DB.
This is pretty easy in C#:
// Assuming value is an int
value.ToString("N");
More details on formatting int in various ways see the Microsoft documentation
Maybe you would like to use formatting like '###,###.###' ?
Here is the example.
mysql> select FORMAT( 123446, 4 );
+---------------------+
| FORMAT( 123446, 4 ) |
+---------------------+
| 123,446.0000 |
+---------------------+
1 row in set (0.02 sec)
mysql> select FORMAT( 123446, 0 );
+---------------------+
| FORMAT( 123446, 0 ) |
+---------------------+
| 123,446 |
+---------------------+
1 row in set (0.00 sec)