Using LIKE operator for multiple words in PySpark

Using LIKE operator for multiple words in PySpark - dataframe

I have a DataFrame df in PySpark, like a one shown below -
+-----+--------------------+-------+
| ID| customers|country|
+-----+--------------------+-------+
|56 |xyz Limited |U.K. |
|66 |ABC Limited |U.K. |
|16 |Sons & Sons |U.K. |
|51 |TÜV GmbH |Germany|
|23 |Mueller GmbH |Germany|
|97 |Schneider AG |Germany|
|69 |Sahm UG |Austria|
+-----+--------------------+-------+
I would like to keep only those rows where ID starts from either 5 or 6. So, I want my final dataframe to look like this -
+-----+--------------------+-------+
| ID| customers|country|
+-----+--------------------+-------+
|56 |xyz Limited |U.K. |
|66 |ABC Limited |U.K. |
|51 |TÜV GmbH |Germany|
|69 |Sahm UG |Austria|
+-----+--------------------+-------+
This can be achieved in many ways and it's not a problem. But, I am interested in learning how this can be done using LIKE statement.
Had I only been interested in those rows where ID starts from 5, it could have been done easily like this -
df=df.where("ID like ('5%')")
My Question: How can I add the second statement like "ID like ('6%')" with OR - | boolean inside where clause? I want to do something like the one shown below, but this code gives an error. So, in nutshell, how can I use multiple boolean statement using LIKE and .where here -
df=df.where("(ID like ('5%')) | (ID like ('6%'))")

This works for me
from pyspark.sql import functions as F
df.where(F.col("ID").like('5%') | F.col("ID").like('6%'))

You can try
df = df.where('ID like "5%" or ID like "6%"')

In pyspark, SparkSql syntax:
where column_n like 'xyz%' OR column_n like 'abc%'
might not work.
Use:
where column_n RLIKE '^xyz|abc'
Explanation: It will filter all words either starting with abc or xyz.
This works perfectly fine.

For me this worked:
from pyspark.sql.functions import col
df.filter((col("ID").like("5%")) | (col("ID").like("6%")))

Related

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.

You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")

Joining two pyspark dataframes by unique values in a column

Let's say, I have two pyspark dataframes, users and shops. A few sample rows for both the dataframes are shown below.
users dataframe:
+---------+-------------+---------+
| idvalue | day-of-week | geohash |
+---------+-------------+---------+
| id-1 | 2 | gcutjjn |
| id-1 | 3 | gcutjjn |
| id-1 | 5 | gcutjht |
+---------+-------------+---------+
shops dataframe
+---------+-----------+---------+
| shop-id | shop-name | geohash |
+---------+-----------+---------+
| sid-1 | kfc | gcutjjn |
| sid-2 | mcd | gcutjhq |
| sid-3 | starbucks | gcutjht |
+---------+-----------+---------+
I need to join both of these dataframes on the geohash column. I can do a naive equi-join for sure, but the users dataframe is huge, containing billions of rows, and geohashes are likely to repeat, within and across idvalues. So, I was wondering if there's a way to perform joins on unique geohashes in the users dataframe and geohashes in the shops dataframe. If we can do that, then it's easy to replicate the shops entries for matching geohashes in resultant dataframe.
Probably it can be achieved with a pandas udf, where I would perform a groupby on users.idvalue, do a join with shops within the udf by only taking the first row from the group (because all ids are same anyway within the group), and creating a one row dataframe. Logically it feels like this should work, but not sure sure on the performance aspect as udf(s) are usually slower than spark native transformations. Any ideas are welcome.

You said that your Users dataframe is huge and that "geohashes are likely to repeat, within and across idvalues". You didn't referred however if there might be duplicated geohashes in your shops dataframe.
If there are no repeated hashes in the latter, I think that a simple join would solve your problem:
val userDf = Seq(("id-1",2,"gcutjjn"),("id-2",2,"gcutjjn"),("id-1",3,"gcutjjn"),("id-1",5,"gcutjht")).toDF("idvalue","day_of_week","geohash")
val shopDf = Seq(("sid-1","kfc","gcutjjn"),("sid-2","mcd","gcutjhq"),("sid-3","starbucks","gcutjht")).toDF("shop_id","shop_name","geohash")
userDf.show
+-------+-----------+-------+
|idvalue|day_of_week|geohash|
+-------+-----------+-------+
| id-1| 2|gcutjjn|
| id-2| 2|gcutjjn|
| id-1| 3|gcutjjn|
| id-1| 5|gcutjht|
+-------+-----------+-------+
shopDf.show
+-------+---------+-------+
|shop_id|shop_name|geohash|
+-------+---------+-------+
| sid-1| kfc|gcutjjn|
| sid-2| mcd|gcutjhq|
| sid-3|starbucks|gcutjht|
+-------+---------+-------+
shopDf
.join(userDf,Seq("geohash"),"inner")
.groupBy($"geohash",$"shop_id",$"idvalue")
.agg(collect_list($"day_of_week").alias("days"))
.show
+-------+-------+-------+------+
|geohash|shop_id|idvalue| days|
+-------+-------+-------+------+
|gcutjjn| sid-1| id-1|[2, 3]|
|gcutjht| sid-3| id-1| [5]|
|gcutjjn| sid-1| id-2| [2]|
+-------+-------+-------+------+
If you have repeated hash values in your shops dataframe, a possible approach would be to remove those repeated hashes from your shops dataframe (if your requirements allow this), and then perform the same join operation.
val userDf = Seq(("id-1",2,"gcutjjn"),("id-2",2,"gcutjjn"),("id-1",3,"gcutjjn"),("id-1",5,"gcutjht")).toDF("idvalue","day_of_week","geohash")
val shopDf = Seq(("sid-1","kfc","gcutjjn"),("sid-2","mcd","gcutjhq"),("sid-3","starbucks","gcutjht"),("sid-4","burguer king","gcutjjn")).toDF("shop_id","shop_name","geohash")
userDf.show
+-------+-----------+-------+
|idvalue|day_of_week|geohash|
+-------+-----------+-------+
| id-1| 2|gcutjjn|
| id-2| 2|gcutjjn|
| id-1| 3|gcutjjn|
| id-1| 5|gcutjht|
+-------+-----------+-------+
shopDf.show
+-------+------------+-------+
|shop_id| shop_name|geohash|
+-------+------------+-------+
| sid-1| kfc|gcutjjn| << Duplicated geohash
| sid-2| mcd|gcutjhq|
| sid-3| starbucks|gcutjht|
| sid-4|burguer king|gcutjjn| << Duplicated geohash
+-------+------------+-------+
//Dataframe with hashes to exclude:
val excludedHashes = shopDf.groupBy("geohash").count.filter("count > 1")
excludedHashes.show
+-------+-----+
|geohash|count|
+-------+-----+
|gcutjjn| 2|
+-------+-----+
//Create a dataframe of shops without the ones with duplicated hashes
val cleanShopDf = shopDf.join(excludedHashes,Seq("geohash"),"left_anti")
cleanShopDf.show
+-------+-------+---------+
|geohash|shop_id|shop_name|
+-------+-------+---------+
|gcutjhq| sid-2| mcd|
|gcutjht| sid-3|starbucks|
+-------+-------+---------+
//Perform the same join operation
cleanShopDf.join(userDf,Seq("geohash"),"inner")
.groupBy($"geohash",$"shop_id",$"idvalue")
.agg(collect_list($"day_of_week").alias("days"))
.show
+-------+-------+-------+----+
|geohash|shop_id|idvalue|days|
+-------+-------+-------+----+
|gcutjht| sid-3| id-1| [5]|
+-------+-------+-------+----+
The code provided was written in Scala but it can be easily converted to Python.
Hope this helps!

This is an idea if it possible you used pyspark SQL to select distinct geohash and create to the tempory table. Then join from this table instead of dataframes.

SQL Output Rows as columns

I have a table that tests an item and stores any faliures similar to:
Item|Test|FailureValue
1 |1a |"ZZZZZZ"
1 |1b | 123456
2 |1a |"MMMMMM"
2 |1c | 111111
1 |1d |"AAAAAA"
Is there a way in SQL to essential pivot these and have the failure values be output to individual columns? I know that I can already use STUFF to achieve what I want for the Test field but I would like the results as individual columns if possible.
I'm hoping to achieve something like:
Item|Tests |FailureValue1|FailureValue2|FailureValue3|Failure......
1 |1a,1b |"ZZZZZZ" |123456 |NULL |NULL ......
2 |1a,1b |"MMMMMM" |111111 |"AAAAAA" |NULL ......
Kind regards
Matt

Split a column in multiple columns using Spark SQL

I have a column col1 that represents a GPS coordinate format:
25 4.1866N 55 8.3824E
I would like to split it in multiple columns based on white-space as separator, as in the output example table_example below:
| 1st_split | 2nd_split | 3rd_split | 4th_split |
|:-----------|------------:|:------------:|:------------:|
| 25 | 4.1866N | 55 | 8.3824E |
Considering the fact that there is the split() function, I have tried in this way:
SELECT explode(split(`col1`, ' ')) AS `col` FROM table_example;
But, instead of splitting per multiple columns, it splits per multiple rows, like in the output below:
Can someone clarify me which would be the worth approach for get the expected result?

If you have a dataframe as
+---------------------+
|col |
+---------------------+
|25 4.1866N 55 8.3824E|
+---------------------+
Using Scala API
You can simply use split inbuilt function and select appropriately as
import org.apache.spark.sql.functions._
df.withColumn("split", split(col("col"), " "))
.select(col("split")(0).as("1st_split"), col("split")(1).as("2nd_split"),col("split")(2).as("3rd_split"),col("split")(3).as("4th_split"))
.show(false)
which would give you
+---------+---------+---------+---------+
|1st_split|2nd_split|3rd_split|4th_split|
+---------+---------+---------+---------+
|25 |4.1866N |55 |8.3824E |
+---------+---------+---------+---------+
Using SQL way
Sql is much easier and similar to the api way
df.createOrReplaceTempView("table_example")
val splitted = sqlContext.sql("SELECT split(`col`, ' ') AS `col` FROM table_example")
splitted.createOrReplaceTempView("splitted_table")
val result = sqlContext.sql("SELECT `col`[0] AS `1st_split`, `col`[1] AS `2nd_split`, `col`[2] AS `3rd_split`, `col`[3] AS `4th_split` FROM splitted_table")
result.show(false)
I hope the answer is helpful

Using data from fitnesse table as a variable

I'm just getting started with Fitnesse.
I am trying to use a result of a query (number?) and with that result use it in another test fixture. so e.g.
!|Create Account |
|Contractor No |Account State |Account Name |Account Type |Invoice Template Name|Number?|Result?|
|${ContractorNo}|${AccountState}|${AccountName}|${AccountType}|${InvoiceTempName} | |TRUE |
!|Check Account |
|AccountNo |Account Exists?|
|(result from number?)|TRUE |
is there anyway of doing this? ive tried this: SymbolTestTables, but this seems to be for one entire record instead of just the result from one function.
As i said im new to fitnesse so i apologize if this is easy.

I think the page you are referring to is the Fit style symbols.
I recommend that folks either use Slim, or move to FitLibrary; as both are better supported than just FIT.
If you are doing things the Slim way, you would want to look at this page: Symbols in tables
And if you were doing the Slim way, your table would look like this:
!|Create Account |
|Contractor No |Account State |Account Name |Account Type |Invoice Template Name|Number? |Result?|
|${ContractorNo}|${AccountState}|${AccountName}|${AccountType}|${InvoiceTempName} |$accountNumber= |TRUE |
!|Check Account |
|AccountNo |Account Exists?|
|$accountNumber|TRUE |
I'm not as familiar with the FitLibrary style, so I will refer you to their documentation: http://sourceforge.net/apps/mediawiki/fitlibrary/index.php?title=Main_Page

With .NET/Fit, you can use << and >> to store and recall values, but I don't know if this is supported with Java:
!|Create Account |
|Contractor No |Account State |Account Name |Account Type |Invoice Template Name|Number? |Result?|
|${ContractorNo}|${AccountState}|${AccountName}|${AccountType}|${InvoiceTempName} |>>AccountNumber|TRUE |
!|Check Account |
|AccountNo |Account Exists?|
|<<AccountNumber |TRUE |

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using LIKE operator for multiple words in PySpark - dataframe

This works for me from pyspark.sql import functions as F df.where(F.col("ID").like('5%') | F.col("ID").like('6%'))

You can try df = df.where('ID like "5%" or ID like "6%"')

In pyspark, SparkSql syntax: where column_n like 'xyz%' OR column_n like 'abc%' might not work. Use: where column_n RLIKE '^xyz|abc' Explanation: It will filter all words either starting with abc or xyz. This works perfectly fine.

For me this worked: from pyspark.sql.functions import col df.filter((col("ID").like("5%")) | (col("ID").like("6%")))

Related

Pyspark dataframe - Illegal values appearing in the column?

Joining two pyspark dataframes by unique values in a column

SQL Output Rows as columns

Split a column in multiple columns using Spark SQL

Using data from fitnesse table as a variable

Categories

Resources