Add column with the first IP address of the subnet - dataframe

I have PySpark dataframe with column named "subnet". I want to add a column which is the first IP of that subnet. I've tried many solutions including
def get_first_ip(prefix):
n = ipaddress.IPv4Network(prefix)
first, last = n[0], n[-1]
return first
df.withColumn("first_ip", get_first_ip(F.col("subnet")))
But getting error:
-> 1161 raise AddressValueError("Expected 4 octets in %r" % ip_str)
1162
1163 try:
AddressValueError: Expected 4 octets in "Column<'subnet'>"
I do understand that is the Column value and can no use it as a simple string here, but how to solve my problem with PySpark?
I could do the same in pandas and then convert to PySpark, but I'm wondering if there's any other more elegant way?

It's hard to tell what's the issue when we don't know how the input dataframe looks like. But something is wrong with the column values as #samkart suggested.
Here's an example that I tested:
import ipaddress
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
def get_first_ip(x):
n = ipaddress.IPv4Network(x)
return str(n[0])
def get_last_ip(x):
n = ipaddress.IPv4Network(x)
return str(n[-1])
first_ip_udf = F.udf(lambda x: get_first_ip(x), StringType())
last_ip_udf = F.udf(lambda x: get_last_ip(x), StringType())
spark = SparkSession.builder.getOrCreate()
data = [
{"IP": "10.10.128.123"},
{"IP": "10.10.128.0/17"},
]
df = spark.createDataFrame(data=data)
df = df.withColumn("first_ip", first_ip_udf(F.col("IP")))
df = df.withColumn("last_ip", last_ip_udf(F.col("IP")))
Outputs:
+--------------+-------------+-------------+
|IP |first_ip |last_ip |
+--------------+-------------+-------------+
|10.10.128.123 |10.10.128.123|10.10.128.123|
|10.10.128.0/17|10.10.128.0 |10.10.255.255|
+--------------+-------------+-------------+

You cannot directly apply python native function to a Spark dataframe column. As demonstrated in this answer, you could create a udf from your function.
Since udf is slow for big dataframes, you could use pandas_udf which is a lot faster.
Input:
import ipaddress
import pandas as pd
from pyspark.sql import functions as F
df = spark.createDataFrame([("10.10.128.123",), ("10.10.128.0/17",)], ["subnet"])
Script:
#F.pandas_udf('string')
def get_first_ip(prefix: pd.Series) -> pd.Series:
return prefix.apply(lambda s: str(ipaddress.IPv4Network(s)[0]))
df = df.withColumn("first_ip", get_first_ip("subnet"))
df.show()
# +--------------+-------------+
# | subnet| first_ip|
# +--------------+-------------+
# | 10.10.128.123|10.10.128.123|
# |10.10.128.0/17| 10.10.128.0|
# +--------------+-------------+

Related

pandas_udf with pd.Series and other object as arguments

I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe.
However, the most straight forward solution doesn't seem to be supported by the Pandas on Spark API:
A very simple example like below
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
import pandas as pd
#F.pandas_udf(IntegerType())
def addition(arr: pd.Series, addition: int) -> pd.Series:
return arr.add(addition)
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show()
df.withColumn("added", addition(F.col("array"),F.col("addition")))
throws the following exception on the udf definition line
NotImplementedError: Unsupported signature: (arr: pandas.core.series.Series, addition: int) -> pandas.core.series.Series.
Am i tackling this problem in a wrong way? I could reimplement the whole "addition" function in native PySpark, but the real function I am talking about is terribly complex and would mean an enormous amount of rework.
Loading the example, adding import array
from pyspark.sql.types as T
import pyspark.sql.functions as F
import pandas as pd
from array import array
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show(truncate=False)
print(df.schema.fields)
The response is,
+---------+--------+
| array|addition|
+---------+--------+
|[1, 2, 3]| 10|
|[4, 5, 6]| 20|
+---------+--------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True)]
If you must use a Pandas function to complete your task here is an option for a solution that uses a Pandas function within a PySpark UDF,
The Spark DF arr column is ArrayType, convert it into a Pandas Series
Apply the Pandas function
Then, convert the Pandas Series back to an array
#F.udf(T.ArrayType(T.LongType()))
def addition_pd(arr, addition):
pd_arr = pd.Series(arr)
added = pd_arr.add(addition)
return array("l", added)
df = df.withColumn("added", addition_pd(F.col("array"),F.col("addition")))
df.show(truncate=False)
print(df.schema.fields)
Returns
+---------+--------+------------+
|array |addition|added |
+---------+--------+------------+
|[1, 2, 3]|10 |[11, 12, 13]|
|[4, 5, 6]|20 |[24, 25, 26]|
+---------+--------+------------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True), StructField('added', ArrayType(LongType(), True), True)]
However, it is worth stating that when possible it is recommended to use PySpark Functions over the use of PySpark UDF (see here)

Creating PySpark UDFs from python method with numpy array input, to calculate and return a single float value

As input I have a csv file with int values in it.
spark_df = spark.read.option("header", "false").csv("../int_values.csv")
df = spark_df.selectExpr("_c0 as something")
_df = df.withColumn("values", df.something.cast(FloatType())).select("values")
I also have some python functions designed for numpy array inputs, that I need to apply on the Spark DataFrame.
The example one:
def calc_sum(float_array):
return np.sum(float_array)
Real function:
def calc_rms(float_array):
return np.sqrt(np.mean(np.diff(float_array)**2))
For the 1. example you can use SQL sum like:
_df.groupBy().sum().collect()
But, what I need is a standard solution to transform these functions into Spark UDFs
I tried many ways, like:
udf_sum = udf(lambda x : calc_sum(x), FloatType())
_df.rdd.flatMap(udf_sum).collect()
but it always failed with:
TypeError: Invalid argument, not a string or column:
Row(values=1114.0) of type <class 'pyspark.sql.types.Row'>. For column
literals, use 'lit', 'array', 'struct' or 'create_map' function.
Is it possible to transform the data in a way that works with these functions?
DataFrame sample:
In [6]: spark_df.show()
+----+
| _c0|
+----+
|1114|
|1113|
|1066|
|1119|
|1062|
|1089|
|1093|
| 975|
|1099|
|1062|
|1062|
|1162|
|1057|
|1123|
|1141|
|1089|
|1172|
|1096|
|1164|
|1146|
+----+
only showing top 20 rows
Expected output:
A Float value returned from the UDF.
For the Sum function it should be clear.
What you want is groupby and use collect_list to get all integer values into an array column then apply your UDF on that column. Also, you need to explicitly return float from calc_rms:
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
def calc_rms(float_array):
return float(np.sqrt(np.mean(np.diff(float_array) ** 2)))
calc_rms_udf = F.udf(calc_rms, FloatType())
df.groupby().agg(F.collect_list("_c0").alias("_c0")) \
.select(calc_rms_udf(F.col("_c0")).alias("rms")) \
.show()
#+--------+
#| rms|
#+--------+
#|67.16202|
#+--------+

Dask .loc only the first result (iloc[0])

Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?
Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()

Parse list and create DataFrame

I have been given a list called data which has the following content
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
I wanted to have a pandas dataframe which looks like the link
Expected Dataframe
How can I achieve this.
You can try this:
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
processed_data = [x.split(',') for x in data[0].decode().replace('\r', '').strip().split('\n')]
df = pd.DataFrame(columns=processed_data[0], data=processed_data[1:])
Hope it helps.
I would recommend you to convert this list to string as there is only one index in this list
str1 = ''.join(data)
Then use solution provided here
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA = StringIO(str1)
df = pd.read_csv(TESTDATA, sep=",")

IPython Notebook cell multiple outputs

I am running this cell in IPython Notebook:
# salaries and teams are Pandas dataframe
salaries.head()
teams.head()
The result is that I am only getting the output of teams data-frame rather than of both salaries and teams. If I just run salaries.head() I get the result for salaries data-frame but on running both the statement I just see the output of teams.head(). How can I correct this?
have you tried the display command?
from IPython.display import display
display(salaries.head())
display(teams.head())
An easier way:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
It saves you having to repeatedly type "Display"
Say the cell contains this:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
a = 1
b = 2
a
b
Then the output will be:
Out[1]: 1
Out[1]: 2
If we use IPython.display.display:
from IPython.display import display
a = 1
b = 2
display(a)
display(b)
The output is:
1
2
So the same thing, but without the Out[n] part.
IPython Notebook shows only the last return value in a cell. The easiest solution for your case is to use two cells.
If you really need only one cell you could do a hack like this:
class A:
def _repr_html_(self):
return salaries.head()._repr_html_() + '</br>' + teams.head()._repr_html_()
A()
If you need this often, make it a function:
def show_two_heads(df1, df2, n=5):
class A:
def _repr_html_(self):
return df1.head(n)._repr_html_() + '</br>' + df2.head(n)._repr_html_()
return A()
Usage:
show_two_heads(salaries, teams)
A version for more than two heads:
def show_many_heads(*dfs, n=5):
class A:
def _repr_html_(self):
return '</br>'.join(df.head(n)._repr_html_() for df in dfs)
return A()
Usage:
show_many_heads(salaries, teams, df1, df2)
Enumerating all the solutions:
sys.displayhook(value), which IPython/jupyter hooks into. Note this behaves slightly differently from calling display, as it includes the Out[n] text. This works fine in regular python too!
display(value), as in this answer
get_ipython().ast_node_interactivity = 'all'. This is similar to but better than the approach taken by this answer.
Comparing these in an interactive session:
In [1]: import sys
In [2]: display(1) # appears without Out
...: sys.displayhook(2) # appears with Out
...: 3 # missing
...: 4 # appears with Out
1
Out[2]: 2
Out[2]: 4
In [3]: get_ipython().ast_node_interactivity = 'all'
In [2]: display(1) # appears without Out
...: sys.displayhook(2) # appears with Out
...: 3 # appears with Out (different to above)
...: 4 # appears with Out
1
Out[4]: 2
Out[4]: 3
Out[4]: 4
Note that the behavior in Jupyter is exactly the same as it is in ipython.
Provide,
print salaries.head()
teams.head()
This works if you use the print function since giving direct commands only returns the output of last command.
For instance,
salaries.head()
teams.head()
outputs only for teams.head()
while,
print(salaries.head())
print(teams.head())
outputs for both the commands.
So, basically, use the print() function