How to count No. of rows with special characters in all columns of a PySpark DataFrame? - dataframe

Assume that I have a PySpark DataFrame. Some of the cells contain only special characters.
Sample dataset:
import pandas as pd
data = {'ID': [1, 2, 3, 4, 5, 6],
'col_1': ['A', '?', '<', ' ?', None, 'A?'],
'col_2': ['B', ' ', '', '?>', 'B', '\B']
}
pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)
I want to count the number of rows which contain only special characters (except blank cells). Values like 'A?' and '\B' and blank cells are not counted.
The expected output will be:
{'ID': 0, 'col_1': 3, 'col_2': 1}
Is there anyway to do that?

Taking your sample dataset, this should do:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
spark = SparkSession.builder.getOrCreate()
import pandas as pd
data = {'ID': [1, 2, 3, 4, 5, 6],
'col_1': ['A', '?', '<', ' ?', None, 'A?'],
'col_2': ['B', ' ', '', '?>', 'B', '\B']
}
pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)
res = {}
for col_name in df.columns:
df = df.withColumn('matched', when((col(col_name).rlike('[^A-Za-z0-9\s]')) & ~(col(col_name).rlike('[A-Za-z0-9]')), True).otherwise(False))
res[col_name] = df.select('ID').where(df.matched).count()
print(res)
The trick is to use regular expressions with two conditions to filter the cells that are valid according to your logic.

Related

How to Iterate with python dataframe

I'm trying to iterate in my data frame and create a column with the result
import pandas as pd
data = {'Name': ['Mary', 'Jose', 'John', 'Marc', 'Ruth','Rachel'],
'Grades': [10, 8, 8, 5, 7,4],
'Gender':['Female','Male','Male','Male','Female','Female']
}
df = pd.DataFrame(data)
values = []
for x in df.iteritems():
values.append('Passed' if x.Grades < 7 else 'Failed')
df['Final_result'] = values
df
I'm getting 'tuple' object that has no attribute 'Grades'. Can you guys help me?
If you want to be good at pandas, avoid loops and start thinking "vectorized operations":
import numpy as np
df["Final_result"] = np.where(df["Grades"] < 7, "Passed", "Failed")

Set value of specific cell in pandas dataframe to sum of two other cells

I have a dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(
data={'X': [1.5, 6.777, 2.444, np.NaN],
'Y': [1.111, np.NaN, 8.77, np.NaN],
'Z': [5.0, 2.333, 10, 6.6666]})
I think this should work, but i get the following error;
df.at[1,'Z'] =(df.loc[[2],'X'] +df.loc[[0],'Y'])
How can I achieve this?
ValueError: setting an array element with a sequence.
This should work
df.loc[1, 'Z'] = df.loc[2,'X'] + df.loc[0,'Y']

How to change pandas display font of index column

data = {
'X': [3, 2, 0, 1],
'Y': [0, 3, 7, 2]
}
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
df.style.set_properties(**{
'font-family':'Courier New'
})
df
The index column is displayed in bold, is it possible to change font of index column?
You must use table_styles. In this example I manage to make the "font-weight":"normal" for the index and columns:
Let's define some test data:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4],
'B':[5,4,3,1]})
We define style customization to use:
styles = [
dict(selector="th", props=[("font-weight","normal"),
("text-align", "center")])]
We pass the style variable as the argument for set_table_styles():
html = (df.style.set_table_styles(styles))
html
And the output is:
Please feel free to read about the documentation in pandas Styling for more details.

pandas dividing by condition

I have a pandas dataframe df with the contents below:
df = pd.DataFrame({'x': ['a', 'b', 'c'], 'y': [15, 10, 5]})
enter image description here
I would like to get a third column that shows the result of dividing by the value in y when x=c
enter image description here
I tried some but not worked:
df['z'] = df['y']/df.loc[df['x']== 'c', 'y']

Getting standard deviation on a specific number of dates

In this dataframe...
import pandas as pd
import numpy as np
import datetime
tf = 365
dt = datetime.datetime.now()-datetime.timedelta(days=365)
df = pd.DataFrame({
'Cat': np.repeat(['a', 'b', 'c'], tf),
'Date': np.tile(pd.date_range(dt, periods=tf), 3),
'Val': np.random.rand(3*tf)
})
How can I get a dictionary of standard deviation of each 'Cat' values for a specific number of days - back from the last day for a large dataset?
This code gives the standard deviation for 10 days...
{s: np.std(df[(df.Cat == s) &
(df.Date > today-datetime.timedelta(days=10))].Val)
for s in df.Cat.unique()}
...looks clunky.
Is there a better way?
First filter by boolean indexing and then aggregate std, but because default value ddof=1 is necessary set it to 0:
d1 = df[(df.Date>dt-datetime.timedelta(days=10))].groupby('Cat')['Val'].std(ddof=0).to_dict()
print (d1)
{'a': 0.28435695432581953, 'b': 0.2908486860242955, 'c': 0.2995981283031974}
Another solution is use custom function:
f = lambda x: np.std(x.loc[(x.Date > dt-datetime.timedelta(days=10)), 'Val'])
d2 = df.groupby('Cat').apply(f).to_dict()
Difference between solutions is if some values in group not matched conditions then is removed and for second solution is assignd NaN:
d1 = {'b': 0.2908486860242955, 'c': 0.2995981283031974}
d2 = {'a': nan, 'b': 0.2908486860242955, 'c': 0.2995981283031974}