pyspark dataframe filtering on multiple columns

pyspark dataframe filtering on multiple columns - dataframe

I have a pyspark dataframe which looks like below
df
num11 num21
10 10
20 30
5 25
I am filtering above dataframe on all columns present, and selecting rows with number greater than 10 [no of columns can be more than two]
from pyspark.sql.functions import col
col_list = df.schema.names
df_fltered = df.where(col(c) >= 10 for c in col_list)
desired output is :
num11 num21
10 10
20 30
How can we achieve filtering on multiple columns using iteration on column list as above. [all efforts are appriciated]
[error i reveive is : condition should be string or column]

As an alternative, if you not averse to some sql-like snippets of code, the following should work:
df.where("AND".join(["(%s >=10)"%(col) for col in col_list]))

You can use functools.reduce to combine the column conditions, to simulate an all condition, for instance, you can use reduce(lambda x, y: x & y, ...):
import pyspark.sql.functions as F
from functools import reduce
df.where(reduce(lambda x, y: x & y, (F.col(x) >= 10 for x in df.columns))).show()
+-----+-----+
|num11|num21|
+-----+-----+
| 10| 10|
| 20| 30|
+-----+-----+

Related

How to select a subset of pandas dataframe containing an even distribution of one column's values?

I have a huge dataset over different years. As a subsample for local tests, I need to separate a small dataframe which contains only a few samples distributed over years. Does anyone have any idea how to do that?
After groupby by 'year' column, the count of instances in each year is something like:
year
A
1838
1000
1839
2600
1840
8900
1841
9900
I want to select a subset which after groupby looks like:
| year| A |
| ----| --|
| 1838| 10|
| 1839| 10|
| 1840| 10|
| 1841| 10|

Try groupby().sample().
Here's example usage with dummy data.
import numpy as np
import pandas as pd
# create a long array of 'years' from 1800 to 1805
years = np.random.randint(low=1800,high=1805,size=200)
values = np.random.randint(low=1, high=200,size=200)
df = pd.DataFrame({'Years':years,"Values":values})
number_per_year = 10
sample_df = df.groupby("Years").sample(n=number_per_year, random_state=1)

How to delete specific characters from a string in a PySpark dataframe?

I want to delete the last two characters from values in a column.
The values of the PySpark dataframe look like this:
1000.0
1250.0
3000.0
...
and they should look like this:
1000
1250
3000
...

You can use substring to get the string until the index length - 2:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col',
F.expr("substring(col, 1, length(col) - 2)")
)

You can use regexp_replace:
from pyspark.sql import functions as F
df1 = df.withColumn("value", F.regexp_replace("value", "(.*).{2}", "$1"))
df1.show()
#+-----+
#|value|
#+-----+
#| 1000|
#| 1250|
#| 3000|
#+-----+
Or regexp_extract:
df1 = df.withColumn("value", F.regexp_extract("value", "(.*).{2}", 1))

You can use the function substring_index to extract the part before the period:
df = spark.createDataFrame([['1000.0'], ['2000.0']], ['col'])
df.withColumn('new_col', F.substring_index(F.col('col'), '.', 1))
Result:
+------+-------+
| col|new_col|
+------+-------+
|1000.0| 1000|
|2000.0| 2000|
+------+-------+

Transpose wide dataframe to long dataframe

I have a data frame looks like:
Region, 2000Q1, 2000Q2, 2000Q3, ...
A, 1,2,3,...
I want to transpose this wide table to a long table by 'Region'. So the final product will look like:
Region, Time, Value
A, 2000Q1,1
A, 2000Q2, 2
A, 2000Q3, 3
A, 2000Q4, 4
....
The original table has a very wide array of columns but the aggregation level is always region and remaining columns are set to be tranposed.
Do you know an easy way or function to do this?

Try with arrays_zip function then explode the array
Example:
df=spark.createDataFrame([('A',1,2,3)],['Region','2000q1','2000q2','2000q3'])
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("cc",explode(arrays_zip(array(cols),split(lit(col_name),"\\|")))).\
select("Region","cc.*").\
toDF(*['Region','Value','Time']).\
show()
#+------+-----+------+
#|Region|Value| Time|
#+------+-----+------+
#| A| 1|2000q1|
#| A| 2|2000q2|
#| A| 3|2000q3|
#+------+-----+------+

Similar but improved for the column calculation.
cols = df.columns
cols.remove('Region')
import pyspark.sql.functions as f
df.withColumn('array', f.explode(f.arrays_zip(f.array(*map(lambda x: f.lit(x), cols)), f.array(*cols), ))) \
.select('Region', 'array.*') \
.toDF('Region', 'Time', 'Value') \
.show(30, False)
+------+------+-----+
|Region|Time |Value|
+------+------+-----+
|A |2000Q1|1 |
|A |2000Q2|2 |
|A |2000Q3|3 |
|A |2000Q4|4 |
|A |2000Q5|5 |
+------+------+-----+
p.s. Don't accept this as an answer :)

Fill in specific row in pySpark dataframe while adding new columns

I have a dataframe which consists of 4 rows and more than 20 columns(dates). The dataframe is a table which I read and convert it in a dataframe. The SUM row contains the sum of the values per date.
+----+-----+-----+
|PR |date1|date2|......
+----+-----+-----+
| a | 30 | 17 |......
| b | 30 | 12 |......
| SUM| 60 | 29 |......
+----+---+-------+
I created this dataframe after the submitting a question here. Since the table is constantly being populated with new data, I want the new data to be added to that dataframe.
I am coding in pySpark and script is the following one:
from pyspark.sql import functions as F
if df.filter(df.PR.like('SUM')):
print("**********")
print("SUM FOUND")
df = df.union(df.select(df.where(df.index == 'SUM').select('PR'), *[F.sum(F.col(c)).alias(c) for c in df.columns if c != 'PR']))
else:
df = df.union(df.select(F.lit("SUM").alias("PR"), *[F.sum(F.col(c)).alias(c) for c in df.columns if c != 'PR']))
What I want to achieve is that, for any new date create a new column and fill in the SUM without adding new rows. Unfortunately I am getting the error AttributeError: 'DataFrame' object has no attribute 'index'
Any help/hint? Should I follow a different approach?

Using Pandas groupby to calculate many slopes

Some illustrative data in a DataFrame (MultiIndex) format:
|entity| year |value|
+------+------+-----+
| a | 1999 | 2 |
| | 2004 | 5 |
| b | 2003 | 3 |
| | 2007 | 2 |
| | 2014 | 7 |
I would like to calculate the slope using scipy.stats.linregress for each entity a and b in the above example. I tried using groupby on the first column, following the split-apply-combine advice, but it seems problematic since it's expecting one Series of values (a and b), whereas I need to operate on the two columns on the right.
This is easily done in R via plyr, not sure how to approach it in pandas.

A function can be applied to a groupby with the apply function. The passed function in this case linregress. Please see below:
In [4]: x = pd.DataFrame({'entity':['a','a','b','b','b'],
'year':[1999,2004,2003,2007,2014],
'value':[2,5,3,2,7]})
In [5]: x
Out[5]:
entity value year
0 a 2 1999
1 a 5 2004
2 b 3 2003
3 b 2 2007
4 b 7 2014
In [6]: from scipy.stats import linregress
In [7]: x.groupby('entity').apply(lambda v: linregress(v.year, v.value)[0])
Out[7]:
entity
a 0.600000
b 0.403226

You can do this via the iterator ability of the group by object. It seems easier to do it by dropping the current index and then specifying the group by 'entity'.
A list comprehension is then an easy way to quickly work through all the groups in the iterator. Or use a dict comprehension to get the labels in the same place (you can then stick the dict into a pd.DataFrame easily).
import pandas as pd
import scipy.stats
#This is your data
test = pd.DataFrame({'entity':['a','a','b','b','b'],'year':[1999,2004,2003,2007,2014],'value':[2,5,3,2,7]}).set_index(['entity','year'])
#This creates the groups
groupby = test.reset_index().groupby(['entity'])
#Process groups by list comprehension
slopes = [scipy.stats.linregress(group.year, group.value)[0] for name, group in groupby]
#Process groups by dict comprehension
slopes = {name:[scipy.stats.linregress(group.year, group.value)[0]] for name, group in groupby}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pyspark dataframe filtering on multiple columns - dataframe

As an alternative, if you not averse to some sql-like snippets of code, the following should work: df.where("AND".join(["(%s >=10)"%(col) for col in col_list]))

Related

How to select a subset of pandas dataframe containing an even distribution of one column's values?

How to delete specific characters from a string in a PySpark dataframe?

Transpose wide dataframe to long dataframe

Fill in specific row in pySpark dataframe while adding new columns

Using Pandas groupby to calculate many slopes

Categories

Resources