I want to find max value comparing 100 columns with data frame - apache-spark-sql

I have a dataframe
syr | P1 | P2
-----------------
1 | 200 | 300
2 | 500 | 700
3 | 900 | 400
I want to create another DataFrame which has max value between col2 & col3. An expected output is like:
syr | P1 | P2 | max
-------------------------
1 | 200 | 300 | 300
2 | 500 | 700 | 700
3 | 900 | 400 | 900

You could define a new UDF function to catch the max value between two column, like:
def maxDef(p1: Int, p2: Int): Int = if(p1>p2) p1 else p2
val max = udf[Int, Int, Int](maxDef)
And then apply the UDF in a withColumn() to define a new Column, like:
val df1 = df.withColumn("max", max(df.col("P1"), df.col("P2")))
+---+---+---+---+
|syr| P1| P2|max|
+---+---+---+---+
| 1|200|300|300|
| 2|500|700|700|
| 3|900|400|900|
+---+---+---+---+
EDIT: Iterate through columns
First initialize the max Column:
df = df.withColumn("max", lit(0))
then foreach Column you want (use filter function property) compare it with the max Column.
df.columns.filter(_.startsWith("P")).foreach(col => {
df = df.withColumn("max", max(df.col("max"), df.col(col)))
})

Related

Pandas apply to a range of columns

Given the following dataframe, I would like to add a fifth column that contains a list of column headers when a certain condition is met on a row, but only for a range of dynamically selected columns (ie subset of the dataframe)
| North | South | East | West |
|-------|-------|------|------|
| 8 | 1 | 8 | 6 |
| 4 | 4 | 8 | 4 |
| 1 | 1 | 1 | 2 |
| 7 | 3 | 7 | 8 |
For instance, given that the inner two columns ('South', 'East') are selected and that column headers are to be returned when the row contains the value of one (1), the expected output would look like this:
Headers
|---------------|
| [South] |
| |
| [South, East] |
| |
The following one liner manages to return column headers for the entire dataframe.
df['Headers'] = df.apply(lambda x: df.columns[x==1].tolist(),axis=1)
I tried adding the dynamic column range condition by using iloc but to no avail. What am I missing?
For reference, these are my two failed attempts (N1 and N2 being column range variables here)
df['Headers'] = df.iloc[N1:N2].apply(lambda x: df.columns[x==1].tolist(),axis=1)
df['Headers'] = df.apply(lambda x: df.iloc[N1:N2].columns[x==1].tolist(),axis=1)
This works:
df=pd.DataFrame({'North':[8,4,1,7],'South':[1,4,1,3],'East':[8,8,1,7],\
'West':[6,4,2,8]})
df1=df.melt(ignore_index=False)
condition1=df1['variable']=='South'
condition2=df1['variable']=='East'
condition3=df1['value']==1
df1=df1.loc[(condition1|condition2)&condition3]
df1=df1.groupby(df1.index)['variable'].apply(list)
df=df.join(df1)

Create new column in Pyspark Dataframe by filling existing Column

I am trying to create new column in an existing Pyspark DataFrame. Currently the DataFrame looks as follows:
+----+----+---+----+----+----+----+
|Acct| M1D|M1C| M2D| M2C| M3D| M3C|
+----+----+---+----+----+----+----+
| B| 10|200|null|null| 20|null|
| C|1000|100| 10|null|null|null|
| A| 100|200| 200| 200| 300| 10|
+----+----+---+----+----+----+----+
I want to fill null values in column M2C with 0 and create a new column Ratio. My expected output would be as follows:
+------+------+-----+------+------+------+------+-------+
| Acct | M1D | M1C | M2D | M2C | M3D | M3C | Ratio |
+------+------+-----+------+------+------+------+-------+
| B | 10 | 200 | null | null | 20 | null | 0 |
| C | 1000 | 100 | 10 | null | null | null | 0 |
| A | 100 | 200 | 200 | 200 | 300 | 10 | 200 |
+------+------+-----+------+------+------+------+-------+
I was trying to achieve my desired results by using following lines of code.
df = df.withColumn('Ratio', df.select('M2C').na.fill(0))
The above line of code resulted in an assertion error as shown below.
AssertionError: col should be Column
The possible solution that I found using this link was to use lit function.
I changed my code to
df = df.withColumn('Ratio', lit(df.select('M2C').na.fill(0)))
The above code led to AttributeError: 'DataFrame' object has no attribute '_get_object_id'
How can I achieve my desired output?
You're doing two things wrong here.
df.select will return a dataframe, not a column.
na.fill will replace null values in all columns, not just in specific columns.
The following code snippet will solve your usecase
from pyspark.sql.functions import col
df = df.withColumn('Ratio', col('M2C')).fillna(0, subset=['Ratio'])

how to add headers to a selected data from bigger data frame?

I'm learning pandas and I have a DataFrame (from CSV) that I need to filter. The original DataFrame looks like this:
+----------+-----------+-------------+
| Header1 | Header2 | Header3 |
| Value 1 | A | B |
| Value 1 | A | B |
| Value 2 | C | D |
| Value 1 | A | B |
| Value 3 | B | E |
| Value 3 | B | E |
| Value 2 | C | D |
+----------+-----------+-------------+
Then, I select the new data with this code:
dataframe.header1.value_counts()
output:
Value 1 -- 3
Value 2 -- 2
Value 3 -- 2
dtype: int64
So, I need to add headers to this selection and output something like this
Values Count
Value 1 -- 3
Value 2 -- 2
Value 3 -- 2
pd.Series.value_counts returns a Series, where the Index is all unique values in the Series calling the method. reset_index is what you want to make it a DataFrame, and we can use the rename methods to get the column labels correct.
(df.Header1.value_counts()
.rename('Count') # Series name becomes column label for counts
.rename_axis('Values') # Index name becomes column label for unique values.
.reset_index() # Series -> DataFrame
)
# Values Count
#0 Value_1 3
#1 Value_2 2
#2 Value_3 2

MSAccess query to search against a list of criteria in a data table

I am trying to build some discount list for products. I have a main table that contains conditions:
ID Connection Class Discount
1 (B OR F) 150 0.1
2 B (600 OR 900) 0.2
3 F (1500 OR 2500) 0.3
4 (C OR F) 4500 0.25
Query I am trying to do is something like this:
SELECT Constraints.Discount
FROM Constraints
WHERE ((('600')=[Class]));
The example above should return row 2. How can this be done? Do I need to format my conditions in a different way? I have tried this example and could not get the result I want. The idea is to build multiple columns of constraints and depending on what configuration is selected I would like to narrow down to the correct discount applicable.
Please let me know if there is an easier way to solve this problem.
Thanks!
As #Minty said in a comment - your data isn't normalised. If you were to split Connection and Class so they only contained a single value you could easily pull the data back.
| ID | Connection | Class | Discount |
|----|------------|-------|----------|
| 1 | B | 150 | 0.1 |
| 1 | F | 150 | 0.1 |
| 2 | B | 600 | 0.2 |
| 2 | B | 900 | 0.2 |
| 3 | F | 1500 | 0.3 |
| 3 | F | 2500 | 0.3 |
| 4 | C | 4500 | 0.25 |
| 4 | F | 4500 | 0.25 |
This SQL would return 0.2:
SELECT Discount
FROM Constraints
WHERE Class = 600
I expect you'd have to bring in Connection as the class on its own would bring back duplicate records (unless you group by Discount based on the sample data).
So either:
SELECT Discount
FROM Table2
WHERE Connection = 'B' AND Class = 600
Or
SELECT Discount
FROM Table2
WHERE Class = 150
GROUP BY Discount
Edit: < ID, Connection, Class > can make up the composite Primary Key in the table.
What data type is the Class column. If it is a number try the following query:
SELECT Constraints.Discount
FROM Constraints
WHERE [Class] = 600
'600' is a string constant that contains the 3 characters 6, 0, and 0 -- this is different than a numeric value of 600

SparkSQL: conditional sum on range of dates

I have a dataframe like this:
| id | prodId | date | value |
| 1 | a | 2015-01-01 | 100 |
| 2 | a | 2015-01-02 | 150 |
| 3 | a | 2015-01-03 | 120 |
| 4 | b | 2015-01-01 | 100 |
and I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates. In other words, I need to build a table with the following columns:
prodId
val_1: sum value if date is between date1 and date2
val_2: sum value if date is between date2 and date3
val_3: same as before
etc.
| prodId | val_1 | val_2 |
| | (01-01 to 01-02) | (01-03 to 01-04) |
| a | 250 | 120 |
| b | 100 | 0 |
Is there any predefined aggregated function in spark that allows doing conditional sums? Do you recommend develop a aggr. UDF (if so, any suggestions)?
Thanks a lot!
First lets recreate example dataset
import org.apache.spark.sql.functions.to_date
val df = sc.parallelize(Seq(
(1, "a", "2015-01-01", 100), (2, "a", "2015-01-02", 150),
(3, "a", "2015-01-03", 120), (4, "b", "2015-01-01", 100)
)).toDF("id", "prodId", "date", "value").withColumn("date", to_date($"date"))
val dates = List(("2015-01-01", "2015-01-02"), ("2015-01-03", "2015-01-04"))
All you have to do is something like this:
import org.apache.spark.sql.functions.{when, lit, sum}
val exprs = dates.map{
case (x, y) => {
// Create label for a column name
val alias = s"${x}_${y}".replace("-", "_")
// Convert strings to dates
val xd = to_date(lit(x))
val yd = to_date(lit(y))
// Generate expression equivalent to
// SUM(
// CASE
// WHEN date BETWEEN ... AND ... THEN value
// ELSE 0
// END
// ) AS ...
// for each pair of dates.
sum(when($"date".between(xd, yd), $"value").otherwise(0)).alias(alias)
}
}
df.groupBy($"prodId").agg(exprs.head, exprs.tail: _*).show
// +------+---------------------+---------------------+
// |prodId|2015_01_01_2015_01_02|2015_01_03_2015_01_04|
// +------+---------------------+---------------------+
// | a| 250| 120|
// | b| 100| 0|
// +------+---------------------+---------------------+