Need a TRUE and FALSE column in Spark-SQL - apache-spark-sql

I'm trying to write a multi-value filter for a Spark SQL DataFrame.
I have:
val df: DataFrame // my data
val field: String // The field of interest
val values: Array[Any] // The allowed possible values
and I'm trying to come up with the filter specification.
At the moment, I have:
val filter = values.map(value => df(field) === value)).reduce(_ || _)
But this isn't robust in the case where I get passed an empty list of values. To cover that case, I would like:
val filter = values.map(value => df(field) === value)).fold(falseColumn)(_ || _)
but I don't know how to specify falseColumn.
Anyone know how to do so?
And is there a better way of writing this filter? (If so, I still need the answer for how to get a falseColumn - I need a trueColumn for a separate piece).

A column that is always true:
val trueColumn = lit(true)
A column that is always false:
val falseColumn = lit(false)
Using lit(...) means these will always be valid columns, regardless of what columns the DataFrame contains.

Related

Convert a spark dataframe to a column

I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column.
So basically, this is my dataframe:
val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ?
Thank you,
#Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot
From the documentation of the class org.apache.spark.sql.Column
A column that will be computed based on the data in a DataFrame. A new
column is constructed based on the input columns present in a
dataframe:
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated
with a DataFrame. col("columnName.field") // Extracting a
struct field col("a.column.with.dots") // Escape . in column
names. $"columnName" // Scala short hand for a named
column. expr("a + 1") // A column that is constructed
from a parsed SQL Expression. lit("abc") // A
column that produces a literal (constant) value.
If filled_column2 is a DataFrame, you could do:
filled_column2("col1")
******** EDITED AFTER CLARIFICATION ************
Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:
val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)
This way, you are also selecting the product_id that you will use as key. Then, you can do the following
val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
.join(filled_column, promo_txn_cnt_seas_df1("product_id") === filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)
Is this what you are trying to achieve?

How to print lists in a scific order in kotlin?

im working on a project and i have a list in kotlin like:
val list = listOf("banana", "1","apple","3","banana","2")
and i want to print it like
Output:
banana = 1
banana = 2
apple = 3
so like every work with the number should be like one val, and i need to print in scific order (the order is toooo random for any sort command), so im panning on just coppying the whole xinhua dictionary here (since all chinese words have a scific unicode), and make the code it replace like:
val list = listOf("banana丨", "1","apple丩","3","banana丨","2")
but how to print them in the order?
ps. even me as a chinese dont know most of the words in xinhua dictionary lol so there is more then enofe
Assuming that you have the following input list, as shown in your question, where the order of occurrence is always one word followed by the scific order:
val list = listOf("banana", "1","apple","3","banana","2")
You could do the following:
1. Create a data class that defines one entry in your raw input list
data class WordEntry(val word: String, val order: Int)
2. Map over your raw input list by using the windowed and map methods
val dictionary = list.windowed(2, 2).map { WordEntry(it.first(), it.last().toInt()) }
Here, the windowed(2, 2) method creates a window of size 2 and step 2, meaning that we iterate over the raw input list and always work with two entries at every second step. Assuming that the order in the raw input list is always the word followed by the scific order, this should work. Otherwise, this would not work, so the order is very important here!
3. Sort the transformed dictionary by the order property
val sortedDictionary = dictionary.sortedBy { it.order }
Edit: You can also sort by any other property. Just pass another property to the lambda expression of sortedBy (e.g. sortedBy { it.word } if you want to sort it by the word property)
4. Finally, you can print out your sorted dictionary
val outputStr = sortedDictionary.joinToString("\n") { "${it.word} = ${it.order}" }
print(outputStr)
Output:
banana = 1
banana = 2
apple = 3

How to return ONLY top 5% of responses in a column PANDAS

I am looking to return the top 5% of responses in a column using pandas. So, for col_1, basically, I want a list of the responses that make up at least 5% of the responses in that column.
The following returns the list of ALL responses in the col_1 that meet the condition, as well as those that do not (returns boolean True and False):
df['col_1'].value_counts(normalize = True) >= .05
While this is somewhat helpful, I would like to return ONLY those that evaluate to true. Should I use a dictionary and loop? If so, how do I signal that I am using value_counts(normalize = True) >= .05 to append to that dictionary?
Thank you for your help!
If need filter by boolean indexing:
s = df['col_1'].value_counts(normalize = True)
L = s.index[s >= .05].tolist()
print (L)

negative of "contains" in openrefine

I would like to add a column based on another column and fill it with all the values that do NOT contain "jpg"
so the negation of this:
filter(value.split(","), v, v.contains("jpg")).join("|")
How can I write "does not contain"?
contains gives a boolean output i.e. true or false. So we have:
v = "picture.jpg" -> v.contains("jpg") = TRUE
v = "picture.gif" -> v.contains("jpg") = FALSE
filter finds all values in an array which return TRUE for whatever condition you use in the filter. There are a couple of ways you could filter an array to find the values that don't contain a string, but using contains the simplest is probably to use not to reverse the result of your condition:
filter(value.split(","), v, not(v.contains("jpg"))).join("|")

convert Int64Index to Int

I'm iterating through a dataframe (called hdf) and applying changes on a row by row basis. hdf is sorted by group_id and assigned a 1 through n rank on some criteria.
# Groupby function creates subset dataframes (a dataframe per distinct group_id).
grouped = hdf.groupby('group_id')
# Iterate through each subdataframe.
for name, group in grouped:
# This grabs the top index for each subdataframe
index1 = group[group['group_rank']==1].index
# If criteria1 == 0, flag all rows for removal
if(max(group['criteria1']) == 0):
for x in range(rank1, rank1 + max(group['group_rank'])):
hdf.loc[x,'remove_row'] = 1
I'm getting the following error:
TypeError: int() argument must be a string or a number, not 'Int64Index'
I get the same error when I try to cast rank1 explicitly I get the same error:
rank1 = int(group[group['auction_rank']==1].index)
Can someone explain what is happening and provide an alternative?
The answer to your specific question is that index1 is an Int64Index (basically a list), even if it has one element. To get that one element, you can use index1[0].
But there are better ways of accomplishing your goal. If you want to remove all of the rows in the "bad" groups, you can use filter:
hdf = hdf.groupby('group_id').filter(lambda group: group['criteria1'].max() != 0)
If you only want to remove certain rows within matching groups, you can write a function and then use apply:
def filter_group(group):
if group['criteria1'].max() != 0:
return group
else:
return group.loc[other criteria here]
hdf = hdf.groupby('group_id').apply(filter_group)
(If you really like your current way of doing things, you should know that loc will accept an index, not just an integer, so you could also do hdf.loc[group.index, 'remove_row'] = 1).
call tolist() on Int64Index object. Then the list can be iterated as int values.
simply add [0] to insure the getting the first value from the index
rank1 = int(group[group['auction_rank']==1].index[0])