string row1
string row2
Is it possible to reduce rows to 1 row?
Rows should be joined with a comma.
As a result I expect
string row1, string row2
One of the workaround could able to solve the above issue,
To concatenate we can use this for e.g | extend New_Column = strcat(tagname,",", tagvalue) with comma between two string.
For example we have tested in our environment with tag name and tag value
resourcecontainers
| where type =~ 'microsoft.resources/subscriptions'
| extend tagname = tostring(bag_keys(tags)[0])
| extend tagvalue = tostring(tags[tagname])
| extend New_Column = strcat(tagname,",", tagvalue) // for concate two rows into one with comma between two string
Here is the sample output for reference:
For more information please refer this SO THREAD
UPDATE: To cancat the rows we tried with the example of code as stated in the given SO THREAD which is suggested by #Yoni L.
| summarize result = strcat_array(make_list(word), ",")
Sample output for reference:
thanks for the tips. in your links I found:
| summarize result = strcat_array(make_list(name_s), ",")
Related
I'm trying to find a column (I do know the name of the column) base on a value. For example in this dataframe below, I'd like to know which row that has a column contains yellow for Category = A . The thing is I don't know the column name (colour) in advance so I couldn't do select * where Category = 'A' and colour = 'yellow' How can I scan the columns and achieve this? Many thanks for your help.
+--------+-----------+-------------+
|Category|colour |. name. |
+--------+-----------+-------------+
|A. | blue.| Elmo|
|A | yellow | Alex|
|B | desc | Erin|
+--------+-----------+-------------+
You can loop that check through the list of column names. You also can wrap this loop in a function for the readable purpose. Please note that this check per column would happen in sequence.
from pyspark.sql import functions as F
cols = df.columns
for c in cols:
cnt = df.where((F.col('Category') == 'A') & (F.col(c) == 'yellow')).count()
if cnt > 0:
print(c)
So, I'm looking for an efficient way to set up values within an existing column and setting values for a new column based on some conditions. If I have 10 conditions in a big data set, do I have to write 10 lines? Or can I combine them somehow...haven't figured it out yet.
Can you guys suggest something?
For example:
data_frame.loc[data_frame.col1 > 50 ,["col1","new_col"]] = "Cool"
data_frame.loc[data_frame.col2 < 100 ,["col1","new_col"]] = "Cool"
Can it be written in a single expression? "&" or "and" don't work...
Thanks!
yes you can do it,
here is an example:
data_frame.loc[(data_frame["col1"]>100) & (data_frame["col2"]<10000) | (data_frame["col3"]<500),"test"] = 0
explanation:
the filter I used is (with "and" and "or" conditions): (data_frame["col1"]>100) & (data_frame["col2"]<10000) | (data_frame["col3"]<500)
the column that will be changed is "test" and the value will be 0
You can try:
all_conditions = [condition_1, condition_2]
fill_with = [fill_condition_1_with, fill_condition_2_with]
df[["col1","new_col"]] = np.select(all_conditions, fill_with, default=default_value_here)
I have problems getting the data in separate rows. At the moment all my data per column is in one cell. I really would appreciate your support!
the column header is "Dealer" and it is showing one value below like this:
|Dealer|
|:---- |
|['Automobiles', 'Garage Benz', 'Cencini SA']|
I would like to get three rows out of this:
Row
Dealer
1
'Automobiles'
2
'Garage Benz'
3
'Cencini SA'
4
....
5
....
...
...
what would be the easiest way to achieve this?
Thanks for your support, as I am totally new to pandas!
The easiest way is to convert your data into a dict like data:
x = {'Dealer':['Automobiles', 'Garage Benz', 'Cencini SA']}
Then
x = pd.DataFrame(x)
I'm trying to run a for loop in PySpark that needs a to filter a variable for an algorithm.
Here's an example of my dataframe df_prods:
+----------+--------------------+--------------------+
|ID | NAME | TYPE |
+----------+--------------------+--------------------+
| 7983 |SNEAKERS 01 | Sneakers|
| 7034 |SHIRT 13 | Shirt|
| 3360 |SHORTS 15 | Short|
I want to iterate over a list of ID's, get the match from the algorithm and then filter the product's type.
I created a function that gets the type:
def get_type(ID_PROD):
return [row[0] for row in df_prods.filter(df_prods.ID == ID_PROD).select("TYPE").collect()]
And wanted it to return:
print(get_type(7983))
Sneakers
But I find two issues:
1- it takes a long time to do that (longer than I got doing a similar thing on Python)
2- It returns an string array type: ['Sneakers'] and when I try to filter the products, this happens:
type = get_type(7983)
df_prods.filter(df_prods.type == type)
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [Sneakers]
Does anyone know a better way to approach this on PySpark?
Thank you very much in advance. I'm having a very hard time learning PySpark.
A little adjustment on your function. This returns the actual string of the target column from the first record found after filtering.
from pyspark.sql.functions import col
def get_type(ID_PROD):
return df.filter(col("ID") == ID_PROD).select("TYPE").collect()[0]["TYPE"]
type = get_type(7983)
df_prods.filter(col("TYPE") == type) # works
I find using col("colname") to be much more readable.
About the performance issue you've mentioned, I really cannot say without more details (e.g. inspecting the data and the rest of your application). Try this syntax and tell me if the performance improves.
I have to call a function func_test(spark,a,b) which accepts two string values and create a df out of it. spark is a SparkSession variable
These two string values are two columns of another dataframe and would be different for different rows of that dataframe.
I am unable to achieve this.
Things tried so far:
1.
ctry_df = func_test(spark, df.select("CTRY").first()["CTRY"],df.select("CITY").first()["CITY"])
Gives CTRY and CITY of only the first record of the df.
2.
ctry_df = func_test(spark, df['CTRY'],df['CITY'])
Gives Column<b'CTRY'> and Column<b'CITY'> as values.
Example:
df is:
+----------+----------+-----------+
| CTRY | CITY | XYZ |
+----------+----------+-----------+
| US | LA | HELLO|
| UK | LN | WORLD|
| SN | SN | SPARK|
+----------+----------+-----------+
So, I want first call to fetch func_test(spark,US,LA); second call to go func_test(spark,UK,LN); third call to be func_test(spark,SN,SN) and so on.
Pyspark - 3.7
Spark - 2.2
Edit 1:
Issue in detail:
func_test(spark,string1,string2) is a function which accepts two string values. Inside this function is a set of various dataframe operations done. For example:- First spark sql in the func_test is a normal select and these two variables string1 and string2 are used in the where clause. The result of this spark sql which generates a df is a temp table of next spark sql and so on. Finally, it creates a df which this function func_test(spark,string1,string2) returns.
Now, In the main class, I have to call this func_test and the two parameters string1 and string2 will be fetched from records of dataframe. So that, first func_test call generates query as select * from dummy where CTRY='US' and CITY='LA'. And the subsequent operations happen which results in df. Second call to func_test becomes select * from dummy where CTRY='UK' and CITY='LN'. Third call becomes select * from dummy where CTRY='SN' and CITY='SN' and so on.
instead of first() use collect() and iterate through the loop
collect_vals = df.select('CTRY','CITY').distinct().collect()
for row_col in collect_vals:
func_test(spark, row_col['CTRY'],row_col['CITY'])
hope this helps !!