Apache spark: asc not working as expected - apache-spark-3.0

I have following code:
df.orderBy(expr("COUNTRY_NAME").desc, expr("count").asc).show()
I expect count column to be arranged in ascending order for a given COUNTRY_NAME. But I see something like this:
Last value of 12 is not as per the expectation.
Why is it so?

If you output df.printSchema(), you'll see that your "count" column is of the string datatype, resulting in the undesired alphanumeric sort.
In pyspark, you can use the following to accomplish what you are looking for:
df = df.withColumn('count',df['count'].cast('int'))
df.orderBy(['COUNTRY_NAME'],ascending=False).orderBy(['count'],ascending=True).show()
You should create and apply your schema when the data is read in - if possible.

Related

How can I change the format of a result from cursor.execute?

I am using jupyter notebook to run some SQL queries. I have ordered my SQL table by ascending time and I want the first time (so the first entry). The SQL query I have is:
cur.execute("SELECT s_time FROM table_1 ORDER BY s_time ASC fetch FIRST 1 ROWS ONLY")
I get the result with the statement
start_time=cur.fetchall()
I need the result to be an int value but by doing this I get a list. The result is shown in the following picture: Result from the fetchall()
I only want the number so I am guessing it is necessary to take out the brackets, parenthesis and comma but I don't know how. How can I convert this?

how to get the data from a column based on name not the index number

I have a dataframe with column abc having values like below
[{note=Part 3 of 4; Total = $11,000, cost=2750, startDate=2021-11-01T05:00:00Z+0000}]
Now I want to extract data based on name,for example i want to extract cost and start date and create a new column.
Asking it to be working on name because the order of these values might change.
I have tried below line of code but due to change in the data order I am getting wrong data.
df_mod = df_mod.withColumn('cost', split(df_mod['costs'], ',').getItem(1)) \
.withColumn('costStartdate', split(df_mod['costs'], ',').getItem(2))
That's because your data is not comma-separated, it just looks like that. You'll want to use regexp_extract to find the correct content.

Pandas Pivot table get max with column name

I have the following pivot table
I want to get the max value from each row, but also, I need to get the column it came from.
So far I know who to get the max row of every column using this:
dff['State'] = stateRace.max(axis=1)
dff
I get this:
which is returning the correct max value but not the column it came from.
You suffer a disadvantage getting help because you have supplied images and the question is not clear. Happy to help if the below answer doesn't help.
stateRace=stateRace.assign(max_value=stateRace.select_dtypes(exclude='object').max(axis=1),\
max_column=stateRace.select_dtypes(exclude='object').idxmax(axis=1))

SQL LIKE operator not working for comma-separated lists

Here is my data:
Column:
8
7,8
8,9,18
6,8,9
10,18
27,28
I only want rows that have and 8 in it. When I do:
Select *
from table
where column like '%8%'
I get all of the above since they contain an 8. When I do:
Select *
from table
where column like '%8%'
and column not like '%_8%'
I get:
8
8,9,18
I don't get 6,8,9, but I need to since it has 8 in it.
Can anyone help get the right results?
I would suggest the following :
SELECT *
FROM TABLE
WHERE column LIKE '%,8,%' OR column LIKE '%,8' OR column LIKE '8,%' OR Column='8';
But I must say storing data like this is highly inefficient, indexing won't help here for example, and you should consider altering the way you store your data, unless you have a really good reason to keep it this way.
Edit:
I highly recommend taking a look at #Bill Karwin's Link in the question's comment:
Is storing a delimited list in a database column really that bad?
You could use:
WHERE ','+col+',' LIKE '%,8,%'
And the obligatory admonishment: avoid storing lists, bad bad, etc.
How about:
where
col like '8,%'
or col like '%,8,%'
or col like '%,8'
or col = '8'
But ideally, as bluefeet suggests, normalizing this data instead of storing as delimited text will save you all kinds of headaches.

How to retrieve a part of a value in a column

Correction - I only need to Pick the WORK value every result set in the column will contain comma seperated values like below..
"SICK 0.08, WORK 0.08" or "SICK 0.08,WORK 0.08"
I only need to pick WORK 0.08 from this.
I am quite new to SQL
I am using the following script to get some results;
select Work.Work_summary, Work.emp_id
from Work
all work fine. but the first column has values like the following :
WORK 08.57, SICK 08.56 (Some columns)
SICK 07.80, WORK 06.80 , OT 02.00 (Some columns)
How can i only retrieve the column with only the WORK% value, if there is no WORK value the results shall be empty.
select Work_summary, emp_id
from Work
where Work_summary like '%WORK%'
This will return the rows in the Work table where Work_summary column contains the word WORK. See the documentation for more details.
Contains is faster than like.
SELECT Work_summary, emp_id FROM Work WHERE CONTAINS(Work_summary, 'WORK');
then use: this will give only the result where work summary contains work content.
select Work.Work_summary, Work.emp_id
from Work where contains(work.Work_summary ,'work');
select replace(Work_summary,",","") as work_summary
from Work
where upper(Work_summary) like '%WORK%'