It's possible to select distinct and no distinct in Pyspark? [duplicate] - pandas

This question already has answers here:
How to get distinct rows in dataframe using pyspark?
(2 answers)
Closed 2 years ago.
I need to select 2 columns from a fact table (attached below). The problem I find is that for one of the columns I need unique values and for the other one I'm happy to have them duplicated as they below to a specific ticket id.
Fact table used:
df = (
spark.table(f'nn_table_{country}.fact_table')
.filter(f.col('date_key').between(start_date,end_date))
.filter(f.col('is_client_plus')==1)
.filter(f.col('source')=='tickets')
.filter(f.col('subtype')=='item_pm')
.filter(f.col('external_id')=='DISC0000077144 | DISC0000076895')
.filter(f.col('external_id').isNotNull())
.select('customer_id','external_id').distinct()
#.join(dim_promotions, 'external_id', 'left')
)
display(df)
As you can see, the select statement contains a customer_id and external_id column, where I'm only interested in get the unique customer_id.
.select('customer_id','external_id').distinct()
Desired output:
customer_id external_id
77000000505097070 DISC0000077144
77000002294023644 DISC0000077144
77000000385346302 DISC0000076895
77000000291101490 DISC0000076895
any idea about how to do that? or if it's possible?
Thanks in advance!

Use dropDuplicates:
df.select('customer_id','external_id').dropDuplicates(['customer_id'])

Related

I want to collect duplicated values in SQL and convert them to a number [duplicate]

This question already has answers here:
How to use count and group by at the same select statement
(11 answers)
Closed 6 months ago.
I want to count a certain value in a column and output it as a number.
Here is an example:
id
job
1
police
2
police
3
ambulance
Now I want to count the value "police" in the "job" column and make the result in a number, so because there are two entries with "police" in the column it would be as output the number two. With the value "ambulance" it is only one entry so the result would be 1.
Can anyone tell me how to write this as code?
I have now searched a lot on the Internet and tried myself but I have found nothing that worked.
You're saying you want to count how many of each type of job there is, right?
SELECT COUNT(*), job
FROM tablename
GROUP BY job

Concatenate distinct values in a group by [duplicate]

This question already has answers here:
How to concatenate strings of a string field in a PostgreSQL 'group by' query?
(14 answers)
Closed last year.
I have data like this:
Group Provider
A ABC
A DEF
B DEF
B HIJ
And I want to transform the data like this:
Group ProviderList
A ABC, DEF
B DEF, HIJ
I was trying something like this using a concat(select distinct...) but not sure if this is the best approach
SELECT distinct
group,
CONCAT(select distinct provider from data)
FROM data
GROUP BY 1
What Laurenz meant with string_agg() is the following
SELECT
group,
STRING_AGG(Provider,',') as ProviderList
FROM data
GROUP BY 1
Optionally you could also use:
STRING_AGG(provider,',' order by Provider)

How to select multiple rows with multiple values in pandas [duplicate]

This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 3 years ago.
I have a dataframe and a list as follows.
id title description
0 17810732 "nn nn." "nnnn nnnn"
1 17810731 "mm mm." "mmmm mmmm"
2 17810739 "ll ll." "llll llll"
3 17810738 "jj jj." "jjjj jjjj"
ids = [17810738, 17810731]
I want to get a dataframe that only has rows corresponding to the ids list.
So my output should be as follows.
id title description
0 17810738 "jj jj." "jjjj jjjj"
1 17810731 "mm mm." "mmmm mmmm"
I have been using this code.
for id in ids:
print(df.loc[df["id"] == id])
However, it only returns seperate dataframes to each id, which is not what I need.
I am happy to provide more details if needed.
The solution is in isin method, so
df[df['id'].isin(ids)]
will do the trick.

How to combine multiple rows into one line [duplicate]

This question already has answers here:
SQL Query to concatenate column values from multiple rows in Oracle
(10 answers)
Closed 4 years ago.
I have a very simple query:
select date,route,employee
from information
where date=Trunc(Sysdate)
however, since for some routes, there are more than 2 employees are assigned, so the query will return two rows
But I want get one route for one row, so the ideal output should be:
so the two names are in the same row, and combine with "|", so how can I achieve this goal in PL/SQL?
You can use listagg function, but you have to add Date and Route to grouping functions as well
SELECT LISTAGG(emp, ' | ')
WITHIN GROUP (ORDER BY emp) "Emp",
MAX(date) "Date",
MAX(route) "Route"
FROM information
WHERE date=Trunc(Sysdate);

Select a range of numbers not in a table in SQL [duplicate]

This question already has answers here:
What is the best way to create and populate a numbers table?
(12 answers)
Closed 4 years ago.
I am wondering if it is possible to get a query that will take a range of numbers, in this case 8 to 17, compare it against a field in a table and remove the ones that do appear in the table and return the rest?
I assume the peusdo code would look something like
Select nums from range(8-17) where nums not in (select column from table)
Is this possible at all?
Edit
To clarify my question.
In table I might have the following:
Intnumber
9
10
16
I would like to have the numbers between 8-17 that do not appear in this table, so 8,11,12,13,14,15,17
Kind regards
Matt
select nums from table where nums not between 8 and 17;