concatenate arbitrary long list of matches in SQL subquery - sql

imagine 2 tables (rather stupid example, but for the sake of simplicity, here you go)
words
word_id
letters
letter
word_id
how can i select all words while selecting all letters that belong to a word and concatenating them to said word? it is important that the letters are returned in the order they appear in the table, as the letter may be mixed into other words, but the order is correct.
|word_id| |word_id|letter|
+-------+ +-------+------+
| 1| | 1| H|
| 2| | 2| B|
| 2| Y|
| 1| I|
| 2| E|
should return
|word_id|word|
+-------+----+
| 1| HI|
| 2| BYE|
any way to accomplish this in pure SQL?

Try this:
SELECT word_id, group_concat (letter,'') FROM letters GROUP BY word_id;

Related

Number of foods that scored "true" in being good, grouped by culture SQL

Okay, so I've been driving myself crazy trying to get this to display in SQL. I have a table that stores types of food, the culture they come from, a score, and a boolean value about whether or not they are good. I want to display a record of how many "goods" each culture racks up. Here's the table (don't ask about the database name):
So I've tried:
SELECT count(good = 1), culture FROM animals_db.foods group by culture;
Or
SELECT count(good = true), culture FROM animals_db.foods group by culture;
But it doesn't present the correct results, it seems to include anything that has any "good" value (1 or 0) at all.
How do I get the data I want?
instead of count , use sum.
SELECT sum(good), culture FROM animals_db.foods group by culture; -- assume good column value have integer data type and good value is represent as 1 otherwise 0
or other way is using count
select count( case when good=1 then 1 end) , culture from animals_db.foods group by culture;
If the purpose is to count the number of good=1 for each culture, this works:
select culture,
count(*)
from foods
where good=1
group by 1
order by 1;
Result:
culture |count(*)|
--------+--------+
| 1|
American| 1|
Chinese | 1|
European| 1|
Italian | 2|
The reason your first query doesn't return the result can be explained as below:
select culture,
good=1 as is_good
from foods
order by 1;
You actually get:
culture |is_good|
--------+-------+
| 1|
American| 0|
American| 1|
Chinese | 1|
European| 1|
French | 0|
French | 0|
German | 0|
Italian | 1|
Italian | 1|
After applied group by culture and count(good=1), you're actually counting the number of NOT NULL values in good=1. For example:
select culture,
count(good=0) as c0,
count(good=1) as c1,
count(good=2) as c2,
count(good) as c3,
count(null) as c4
from foods
group by culture
order by culture;
Outcome:
culture |c0|c1|c2|c3|c4|
--------+--+--+--+--+--+
| 1| 1| 1| 1| 0|
American| 2| 2| 2| 2| 0|
Chinese | 1| 1| 1| 1| 0|
European| 1| 1| 1| 1| 0|
French | 2| 2| 2| 2| 0|
German | 1| 1| 1| 1| 0|
Italian | 2| 2| 2| 2| 0|
Update: This is similar to your question: Is it possible to specify condition in Count()?.

Compare columns from two different dataframes based on id

I have two dataframes to compare, the order of records are different, the name of columns might be different. Have to compare columns (more than one) based on the unique key (id)
Example: consider cataframes df1 and df2
df1:
+---+-------+-----+
| id|student|marks|
+---+-------+-----+
| 1| Vijay| 23|
| 4| Vithal| 24|
| 2| Ram| 21|
| 3| Rahul| 25|
+---+-------+-----+
df2:
+-----+--------+------+
|newId|student1|marks1|
+-----+--------+------+
| 3| Rahul| 25|
| 2| Ram| 23|
| 1| Vijay| 23|
| 4| Vithal| 24|
+-----+--------+------+
Here based on id and newId, I need to compare values studentName and Marks, and need to check that whether the student with same id has same name and marks
In this example student with id 2 has 21 marks but in df2 23 marks
df1.exceptAll(df2).show()
// +---+-------+-----+
// | id|student|marks|
// +---+-------+-----+
// | 2| Ram| 21|
// +---+-------+-----+
I think diff will give the result you are looking for.
scala> df1.diff(df2)
res0: Seq[org.apache.spark.sql.Row] = List([2,Ram,21])

PySpark Dataframe: Unify Certain Rows

I'm having some trouble figuring this one out
Here's a simple example:
+---+----+-----+
| Id|Rank|State+
+---+----+-----+
| a| 5| NJ +
| a| 7| GA +
| b| 8| CA +
| b| 1| CA +
+---+----+-----+
I'd like to format this dataframe in a way where if the same Id is in multiple states, have it only store one state. In this example, any row with Id "a" should have state "NJ" instead of "NJ" and "GA".
The result should be something like:
+---+----+-----+
| Id|Rank|State+
+---+----+-----+
| a| 5| NJ +
| a| 7| NJ +
| b| 8| CA +
| b| 1| CA +
+---+----+-----+`
How can this be accompished? Thanks!!
Try first windowing function like:
w = Window().partitionBy("Id").orderBy("Rank")
df.select(col("Id"), col("Rank"), first("State", True).over(w).alias("NewState"))
This will put into "NewState" column the first state according to the rank within id group.
The same thing can easily be expressed in pure SQL, if you want to use it.
BTW, welcome to StackOverflow community!

Spark SQL: Is there a way to distinguish columns with same name?

I have a csv with a header with columns with same name.
I want to process them with spark using only SQL and be able to refer these columns unambiguously.
Ex.:
id name age height name
1 Alex 23 1.70
2 Joseph 24 1.89
I want to get only first name column using only Spark SQL
As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.
Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.
For instance, let's create a dataframe with three id columns.
val df = spark.range(3)
.select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id| x| id| y| id|
+---+---+---+---+---+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+---+---+---+---+---+
Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.
var i = -1
val names = df.columns.map( n =>
if(n == "id") {
i+=1
s"id_$i"
} else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0| x|id_1| y|id_2|
+----+---+----+---+----+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+----+---+----+---+----+

Add aggregated columns to pivot without join

Considering the table:
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df.show()
+---+-----+---------+
| id|error|timestamp|
+---+-----+---------+
| 1| 1| 1|
| 5| 0| 2|
| 27| 1| 1|
| 1| 0| 3|
| 5| 1| 1|
| 1| 0| 2|
+---+-----+---------+
I would like to make a pivot on timestamp column keeping some other aggregated information from the original table. The result I am interested in can be achieved by
df1=df.groupBy('id').agg(sf.sum('error').alias('Ne'),sf.count('*').alias('cnt'))
df2=df.groupBy('id').pivot('timestamp').agg(sf.count('*')).fillna(0)
df1.join(df2, on='id').filter(sf.col('cnt')>1).show()
with the resulting table:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
However, there are at least two issues with the mentioned solution:
I am filtering by cnt at the end of the script. If I would be able to do this at the beginning, I can avoid almost all processing, because a large portion of data is removed using this filtration. Is there any way how to do this excepting collect and isin methods?
I am doing groupBy on id two-times. First, to aggregate some columns I need in results and the second time to get the pivot columns. Finally, I need join to merge these columns. I feel that I am surely missing some solution because it should be possible to do this with just one groubBy and without join, but I cannot figure out, how to do this.
I think you can not get around the join, because the pivot will need the timestamp values and the first grouping should not consider them. So if you have to create the NE and cnt values you have to group the dataframe only by id which results in the loss of timestamp if you want to preserve the values in columns you have to do the pivot as you did separately and join it back.
The only improvement that can be done is to move the filter to the df1 creation. So as you said this could already improve the performance since df1 should be much smaller after the filtering for your real data.
from pyspark.sql.functions import *
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df1=df.groupBy('id').agg(sum('error').alias('Ne'),count('*').alias('cnt')).filter(col('cnt')>1)
df2=df.groupBy('id').pivot('timestamp').agg(count('*')).fillna(0)
df1.join(df2, on='id').show()
Output:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
Actually it is indeed possible to avoid join using Window as
w1 = Window.partitionBy('id')
w2 = Window.partitionBy('id', 'timestamp')
df.select('id', 'timestamp',
sf.sum('error').over(w1).alias('Ne'),
sf.count('*').over(w1).alias('cnt'),
sf.count('*').over(w2).alias('cnt_2')
).filter(sf.col('cnt')>1) \
.groupBy('id', 'Ne', 'cnt').pivot('timestamp').agg(sf.first('cnt_2')).fillna(0).show()