how do i get an element of a spark dataframe?

how do i get an element of a spark dataframe? - apache-spark-sql

Suppose I have a dataframe like below:
df :
+-----------------------+-------------------------+---------------------+
| a | b | c |
+-----------------------+-------------------------+---------------------+
| 1 | 1 | 0.2 |
| 1 | 2 | 0.3 |
| 1 | 3 | 0.4 |
| 1 | 4 | 0.5 |
| 1 | 5 | 0.2 |
+-----------------------+-------------------------+---------------------+
How do I get value of c where a = 1 and b =2?

val resDF = df.filter(col("a").equalTo(1)).filter(col("b").equalTo(2))

val col_c = df.filter(col("a").equalTo(1).and(col("b").equalTo(2))).select("c")

Related

Why sorting a pandas column causing reordering the sub-groups? [duplicate]

This question already has answers here:
Sorting Dataframe using pandas. Keeping columns intact
(2 answers)
Closed last year.
The goal of my question is to understand why this happens and if this is a defined behaviour. I need to know to design my unittests in a predictable way. I do not want or need to change that behaviour or work around it.
Here is the initial data on the left side complete and on the right side just all ID.eq(1) but the order is the same as you can see in the index and the val column.
| | ID | val | | | ID | val |
|---:|-----:|:------| |---:|-----:|:------|
| 0 | 1 | A | | 0 | 1 | A |
| 1 | 2 | B | | 3 | 1 | x |
| 2 | 9 | C | | 4 | 1 | R |
| 3 | 1 | x | | 6 | 1 | G |
| 4 | 1 | R | | 9 | 1 | a |
| 5 | 4 | F | | 12 | 1 | d |
| 6 | 1 | G | | 13 | 1 | e |
| 7 | 9 | H |
| 8 | 4 | I |
| 9 | 1 | a |
| 10 | 2 | b |
| 11 | 9 | c |
| 12 | 1 | d |
| 13 | 1 | e |
| 14 | 4 | f |
| 15 | 2 | g |
| 16 | 9 | h |
| 17 | 9 | i |
| 18 | 4 | X |
| 19 | 5 | Y |
This right table is also the result I would expected when doing the following:
When I sort by ID the order of the rows inside the subgroups (e.g. ID.eq(1)) is modified. Why is it so?
This is the unexpected result
| | ID | val |
|---:|-----:|:------|
| 0 | 1 | A |
| 13 | 1 | e |
| 12 | 1 | d |
| 6 | 1 | G |
| 9 | 1 | a |
| 3 | 1 | x |
| 4 | 1 | R |
This is a full MWE
#!/usr/bin/env python3
import pandas as pd
# initial data
df = pd.DataFrame(
{
'ID': [1, 2, 9, 1, 1, 4, 1, 9, 4, 1,
2, 9, 1, 1, 4, 2, 9, 9, 4, 5],
'val': list('ABCxRFGHIabcdefghiXY')
}
)
print(df.to_markdown())
# only the group "1"
print(df.loc[df.ID.eq(1)].to_markdown())
# sort by 'ID'
df = df.sort_values('ID')
# only the group "1" (after sorting)
print(df.loc[df.ID.eq(1)].to_markdown())

As explained in the sort_values documentation, the stability of the sort is not always guaranteed depending on the chosen algorithm:
kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'
Choice of sorting algorithm. See also :func:`numpy.sort` for more
information. `mergesort` and `stable` are the only stable algorithms. For
DataFrames, this option is only applied when sorting on a single
column or label.
If you want to ensure using a stable sort:
df.sort_values('ID', kind='stable')
output:
ID val
0 1 A
3 1 x
4 1 R
6 1 G
9 1 a
...

How to assign duplicate increment in SQL?

While going through SQL columns, if we find text match "NEW" in Calc column, update the incrementing a count starting with 1 in Results column.
It should look like this on the output:

The following uses an id column to resolve the order issue. Replace that with your corresponding expression. This also addresses the requirement to start the display sequence with 1 and also show 0 for the 'NEW' rows.
The SQL (updated):
SELECT logs.*
, CASE WHEN text = 'NEW' THEN 0
ELSE
COALESCE(SUM(CASE WHEN text = 'NEW' THEN 1 END) OVER (PARTITION BY xrank ORDER BY id)+1, 1)
END AS display
FROM logs
ORDER BY id
The result:
+----+-------+------+---------+
| id | xrank | text | display |
+----+-------+------+---------+
| 1 | 1 | A | 1 |
| 2 | 1 | B | 1 |
| 3 | 1 | C | 1 |
| 4 | 1 | NEW | 0 |
| 5 | 1 | D | 2 |
| 6 | 1 | Q | 2 |
| 7 | 1 | B | 2 |
| 8 | 1 | NEW | 0 |
| 9 | 1 | D | 3 |
| 10 | 1 | Z | 3 |
| 11 | 2 | A | 1 |
| 12 | 2 | B | 1 |
| 13 | 2 | C | 1 |
| 14 | 2 | NEW | 0 |
| 15 | 2 | D | 2 |
| 16 | 2 | Q | 2 |
| 17 | 2 | B | 2 |
| 18 | 2 | NEW | 0 |
| 19 | 2 | D | 3 |
| 20 | 2 | Z | 3 |
+----+-------+------+---------+

You need a column that specifies the ordering for the table. With that, just use a cumulative sum:
select t.*,
1 + sum(case when Calc = 'NEW' then 1 else 0 end) over (partition by Rank_Id order by Seq) as display
from t;

Count values less than in another dataframe based on values in existing dataframe

I have two python pandas dataframes, in simplified form they look like this:
DF1
+---------+------+
| Date 1 | Item |
+---------+------+
| 1991-08 | A |
| 1992-08 | A |
| 1997-02 | B |
| 1998-03 | C |
| 1999-02 | D |
| 1999-02 | D |
+---------|------+
DF2
+---------+------+
| Date 2 | Item |
+---------+------+
| 1993-08 | A |
| 1993-09 | B |
| 1997-01 | C |
| 1999-03 | D |
| 2000-02 | E |
| 2001-03 | F |
+---------|------+
I want to count how many element in Item column DF2 appeared in DF1 if the date in DF1 are less than the date in DF2
Desired Output
+---------+------+-------+
| Date 2 | Item | Count |
+---------+------+-------+
| 1993-08 | A | 2 |
| 1993-09 | B | 0 |
| 1997-01 | C | 0 |
| 1999-03 | D | 2 |
| 2000-02 | E | 0 |
| 2001-03 | F | 0 |
+---------+------+-------+
Appreciate any comment and feedback, thanks in advance

Let's merge with a cartesian join and filter, then use value_counts and map back to your dataframe:
df_c = df1.merge(df2, on='Item')
df_c = df_c[df_c['Date 1'] < df_c['Date 2']]
df2['Count'] = df2['Item'].map(df_c['Item'].value_counts()).fillna(0)
print(df2)
Output:
Date 2 Item Count
0 1993-08 A 2.0
1 1993-09 B 0.0 # Note, I get no counts for B
2 1997-01 C 0.0
3 1999-03 D 2.0
4 2000-02 E 0.0
5 2001-03 F 0.0

pandas iterate over rows based on column values

I want to calculate the temperature difference at the same time between to cities. The data structure looks as follows:
dic = {'city':['a','a','a','a','a','b','b','b','b','b'],'week':[1,2,3,4,5,3,4,5,6,7],'temp':[20,21,23,21,25,20,21,24,21,22]}
df = pd.DataFrame(dic)
df
+------+------+------+
| city | week | temp |
+------+------+------+
| a | 1 | 20 |
| a | 2 | 21 |
| a | 3 | 23 |
| a | 4 | 21 |
| a | 5 | 25 |
| b | 3 | 20 |
| b | 4 | 21 |
| b | 5 | 24 |
| b | 6 | 21 |
| b | 7 | 22 |
+------+------+------+
I would like to calculate the difference in temperature between city a and b at week 3, 4, and 5. The final data structure should look as follows:
+--------+-------+------+------+
| city_1 | city2 | week | diff |
+--------+-------+------+------+
| a | b | 3 | 3 |
| a | b | 4 | 0 |
| a | b | 5 | 1 |
+--------+-------+------+------+

I would pivot your data, drop the NA values, and do the subtraction directly. This way you can keep the source temperatures associated with each city.
result = (
df.pivot(index='week', columns='city', values='temp')
.dropna(how='any', axis='index')
.assign(diff=lambda df: df['a'] - df['b'])
)
print(result)
city a b diff
week
3 23.0 20.0 3.0
4 21.0 21.0 0.0
5 25.0 24.0 1.0

add two columns with groupby

How can I add two columns after being grouped by a key from another column,
for example I have the following table:
+------+------+------+
| Col1 | Val1 | Val2 |
+------+------+------+
| 1 | 3 | 3 |
| 1 | 4 | 2 |
| 1 | 2 | 1 |
| 2 | 2 | 0 |
| 2 | 3 | 0 |
| 3 | 2 | 9 |
| 3 | 2 | 8 |
| 4 | 2 | 1 |
| 5 | 1 | 1 |
+------+------+------+
what I want to achieve is
+------+----------------------+
| Col1 | Sum of Val1 and Val2 |
+------+----------------------+
| 1 | 15 |
| 2 | 5 |
| 3 | 21 |
| 4 | 3 |
| 5 | 2 |
+------+----------------------+
I can get sum of a column grouping Col1, Col1 and then adding thier results but I am creating multiple columns in the process.
import pandas as pd
data =[[1,3,3],[1,4,2],[1,2,1],[2,2,0],[2,3,0],[3,2,9],[3,2,8],
[4,2,1],[5,1,1]]
mydf = pd.DataFrame(data, columns = ['Col1','Val1','Val2'])
print(mydf)
mydf['total1'] = mydf.groupby('Col1')['Val1'].transform('sum')
mydf['total2'] = mydf.groupby('Col1')['Val2'].transform('sum')
mydf['Sum of Val1 and Val2'] = mydf['total1'] + mydf['total2']
mydf = mydf.drop_duplicates('Col1')
print(mydf[['Col1', 'Sum of Val1 and Val2' ]])
is there a shorter way to deal with this?

mydf.groupby('Col1').sum().sum(axis=1)

Use the following:
mydf['Sum of Val1 and Val2'] = mydf['Val1'] + mydf['Val2']
df = mydf.groupby('Col1')['Sum of Val1 and Val2'].sum().reset_index()
print(df)
Col1 Sum of Val1 and Val2
0 1 15
1 2 5
2 3 21
3 4 3
4 5 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how do i get an element of a spark dataframe? - apache-spark-sql

val resDF = df.filter(col("a").equalTo(1)).filter(col("b").equalTo(2))

val col_c = df.filter(col("a").equalTo(1).and(col("b").equalTo(2))).select("c")

Related

Why sorting a pandas column causing reordering the sub-groups? [duplicate]

How to assign duplicate increment in SQL?

Count values less than in another dataframe based on values in existing dataframe

pandas iterate over rows based on column values

add two columns with groupby

Categories

Resources