Related
I have array of set data below data in pyspark dataframe like below.
-+-----------------------------------------------------------------------------------+-
| targeting_values |
-+-----------------------------------------------------------------------------------+-
| [('123', '123', '123'), ('abc', 'def', 'ghi'), ('jkl', 'mno', 'pqr'), (0, 1, 2)] |
-+-----------------------------------------------------------------------------------+-
I want 4 different columns have with set in each column like below.
-+----------------------+----------------------+-----------------------+--------------------+-
| value1 | value2 | value3 | value4 |
-+----------------------+----------------------+-----------------------+--------------------+-
| ('123', '123', '123')|('abc', 'def', 'ghi') | ('jkl', 'mno', 'pqr') | (0, 1, 2) |
-+----------------------+----------------------+-----------------------+--------------------+-
I was trying to achieve this by using split() but no luck.
I did not found other way to do solve this issue.
So is there a good way to do this?
You can do it by exploding the array than pivoting it,
// first create the data:
val arrayStructData = Seq(
Row(List(Row("123", "123", "123"), Row("abc", "def", "ghi"), Row("jkl", "mno", "pqr"), Row("0", "1", "2"))),
Row(List(Row("456", "456", "456"), Row("qsd", "fgh", "hjk"), Row("aze", "rty", "uio"), Row("4", "5", "6")))
)
val arrayStructSchema = new StructType()
.add("targeting_values", ArrayType(new StructType()
.add("_1", StringType)
.add("_2", StringType)
.add("_3", StringType)))
val df = spark.createDataFrame(spark.sparkContext
.parallelize(arrayStructData), arrayStructSchema)
df.show(false)
+--------------------------------------------------------------+
|targeting_values |
+--------------------------------------------------------------+
|[{123, 123, 123}, {abc, def, ghi}, {jkl, mno, pqr}, {0, 1, 2}]|
|[{456, 456, 456}, {qsd, fgh, hjk}, {aze, rty, uio}, {4, 5, 6}]|
+--------------------------------------------------------------+
// Then a combination of explode, creating and id then pivoting it like this:
df.withColumn("id2", monotonically_increasing_id())
.select(col("id2"), posexplode(col("targeting_values"))).withColumn("id", concat(lit("value"), col("pos") + 1))
.groupBy("id2").pivot("id").agg(first("col")).drop("id2")
.show(false)
+---------------+---------------+---------------+---------+
|value1 |value2 |value3 |value4 |
+---------------+---------------+---------------+---------+
|{123, 123, 123}|{abc, def, ghi}|{jkl, mno, pqr}|{0, 1, 2}|
|{456, 456, 456}|{qsd, fgh, hjk}|{aze, rty, uio}|{4, 5, 6}|
+---------------+---------------+---------------+---------+
You can try this:
df.selectExpr([f"targeting_values[{i}] as value{i+1}" for i in range(4)])
This is my dataframe:
df = pd.DataFrame(
{
"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],
"score": [3, 5, 6, 2, 4, 1],
}
)
I want to compare the score of bob_x with 'bob_y, and retain the row with the lowest, and do the same for jay_xandjay_y. No change is required for madandjoe`.
You can first split the names by _ and keep the first part, then groupby and keep the lowest value:
import pandas as pd
df = pd.DataFrame({"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],"score": [3, 5, 6, 2, 4, 1]})
df['name'] = df['name'].str.split('_').str[0]
df.groupby('name')['score'].min().reset_index()
Result:
name
score
0
bob
2
1
jay
4
2
joe
1
3
mad
5
I feel like there is a better way than this:
import pandas as pd
df = pd.DataFrame(
columns=" index c1 c2 v1 ".split(),
data= [
[ 0, "A", "X", 3, ],
[ 1, "A", "X", 5, ],
[ 2, "A", "Y", 7, ],
[ 3, "A", "Y", 1, ],
[ 4, "B", "X", 3, ],
[ 5, "B", "X", 1, ],
[ 6, "B", "X", 3, ],
[ 7, "B", "Y", 1, ],
[ 8, "C", "X", 7, ],
[ 9, "C", "Y", 4, ],
[ 10, "C", "Y", 1, ],
[ 11, "C", "Y", 6, ],]).set_index("index", drop=True)
def callback(x):
x['seq'] = range(1, x.shape[0] + 1)
return x
df = df.groupby(['c1', 'c2']).apply(callback)
print df
To achieve this:
c1 c2 v1 seq
0 A X 3 1
1 A X 5 2
2 A Y 7 1
3 A Y 1 2
4 B X 3 1
5 B X 1 2
6 B X 3 3
7 B Y 1 1
8 C X 7 1
9 C Y 4 1
10 C Y 1 2
11 C Y 6 3
Is there a way to do it that avoids the callback?
use cumcount(), see docs here
In [4]: df.groupby(['c1', 'c2']).cumcount()
Out[4]:
0 0
1 1
2 0
3 1
4 0
5 1
6 2
7 0
8 0
9 0
10 1
11 2
dtype: int64
If you want orderings starting at 1
In [5]: df.groupby(['c1', 'c2']).cumcount()+1
Out[5]:
0 1
1 2
2 1
3 2
4 1
5 2
6 3
7 1
8 1
9 1
10 2
11 3
dtype: int64
This might be useful
df = df.sort_values(['userID', 'date'])
grp = df.groupby('userID')['ItemID'].aggregate(lambda x: '->'.join(tuple(x))).reset_index()
print(grp)
it will create a sequence like this
If you have a dataframe similar to the one below and you want to add seq column by building it from c1 or c2, i.e. keep a running count of similar values (or until a flag comes up) in other column(s), read on.
df = pd.DataFrame(
columns=" c1 c2 seq".split(),
data= [
[ "A", 1, 1 ],
[ "A1", 0, 2 ],
[ "A11", 0, 3 ],
[ "A111", 0, 4 ],
[ "B", 1, 1 ],
[ "B1", 0, 2 ],
[ "B111", 0, 3 ],
[ "C", 1, 1 ],
[ "C11", 0, 2 ] ])
then first find group starters, (str.contains() (and eq()) is used below but any method that creates a boolean Series such as lt(), ne(), isna() etc. can be used) and call cumsum() on it to create a Series where each group has a unique identifying value. Then use it as the grouper on a groupby().cumsum() operation.
In summary, use a code similar to the one below.
# build a grouper Series for similar values
groups = df['c1'].str.contains("A$|B$|C$").cumsum()
# or build a grouper Series from flags (1s)
groups = df['c2'].eq(1).cumsum()
# groupby using the above grouper
df['seq'] = df.groupby(groups).cumcount().add(1)
The cleanliness of Jeff's answer is nice, but I prefer to sort explicitly...though generally without overwriting my df for these type of use-cases (e.g. Shaina Raza's answer).
So, to create a new column sequenced by 'v1' within each ('c1', 'c2') group:
df["seq"] = df.sort_values(by=['c1','c2','v1']).groupby(['c1','c2']).cumcount()
you can check with:
df.sort_values(by=['c1','c2','seq'])
or, if you want to overwrite the df, then:
df = df.sort_values(by=['c1','c2','seq']).reset_index()
I have an objects table and a lookup table. In the objects table, I'm looking to add the smallest value from the lookup table that is greater than the object's number.
I found this similar question but it's about finding a value greater than a constant, rather than changing for each row.
In code:
import pandas as pd
objects = pd.DataFrame([{"id": 1, "number": 10}, {"id": 2, "number": 30}])
lookup = pd.DataFrame([{"number": 3}, {"number": 12}, {"number": 40}])
expected = pd.DataFrame(
[
{"id": 1, "number": 10, "smallest_greater": 12},
{"id": 2, "number": 30, "smallest_greater": 40},
]
)
First compare each value lookup['number'] by objects['number'] to 2d boolean mask, then add cumsum and compare first value by 1 and get position by numpy.argmax for set value by lookup['number'].
Output is generated with numpy.where for overwrite all not matched values to NaN.
objects = pd.DataFrame([{"id": 1, "number": 10}, {"id": 2, "number": 30},
{"id": 3, "number": 100},{"id": 4, "number": 1}])
print (objects)
id number
0 1 10
1 2 30
2 3 100
3 4 1
m1 = lookup['number'].values >= objects['number'].values[:, None]
m2 = np.cumsum(m1, axis=1) == 1
m3 = np.any(m1, axis=1)
out = lookup['number'].values[m2.argmax(axis=1)]
objects['smallest_greater'] = np.where(m3, out, np.nan)
print (objects)
id number smallest_greater
0 1 10 12.0
1 2 30 40.0
2 3 100 NaN
3 4 1 3.0
smallest_greater = []
for i in objects['number']: smallest_greater.append(lookup['number'[lookup[lookup['number']>i].sort_values(by='number').index[0]])
objects['smallest_greater'] = smallest_greater
I have a postgres 9.6 table with a JSONB column
> SELECT id, data FROM my_table ORDER BY id LIMIT 4;
id | data
----+---------------------------------------
1 | {"a": [1, 7], "b": null, "c": [8]}
2 | {"a": [2, 9], "b": [1], "c": null}
3 | {"a": [8, 9], "b": null, "c": [3, 4]}
4 | {}
As you can see, some JSON keys have null values.
I'd like to exclude these - is there an easy way to SELECT only the non-null key-value pairs to produce:
id | data
----+---------------------------------------
1 | {"a": [1, 7], "c": [8]}
2 | {"a": [2, 9], "b": [1]}
3 | {"a": [8, 9], "c": [3, 4]}
4 | {}
Thanks!
You can use jsonb_strip_nulls()
select id, jsonb_strip_nulls(data) as data
from my_table;
Online example: http://rextester.com/GGJRW83576
Note that this function would not remove null values inside the arrays.