Loop to iterate join over columns in pandas - pandas

I have a dataframe:
import pandas as pd
data = [('s1', 's2'),
('s1', 's3'),
('s2', 's4'),
('s3', 's5'),
('s5', 's6')]
df = pd.DataFrame(data, columns=['start', 'end'])
+-----+---+
|start|end|
+-----+---+
| s1| s2|
| s1| s3|
| s2| s4|
| s3| s5|
| s5| s6|
+-----+---+
I want to see if the end column has values from the start column and write their values in a new end2 column
new_df = df
df = df.join(new_df , (df.start== new_df.end))
The result is something like this:
+-----+---+----+
|start|end|end2|
+-----+---+----+
| s1| s2| s4|
| s1| s3| s5|
| s2| s4|null|
| s3| s5| s6|
| s5| s6|null|
+-----+---+----+
Then I want to join again to see if end2 has values in start and write their values from the end column to the new end3 column. And so do the join until the last column is filled with all the None values
That is, an iterative join along the columns comes out (there are actually a lot more rows in my dataframe, so writing each join is not very good). But I don't understand how to do it. The result should be something like this
while df[-1].notnull():
df = df.join... #join ends columns
new_df[] = #add new column
+-----+---+----+----+----+
|start|end|end2|end3|end4|
+-----+---+----+----+----+
| s1| s2| s4|None|None|
| s1| s3| s5| s6|None|
| s2| s4|None|None|None|
| s3| s5| s6|None|None|
| s5| s6|None|None|None|
+-----+---+----+----+----+

Try mapping the original dictionary to the end column, using start as the key, and end as the values:
df.assign(end2 = df['end'].map(dict(df.to_records(index=False))))
Output:
start end end2
0 s1 s2 s4
1 s1 s3 s5
2 s2 s4 NaN
3 s3 s5 s6
4 s5 s6 NaN
To create all possible columns, we can use a while loop:
i = 2
m = dict(df.to_records(index=False))
while df.iloc[:,-1].count() != 0:
df['end{}'.format(i)] = df.iloc[:,-1].map(m)
i += 1
Output:
start end end2 end3 end4
0 s1 s2 s4 NaN NaN
1 s1 s3 s5 s6 NaN
2 s2 s4 NaN NaN NaN
3 s3 s5 s6 NaN NaN
4 s5 s6 NaN NaN NaN

Related

Joining 2 dataframes pyspark

I am new to Pyspark.
I have data like this in 2 tables as below. I am using data frames.
Table1:
Id
Amount
Date
1
£100
01/04/2021
1
£50
08/04/2021
2
£60
02/04/2021
2
£20
06/05/2021
Table2:
Id
Status
Date
1
S1
01/04/2021
1
S2
05/04/2021
1
S3
10/04/2021
2
S1
02/04/2021
2
S2
10/04/2021
I need to join those 2 data frames above to produce output like this as below.
For every record in table 1, we need to get the record from table 2 valid as of that Date and vice versa. For e.g, table1 has £50 for Id=1 on 08/04/2021 but table 2 has a record for Id=1 on 05/04/2021 where status changed to S2. So, for 08/04/2021 the status is S2. That's what I am not sure how to give in the join condition to get this output
What's the efficient way of achieving this?
Expected Output:
Id
Status
Date
Amount
1
S1
01/04/2021
£100
1
S2
05/04/2021
£100
1
S2
08/04/2021
£50
1
S3
10/04/2021
£50
2
S1
02/04/2021
£60
2
S2
10/04/2021
£60
2
S2
06/05/2021
£20
Use full join on Id and Date then lag window function to get the values of Status and Amount from the precedent closest Date row:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy("Id").orderBy(F.to_date("Date", "dd/MM/yyyy"))
joined_df = df1.join(df2, ["Id", "Date"], "full").withColumn(
"Status",
F.coalesce(F.col("Status"), F.lag("Status").over(w))
).withColumn(
"Amount",
F.coalesce(F.col("Amount"), F.lag("Amount").over(w))
)
joined_df.show()
#+---+----------+------+------+
#| Id| Date|Amount|Status|
#+---+----------+------+------+
#| 1|01/04/2021| £100| S1|
#| 1|05/04/2021| £100| S2|
#| 1|08/04/2021| £50| S2|
#| 1|10/04/2021| £50| S3|
#| 2|02/04/2021| £60| S1|
#| 2|10/04/2021| £60| S2|
#| 2|06/05/2021| £20| S2|
#+---+----------+------+------+

Store count as variable and use it for calculations in PySpark

I have dataframe df1:
+------+-----------+----------+----------+-----+
| sid|acc_term_id|first_name| last_name|major|
+------+-----------+----------+----------+-----+
|106454| 2014B| Doris| Marshall| BIO|
|106685| 2015A| Sara|Richardson| CHM|
|106971| 2015B| Rose| Butler| CHM|
|107298| 2015B| Kayla| Barnes| CSC|
|107555| 2016A| Carolyn| Ford| PHY|
|107624| 2016B| Marie| Webb| BIO|
I want to store the count of sid from this dataframe
c_value = current.agg({"sid": "count"}).collect()[0][0]
and use it for creating a prop column as shown in code below:
c_value = current.agg({"sid": "count"}).collect()[0][0]
stud_major = (
current
.groupBy('major')
.agg(
expr('COUNT(*) AS n_students')
)
.select('major', 'n_students', expr('ROUND(n_students/c_value, 4) AS prop'),
)
)
stud_major.show(16)
When I run the code I get error:
cannot resolve '`c_value`' given input columns: [major, n_students]; line 1 pos 17;
If I put numeric value 2055 instead of c_value everything ok like below.
+
-----+----------+------+
|major|n_students| prop|
+-----+----------+------+
| MTH| 320|0.1557|
| CHM| 405|0.1971|
| CSC| 508|0.2472|
| BIO| 615|0.2993|
| PHY| 207|0.1007|
+-----+----------+------+
Probably there are other ways to calculate but I need by storing count as variable.
Any ideas?
In jupyter Use pandas agg
j=df.agg({'sid':'count'})
df.groupby("major")['sid'].agg(n_students=(lambda x: x.count()), prop=(lambda x: x.count()/j))
major n_students prop
0 BIO 2 0.333333
1 CHM 2 0.333333
2 CSC 1 0.166667
3 PHY 1 0.166667
and pyspark
from pyspark.sql.functions import *
df.groupby('major').agg(count('sid').alias('n_students')).withColumn('prop', round((col('n_students')/c_value),2)).show()
Alternatively You could
c_value = df.agg({"sid": "count"}).collect()[0][0]
df.groupBy('major').agg(expr('COUNT(*) AS n_students')).selectExpr('major',"n_students", f"ROUND(n_students/{c_value},2) AS prop").show()
+-----+----------+----+
|major|n_students|prop|
+-----+----------+----+
| BIO| 2|0.33|
| CHM| 2|0.33|
| CSC| 1|0.17|
| PHY| 1|0.17|
+-----+----------+----+

Visits during the last 2 years

I have a list with users and the dates of their last visit. For every time they visit, I want to know how many times they visited over the last 2 years.
# Create toy example
import pandas as pd
import numpy as np
date_range = pd.date_range(pd.to_datetime('2010-01-01'),
pd.to_datetime('2016-01-01'), freq='D')
date_range = np.random.choice(date_range, 8)
visits = {'user': list(np.repeat(1, 4)) + list(np.repeat(2, 4)) ,
'time': list(date_range)}
df = pd.DataFrame(visits)
df.sort_values(by = ['user', 'time'], axis = 0)
df = spark.createDataFrame(df).repartition(1).cache()
df.show()
What I am looking for is something like this:
time user nr_visits_during_2_previous_years
0 2010-02-27 1 0
2 2012-02-21 1 1
3 2013-04-30 1 1
1 2013-06-20 1 2
6 2010-06-23 2 0
4 2011-10-19 2 1
5 2011-11-10 2 2
7 2014-02-06 2 0
Suppose you create a dataframe with these values and you need to check for visits after 2015-01-01.
import pyspark.sql.functions as f
import pyspark.sql.types as t
df = spark.createDataFrame([("2014-02-01", "1"),("2015-03-01", "2"),("2017-12-01", "3"),
("2014-05-01", "2"),("2016-10-12", "1"),("2016-08-21", "1"),
("2017-07-01", "3"),("2015-09-11", "1"),("2016-08-24", "1")
,("2016-04-05", "2"),("2014-11-19", "3"),("2016-03-11", "3")], ["date", "id"])
Now, you need to change your date column to DateType from StringType and then filter rows for which user visited after 2015-01-01.
df2 = df.withColumn("date",f.to_date('date', 'yyyy-MM-dd'))
df3 = df2.where(df2.date >= f.lit('2015-01-01'))
Last part, just use groupby on id column and use count to get the number of visits by a user after 2015-01-01
df3.groupby('id').count().show()
+---+-----+
| id|count|
+---+-----+
| 3| 3|
| 1| 4|
| 2| 2|
+---+-----+

Pandas: how to groupby on concatenated dataframes with same column names?

How to properly concat (or maybe this is .merge()?) N dataframes with the same column names, so that I could groupby them with distinguished column keys. For ex:
dfs = {
'A': df1, // columns are C1, C2, C3
'B': df2, // same columns C1, C2, C3
}
gathered_df = pd.concat(dfs.values()).groupby(['C2'])['C3']\
.count()\
.sort_values(ascending=False)\
.reset_index()
I want to get something like
|----------|------------|-------------|
| | A | B |
| C2_val1 | count_perA | count_perB |
| C2_val2 | count_perA | count_perB |
| C2_val3 | count_perA | count_perB |
I think you need reset_index for create columns from MultiIndex and then add column to groupby dor distinguish dataframes. Last reshape by unstack:
gathered_df = pd.concat(dfs).reset_index().groupby(['C2','level_0'])['C3'].count().unstack()
What is the difference between size and count in pandas?
Sample:
df1 = pd.DataFrame({'C1':[1,2,3],
'C2':[4,5,5],
'C3':[7,8,np.nan]})
df2 = df1.mul(10).fillna(1)
df2.C2 = df1.C2
print (df1)
C1 C2 C3
0 1 4 7.0
1 2 5 8.0
2 3 5 NaN
print (df2)
C1 C2 C3
0 10 4 70.0
1 20 5 80.0
2 30 5 1.0
dfs = {
'A': df1,
'B': df2
}
gathered_df = pd.concat(dfs).reset_index().groupby(['C2','level_0'])['C3'].count().unstack()
gathered_df.index.name = None
gathered_df.columns.name = None
print (gathered_df)
A B
4 1 1
5 1 2

Resampling a DataFrame to hourly 15min and 5min periods in Julia

I'm quite new to Julia but I'm giving it a try since the benchmarks claim it to be much faster than Python.
I'm trying to use some stock tick data in the format ["unixtime", "price", "amount"]
I managed to load the data and convert the unixtime to a date in Julia, but now I need to resample the data to use olhc (open, high, low, close) for the price and sum for the amount, for a specific period in Julia (hourly, 15min, 5 min, etc...):
julia> head(btc_raw_data)
6x3 DataFrame:
date price amount
[1,] 2011-09-13T13:53:36 UTC 5.8 1.0
[2,] 2011-09-13T13:53:44 UTC 5.83 3.0
[3,] 2011-09-13T13:53:49 UTC 5.9 1.0
[4,] 2011-09-13T13:53:54 UTC 6.0 20.0
[5,] 2011-09-13T14:32:53 UTC 5.95 12.4521
[6,] 2011-09-13T14:35:04 UTC 5.88 7.458
I see there is a package called Resampling, but it doesn't seem to accept a time period only the number of row I want the output data to have.
Any other alternatives?
You can convert DataFrame (from DataFrames.jl) to TimeArray (from TimeSeries.jl) using https://github.com/femtotrader/TimeSeriesIO.jl
using TimeSeriesIO: TimeArray
ta = TimeArray(df, colnames=[:price], timestamp=:date)
You can resample timeseries (TimeArray from TimeSeries.jl) using TimeSeriesResampler https://github.com/femtotrader/TimeSeriesResampler.jl
and TimeFrames https://github.com/femtotrader/TimeFrames.jl
using TimeSeriesResampler: resample, mean, ohlc, sum, TimeFrame
# Define a sample timeseries (prices for example)
idx = DateTime(2010,1,1):Dates.Minute(1):DateTime(2011,1,1)
idx = idx[1:end-1]
N = length(idx)
y = rand(-1.0:0.01:1.0, N)
y = 1000 + cumsum(y)
#df = DataFrame(Date=idx, y=y)
ta = TimeArray(collect(idx), y, ["y"])
println("ta=")
println(ta)
# Define how datetime should be grouped (timeframe)
tf = TimeFrame(dt -> floor(dt, Dates.Minute(15)))
# resample using OHLC values
ta_ohlc = ohlc(resample(ta, tf))
println("ta_ohlc=")
println(ta_ohlc)
# resample using mean values
ta_mean = mean(resample(ta, tf))
println("ta_mean=")
println(ta_mean)
# Define an other sample timeseries (volume for example)
vol = rand(0:0.01:1.0, N)
ta_vol = TimeArray(collect(idx), vol, ["vol"])
println("ta_vol=")
println(ta_vol)
# resample using sum values
ta_vol_sum = sum(resample(ta_vol, tf))
println("ta_vol_sum=")
println(ta_vol_sum)
You should get:
julia> ta
525600x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:59:00
y
2010-01-01T00:00:00 | 1000.16
2010-01-01T00:01:00 | 1000.1
2010-01-01T00:02:00 | 1000.98
2010-01-01T00:03:00 | 1001.38
⋮
2010-12-31T23:56:00 | 972.3
2010-12-31T23:57:00 | 972.85
2010-12-31T23:58:00 | 973.74
2010-12-31T23:59:00 | 972.8
julia> ta_ohlc
35040x4 TimeSeries.TimeArray{Float64,2,DateTime,Array{Float64,2}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
Open High Low Close
2010-01-01T00:00:00 | 1000.16 1002.5 1000.1 1001.54
2010-01-01T00:15:00 | 1001.57 1002.64 999.38 999.38
2010-01-01T00:30:00 | 999.13 1000.91 998.91 1000.91
2010-01-01T00:45:00 | 1001.0 1006.42 1001.0 1006.42
⋮
2010-12-31T23:00:00 | 980.84 981.56 976.53 976.53
2010-12-31T23:15:00 | 975.74 977.46 974.71 975.31
2010-12-31T23:30:00 | 974.72 974.9 971.73 972.07
2010-12-31T23:45:00 | 972.33 973.74 971.49 972.8
julia> ta_mean
35040x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
y
2010-01-01T00:00:00 | 1001.1047
2010-01-01T00:15:00 | 1001.686
2010-01-01T00:30:00 | 999.628
2010-01-01T00:45:00 | 1003.5267
⋮
2010-12-31T23:00:00 | 979.1773
2010-12-31T23:15:00 | 975.746
2010-12-31T23:30:00 | 973.482
2010-12-31T23:45:00 | 972.3427
julia> ta_vol
525600x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:59:00
vol
2010-01-01T00:00:00 | 0.37
2010-01-01T00:01:00 | 0.67
2010-01-01T00:02:00 | 0.29
2010-01-01T00:03:00 | 0.28
⋮
2010-12-31T23:56:00 | 0.74
2010-12-31T23:57:00 | 0.66
2010-12-31T23:58:00 | 0.22
2010-12-31T23:59:00 | 0.47
julia> ta_vol_sum
35040x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
vol
2010-01-01T00:00:00 | 7.13
2010-01-01T00:15:00 | 6.99
2010-01-01T00:30:00 | 8.73
2010-01-01T00:45:00 | 8.27
⋮
2010-12-31T23:00:00 | 6.11
2010-12-31T23:15:00 | 7.49
2010-12-31T23:30:00 | 5.75
2010-12-31T23:45:00 | 8.36