how to pivot values in dataframe spark scala - dataframe

I am having a really tough time trying to do this. I have a dataframe with different currencies and their type of rate exchange. This could be 01 or 07, with their respective number associated. For example:
curr
type
value
EUR
01
0.2345
EUR
07
0.1564
DOL
01
0.4566
DOL
07
0.1233
I think you get it. I want to pivot de 01 and 07 into columns, renaming them, and agrupate the type of currency so my final dataframe is something like this:
t_01
t_02
EUR
0.23
0.15
DOL
0.45
0.12
I have tried to do this (balanceConvert is the original dataframe, and dfBalance the one I want final)
var dfBalance = balanceConvert.select(g_currency_id).distinct()
balanceConvert.show()
dfBalance = dfBalance.withColumn("gf_em_cns_exr_lc_eu_amount",
when(balanceConvert.col(g_currency_id) === dfBalance.col(g_currency_id) && balanceConvert.col(gf_exchange_rate_applied_type) === "01",
balanceConvert.col(gf_exchange_rate_amount)).otherwise("TBD"))
dfBalance = dfBalance.withColumn("gf_avg_cns_exr_lc_eu_amount",
when(balanceConvert.col(g_currency_id) === dfBalance.col(g_currency_id) && balanceConvert.col(gf_exchange_rate_applied_type) === "07",
balanceConvert.col(gf_exchange_rate_amount)).otherwise("TBD"))
But I realized that I cannot access to another dataframe when i am inside a withColumn.

There is a pivot function that does exactly what you want:
val balanceConvert = Seq(
("EUR", "01", 0.2345),
("EUR", "07", 0.1564),
("DOL", "01", 0.4566),
("DOL", "07", 0.1233)
).toDF("curr", "type", "value")
balanceConvert
// renaming the type
.withColumn("type", concat(lit("t_"), 'type))
.groupBy("curr")
.pivot("type")
// If you have just one row per (curr, type) tuple, the aggregation you
// use does not matter much. You could as well use first.
.agg(avg('value))
.show()
+----+------+------+
|curr| t_01| t_07|
+----+------+------+
| DOL|0.4566|0.1233|
| EUR|0.2345|0.1564|
+----+------+------+

Related

Pandas - Take value n month before

I am working with datetime. Is there anyway to get a value of n months before.
For example, the data look like:
dft = pd.DataFrame(
np.random.randn(100, 1),
columns=["A"],
index=pd.date_range("20130101", periods=100, freq="M"),
)
dft
Then:
For every Jul of each year, we take value of December in previous year and apply it to June next year
For other month left (from Aug this year to June next year), we take value of previous month
For example: that value from Jul-2000 to June-2001 will be the same and equal to value of Dec-1999.
What I've been trying to do is:
dft['B'] = np.where(dft.index.month == 7,
dft['A'].shift(7, freq='M') ,
dft['A'].shift(1, freq='M'))
However, the result is simply a copy of column A. I don't know why. But when I tried for single line of code :
dft['C'] = dft['A'].shift(7, freq='M')
then everything is shifted as expected. I don't know what is the issue here
The issue is index alignment. This shift that you performed acts on the index, but using numpy.where you convert to arrays and lose the index.
Use pandas' where or mask instead, everything will remain as Series and the index will be preserved:
dft['B'] = (dft['A'].shift(1, freq='M')
.mask(dft.index.month == 7, dft['A'].shift(7, freq='M'))
)
output:
A B
2013-01-31 -2.202668 NaN
2013-02-28 0.878792 -2.202668
2013-03-31 -0.982540 0.878792
2013-04-30 0.119029 -0.982540
2013-05-31 -0.119644 0.119029
2013-06-30 -1.038124 -0.119644
2013-07-31 0.177794 -1.038124
2013-08-31 0.206593 -2.202668 <- correct
2013-09-30 0.188426 0.206593
2013-10-31 0.764086 0.188426
... ... ...
2020-12-31 1.382249 -1.413214
2021-01-31 -0.303696 1.382249
2021-02-28 -1.622287 -0.303696
2021-03-31 -0.763898 -1.622287
2021-04-30 0.420844 -0.763898
[100 rows x 2 columns]

Make a for loop for a dataframe to substract dates and put it in a variable

I have a dataframe that looks like this with a lot of products
Product
Start Date
00001
2021/08/10
00002
2021/01/10
I want to make a cycle so that it goes from product to product subtracting three months from the date and then putting it in a variable, something like this.
date[]=''
for dataframe in i:
date['3monthsbefore']=i['start date']-3 months
date['3monthsafter']=i['start date']+3 months
date['product']=i['product']
"Another process with those variables"
And then concat all this data in a dataframe I´m a little bit lost,
I want to do this because I need to use those variables in another process, so I't is possible to do this?.
Using pandas, you usually don't need to loop over your DataFrame. In this case, you can get the 3 months before/after for all rows pretty simply using pd.DateOffset:
df["Start Date"] = pd.to_datetime(df["Start Date"])
df["3monthsbefore"] = df["Start Date"] - pd.DateOffset(months=3)
df["3monthsafter"] = df["Start Date"] + pd.DateOffset(months=3)
This gives:
Product Start Date 3monthsbefore 3monthsafter
0 00001 2021-08-10 2021-05-10 2021-11-10
1 00002 2021-01-10 2020-10-10 2021-04-10
Data:
df = pd.DataFrame({"Product": ["00001", "00002"], "Start Date": ["2021/08/10", "2021/01/10"]})

Pandas how to get row number from datetime index and back again?

I have great difficulties. I have read a csv files, and set the index on "Timestamp" column like this
# df = pd.read_csv (csv_file, quotechar = "'", decimal = ".", delimiter=";", parse_dates = True, index_col="Timestamp")
# df
XYZ PRICE position nrLots posText
Timestamp
2014-10-14 10:00:29 30 140 -1.0 -1.0 buy
2014-10-14 10:00:30 21 90 -1.0 -5.0 buy
2014-10-14 10:00:31 3 110 1.0 2.0 sell
2014-10-14 10:00:32 31 120 1.0 1.0 sell
2014-10-14 10:00:33 4 70 -1.0 -5.0 buy
So if I want to get the price of 2nd row, I want to do like this:
df.loc [2,"PRICE"]
But that does not work. If I want to use df.loc[] operator, I need to insert a Timestamp, like this:
df.loc["2014-10-14 10:00:31", "PRICE"]
If I want to use row numbers, I need to do like this instead:
df["PRICE"].iloc[2]
which sucks. The syntax is ugly. However, it works. I can get the value, and I can set the value - which is what I want.
If I want to find the Timestamp from a row, I can do like this:
df.index[row]
Question) Is there a more elegant syntax to get and set the value, when you always work with a row number? I always iterate over the row numbers, never iterate over Timestamps. I never use the Timestamp to access values, I always use row numbers.
Bonusquestion) If I have a Timestamp, how can I find the corresponding row number?
There is way to do this .
First use df = df.reset_index() .
"Timestamp" will be new column added to df , now you get new index as integer.
And you access any row element with df.loc[] or df.iat[] and you can find any row with specific element .

How to systematically drop a Pandas row given a particular condition in a column?

I have a simple dataframe consisting of 3 columns: Year, Period (Month) and Inflation
I want to remove even months (e.g., February, April, etc.) and the code I came up with is the following:
i = df[((df["Periodo"] == "Febrero") | (df["Periodo"] == "Abril") | (df["Periodo"] =="Junio") | (df["Periodo"] =="Agosto") | (df["Periodo"] =="Octubre") | (df["Periodo"] =="Diciembre"))].index
df.drop(i, inplace = True)
Is there a quicker way rather than typing those tedious OR conditions to make the index? Like a for loop or something.
Thanks
You can use .isin() (note the ~ used as negation):
to_drop = ["Febrero", "Abril", "Junio", "Agosto", "Octubre", "Diciembre"]
print(df[~df.Periodo.isin(to_drop)])

pandas dataframe insert to MongoDB as embedded document (list of Directories)

my dataframe before inserting to MongoDB is like below,
code trade_date open high low close pre_close change pct_chg vol
0 600111 20210308 58.99 59.16 56.58 56.66 58.52 -1.86 -3.1784 299902.37
1 600111 20210305 58.00 58.91 57.71 58.52 59.31 -0.79 -1.3320 281584.57
2 600111 20210304 60.31 61.60 58.96 59.31 58.67 0.64 1.0908 621415.96
3 600111 20210303 58.21 58.80 57.80 58.67 58.49 0.18 0.3077 235677.52
I have extracted Year & Month from trade_date and appended a new column "trade_month" as below,
code trade_date open high low close pre_close change pct_chg vol trade_month
0 600111 20210308 58.99 59.16 56.58 56.66 58.52 -1.86 -3.1784 299902.37 202103
1 600111 20210305 58.00 58.91 57.71 58.52 59.31 -0.79 -1.3320 281584.57 202103
How can I insert this table to Mongodb in below format (embedded mongodb document including the whole month trade data as a list of dictionary) via Pymongo please ?
{
'code':'600111',
'trade_month':'202103',
{'month_records':
[
{'trade_date':'20210301','open':'123', 'high':'123',...},
{'trade_date':'20210302','open':'123', 'high':'123',...},
{'trade_date':'20210303','open':'123', 'high':'123',...},
...
]
}
}
I want to have a month's records as a single MongoDB document is aligning with my MongoDB schema design because I normally retrieve 1 month data via code and trade_date which would be indexed in MongoDB.
I did some research on my own for nearly 2 days, and did not find any clues on how to achieve that. Appreciate if any help.
Use groupby at code and month level and aggregate to dict.
df.groupby(['code', 'trade_month']).apply(
lambda x: x.to_dict(orient='records')
).rename('month_records').reset_index().to_dict('records')
Output:
[
{
"code":600111,
"trade_month":202103,
"month_records":[
{
"code":600111,
"trade_date":20210308,
"open":58.99,
"high":59.16,
"low":56.58,
"close":56.66,
"pre_close":58.52,
"change":-1.86,
"pct_chg":-3.1784,
"vol":299902.37,
"trade_month":202103
},
{
"code":600111,
"trade_date":20210305,
"open":58.0,
"high":58.91,
"low":57.71,
"close":58.52,
"pre_close":59.31,
"change":-0.79,
"pct_chg":-1.332,
"vol":281584.57,
"trade_month":202103
}
]
}
]