Dealing with multiple values in Pandas Dataframe Cell - pandas

Columns are the description of the data and the rows keep the values. However, in some columns there are multiple values (tabular form on website). Rows of those tabular get merged in one cell and are separated by hashtags. Since they are only part of the tabular they refer to other columns with values in cells also separated by hashtags.
Column Name: solution_id | type labour | labour_unit | est_labour_quantity | est_labour_costs | est_labour_total_costs
10 | WorkA#WorkB | Person#Person | 2.0#2.0 | 300.0#300.0. | 600.0#600.0
11 | WorkC#WorkD | Person#Person | 3.0#2.0 | 300.0#300.0. | 900.0#600.0
My questions are twofold:
What would be a good way to transform the data to work on it more efficiently, e.g. create as many as new columns as there are entries in one cell. So e.g. separate it like e.g.
Column Name: solution_id | type labour_1 | labour_unit_1 | est_labour_quantity_1 | est_labour_costs_1 | est_labour_total_costs_1 | type labour_2 | labour_unit_2 | est_labour_quantity_2 | est_labour_costs_2 | est_labour_total_costs_2
10 | WorkA | Person. | 2.0. | 300.0. | 600.0. | WorkB | Person | 2.0 | 300.0 | 600.0
11 | WorkC | Person. | 3.0. | 300.0. | 900.0. | WorkD | Person | 2.0 | 300.0 | 600.0
This makes it more readable but it doubles the amount of columns and I have some cells with up to 5 entries, so it would be x5 more columns. What I also don't like so much about the idea is that the new column names are not really meaningful and it will be hard to interpret them.
How can I make this separation in pandas, so that I have WorkA and then the associated values, and then Work B etc...
If there is another better way to work with this tabular form, maybe bring it all in one cell? Please let me know!

Use:
#unpivot by melt
df = df.melt('solution_id')
#create lists by split #
df['value'] = df['value'].str.split('#')
#repeat rows by value column
df = df.explode('value')
#counter for new columns names
df['g'] = df.groupby(['solution_id','variable']).cumcount().add(1)
#pivoting and sorting MultiIndex
df = (df.pivot('solution_id',['variable', 'g'], 'value')
.sort_index(level=1, axis=1, sort_remaining=False))
#flatten MultiIndex
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
type_labour_1 labour_unit_1 est_labour_quantity_1 \
solution_id
10 WorkA Person 2.0
11 WorkC Person 3.0
est_labour_costs_1 est_labour_total_costs_1 type_labour_2 \
solution_id
10 300.0 600.0 WorkB
11 300.0 900.0 WorkD
labour_unit_2 est_labour_quantity_2 est_labour_costs_2 \
solution_id
10 Person 2.0 300.0.
11 Person 2.0 300.0.
est_labour_total_costs_2
solution_id
10 600.0
11 600.0

You can split your strings, explode and reshape:
df2 = (df
.set_index('solution_id')
.apply(lambda c: c.str.split('#'))
.explode(list(df.columns[1:]))
.assign(idx=lambda d: d.groupby(level=0).cumcount().add(1))
.set_index('idx', append=True)
.unstack('idx')
.sort_index(axis=1, level='idx', sort_remaining=False)
)
df2.columns = [f'{a}_{b}' for a,b in df2.columns]
output:
type labour_1 labour_unit_1 est_labour_quantity_1 est_labour_costs_1 est_labour_total_costs_1 type labour_2 labour_unit_2 est_labour_quantity_2 est_labour_costs_2 est_labour_total_costs_2
solution_id
10 WorkA Person 2.0 300.0 600.0 WorkB Person 2.0 300.0. 600.0
11 WorkC Person 3.0 300.0 900.0 WorkD Person 2.0 300.0. 600.0
Or, shorter code using the same initial split followed by slicing and concatenation:
df2=(df
.set_index('solution_id')
.apply(lambda c: c.str.split('#'))
)
pd.concat([df2.apply(lambda c: c.str[i]).add_suffix(f'_{i+1}')
for i in range(len(df2.iat[0,0]))], axis=1)

Related

Trying to iterate through a column to populate another column

I am trying to populate the column num_crimes. Since the zipcode repeats in the houses data frame, I just want to add the number of crimes related to that zipcode from the dictionary containing all the crimes per zipcode.
the houses dataframe contains 5000 entries, and the dictionary contains only 67, so I cannot just merge them.
This is the houses dataframe:
sold_price | zipcode | fireplaces | num_crimes
5300000 | 85637 | 6 | NaN
4200000 | 85646 | 5 | NaN
4200000 | 85646 | 5 | NaN
4500000 | 85646 | 6 | NaN
3411450 | 85750 | 4 | NaN
and this is the dictionary:
{85141: 1,85601: 2, 85607: 1, 85614: 4, 85622: 2, 85629: 4, 85634: 1....}
Problem: this is the code I used for that, but it is not changing the values in num_crimes:
def populate(df1):
for row, rows in df1.iterrows():
if rows[1] in my_dict:
rows[3]=my_dict[rows[1]]
else:
rows[3]=0
You can just do something like:
df["num_crimes"] = df["zipcode"].apply(lambda z: my_dict[z])
If you have zipcode in df that are not in my_dict, you need to handle for that as well:
df["num_crimes"] = df["zipcode"].apply(lambda z: my_dict[z] if z in my_dict else -1)
It's a lot easier to answer your questions if you post your data as text rather than images. Anyways, you could make the dict into a dataframe and then join it with the original dataframe. So something like this:
houses.set_index("Zipcode").join(pd.DataFrame.from_dict(my_dict, orient='index', columns = ["Crimes from dict"]))
Would that work?

Merge two DataFrames on multiple columns

hope you can help me.
I have two pretty big Datasets.
DF1 Example:
|id| A_Workflow_Type_ID | B_Workflow_Type_ID | ...
1 123 456
2 789 222 ...
3 333 NULL ...
DF2 Example:
Workflow| Operation | Profile | Type | Name | ...
123 1 2 Low_Cost xyz ...
456 2 5 High_Cost z ...
I need to merge the two datasets without creating many NaNs and multiple columns. So i merge on the informations A_Workflow_Type_ID and B_Workflow_Type_ID from DF1 on Workflow from DF2.
I tried it with several join operations in pandas and the merge option it failure.
My last try:
all_Data = pd.merge(left=DF1,right=DF2, how='inner', left_on =['A_Workflow_Type_ID ','B_Workflow_Type_ID '], right_on=['Workflow'])
But that returns an error that they have to be equal lenght on both sides.
Thanks for the help!
You need reshape first by melt and then merge:
#generate all column without strings Workflow
cols = DF1.columns[~DF1.columns.str.contains('Workflow')]
print (cols)
Index(['id'], dtype='object')
df = DF1.melt(cols, value_name='Workflow', var_name='type')
print (df)
id type Workflow
0 1 A_Workflow_Type_ID 123.0
1 2 A_Workflow_Type_ID 789.0
2 3 A_Workflow_Type_ID 333.0
3 1 B_Workflow_Type_ID 456.0
4 2 B_Workflow_Type_ID 222.0
5 3 B_Workflow_Type_ID NaN
all_Data = pd.merge(left=df,right=DF2, on ='Workflow')
print (all_Data)
id type Workflow Operation Profile Type Name
0 1 A_Workflow_Type_ID 123 1 2 Low_Cost xyz
1 1 B_Workflow_Type_ID 456 2 5 High_Cost z

pandas pivot_table with dates as values

let's say I have the following table of customer data
df = pd.DataFrame.from_dict({"Customer":[0,0,1],
"Date":['01.01.2016', '01.02.2016', '01.01.2016'],
"Type":["First Buy", "Second Buy", "First Buy"],
"Value":[10,20,10]})
which looks like this:
Customer | Date | Type | Value
-----------------------------------------
0 |01.01.2016|First Buy | 10
-----------------------------------------
0 |01.02.2016|Second Buy| 20
-----------------------------------------
1 |01.01.2016|First Buy | 10
I want to pivot the table by the Type column.
However, the pivoting only gives the numeric Value columns as a result.
I'd desire a structure like:
Customer | First Buy Date | First Buy Value | Second Buy Date | Second Buy Value
---------------------------------------------------------------------------------
where the missing values are NAN or NAT
Is this possible using pivot_table. If not, I can imagine some workarounds, but they are quite lenghty. Any other suggestions?
Use unstack:
df1 = df.set_index(['Customer', 'Type']).unstack()
df1.columns = ['_'.join(cols) for cols in df1.columns]
print (df1)
Date_First Buy Date_Second Buy Value_First Buy Value_Second Buy
Customer
0 01.01.2016 01.02.2016 10.0 20.0
1 01.01.2016 None 10.0 NaN
If need another order of columns use swaplevel and sort_index:
df1 = df.set_index(['Customer', 'Type']).unstack()
df1.columns = ['_'.join(cols) for cols in df1.columns.swaplevel(0,1)]
df1.sort_index(axis=1, inplace=True)
print (df1)
First Buy_Date First Buy_Value Second Buy_Date Second Buy_Value
Customer
0 01.01.2016 10.0 01.02.2016 20.0
1 01.01.2016 10.0 None NaN

Using Pandas groupby to calculate many slopes

Some illustrative data in a DataFrame (MultiIndex) format:
|entity| year |value|
+------+------+-----+
| a | 1999 | 2 |
| | 2004 | 5 |
| b | 2003 | 3 |
| | 2007 | 2 |
| | 2014 | 7 |
I would like to calculate the slope using scipy.stats.linregress for each entity a and b in the above example. I tried using groupby on the first column, following the split-apply-combine advice, but it seems problematic since it's expecting one Series of values (a and b), whereas I need to operate on the two columns on the right.
This is easily done in R via plyr, not sure how to approach it in pandas.
A function can be applied to a groupby with the apply function. The passed function in this case linregress. Please see below:
In [4]: x = pd.DataFrame({'entity':['a','a','b','b','b'],
'year':[1999,2004,2003,2007,2014],
'value':[2,5,3,2,7]})
In [5]: x
Out[5]:
entity value year
0 a 2 1999
1 a 5 2004
2 b 3 2003
3 b 2 2007
4 b 7 2014
In [6]: from scipy.stats import linregress
In [7]: x.groupby('entity').apply(lambda v: linregress(v.year, v.value)[0])
Out[7]:
entity
a 0.600000
b 0.403226
You can do this via the iterator ability of the group by object. It seems easier to do it by dropping the current index and then specifying the group by 'entity'.
A list comprehension is then an easy way to quickly work through all the groups in the iterator. Or use a dict comprehension to get the labels in the same place (you can then stick the dict into a pd.DataFrame easily).
import pandas as pd
import scipy.stats
#This is your data
test = pd.DataFrame({'entity':['a','a','b','b','b'],'year':[1999,2004,2003,2007,2014],'value':[2,5,3,2,7]}).set_index(['entity','year'])
#This creates the groups
groupby = test.reset_index().groupby(['entity'])
#Process groups by list comprehension
slopes = [scipy.stats.linregress(group.year, group.value)[0] for name, group in groupby]
#Process groups by dict comprehension
slopes = {name:[scipy.stats.linregress(group.year, group.value)[0]] for name, group in groupby}

Resampling a DataFrame to hourly 15min and 5min periods in Julia

I'm quite new to Julia but I'm giving it a try since the benchmarks claim it to be much faster than Python.
I'm trying to use some stock tick data in the format ["unixtime", "price", "amount"]
I managed to load the data and convert the unixtime to a date in Julia, but now I need to resample the data to use olhc (open, high, low, close) for the price and sum for the amount, for a specific period in Julia (hourly, 15min, 5 min, etc...):
julia> head(btc_raw_data)
6x3 DataFrame:
date price amount
[1,] 2011-09-13T13:53:36 UTC 5.8 1.0
[2,] 2011-09-13T13:53:44 UTC 5.83 3.0
[3,] 2011-09-13T13:53:49 UTC 5.9 1.0
[4,] 2011-09-13T13:53:54 UTC 6.0 20.0
[5,] 2011-09-13T14:32:53 UTC 5.95 12.4521
[6,] 2011-09-13T14:35:04 UTC 5.88 7.458
I see there is a package called Resampling, but it doesn't seem to accept a time period only the number of row I want the output data to have.
Any other alternatives?
You can convert DataFrame (from DataFrames.jl) to TimeArray (from TimeSeries.jl) using https://github.com/femtotrader/TimeSeriesIO.jl
using TimeSeriesIO: TimeArray
ta = TimeArray(df, colnames=[:price], timestamp=:date)
You can resample timeseries (TimeArray from TimeSeries.jl) using TimeSeriesResampler https://github.com/femtotrader/TimeSeriesResampler.jl
and TimeFrames https://github.com/femtotrader/TimeFrames.jl
using TimeSeriesResampler: resample, mean, ohlc, sum, TimeFrame
# Define a sample timeseries (prices for example)
idx = DateTime(2010,1,1):Dates.Minute(1):DateTime(2011,1,1)
idx = idx[1:end-1]
N = length(idx)
y = rand(-1.0:0.01:1.0, N)
y = 1000 + cumsum(y)
#df = DataFrame(Date=idx, y=y)
ta = TimeArray(collect(idx), y, ["y"])
println("ta=")
println(ta)
# Define how datetime should be grouped (timeframe)
tf = TimeFrame(dt -> floor(dt, Dates.Minute(15)))
# resample using OHLC values
ta_ohlc = ohlc(resample(ta, tf))
println("ta_ohlc=")
println(ta_ohlc)
# resample using mean values
ta_mean = mean(resample(ta, tf))
println("ta_mean=")
println(ta_mean)
# Define an other sample timeseries (volume for example)
vol = rand(0:0.01:1.0, N)
ta_vol = TimeArray(collect(idx), vol, ["vol"])
println("ta_vol=")
println(ta_vol)
# resample using sum values
ta_vol_sum = sum(resample(ta_vol, tf))
println("ta_vol_sum=")
println(ta_vol_sum)
You should get:
julia> ta
525600x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:59:00
y
2010-01-01T00:00:00 | 1000.16
2010-01-01T00:01:00 | 1000.1
2010-01-01T00:02:00 | 1000.98
2010-01-01T00:03:00 | 1001.38
⋮
2010-12-31T23:56:00 | 972.3
2010-12-31T23:57:00 | 972.85
2010-12-31T23:58:00 | 973.74
2010-12-31T23:59:00 | 972.8
julia> ta_ohlc
35040x4 TimeSeries.TimeArray{Float64,2,DateTime,Array{Float64,2}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
Open High Low Close
2010-01-01T00:00:00 | 1000.16 1002.5 1000.1 1001.54
2010-01-01T00:15:00 | 1001.57 1002.64 999.38 999.38
2010-01-01T00:30:00 | 999.13 1000.91 998.91 1000.91
2010-01-01T00:45:00 | 1001.0 1006.42 1001.0 1006.42
⋮
2010-12-31T23:00:00 | 980.84 981.56 976.53 976.53
2010-12-31T23:15:00 | 975.74 977.46 974.71 975.31
2010-12-31T23:30:00 | 974.72 974.9 971.73 972.07
2010-12-31T23:45:00 | 972.33 973.74 971.49 972.8
julia> ta_mean
35040x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
y
2010-01-01T00:00:00 | 1001.1047
2010-01-01T00:15:00 | 1001.686
2010-01-01T00:30:00 | 999.628
2010-01-01T00:45:00 | 1003.5267
⋮
2010-12-31T23:00:00 | 979.1773
2010-12-31T23:15:00 | 975.746
2010-12-31T23:30:00 | 973.482
2010-12-31T23:45:00 | 972.3427
julia> ta_vol
525600x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:59:00
vol
2010-01-01T00:00:00 | 0.37
2010-01-01T00:01:00 | 0.67
2010-01-01T00:02:00 | 0.29
2010-01-01T00:03:00 | 0.28
⋮
2010-12-31T23:56:00 | 0.74
2010-12-31T23:57:00 | 0.66
2010-12-31T23:58:00 | 0.22
2010-12-31T23:59:00 | 0.47
julia> ta_vol_sum
35040x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
vol
2010-01-01T00:00:00 | 7.13
2010-01-01T00:15:00 | 6.99
2010-01-01T00:30:00 | 8.73
2010-01-01T00:45:00 | 8.27
⋮
2010-12-31T23:00:00 | 6.11
2010-12-31T23:15:00 | 7.49
2010-12-31T23:30:00 | 5.75
2010-12-31T23:45:00 | 8.36