combined three dataframes columns into single dataframe - apache-spark-sql

In pyspark, I have created three dataframes: B1, P1, C1.
Dataframe: B1 has five columns (B_Num, B_Tin, B_Light, B_Dark, and
B_White)
Dataframe: P1 has three columns(P_Prov, P_Tip, and P_Bye)
Datafram: C1 has three columns(C_Cust, C_Addr1, and C_Addr2)
I tried doing union the three dataframes. It's working fine.I don't want
to do.
B1 = B1.withColumn("id", monotonically_increasing_id())
P1 = P1.withColumn("id", monotonically_increasing_id())
C1 = C1.withColumn("id", monotonically_increasing_id())
combined = B1.join(P1, "id", "outer").join(C1, "id", "outer").drop("id")
display(combined)
Below is the output of the combined:
B_Num, B_Tin, B_Light, B_Dark, B_White, P_Prov, P_Tip, P_Bye, C_Cust,
C_Addr1, and C_Addr2
I except the ouput like this:
B_Num,P_Prov,B_Tin,C_Addr2,B_Light,P_Tip,C_Cust,B_Dark,B_White,P_Bye,C_Addr1

Since your problem is only the ordering of the columns (as mentioned in your comment), you can select them in the right order :
B1 = B1.withColumn("id", monotonically_increasing_id())
P1 = P1.withColumn("id", monotonically_increasing_id())
C1 = C1.withColumn("id", monotonically_increasing_id())
combined = B1.join(P1, "id", "outer").join(C1, "id", "outer").drop("id")
good_ordering = combined.select("B_Num", "P_Prov", "B_Tin", "C_Addr2", "B_Light", "P_Tip", "C_Cust", "B_Dark", "B_White", "P_Bye", "C_Addr1")
display(good_ordering)
>>> B_Num,P_Prov,B_Tin,C_Addr2,B_Light,P_Tip,C_Cust,B_Dark,B_White,P_Bye,C_Addr1

Related

Dataframe to multiIndex for sktime format

I have a multivariate time series data which is in this format(pd.Dataframe with index on Time),
I am trying to use sktime, which requires the data to be in multi index format. On the above if i want to use a rolling window of 3 on above data. It requires it in this format. Here pd.Dataframe has multi-index on (instance,time)
I was thinking if it is possible to transform it to new format.
Edit here's a more straightforward and probably faster solution using row indexing
df = pd.DataFrame({
'time':range(5),
'a':[f'a{i}' for i in range(5)],
'b':[f'b{i}' for i in range(5)],
})
w = 3
w_starts = range(0,len(df)-(w-1)) #start positions of each window
#iterate through the overlapping windows to create 'instance' col and concat
roll_df = pd.concat(
df[s:s+w].assign(instance=i) for (i,s) in enumerate(w_starts)
).set_index(['instance','time'])
print(roll_df)
Output
a b
instance time
0 0 a0 b0
1 a1 b1
2 a2 b2
1 1 a1 b1
2 a2 b2
3 a3 b3
2 2 a2 b2
3 a3 b3
4 a4 b4
Here's one way to achieve the desired result:
# Create the instance column
instance = np.repeat(range(len(df) - 2), 3)
# Repeat the Time column for each value in A and B
time = np.concatenate([df.Time[i:i+3].values for i in range(len(df) - 2)])
# Repeat the A column for each value in the rolling window
a = np.concatenate([df.A[i:i+3].values for i in range(len(df) - 2)])
# Repeat the B column for each value in the rolling window
b = np.concatenate([df.B[i:i+3].values for i in range(len(df) - 2)])
# Create a new DataFrame with the desired format
new_df = pd.DataFrame({'Instance': instance, 'Time': time, 'A': a, 'B': b})
# Set the MultiIndex on the new DataFrame
new_df.set_index(['Instance', 'Time'], inplace=True)
new_df

Panda Table Conversion

How can I convert the table below to a table with columns ["ID", "PC1_0.1", "PC1_0.2", "PC1_0.3", ..., "PC10_111.2"] and only 24 rows. Each row may have the same wafer ID (meaning the same wafer is used repeatedly) and data of some wafer is not recorded.
i hope this codes work for you :)
d = {
"ID":["W-01"]*4+["W-02"]*2,
"Time":["t1","t2"]*3,
"PC1":["00","10","20","30","40","50"],
"PC2":["01","11","21","31","41","51"],
}
df = pd.DataFrame(d)
# for grouping on Time-PC1-PC2 and pivot
melt = df.melt(id_vars=["ID","Time"], value_vars=["PC1","PC2"])
melt["no"] = np.arange(0,melt.shape[0])
pivot = melt.pivot(index=["no","ID"], columns=["Time","variable"], values="value")
# We are combining non-nan columns because during the melt operation, nan data will emerge.
con = pd.DataFrame()
for col in range(pivot.columns.size):
part = pivot.iloc[:,[col]].dropna()
part = part.reset_index().drop("no", axis=1).set_index("ID")
con = pd.concat([con, part], axis=1)

panda expand columns with list into multiple columns

I want to expand / cast a column that contains lists into multiple columns:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
# I want:
pd.DataFrame({'a':[1,2], 'b1':[11,33], 'b2':[22,44]})
Send the column .tolist and create the DataFrame, then join back to the other column(s).
df = pd.concat([df.drop(columns='b'), pd.DataFrame(df['b'].tolist(), index=df.index).add_prefix('b')],
axis=1)
a b0 b1
0 1 11 22
1 2 33 44
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df["b1"] = df["b"].apply(lambda cell: cell[0])
df["b2"] = df["b"].apply(lambda cell: cell[1])
df[["a", "b1", "b2"]]
You can use .tolist() on your "b" column to expand it out, then just assign it back to the dataframe and get rid of your original "b" column:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df[["b1", "b2"]] = df["b"].tolist()
df = df.drop("b", axis=1) # alternatively: del df["b"]
print(df)
a b1 b2
0 1 11 22
1 2 33 44

Create pandas dataframe from series with ordered dict as rows

I am trying to extract lmfit parameter results as dataframes. I pass 1 column x, 1 column data through a fit_func and parameters pars and the output of the minimize function in lmfit outputs as OrderedDict.
out = minimize(fit_func, pars, method = 'leastsq', args=(x, data))
res = out.params.valuesdict()
res
Output:
OrderedDict([('a1', 12.850309404600393),
('c1', 1346.833513206811),
('s1', 44.22337472274829),
('f1', 1.1275639898142586),
('a2', 77.15732669480884),
('c2', 1580.5712512351947),
('s2', 16.239969775527275),
('f2', 0.8684363668111492)])
The output I want in DataFrames I achieved like this with pd.DataFrame(res,index=[0]) :
I have 3 data columns that I want to quickly fit:
x = d.iloc[:,0]
fit_odict = pd.DataFrame(d.iloc[:,1:4].\
apply(lambda y: minimize(fit_func, pars, method = 'leastsq', args=(x, y))\
.params.valuesdict()),index=[1])
But I get a series of Ordered Dicts as rows in the Dataframe:
How do I get the output I want with the three parameter results as rows ? Is there a better way to apply the function?
UPDATE:
Appended #M Newville in my solution. Might be helpful for those who want to quickly extract lmfit parameter results from multiple data columns d1.iloc[:,1:] :
def fff(cols):
out = minimize(fit_func, pars, method = 'leastsq', args=(x, cols))
return {key: par.value for key, par in out.params.items()}
results = d1.iloc[:,1:].apply(fff,result_type='expand').transpose()
Output:
For a single fit, this would probably be what you are looking for:
out = minimize(fit_func, pars, method = 'leastsq', args=(x, data))
fit_odict = pd.DataFrame({key: [par.value] for key, par in out.params.items()})
I think you probably are looking for something like this:
results = {key: [] for key in pars}
for data in datasets:
out = minimize(fit_func, pars, method='leastsq', args=(x, data))
for par_name, val_list in results.items():
val_list.append(out.params[par_name].value)
results = pd.DataFrame(results)
You could probably stuff that all into a single long line, but I wouldn't recommend it -- someone may want to read that code ;).
This is quick work around that you can do. The code is not efficient but you can optimize it. Note that index start 1 but you are welcome to re-index using pandas library
import pandas as pd
# Your output is a list of tuple
OrderedDict = [('a1', 12.850309404600393),('c1', 1346.833513206811),('s1',
44.22337472274829),('f1', 1.1275639898142586),('a2', 77.15732669480884),('c2',
1580.5712512351947),('s2', 16.239969775527275),('f2', 0.8684363668111492)]
# Create a dataframe from the list of tuple and tanspose
df = pd.DataFrame(OrderedDict).T
# Get the first row for the dataframe columns namea
columns = df.loc[0].values.tolist()
df.columns = columns
output = df.drop(df.index[0])
output
a1 c1 s1 f1 a2 c2 s2 f2
1 12.8503 1346.83 44.2234 1.12756 77.1573 1580.57 16.24 0.868436

Replacing column names in a pandas dataframe based on a lookup

Hi I have several dataframes with column headings that vary slightly. An example of a header of a dataframe will be:
A1 B1 C1
In other dataframes the first row is called A2 or A3 etc. A1..., B1...C1 represent multi character words/labels and are not the literal column names. I want to replace the column headings based on a mapping that I have between the A1, A2, Ax, B1, B2, Bx, C1, C2, Cx etc. and A, B and C.
What is the best way of doing this?
Thanks in advance.
I think here is possible use indexing with str for replace by first letter:
df.columns = df.columns.str[0]
Another possible solution is create dictionary for replace, e.g.:
d = {x:x[0] for x in df.columns}
df = df.rename(columns=d)