combined three dataframes columns into single dataframe

combined three dataframes columns into single dataframe - apache-spark-sql

In pyspark, I have created three dataframes: B1, P1, C1.
Dataframe: B1 has five columns (B_Num, B_Tin, B_Light, B_Dark, and
B_White)
Dataframe: P1 has three columns(P_Prov, P_Tip, and P_Bye)
Datafram: C1 has three columns(C_Cust, C_Addr1, and C_Addr2)
I tried doing union the three dataframes. It's working fine.I don't want
to do.
B1 = B1.withColumn("id", monotonically_increasing_id())
P1 = P1.withColumn("id", monotonically_increasing_id())
C1 = C1.withColumn("id", monotonically_increasing_id())
combined = B1.join(P1, "id", "outer").join(C1, "id", "outer").drop("id")
display(combined)
Below is the output of the combined:
B_Num, B_Tin, B_Light, B_Dark, B_White, P_Prov, P_Tip, P_Bye, C_Cust,
C_Addr1, and C_Addr2
I except the ouput like this:
B_Num,P_Prov,B_Tin,C_Addr2,B_Light,P_Tip,C_Cust,B_Dark,B_White,P_Bye,C_Addr1

Since your problem is only the ordering of the columns (as mentioned in your comment), you can select them in the right order :
B1 = B1.withColumn("id", monotonically_increasing_id())
P1 = P1.withColumn("id", monotonically_increasing_id())
C1 = C1.withColumn("id", monotonically_increasing_id())
combined = B1.join(P1, "id", "outer").join(C1, "id", "outer").drop("id")
good_ordering = combined.select("B_Num", "P_Prov", "B_Tin", "C_Addr2", "B_Light", "P_Tip", "C_Cust", "B_Dark", "B_White", "P_Bye", "C_Addr1")
display(good_ordering)
>>> B_Num,P_Prov,B_Tin,C_Addr2,B_Light,P_Tip,C_Cust,B_Dark,B_White,P_Bye,C_Addr1

Related

Dataframe to multiIndex for sktime format

I have a multivariate time series data which is in this format(pd.Dataframe with index on Time),
I am trying to use sktime, which requires the data to be in multi index format. On the above if i want to use a rolling window of 3 on above data. It requires it in this format. Here pd.Dataframe has multi-index on (instance,time)
I was thinking if it is possible to transform it to new format.

Edit here's a more straightforward and probably faster solution using row indexing
df = pd.DataFrame({
'time':range(5),
'a':[f'a{i}' for i in range(5)],
'b':[f'b{i}' for i in range(5)],
})
w = 3
w_starts = range(0,len(df)-(w-1)) #start positions of each window
#iterate through the overlapping windows to create 'instance' col and concat
roll_df = pd.concat(
df[s:s+w].assign(instance=i) for (i,s) in enumerate(w_starts)
).set_index(['instance','time'])
print(roll_df)
Output
a b
instance time
0 0 a0 b0
1 a1 b1
2 a2 b2
1 1 a1 b1
2 a2 b2
3 a3 b3
2 2 a2 b2
3 a3 b3
4 a4 b4

Here's one way to achieve the desired result:
# Create the instance column
instance = np.repeat(range(len(df) - 2), 3)
# Repeat the Time column for each value in A and B
time = np.concatenate([df.Time[i:i+3].values for i in range(len(df) - 2)])
# Repeat the A column for each value in the rolling window
a = np.concatenate([df.A[i:i+3].values for i in range(len(df) - 2)])
# Repeat the B column for each value in the rolling window
b = np.concatenate([df.B[i:i+3].values for i in range(len(df) - 2)])
# Create a new DataFrame with the desired format
new_df = pd.DataFrame({'Instance': instance, 'Time': time, 'A': a, 'B': b})
# Set the MultiIndex on the new DataFrame
new_df.set_index(['Instance', 'Time'], inplace=True)
new_df

Panda Table Conversion

How can I convert the table below to a table with columns ["ID", "PC1_0.1", "PC1_0.2", "PC1_0.3", ..., "PC10_111.2"] and only 24 rows. Each row may have the same wafer ID (meaning the same wafer is used repeatedly) and data of some wafer is not recorded.

i hope this codes work for you :)
d = {
"ID":["W-01"]*4+["W-02"]*2,
"Time":["t1","t2"]*3,
"PC1":["00","10","20","30","40","50"],
"PC2":["01","11","21","31","41","51"],
}
df = pd.DataFrame(d)
# for grouping on Time-PC1-PC2 and pivot
melt = df.melt(id_vars=["ID","Time"], value_vars=["PC1","PC2"])
melt["no"] = np.arange(0,melt.shape[0])
pivot = melt.pivot(index=["no","ID"], columns=["Time","variable"], values="value")
# We are combining non-nan columns because during the melt operation, nan data will emerge.
con = pd.DataFrame()
for col in range(pivot.columns.size):
part = pivot.iloc[:,[col]].dropna()
part = part.reset_index().drop("no", axis=1).set_index("ID")
con = pd.concat([con, part], axis=1)

panda expand columns with list into multiple columns

I want to expand / cast a column that contains lists into multiple columns:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
# I want:
pd.DataFrame({'a':[1,2], 'b1':[11,33], 'b2':[22,44]})

Send the column .tolist and create the DataFrame, then join back to the other column(s).
df = pd.concat([df.drop(columns='b'), pd.DataFrame(df['b'].tolist(), index=df.index).add_prefix('b')],
axis=1)
a b0 b1
0 1 11 22
1 2 33 44

df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df["b1"] = df["b"].apply(lambda cell: cell[0])
df["b2"] = df["b"].apply(lambda cell: cell[1])
df[["a", "b1", "b2"]]

You can use .tolist() on your "b" column to expand it out, then just assign it back to the dataframe and get rid of your original "b" column:
df = pd.DataFrame({'a':[1,2], 'b':[[11,22],[33,44]]})
df[["b1", "b2"]] = df["b"].tolist()
df = df.drop("b", axis=1) # alternatively: del df["b"]
print(df)
a b1 b2
0 1 11 22
1 2 33 44

Create pandas dataframe from series with ordered dict as rows

I am trying to extract lmfit parameter results as dataframes. I pass 1 column x, 1 column data through a fit_func and parameters pars and the output of the minimize function in lmfit outputs as OrderedDict.
out = minimize(fit_func, pars, method = 'leastsq', args=(x, data))
res = out.params.valuesdict()
res
Output:
OrderedDict([('a1', 12.850309404600393),
('c1', 1346.833513206811),
('s1', 44.22337472274829),
('f1', 1.1275639898142586),
('a2', 77.15732669480884),
('c2', 1580.5712512351947),
('s2', 16.239969775527275),
('f2', 0.8684363668111492)])
The output I want in DataFrames I achieved like this with pd.DataFrame(res,index=[0]) :
I have 3 data columns that I want to quickly fit:
x = d.iloc[:,0]
fit_odict = pd.DataFrame(d.iloc[:,1:4].\
apply(lambda y: minimize(fit_func, pars, method = 'leastsq', args=(x, y))\
.params.valuesdict()),index=[1])
But I get a series of Ordered Dicts as rows in the Dataframe:
How do I get the output I want with the three parameter results as rows ? Is there a better way to apply the function?
UPDATE:
Appended #M Newville in my solution. Might be helpful for those who want to quickly extract lmfit parameter results from multiple data columns d1.iloc[:,1:] :
def fff(cols):
out = minimize(fit_func, pars, method = 'leastsq', args=(x, cols))
return {key: par.value for key, par in out.params.items()}
results = d1.iloc[:,1:].apply(fff,result_type='expand').transpose()
Output:

For a single fit, this would probably be what you are looking for:
out = minimize(fit_func, pars, method = 'leastsq', args=(x, data))
fit_odict = pd.DataFrame({key: [par.value] for key, par in out.params.items()})
I think you probably are looking for something like this:
results = {key: [] for key in pars}
for data in datasets:
out = minimize(fit_func, pars, method='leastsq', args=(x, data))
for par_name, val_list in results.items():
val_list.append(out.params[par_name].value)
results = pd.DataFrame(results)
You could probably stuff that all into a single long line, but I wouldn't recommend it -- someone may want to read that code ;).

This is quick work around that you can do. The code is not efficient but you can optimize it. Note that index start 1 but you are welcome to re-index using pandas library
import pandas as pd
# Your output is a list of tuple
OrderedDict = [('a1', 12.850309404600393),('c1', 1346.833513206811),('s1',
44.22337472274829),('f1', 1.1275639898142586),('a2', 77.15732669480884),('c2',
1580.5712512351947),('s2', 16.239969775527275),('f2', 0.8684363668111492)]
# Create a dataframe from the list of tuple and tanspose
df = pd.DataFrame(OrderedDict).T
# Get the first row for the dataframe columns namea
columns = df.loc[0].values.tolist()
df.columns = columns
output = df.drop(df.index[0])
output
a1 c1 s1 f1 a2 c2 s2 f2
1 12.8503 1346.83 44.2234 1.12756 77.1573 1580.57 16.24 0.868436

Replacing column names in a pandas dataframe based on a lookup

Hi I have several dataframes with column headings that vary slightly. An example of a header of a dataframe will be:
A1 B1 C1
In other dataframes the first row is called A2 or A3 etc. A1..., B1...C1 represent multi character words/labels and are not the literal column names. I want to replace the column headings based on a mapping that I have between the A1, A2, Ax, B1, B2, Bx, C1, C2, Cx etc. and A, B and C.
What is the best way of doing this?
Thanks in advance.

I think here is possible use indexing with str for replace by first letter:
df.columns = df.columns.str[0]
Another possible solution is create dictionary for replace, e.g.:
d = {x:x[0] for x in df.columns}
df = df.rename(columns=d)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

combined three dataframes columns into single dataframe - apache-spark-sql

Related

Dataframe to multiIndex for sktime format

Panda Table Conversion

panda expand columns with list into multiple columns

Create pandas dataframe from series with ordered dict as rows

Replacing column names in a pandas dataframe based on a lookup

Categories

Resources