Create columns which correspond to the number of characters contained in strings of a dataframe column - dataframe

I have a dataframe, the first column contains string (eg:'AABCD'). I have to count occurences for each string. Then the results for each count must be stored in column (one column for each character, A,B,C,D).
See below
I have the following dataframe:
I want to get:
Remark: Columns A, B, C, D contain the number of characters for each string in each line
I want to create columns A,B,C,D with the number of characters for each string in each line

Assuming the columns are already in the dataframe, and the column containing the strings to start really is a column and not the index:
Set up dataframe:
df = pd.DataFrame({
"string":["AABCD", "ACCB", "AB", "AC"],
"A":[float("nan"),float("nan"),float("nan"),float("nan")],
"B":[float("nan"),float("nan"),float("nan"),float("nan")],
"C":[float("nan"),float("nan"),float("nan"),float("nan")],
"D":[float("nan"),float("nan"),float("nan"),float("nan")],
})
Loop through the columns and apply a lambda function to each row.
for col_name in df.columns:
if col_name == "string":
continue
df[col_name]=df.apply(lambda row: row["string"].count(col_name), axis=1)

Related

Pyspark dynamic column selection from dataframe

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns).
The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.
columnname
t_pono
t_pric
:
:
I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.
How can this be done?
Let's say you have the following columns (which can be obtained using df.columns, which returns a list):
df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]
To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):
list(set(df1_cols).intersection(df2_cols))
And we get the result:
["t_pono", "t_pric"]
To put it all together and select only those columns:
select_columns = list(set(df1_cols).intersection(df2_cols))
new_df = df1.select(*select_columns)

Joining all elements in an array in a dataframe column with another dataframe

Let's say pcPartsInfoDf has the columns
pcPartCode:integer
pcPartName:string
And df has the array column
pcPartCodeList:array
|-- element:integer
The pcPartCodeList in df has a list of codes for each row that match with pcPartCode values in pcPartsInfoDf, but only pcPartsInfoDf has the names of the parts.
I'm trying to join the two dataframes so that we get a new column that is an array of strings for all the pc part names for a row, corresponding to the array of ints, pcPartCodeList. I tried doing this with the code below, but this only adds at most 1 part since pcPartName is typed as a string and only holds 1 value.
df
.join(pcPartsInfoDf, expr("array_contains(pcPartCodeList, pcPartCode"))
.select(computerDf("*"), pcPartsInfoDf("pcPartName"))
How could I collect all the pcPartName values corresponding to a pcPartCodeList for a row, and put them in an array of strings in that row?

pandas df: replace values with np.NaN if character count do not match across multiple columns

currently stuck with something I hope to find an answer for in this forum:
I have a df with multiple columns containing URLs. My index column are URLs as well.
AIM: I'd like to replace df values across all columns with np.NaN if the number of "/" (count()) in the index is not equal to the number of "/" (count()) in the values of each individual of of the other columns
E.x.
First, you need one column to compare to.
counts = df['id_url'].str.count('/')
Then you evaluate all the rows at once.
mask = df.str.count('/') == counts
Then we want to to show rows where all the values are equal.
mask = mask.all(axis=1)
Now we have a mask for where every value is equal, we can use the not operator to filter for those where at least one column is not equal.
df.loc[~mask, :] = np.nan # replaces every value in the row with np.nan

Convert Series to Dataframe where series index is Dataframe column names

I am selecting row by row as follows:
for i in range(num_rows):
row = df.iloc[i]
as a result I am getting a Series object where row.index.values contains names of df columns.
But I wanted instead dataframe with only one row having dataframe columns in place.
When I do row.to_frame() instead of 1x85 dataframe (1 row, 85 cols) I get 85x1 dataframe where index contains names of columns and row.columns
outputs
Int64Index([0], dtype='int64').
But all I want is just original data-frame columns with only one row. How do I do it?
Or how do I convert row.index values to row.column values and change 85x1 dimension to 1x85
You just need to adding T
row.to_frame().T
Also change your for loop with adding []
for i in range(num_rows):
row = df.iloc[[i]]

PySpark Aggregation on Comma Seperated Column

I have a huge DataFrame with two of many columns: "NAME", "VALUE". One of the row value for "NAME" column is "X,Y,V,A".
I want to transpose my DataFrame so the "NAME" values are columns and the average of the "VALUE" are the row values.
I used the pivot function:
df1 = df.groupby('DEVICE', 'DATE').pivot('NAME').avg('VALUE')
All NAME values except for "X,Y,V,A" work well with the above. I am not sure how to separate the 4 values of "X,Y,V,A" and aggregate on individual value.
IIUC, you need to split and explode the string first:
from pyspark.sql.functions import split, explode
df = df.withColumn("NAME", explode(split("NAME", ",")))
Now you can group and pivot:
df1 = df.groupby('DEVICE', 'DATE').pivot('NAME').avg('VALUE')