Convert Series to Dataframe where series index is Dataframe column names - pandas

I am selecting row by row as follows:
for i in range(num_rows):
row = df.iloc[i]
as a result I am getting a Series object where row.index.values contains names of df columns.
But I wanted instead dataframe with only one row having dataframe columns in place.
When I do row.to_frame() instead of 1x85 dataframe (1 row, 85 cols) I get 85x1 dataframe where index contains names of columns and row.columns
outputs
Int64Index([0], dtype='int64').
But all I want is just original data-frame columns with only one row. How do I do it?
Or how do I convert row.index values to row.column values and change 85x1 dimension to 1x85

You just need to adding T
row.to_frame().T
Also change your for loop with adding []
for i in range(num_rows):
row = df.iloc[[i]]

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

Pyspark dynamic column selection from dataframe

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns).
The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.
columnname
t_pono
t_pric
:
:
I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.
How can this be done?
Let's say you have the following columns (which can be obtained using df.columns, which returns a list):
df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]
To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):
list(set(df1_cols).intersection(df2_cols))
And we get the result:
["t_pono", "t_pric"]
To put it all together and select only those columns:
select_columns = list(set(df1_cols).intersection(df2_cols))
new_df = df1.select(*select_columns)

How to broadcast a list of data into dataframe (Or multiIndex )

I have a big dataframe its about 200k of rows and 3 columns (x, y, z). Some rows doesn't have y,z values and just have x value. I want to make a new column that first set of data with z value be 1,second one be 2,then 3, etc. Or make a multiIndex same format.
Following image shows what I mean
Like this image
I made a new column called "NO." and put zero as initial value. Then
I tried to record the index of where I want the new column get a new value. with following code
df = pd.read_fwf(path, header=None, names=['x','y','z'])
df['NO.']=0
index_NO_changed = df.index[df['z'].isnull()]
Then I loop through it and change the number:
for i in range(len(index_NO_changed)-1):
df['NO.'].iloc[index_NO_changed[i]:index_NO_changed[i+1]]=i+1
df['NO.'].iloc[index_NO_changed[-1]:]=len(index_NO_changed)
But the problem is I get a warning that "
A value is trying to be set on a copy of a slice from a DataFrame
I was wondering
Is there any better way? Is creating multiIndex instead of adding another column easier considering size of dataframe?

Joining all elements in an array in a dataframe column with another dataframe

Let's say pcPartsInfoDf has the columns
pcPartCode:integer
pcPartName:string
And df has the array column
pcPartCodeList:array
|-- element:integer
The pcPartCodeList in df has a list of codes for each row that match with pcPartCode values in pcPartsInfoDf, but only pcPartsInfoDf has the names of the parts.
I'm trying to join the two dataframes so that we get a new column that is an array of strings for all the pc part names for a row, corresponding to the array of ints, pcPartCodeList. I tried doing this with the code below, but this only adds at most 1 part since pcPartName is typed as a string and only holds 1 value.
df
.join(pcPartsInfoDf, expr("array_contains(pcPartCodeList, pcPartCode"))
.select(computerDf("*"), pcPartsInfoDf("pcPartName"))
How could I collect all the pcPartName values corresponding to a pcPartCodeList for a row, and put them in an array of strings in that row?

Python and Pandas: apply per multiple columns

I am new to python and I was successful in using apply in a dataframe to create a new column inside a dataframe.
X['Geohash']=X[['Lat','Long']].apply (lambda column: geohash.encode(column[0],column[1],precision=8), axis=1)
this is calling the geohash function with the latitudes and longitudes per row and per column.
Now I have two new data frames one for Latitude and one for Longitude.
Each dataframe has twenty columns and I want that the
.apply (lambda column: geohash.encode(column[0],column[1],precision=8), axis=1)
is called twenty times.
-First time the first
dataframe-Latitude column with the first dataframe-Longitude column then
-the second time, the second dataframe-Latitude column with the second dataframe-Longitude column.
How I can do this iteration per column and at each iteration call the
.apply (lambda column: geohash.encode(column[0],column[1],precision=8), axis=1)
What I want to get is a new dataframe with twenty columns with each column being the result of the geohash function.
Ideas will be appreciated.
You can do this by creating an "empty" dataframe with 20 columns, and then using df.columns[i] to loop through your other dataframes - something like this:
output = pd.DataFrame({i:[] for i in range(20)})
This creates an empty dataframe with all the columns you wanted (numbered).
Now, let's say longitude and latitude dataframes are called 'lon' and 'lat'. We need to join them into one dataframe Then:
lonlat = lat.join(lon)
for i in range(len(output.columns)):
output[output.columns[i]] = lonlat.apply(lambda column: geohash.encode(column[lat.columns[i]],
column[lon.columns[i]],
precision=8), axis=1)