I would like to create a ((25520*43),3) pandas Dataframe in a for loop.
I created the dataframe like:
lst=['Region', 'GeneID', 'DistanceValue']
df=pd.DataFrame(index=lst).T
And now I want to fill 'Region', 43 times with 25520 values. Also GeneID and DistanceValue.
This is my for loop for that:
for i in range(43):
df.DistanceValue = np.sort(distance[i,:])
df.Region = np.ones(25520) * i
args = np.argsort(distance[i,:])
df.GeneID = ids[int(args[i])]
But than my df exists just of (25520, 3). So I just have the last iteration for 43 filled in.
How can I concat all iteration one to 43 in my df?
I can't reproduce your example but there are couple of corrections you can make:
lst=['Region', 'GeneID', 'DistanceValue']
df=pd.DataFrame(index=lst).T
region = []
for i in range(43):
region.append(np.ones(25520))
flat_list = [item for sublist in region for item in sublist]
df.Region = flat_list
First create a new list outside loop and then append values within loop in this list.
The flat_list will consolidate all 43 lists to one and then you can map it to the DataFrame. It is always easier to fill DataFrame values outside of loop.
Similarly you can update all 3 columns.
Related
I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.
Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()
for row in range(1, len(df)):
try:
df_out, orthogroup, len_group = HOG_get_group_stats(df.loc[row, "HOG"])
temp_df = pd.DataFrame()
for id in range(len(df_out)):
print(" ")
temp_df = pd.concat([df, pd.DataFrame(df_out.iloc[id, :]).T], axis=1)
temp_df["HOG"] = orthogroup
temp_df["len_group"] = len_group
print(temp_df)
except:
print(row, "no")
Here I have a script that does the following:
Iterate over df and apply the HOG_get_group_stats function to the HOG column in df and then, get 3 variables as outputs. (Basically, the function creates some stats as a data frame called df_out, and extracts some information as two more columns called orthogroup, len_group)
Create an empty template called temp_df
Transpose the df_out data frame and make it one single row and then, concatenate with the df we used in the beginning as columns.
Add orthogroup, len_group columns to the end of the temp_df
Problem:
It prints out the data however, when I try to see the temp_df as a data frame it shows only a single row ( probably the last one) means that my concatenation of several data frames doesn't work.
Questions:
How can I iterate and then append a data frame as columns?
Is there an easier way to iterate over a data frame? (e.g. iterrow)
Is there a better way to transpose rows to columns in a data frame? ( e.g. pivot, melt)
Any help would be appreciated!!
You can find the sample files to df, df_out,temp_df and expected output_sample table here :
Sample_files
I am trying to append dataframe subsets to a separate dataframe.
So far I have tried append to an empty dataframe, but it returns an empty dataframe.
search_ID = ['36962G3P7', 'B3V3W13', 'XS1903485800', 'EXLS']
Search_Results = pd.DataFrame()
for i in search_ID:
current_table = df.loc[df.isin([i]).any(axis=1)]
Search_Results.append(current_table)
#Returns an empty dataframe
When I print every iteration it shows that it is creating new dataframes for every list item.
Search_ID = ['36962G3P7', 'B3V3W13', 'XS1903485800', 'EXLS']
Search_Results = pd.DataFrame()
for i in Search_ID:
current_table = df.loc[df.isin([i]).any(axis=1)]
print(current_table)
#Returns 4 printed dataframes
When I append outside the loop, the table does append to the empty dataframe.
current_table = df.loc[df.isin(['36962G3P7']).any(axis=1)]
Search_Results.append(current_table)
#Returns a filled dataframe
The append function does not occur in place, so you need to reassign on each iteration.
Search_Results = Search_Results.append(current_table)
Depending on how many times you append, this can be very slow so you might instead consider:
Search_ID = ['36962G3P7', 'B3V3W13', 'XS1903485800', 'EXLS']
Search_Results = pd.concat([
df.loc[df.isin([i]).any(axis=1)]
for i in Search_ID
], axis=0)
I tried to perform my self-created function on a for loop, but it does not work as expected.
Some remarks in advance:
ma_strategy is my function and requires three inputs
ticker_list is a list with strings
result is a pandas Dataframe with 7 columns and I can call the column 'return_cum' with result['return_cum']. The rows of this column are containing floating point numbers.
These for loops doesn't work:
for i in ticker_list:
result = ma_strategy(i, 20, 5)
x = result['return_cum']
sample_returns = pd.DataFrame
y = pd.merge(x.to_frame(),sample_returns, left_index=True)
for i in ticker_list:
result = ma_strategy(i, 20, 5)
x = result[['return_cum']]
sample_returns = pd.DataFrame
y = pd.concat([sample_returns, x], axis=1)
My intention is the following:
The for loop should iterate over the items in my ticker_list and should save the 'return_cum' columns in x. Then the 'return_cum' columns should be stored in y together so that at the end I get a DataFrame with all the 'return_cum' columns of my ticker list.
How can I achieve that goal? I tried pd.concoat and merge, but nothing works.
Thanks for your help!
I have two dataframes, one contains screen names/display names and another contains individuals, and I am trying to create a third dataframe that contains all the data from each dataframe in a new row for each time a last name appears in the screen name/display name. Functionally this will create a list of possible matching names. My current code, which works perfectly but very slowly, looks like this:
# Original Social Media Screen Names
# cols = 'userid','screen_name','real_name'
usernames = pd.read_csv('social_media_accounts.csv')
# List Of Individuals To Match To Accounts
# cols = 'first_name','last_name'
individuals = pd.read_csv('individuals_list.csv')
userid, screen_name, real_name, last_name, first_name = [],[],[],[],[]
for index1, row1 in individuals.iterrows():
for index2, row2 in usernames.iterrows():
if (row2['Screen_Name'].lower().find(row1['Last_Name'].lower()) != -1) | (row2['Real_Name'].lower().find(row1['Last_Name'].lower()) != -1):
userid.append(row2['UserID'])
screen_name.append(row2['Screen_Name'])
real_name.append(row2['Real_Name'])
last_name.append(row1['Last_Name'])
first_name.append(row1['First_Name'])
cols = ['UserID', 'Screen_Name', 'Real_Name', 'Last_Name', 'First_Name']
index = range(0, len(userid))
match_list = pd.DataFrame(index=index, columns=cols)
match_list = match_list.fillna('')
match_list['UserID'] = userid
match_list['Screen_Name'] = screen_name
match_list['Real_Name'] = real_name
match_list['Last_Name'] = last_name
match_list['First_Name'] = first_name
Because I need the whole row from each column, the list comprehension methods I have tried do not seem to work.
The thing you want is to iterate through a dataframe faster. Doing that with a list comprehension is, taking data out of a pandas dataframe, handling it using operations in python, then putting it back in a pandas dataframe. The fastest way (currently, with small data) would be to handle it using pandas iteration methods.
The next thing you want to do is work with 2 dataframes. There is a tool in pandas called join.
result = pd.merge(usernames, individuals, on=['Screen_Name', 'Last_Name'])
After the merge you can do your filtering.
Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html