Extract info from each row of a dataframe without a loop - pandas

I have a large dataframe (~500,000 rows). Processing each row gives me a Counter object (a dictionary with objects counts). The output I want is a new dataframe which column headers are the objects that are being counted (the keys in the dictionary). I am looping over the rows, however it takes very long.I know that loops should be avoided in Pandas, any suggestion?
out_df = pd.DataFrame()
for row in input_df['text']:
tokens = nltk.word_tokenize(row)
pos = nltk.pos_tag(tokens)
count = Counter(elem[1] for elem in pos)
out_df = out_df.append(count, ignore_index=True)
for indication, Counter(elem[1] for elem in pos) looks like Counter({'NN':8, 'VBZ': 2, 'DT':3, 'IN': 4})

Using append on a dataframe is quite inefficient I believe (has to reallocate memory for the entire data frame each time).
DataFrames were meant for analyzing data and easily adding columns—but not rows.
So I think a better approach would be to create list first (lists are mutable) and convert it to a dataframe at the end.
I'm not familiar with nltk so I can't actually test this but something along the following lines should work:
out_data = []
for row in input_df['text']:
tokens = nltk.word_tokenize(row)
pos = nltk.pos_tag(tokens)
count = Counter(elem[1] for elem in pos)
out_data.append(count)
out_df = pd.DataFrame(out_data)
You might want to add the following to remove any NaNs and convert the final counts to integers:
out_df = out_df.fillna(0).astype(int)
And delete the list after to free up the memory:
del out_data

I think must use a vecotrized solution maybe: "Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided (using) a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing." From https://towardsdatascience.com/you-dont-always-have-to-loop-through-rows-in-pandas-22a970b347ac

Related

A for loop outputs one list after each iteration. How to append each of them in its own row in a 3-column dataframe?

this seemingly simple operation at point d) still eludes me after numerous attempts to do it by myself.
The for loop I use:
a) cycles through an unknown numbers of excel files,
b) selects 3 columns from each file,
c) perform some string manipulations on their headers using
conditions then
d) outputs the 1-row extraction of the headers I
have achieved so far to a individual list .
After n(3) iterations of a), b) and c), the for loop outputs lists such as:
['Col1','Col1a','Col1b']
['Col2','Col2a','Col2b']
['Col3','Col3a','Col3b']
I am looking to append/concatenate/merge these individual lists each in its own individual row into one dataframe that I can further manipulate.
Excepted final dataframe with index=True and header=None:
0, 'Col1','Col1a','Col1b'
1, 'Col2','Col2a','Col2b'
2, 'Col3','Col3a','Col3b'
I have tried many examples found in SO such as:
df = pd.DataFrame()
for lst in [list1, list2, list3]:
df_temp = pd.DataFrame(lst)
df = df.append(df_temp)
print(df)
Thanks for the time you take reviewing this request.
You can create the dataframe from the list:
pd.DataFrame([list1, list2, list3]).to_csv('file.csv', header=None, index=True)
file.csv:
0,Col1,Col1a,Col1b
1,Col2,Col2a,Col2b
2,Col3,Col3a,Col3b

How to compare one row in Pandas Dataframe to all other rows in the same Dataframe

I have a csv file with in which I want to compare each row with all other rows. I want to do a linear regression and get the r^2 value for the linear regression line and put it into a new matrix. I'm having trouble finding a way to iterate over all the other rows (it's fine to compare the primary row to itself).
I've tried using .iterrows but I can't think of a way to define the other rows once I have my primary row using this function.
UPDATE: Here is a solution I came up with. Please let me know if there is a more efficient way of doing this.
def bad_pairs(df, limit):
list_fluor = list(combinations(df.index.values, 2))
final = {}
for fluor in list_fluor:
final[fluor] = (r2_score(df.xs(fluor[0]),
df.xs(fluor[1])))
bad_final = {}
for i in final:
if final[i] > limit:
bad_final[i] = final[i]
return(bad_final)
My data is a pandas DataFrame where the index is the name of the color and there is a number between 0-1 for each detector (220 columns).
I'm still working on a way to make a new pandas Dataframe from a dictionary with all the values (final in the code above), not just those over the limit.

Alternatives to iloc for searching dataframes

I have a simple piece of code that iterates through a list of id's, and if an id is in a particular data frame column(in this case, the column is called uniqueid), it uses iloc to get the value from another column on the matching row and then adds it to as a value in a dictionary with the id as the key:
union_cols = ['uniqueid', 'FLD_ZONE', 'FLD_ZONE_1', 'ST_FIPS', 'CO_FIPS', 'CID']
union_df = gpd.GeoDataFrame.from_features(records(union_gdb, union_cols))
pop_df = pd.read_csv(pop_csv, low_memory=False) # Example dataframe
uniqueid_inin = ['', 'FL1234', 'F54323', ....] # Just an example
isin_dict = dict()
for id in uniqueid_inin:
if (id is not '') & (id in pop_df.uniqueid.values):
v = pop_df.loc[pop_df['uniqueid'] == id, 'Pop_By_Area'].iloc[0]
inin_dict.setdefault(id, v)
This works, but it is very slow. Is there a quicker way to do this?
To resolve this issue (and make the process more efficient) I had to think about the process in a different way that took advantage of Pandas and didn't rely on a generic Python solution. I first had to get a list of only the uniqueids from my union_df that were absolutely in pop_df. If they were not, applying the .isin() method would throw an indexing error.
#Get list of uniqueids in pop_df
pop_uniqueids = pop_df['uniqueid'].unique()
#Get only the union_df rows where the uniqueid matches pop_uniqueid
union_df = union_df.loc[(union_df['uniqueid'].isin(pop_uniqueids))]
union_df = union_df.reset_index()
union_df = union_df.drop(columns='index')
When the uniqueid_inin list is created from union_df (by just getting the uniqueid's from rows where my zone_status column is equal to 'in-in'), it will only contain a subset of items that are definitely in pop_df and empty values are no longer an issue. Then, I simply create a subset dataframe using the list and zip the desired column values together in a dictionary:
inin_subset =pop_df[ pop_df['uniqueid'].isin(uniqueid_inin)]
inin_pop_dict = dict(zip(inin_subset.uniqueid, inin_subset.Pop_By_Area))
I hope this technique helps.

Repeat elements in pandas dataframe so equal number of each unique element

I have a pandas dataframe with multiple different feature columns. I have one particular column which can take on a variety of integer value. I want to manipulate the dataframe in such a way that there is an equal number of each of these integer value.
Before;
df['key'] = [1,1,1,3,4,5,5]
After;
df['key'] = [1,1,1,3,3,3,4,4,4,5,5,5]
I want this to be applied to every key in the dataframe.
So here's an ugly way that I've coded up a solution, but I feel like it goes against the entire reason to use pandas dataframes.
for idx, i in enumerate(data['key'].value_counts()):
if i == max(data['key'].value_counts()):
pass
else:
scaling = (max(data['key'].value_counts()) // i) - 1
data2 = pd.concat([data[data['key'] == idx]]*scaling, ignore_index=True)
data = pd.concat([data, data2], ignore_index=True)

Generate programmed data & create data frame out of it (generated data in to single column)

Initial context was, I was using "for loop" and generating some random data (using some logic shown below) and then writing that data to a key ('server_avg_response_time') in dictionary('data_dict'). Finally., that's a list('data_rows') of dictionaries and writing the whole to CSV.
Code snippet for generating random data:
server_avg_response_time_alert = "low"
for i in range(0,no_of_rows):
if (random.randint(0,10) != 7 and server_avg_response_time_alert != "high"):
data_dict['server_avg_response_time'] = random.randint(1,800000)
else:
if(server_avg_response_time_alert == "low"):
print "***ALERT***"
server_avg_response_time_alert = "high"
data_dict['server_avg_response_time'] = random.randint(600000,800000)
server_avg_response_time_period = random.randint(1,1000)
if(server_avg_response_time_period > 980):
print "***ALERT OFF***"
server_avg_response_time_alert = "low"
data_rows.insert(i,data_dict.copy())
This is taking lot of time (to generate some 300 000 rows of data) and hence I was asked to look for Pandas (to generate data fastly). Now, I am trying to use the same logic to pandas dataframe.
Question: If I put above code in a function, can't I use that function to mint data in to column of dataframe? What is the best way to program this data in to a column of dataframe? I believe I don't need a dictionary (key) too if putting data directly to dataframe after generating it randomly. But don't know the syntax to do it.
try wrapping your logic (everything after the for loop) in a function, then pass that to an empty pandas df with one column called 'avg_resp_time' (with 30000 rows) like this using the apply method:
def randomLogic(value):
random_value = 0 # logic goes here
return random_value
df = pd.DataFrame(np.zeros(300000), columns=['server_avg_response_time'])
df['server_avg_response_time'] = df.server_avg_response_time.apply(randomLogic)