How to preallocate memory for a large pandas dataframe? - pandas

I need to create a large dataframe to save my data. It has 30001 columns, 1000 rows. For the data type, 30000 columns are int64, and the last columns is a hash values.
So I first create an empty dataframe:
df = pd.DataFrame(columns=columnNames, data=np.empty(shape=(1000, 30001)))
And then I create a Series based on dataframe's columns:
record = pd.Series(index=df.columns)
Then in a loop I'll populate the record and assign them to dataframe:
loop:
record[0:30000] = values #fill record with values
record['hash']= hash_value
df.loc[index] = record <==== this is slow
index += 1
When I debug on my code, I found the above step which assign record to a row is horribly slow.
My guess is that if I could create a dataframe with exact the size preallocated, then assigning the record to each row will be much faster.
So can I create the dataframe with full size preallocated?
(note: my original dataframe does not have the 'hash' column, it runs without any performance issue. Recently I found I need this additional hash column, which is a string value. And this performance issue occurred right after this new column added)

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

How to broadcast a list of data into dataframe (Or multiIndex )

I have a big dataframe its about 200k of rows and 3 columns (x, y, z). Some rows doesn't have y,z values and just have x value. I want to make a new column that first set of data with z value be 1,second one be 2,then 3, etc. Or make a multiIndex same format.
Following image shows what I mean
Like this image
I made a new column called "NO." and put zero as initial value. Then
I tried to record the index of where I want the new column get a new value. with following code
df = pd.read_fwf(path, header=None, names=['x','y','z'])
df['NO.']=0
index_NO_changed = df.index[df['z'].isnull()]
Then I loop through it and change the number:
for i in range(len(index_NO_changed)-1):
df['NO.'].iloc[index_NO_changed[i]:index_NO_changed[i+1]]=i+1
df['NO.'].iloc[index_NO_changed[-1]:]=len(index_NO_changed)
But the problem is I get a warning that "
A value is trying to be set on a copy of a slice from a DataFrame
I was wondering
Is there any better way? Is creating multiIndex instead of adding another column easier considering size of dataframe?

How to select different row in df as the column or delete the first few rows including the column?

I'm using read_csv to make a df, but the csv includes some garbage rows before the actual columns, the actual columns are located say in the 5th rows in the csv.
Here's the thing, I don't know how many garbage rows are there in advance and I can only read_csv once, so I can't use "head" or "skiprows" in read_csv.
So my question is how to select a different row as the columns in the df or just delete the first n rows including the columns? If I were to use "df.iloc[3:0]" the columns are still there.
Thanks for your help.
EDIT: Updated so that it also resets the index and does not include an index name:
df.columns = df.iloc[4].values
df = df.iloc[5:].reset_index(drop=True)
If you know your column names start in row 5 as in your example, you can do:
df.columns = df.iloc[4]
df = df.iloc[5:]
If the number of garbage rows is determined, then you can use 'iloc', example the number of garbage rows is 3 firs rows (index 0,1,2), then you can use the following code to get all remaining actual data rows:
df=df.iloc[3:]
If the number of garbage rows is not determined, then you must search the index of first actual data rows from the garbage rows. so you can find the first index of actual data rows and can be used to get all remaining data rows.
df=df.iloc[n:]
n=fisrt index of actual data

pandas : Indexing for thousands of rows in dataframe

I initially had 100k rows in my dataset. I read the csv using pandas into a dataframe called data. I tried to do a subset selection of 51 rows using .loc. My index labels are numeric values 0, 1, 2, 3 etc. I tried using this command -
data = data.loc['0':'50']
But the results were weird, it took all the rows from 0 to 49999, looks like it is taking rows till the index value starts with 50.
Similarly, I tried with this command - new_data = data.loc['0':'19']
and the result was all the rows, starting from 0 till 18999.
Could this be a bug in pandas?
You want to use .iloc in place of .loc, since you are selecting data from the dataframe via numeric indices.
For example:
data.iloc[:50,:]
Keep in mind that your indices are of numeric-type, not string-type, so querying with a string (as you have done in your OP) attempts to match string-wise comparisons.

Apply function for counting length of a DataFrame filter

What's the best way to create a new pandas column with the length of filtering of another df based on a value from the first df?
df_account has account numbers
df_retention has rows for each date an account numbers was active
I am trying to create a new column on df_account that has the total number of days the account was active. Using .apply seems extremely slow.
def retention_count(x):
return len(df_retention[df_retention['account'] == x])
df_account['retention_total'] = df_account['account'].apply(retention_count)
On a small number of rows, this works, but when my df_account has over 750k rows it is really slow. What can I do to make this faster? Thanks.
You could use groupby and count the rows in the df_retention dataframe. Assuming account is your index on df_account
df_account.set_index('account',inplace=True)
df_account['retention_total'] = df_retention.groupby('account').count()