pandas : Indexing for thousands of rows in dataframe - pandas

I initially had 100k rows in my dataset. I read the csv using pandas into a dataframe called data. I tried to do a subset selection of 51 rows using .loc. My index labels are numeric values 0, 1, 2, 3 etc. I tried using this command -
data = data.loc['0':'50']
But the results were weird, it took all the rows from 0 to 49999, looks like it is taking rows till the index value starts with 50.
Similarly, I tried with this command - new_data = data.loc['0':'19']
and the result was all the rows, starting from 0 till 18999.
Could this be a bug in pandas?

You want to use .iloc in place of .loc, since you are selecting data from the dataframe via numeric indices.
For example:
data.iloc[:50,:]
Keep in mind that your indices are of numeric-type, not string-type, so querying with a string (as you have done in your OP) attempts to match string-wise comparisons.

Related

How to compare one row in Pandas Dataframe to all other rows in the same Dataframe

I have a csv file with in which I want to compare each row with all other rows. I want to do a linear regression and get the r^2 value for the linear regression line and put it into a new matrix. I'm having trouble finding a way to iterate over all the other rows (it's fine to compare the primary row to itself).
I've tried using .iterrows but I can't think of a way to define the other rows once I have my primary row using this function.
UPDATE: Here is a solution I came up with. Please let me know if there is a more efficient way of doing this.
def bad_pairs(df, limit):
list_fluor = list(combinations(df.index.values, 2))
final = {}
for fluor in list_fluor:
final[fluor] = (r2_score(df.xs(fluor[0]),
df.xs(fluor[1])))
bad_final = {}
for i in final:
if final[i] > limit:
bad_final[i] = final[i]
return(bad_final)
My data is a pandas DataFrame where the index is the name of the color and there is a number between 0-1 for each detector (220 columns).
I'm still working on a way to make a new pandas Dataframe from a dictionary with all the values (final in the code above), not just those over the limit.

How to select different row in df as the column or delete the first few rows including the column?

I'm using read_csv to make a df, but the csv includes some garbage rows before the actual columns, the actual columns are located say in the 5th rows in the csv.
Here's the thing, I don't know how many garbage rows are there in advance and I can only read_csv once, so I can't use "head" or "skiprows" in read_csv.
So my question is how to select a different row as the columns in the df or just delete the first n rows including the columns? If I were to use "df.iloc[3:0]" the columns are still there.
Thanks for your help.
EDIT: Updated so that it also resets the index and does not include an index name:
df.columns = df.iloc[4].values
df = df.iloc[5:].reset_index(drop=True)
If you know your column names start in row 5 as in your example, you can do:
df.columns = df.iloc[4]
df = df.iloc[5:]
If the number of garbage rows is determined, then you can use 'iloc', example the number of garbage rows is 3 firs rows (index 0,1,2), then you can use the following code to get all remaining actual data rows:
df=df.iloc[3:]
If the number of garbage rows is not determined, then you must search the index of first actual data rows from the garbage rows. so you can find the first index of actual data rows and can be used to get all remaining data rows.
df=df.iloc[n:]
n=fisrt index of actual data

Write pandas data to a CSV file if column sums are greater than a specified value

I have a CSV file whose columns are frequency counts of words, and whose rows are time periods. I want to sum for each column the total frequencies. Then I want to write to a CSV file for sums greater than or equal to 30, the column and row values, thus dropping columns whose sums are less than 30.
Just learning python and pandas. I know it is a simple question, but my knowledge is at that level. Your help is most appreciated.
I can read in the CSV file and compute the column sums.
df = pd.read_csv('data.csv')
Except of data file containing 3,874 columns and 100 rows
df.sum(axis = 0, skipna = True)
Excerpt of sums for columns
I am stuck on how to create the output file so that it looks like the original file but no longer has columns whose sums were less than 30.
I am stuck on how to write to a CSV file each row for each column whose sums are greater than or equal to 30. The layout of the output file would be the same as for the input file. The sums would not be included in the output.
Thanks very much for your help.
So, here is a link showing an excerpt of a file containing 100 rows and 3,857 columns:
It's easiest to do this in two steps:
1. Filter the DataFrame to just the columns you want to save
df_to_save = df.loc[:, (df.sum(axis=0, skipna=True) >= 30)]
.loc is for picking rows/columns based either on labels or conditions; the syntax is .loc[rows, columns], so : means "take all the rows", and then the second part is the condition on our columns - I've taken the sum you'd given in your question and set it greater than or equal to 30.
2. Save the filtered DataFrame to CSV
df_to_save.to_csv('path/to/write_file.csv', header=True, index=False)
Just put your filepath in as the first argument. header=True means the header labels from the table will be written back out to the file, and index=False means the numbered row labels Pandas automatically created when you read in the CSV won't be included in the export.
See this answer here: How to delete a column in pandas dataframe based on a condition? . Note, the solution for your question doesn't need isnull() before the sum(), as that is specific to their question for counting NaN values.

How to preallocate memory for a large pandas dataframe?

I need to create a large dataframe to save my data. It has 30001 columns, 1000 rows. For the data type, 30000 columns are int64, and the last columns is a hash values.
So I first create an empty dataframe:
df = pd.DataFrame(columns=columnNames, data=np.empty(shape=(1000, 30001)))
And then I create a Series based on dataframe's columns:
record = pd.Series(index=df.columns)
Then in a loop I'll populate the record and assign them to dataframe:
loop:
record[0:30000] = values #fill record with values
record['hash']= hash_value
df.loc[index] = record <==== this is slow
index += 1
When I debug on my code, I found the above step which assign record to a row is horribly slow.
My guess is that if I could create a dataframe with exact the size preallocated, then assigning the record to each row will be much faster.
So can I create the dataframe with full size preallocated?
(note: my original dataframe does not have the 'hash' column, it runs without any performance issue. Recently I found I need this additional hash column, which is a string value. And this performance issue occurred right after this new column added)

How do I preset the dimensions of my dataframe in pandas?

I am trying to preset the dimensions of my data frame in pandas so that I can have 500 rows by 300 columns. I want to set it before I enter data into the dataframe.
I am working on a project where I need to take a column of data, copy it, shift it one to the right and shift it down by one row.
I am having trouble with the last row being cut off when I shift it down by one row (eg: I started with 23 rows and it remains at 23 rows despite the fact that I shifted down by one and should have 24 rows).
Here is what I have done so far:
bolusCI = pd.DataFrame()
##set index to very high number to accommodate shifting row down by 1
bolusCI = bolus_raw[["Activity (mCi)"]].copy()
activity_copy = bolusCI.shift(1)
activity_copy
pd.concat([bolusCI, activity_copy], axis =1)
Thanks!
There might be a more efficient way to achieve what you are looking to do, but to directly answer your question you could do something like this to init the DataFrame with certain dimensions
pd.DataFrame(columns=range(300),index=range(500))
You just need to define the index and columns in the constructor. The simplest way is to use pandas.RangeIndex. It mimics np.arange and range in syntax. You can also pass a name parameter to name it.
pd.DataFrame
pd.Index
df = pd.DataFrame(
index=pd.RangeIndex(500),
columns=pd.RangeIndex(300)
)
print(df.shape)
(500, 300)