Excel xlsx file modified into a dataframe is not recognized by an R package that uses dataframe - dataframe

I uploaded an Excel xlsx file then created a dataframe by converting numeric variables into categories. When I run a R package that uses dataframe, the output shows the following error:
> library(DiallelAnalysisR)
> Griffing(Yield, Rep, Cross1, Cross2, GriffingData41, 4, 1)
Error in `$<-.data.frame`(`*tmp*`, "Trt", value = character(0)) :
replacement has 0 rows, data has 20
When I issue a str() function, it shows the modifications of the numeric columns into catergories as below.
> str(GriffingData41)
'data.frame': 20 obs. of 4 variables:
$ CROSS1: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 3 3 4 ...
$ CROSS2: Factor w/ 4 levels "2","3","4","5": 1 2 3 4 2 3 4 3 4 4 ...
$ REP : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 2 ...
$ YIELD : num 11.9 14.5 9 13.5 20.5 9.8 16.5 22.1 18.4 19.4 ...
Is this a problem in my dataframe creation?
I would appreciate it if I could be helped with this error. By the way, I am running this in R Studio.
Thank you.
Note: This is not really a solution to my problem but I managed to move forward by saving my Excel data in CSV format; changing the data type of the specific columns to character and importing to R Studio. From there, creating the dataframe and running the R package went smoothly. Still, I am curious why it did not work on the "xlsx" file.

Related

Subtract a specific row from a csv using phyton

I have two csv files: one containing data, the other one containing a single row with the same columns as the first file. I am trying to subtract the one row from the second file from all the rows from the first file using pandas.
I have tried the following, but to no avail.
df = df.subtract(row, axis=1)
You're looking for the "drop" method. From pandas docs:
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
drop by index:
df.drop([0, 1])
A B C D
2 8 9 10 11
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

Labeling rows in pandas using multiple boolean conditions without chaining

I'm trying to label data in the original dataframe, based on multiple boolean conditions. This is easy enough when labeling based on one or two conditions, but as I begin requiring multiple conditions the code becomes difficult to manage. The solution seems to break the code down into copies, but that causes chain errors. Here is one example of the issue...
This is a simplified version of what my data looks like:
df=pd.DataFrame(np.array([['ABC',1,3,3,4], ['std',0,0,2,4],['std',2,1,2,4],['std',4,4,2,4],['std',2,6,2,6]]), columns=['Note', 'Na','Mg','Si','S'])
df
Note Na Mg Si S
0 ABC 1 3 3 4
1 std 0 0 2 4
2 std 2 1 2 4
3 std 4 4 2 4
4 std 2 6 2 6
A standard (std) is located throughout the dataframe. I would like to create a label when the instrument fails. This occurs in the data when:
String condition met (Note = standard/std)
Na>0 & Mg>0
Doesn't fall outside of a calculated range for more than 2 elements.
For requirement 3 - Here is an example of a range:
maxMin=pd.DataFrame(np.array([['Max',3,3,3,7], ['Min',1,1,2,2]]), columns=['Note', 'Na','Mg','Si','S'])
maxMin
Note Na Mg Si S
0 Max 3 3 3 7
1 Min 1 1 2 2
Calculating out of bound standard:
elements=['Na','Mg','Si','S']
std=df[(df['Note'].str.contains('std|standard'))&(df['Na']>0)&(df['Mg'])
df.loc[(std[elements].lt(maxMin.loc[1, :])|std[elements].gt(maxMin.loc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>2]
Note Na Mg Si S
3 std 4 4 2 4
Now, I would like to label this datapoint within the original dataframe. Desired result:
Note Na Mg Si S Error
0 ABC 1 3 3 4 False
1 std 0 0 2 4 False
2 std 2 1 2 4 False
3 std 4 4 2 4 True
4 std 2 6 2 6 False
I've tried things like:
df['Error'].loc[std.loc[(std[elements].lt(maxMin.loc[1, :])|std[elements].gt(mMmaxMinloc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>5].index.values.copy()]=True
That unfortunately causes a chain error.
How would you accomplish this without creating a chain error? Most books/tutorial revolve around creating one long expression, but as I dive deeper, I feel there might be a simpler solution. Any input would be appreciated
I figured it out a solution that works for me.
The solution was to use .index.value to create an array of the index that passed the bool conditions. That array can be used to pass edit the original dataframe.
##These two conditions can probably be combined
condition1=df[(df['Note'].str.contains('std|standard'))&(df['Na']>.01)&(df['Mg']>.01)]
##where condition1 is greater/less than the bounds of the known value.
##provides array where condition is true
OutofBounds=condition1.loc[(condition1[elements].lt(maxMin.loc[1, :])|condition1[elements].gt(maxMin.loc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>5].index.values
OutofBounds
out:array([ 3], dtype=int64)
Now I can pass the array into the original dataframe:
df.loc[OutofBounds, 'Error']=True

pandas read csv is returning extra unknown column

I am creating a csv file from pandas dataframe by combining two lists using:
df= pd.DataFrame(list(zip(patients_full, labels)),
columns=['id','cancer'])
df.to_csv("labels.csv")
but when I read the csv back there is an unknown column unnamed that shows up ? how do I remove that ?
Unnamed: 0 id cancer
0 0 HF0953.npy 1
1 1 HF1058.npy 3
2 2 HF1071.npy 3
3 3 HF1122.npy 3
4 4 HF1235.npy 1
5 5 HF1280.npy 2
6 6 HF1344.npy 1
7 7 HF1463.npy 1
8 8 HF1489.npy 1
9 9 HF1490.npy 2
10 10 HF1587.npy 2
11 11 HF1613.npy 2
This is happening because of the index column that is saved by default when you do to_csv("labels.csv"). As the index column in the data frame that you were saving didn't have a name, when you read your read_csv("labels.csv") it is treated as all other columns but with 'Blank' column named that is becoming Unnamed: 0. To avoid this you have 2 options:
Option 1 - not read the index:
read_csv("labels.csv", index_col=False)
Option 2 - not save the index:
to_csv("labels.csv", index=False)
What that column is in your output is the index of the dataframe. To not include it in the output: df.to_csv('labels.csv', index=False). More information is available on that method here in the pandas docs

How to get the mode of a column in pandas where there are few of the same mode values pandas

I have a data frame and i'd like to get the mode of a specific column.
i'm using:
freq_mode = df.mode()['my_col'][0]
However I get the error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index my_col')
I'm guessing it's because I have few mode that are the same.
I need any of the mode, it doesn't matter. How can I use any() to get any of the mode existed?
For me your code working nice with sample data.
If necessary select first value of Series from mode use:
freq_mode = df['my_col'].mode().iat[0]
We can see the one column
df=pd.DataFrame({"A":[14,4,5,4,1,5],
"B":[5,2,54,3,2,7],
"C":[20,20,7,3,8,7],
"train_label":[7,7,6,6,6,7]})
X=df['train_label'].mode()
print(X)
DataFrame
A B C train_label
0 14 5 20 7
1 4 2 20 7
2 5 54 7 6
3 4 3 3 6
4 1 2 8 6
5 5 7 7 7
Output
0 6
1 7
dtype: int64

pandas applying function to columns array is very slow

os hour day
0 13 14 0
1 19 14 0
2 13 14 0
3 13 14 0
4 13 14 0
Here is my dataframe and I just want to get a new column which is str(os)+'_'+str(hour)+'_'str(day). I use apply function to process the dataframe but it is very slow.
Any high-performance method to realize this ?
I also tried convert the df to array and process every row. It seems that it is slow too.
There are nearly two hundred millions rows of the dataframe.
Not sure what code are you using but you can try
df.astype(str).apply('_'.join, axis = 1)
0 13_14_0
1 19_14_0
2 13_14_0
3 13_14_0
4 13_14_0