Is there a way to merge multiple pairs of columns in a dataframe in R? - data-manipulation

I am very new to R and the related forums (this includes asking questions, so please bear with me). I have a dataframe with a large number of variables. Most of the variables are the observation made on the abundance of a chemical element in a sample, with one being the "value" of abundance and the other being "error" in the measurement of that element's abundance.
There are 40+ elements in my dataframe, and each has a "value" and "error" column corresponding to it. I need to combine each corresponding elements' value with its error into a single column using the "±" symbol (i.e., "1000 ± 150"). That is, I need to take the "value" and "error" columns of for each element and merge them using: df_new <- paste(df$value, ±, df$error)
I have been able to merge a single element's "value" and "error" column, but because my dataframe contains multiple types of data and not just these columns, I am confused about how to specifically select for the "value" and "error" columns of each element and merge these accordingly.
### Example dataframe
SampleID <- c("0-3cm", "3-6cm", "6-9cm", "9-12cm", "12-15cm")
Mg_value <- c(100, 150, 175, 170, 125)
Mg_error <- c(20, 23, 21, 19, 12)
Fe_value <- c(300, 315, 290, 400, 450)
Fe_error <- c(12, 25, 20, 20, 15)
K_value <- c(120, 125, 130, 150, 190)
K_error <- c(15, 15, 20, 18, 12)
ProjectID <- c(example_core, example_core, example_core, example_core, example_core)
df <- data.frame(SampleID, Mg_value, Mg_error, Fe_value, Fe_error, K_value, K_error, ProjectID)
Can I write a function which scans for individual elements' "value" and "error" responses and merge them like I would for an individual element's response? (like below) Or would I have to do this individually and produce a new dataframe using the merged column pairs?
df_new <- paste(df$Mg_value, "±", df$Mg_error)

Related

comparing and removing rows in pandas

I am trying to create a new object by comparing two list. If the rows are matching the row should be removed form the splitted row_list or appended to a new list containing only the differences between both lists.
results = []
for row in splitted_row_list:
print(row)
for row1 in all_rows:
if row1 == row:
splitted_row_list.remove(row)
else:
results.append(row)
print(results)
However, this code just returns all the rows. Does anyone have a suggestion?
Sample data
all_rows[0]:'1390', '139080', '13980', '1380', '139080', '13080'
splitted_row_list[0]:'35335','53527','353529','242424','5222','444'
As I understand you want to compare two lists by index and keep the differences and you want to do it with pandas (because of the tag):
So here are two lists for example:
ls1=[0,10,20,30,40,50,60,70,80,90]
ls2=[0,15,20,35,40,55,60,75,80,95]
I make a pandas dataframe with these lists, and build a mask to filter out the the matching values:
df= pd.DataFrame(data={'ls1':ls1, 'ls2':ls2})
mask= df['ls1']!=df['ls2']
I can then call the different values for each list using the mask:
# list 1
df[mask]['ls1'].values
out: array([10, 30, 50, 70, 90])
and
# list 2
df[mask]['ls2'].values
out: array([15, 35, 55, 75, 95])

AttributeError for df.apply() when trying to subtract the column mean and divide by the column standard deviation for each column in a dataframe

I have a data frame with roughly 26 columns. For 14 of these columns (all data are floats) I want to determine the mean and standard deviation for each column, then for each value in each column I want to subtract the column mean and divide by the column standard deviation (only for the column to which the value belongs).
I can do this separately for each column like so:
chla_array = df['Chla'].to_numpy()
mean_chla = np.nanmean(chla_array)
std_chla = np.nanstd(chla_array)
df['Chla_standardized'] = (df['Chla'] - mean_chla) / std_chla
Because I have 14 columns to do this for, I am looking for a more concise way of coding this, rather than copy and pasting the above code out thirteen more times and changing the column headers. I was thinking of using df.apply() but I can't get it to work. Here is what I have:
df = df.iloc[:, [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]]
df_standardized = df.apply((df - df.mean(skipna=True)) / df.std(skipna=True, ddof=0))
The error I encounter is this:
AttributeError: 'Canyon_dist' is not a valid function for 'Series' object
Where 'Canyon_dist' is the header for the first column the code encounters.
I'm not sure that df.apply is appropriate for what I am trying to achieve, so if there is a more appropriate way of doing this please let me know (perhaps using a for loop?).
I am open to all suggestions and thank you.

Pandas append function adds new columns

I want to append one row to my dataframe.
Here's the code
import pandas as pd
citiesDataFrame=pd.read_csv('cities.csv')
citiesDataFrame=citiesDataFrame.append({
'LatD': 50,
'"LatM"' : 70,
'"LatS"' : 40,
'"NS"': '"S"',
'"LonD"': 200,
'"LonM"': 15,
'"LonS"': 40,
'"EW"': "E",
'"City"': '"Kentucky"',
'"State"': "KY"},ignore_index=True)
citiesDataFrame
But when i run, append doesn't work properly. In my dataframe i have 10 columns and 128 rows, when i run the code, it appends 9 columns and 1 row (here is modified dataframe) to dataframe.
Notice it works for LatD. The reason is your column names aren't identical to the existing names. Seems like a quoting issue. Not sure why you have the double quotes inside the single quotes. Make the column names match and then the append will work.

reindex group to add missing rows

I am trying to reindex groups to extend dataframes with missing values. Similar as resample works for time indexes, I am trying to achieve this for normal integer values.
So, for a group belonging to a certain group key (proID in my case) a maximum existent integer value shall be determined (specifying the end point of the resampling process). The group shall be extended (I was trying to achieve it with reindex) by the missing values of this integer value.
I have a dataframe having many rows per proID and a integer bin value which can range from 0 to 100 and some meaningless columns. Basically, the bin value shall be filled if some data are missing similarly as resample would do for time indexes.
def rsmpint(df):
mx = df.bin.max() #identify maximal existing bin value in dataframe (group)
no = (mx * 20 / 100).astype(np.int64) + 1 #calculate number of bin values
idx = pd.Index(np.linspace(0,mx,no), name='bin') # define full bin-Index for df (group)
df.set_index('bin').reindex(idx).ffill().reset_index(drop=True, inplace=True)
return df
DF.groupby('proID').apply(rsmpint)
Let assume for a specific proID there are currently 5 bin values [0, 15, 20, 40, 65] (i.e. 5 rows of the original proID group). The output shall be an extended proID group with bin values [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 65] with the content of the "meaningless" columns filled using ffill().

Panda's Update information based on bins / cutting

I'm working on a dataset which has a large amount of missing information.
I understand I could use FillNA but i'd like to base my updates on the binned values of another column.
Selection of missing data:
missing = train[train['field'].isnull()]
Bin the data (this works correctly):
filter_values = [0, 42, 63, 96, 118, 160]
labels = [1,2,3,4,5]
out = pd.cut(missing['field2'], bins = filter_values, labels=labels)
counts = pd.value_counts(out)
print(counts)
Now, based on the bin assignments, I would like to set the correct bin label, to the missing/train['field'] for all data assigned to this bin.
IIUC:
You just need to fillna
train['field'] = train['field'].fillna(out)