How to locate columns in data frame with index? - pandas

Asset Liab Net Worth
Date
1/1 8.99 K -19.65 K -10.66 K
1/2 8.99 K -19.66 K -10.67 K
The data looked something like that.
Below is what I want to achieve
> df['Asset'] = df['Asset'].str.rstrip('K')
> df['Liab'] = df['Liab'].str.rstrip('K')
> df['Net Worth'] = df['Net Worth'].str.rstrip('K')
I want to make a loop for it to process every column, but
> df.columns[] #only return the column's name not the whole list
> df.iloc[] #return the value based on vertical index
> df.loc[] #shows invalid syntax
I ended up doing this
> def removeSuffix(df, suffix):
df = df.T
i = 0
while i < len(df):
df.iloc[i] = df.iloc[i].str.rstrip(suffix)
i += 1
return df
One wired thing, this function works in VScode's interactive window, but shows syntax error in terminal.
Sorry if this question is dumb, I'm new to this. I'm so clueless on how to get the entire column.

Use apply to apply the function to each column
df = df.apply(lambda col: col.str.rstrip('K'))
Note that the values after stripping are still strings. If you want them as floats you can do
df = df.apply(lambda col: col.str.rstrip('K').astype(float))

Related

In Pandas how to replace dataframe column values with new values that are a function of old values

I need your guidance on this issue i am facing, i want to some this like this:
df[x], where x is variable name that represents one of the column names.
Now i want the entire column to go through this manipulation
each value of column x should be process as below equation:
( 1 / (x * 22)) note that x here is a individual value of the column and it can be huge number
since the huge number getting reciprocated (or 1 over x), it may result in exponential number (e+/-01)
The resultant number should be replacing the original number in the dataframe
If the number is 100, the new number to be put into the dataframe is (1/((100)*22)) = 4.54e-4
Please let me know how to do it rounded to 2 decimal points.
Thanks.
df[x] = df[x].apply(lambda x: (1/((x * 22)))
This is resulting in 0 in the dataframe
df[x] = df[x].apply(lambda x: (1/((x * 22)))
looks okay, but will probably round to whole numbers. This may depend on what datatype your column x has and/or what x is.
This may work:
df[x] = df[x].apply(lambda x: (1.0/((x * 22.0)))
If your lambda function gets more complicated, for example if you want to use if-else clauses, you should write a helper function and call that inside of your apply:
def helper(x):
if x == 100:
return (1.0/((100)*22.0))
else:
return (1.0/((x)*22.0))
df[x] = df[x].apply(lambda x: helper(x))
Use round() to round the result to two decimals:
df[x] = df[x].apply(lambda x: round((1.0/((x * 22.0)),2))
Your formula is good, but the numbers are too small to be showed in the answer (rounding with 2 decimals 0.000455 will result in 0.00).
xd = pd.DataFrame({'x':[100,101,102],'y':[1,2,3]})
xd['x'] = xd['x'].apply(lambda x: (1/(x * 22)))
>>> xd
x y
0 0.000455 1
1 0.000450 2
2 0.000446 3
Try this to format the numbers to exponential format with 2 decimals.
>>> pd.set_option('display.float_format', lambda x: '%.2E' % x)
>>> xd
x y
0 4.55E-04 1
1 4.50E-04 2
2 4.46E-04 3
Source for formatting: https://stackoverflow.com/a/6913576/14593183

Getting same value from list in dataframe column using Python

I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.
Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()

Item Wrong Length 1 Instead of 50 Pandas

I'm dealing with a csv file consists of 2 columns and 51 rows in total.
data = pd.read_csv("data.csv", sep = ',')
data.columns=['x_column', 'y_column']
Then I perform linear regresssion
X = data.iloc[:, 0].values.reshape(-1, 1)
y = data.iloc[:, 1].values.reshape(-1, 1)
lr = LinearRegression()
Next thing I need to perform is Tukey Method.
X = data.iloc[[0], :].values
y = data.iloc[[1], :].values
Then I plotted the boxes and found out my range is between -40 to 10.
data.boxplot(return_type='dict')
plt.plot()
I need to assign my outliers to a value in order to remove them before training my dataset again. And this is where I have a problem.
y_column = X[:, 1]
data_outliers = (y_column > 0.0)
data[data_outliers]
When I run this last part I get Item wrong length 1 instead of 50. error and I don't know how to solve that. Any help is appreciated.
Try:
data_outliers = (y_column > 0.0).ravel()
The problem was that your data_outliers was a numpy column with two dimensions (shape: (1,50)) and that was impossible to mask the df like that... ravel just flattened it...

how to find missing number between minimum and maximum

I want to make a NumPy array which has below;
Random number: 0~9 (0<=value<=9) Random 1D size: 5~9 (5<= size <=9)
And I hope to find missing numbers between min and max so I made a code like this
import numpy as np
min_val = 0
max_val = 10
min_val_len = 5
max_val_len = 10
arr1 = [4,3,2,7,8,2,3]
a = list(arr1)
print(a)
diff = np.setdiff1d(range(min_val, max_val), arr1)
arr = np.arange(min_val_len, max_val_len)
if diff in arr:
print(diff)
else:
print("no missing")
In my purpose, the output will be [5,6].
And if an input is [1, 2, 3, 4, 5], the result will be 'no_missing'.
But the code isn't work on my expectation.
I think you expect in to work in a way it does not: You want to check every single element, try:
b = [d in arr for d in diff]
Now b contains a boolean value for each value d of diff. If you want to find the actual number that are missing you can do it using a condition
b = np.intersect1d(np.setdiff1d(range(min_val, max_val), arr1), arr)
Also note that python has built in set types, so you do not actually need to use numpy.
Now b contains all numbers of d that are in arr. But you can do it in even a simpler way as you're already using the notion of sets:
print(np.setdiff1d(rang

I have a dataframe and I want to find the standard deviation for some specific cells

I'm trying to use pandas to find the standard deviation for the entries in some specific cells
I have tried using numPy's stdev like so:
numpy.std(df[columnName][j:i])
I have also tried using this:
df.std(axis=0)[columnName][j:i]
Just pseudocode becuase my actual code is more confusing than necessary for this question:
df = loadIris()
for feat in df.columns:
i = 0
j = 0
flower = df['flower'][i]
while i < df.index.max():
if df['flower'][i] == flower:
i+=1
else:
j = i
stand = df.std(axis=0)[feat][j:i]
flower = df['flower'][i]
I ended up just appending all of the values to a list and then calculating the standard deviation using statistics.stdev which you can get by importing statistics.