I have two columns in a df. each column has multiple values in 1 row.I want to split each value in a new row in another table and generate sequence number. given data is
x y
76.25, 345.65 87.12,96.45
78.12,35.1,98.27 85.23,65.2,56.63
new df should be like this
x 76.25
y 87.12
sequence number 1
x 345.65
y 96.45
sequence number 2
x 78.12
y 85.23
sequence number 1
x 35.1
y 65.21
sequence number 2
x 98.27
y 56.63
sequence number 3
all values are strings. I have no idea how should I do it.Should I write a function or there is any command in dataframe? any help is appreciated
You can do it using iterrows() + concat():
df = pd.DataFrame({
'x': ('76.25,345.65', '78.12,35.1,98.27'),
'y': ('87.12,96.45', '85.23,65.2,56.63')
})
def get_parts():
for index, row in df.iterrows():
x = row['x'].split(',')
y = row['y'].split(',')
for index, _ in enumerate(x):
# len(x) must be equal len(y)...
yield 'x', x[index]
yield 'y', y[index]
# generate number after each splitted item
yield 'sequence number', index + 1
# generate Series from parts and union into new dataframe
new_df = pd.concat([
pd.Series([p[1]], [p[0]])
for p in get_parts()
])
Hope this helps.
Related
I have a dataframe with several numeric columns and their range goes either from 1 to 5 or 1 to 10
I want to create two lists of these columns names this way:
names_1to5 = list of all columns in df with numbers ranging from 1 to 5
names_1to10 = list of all columns in df with numbers from 1 to 10
Example:
IP track batch size type
1 2 3 5 A
9 1 2 8 B
10 5 5 10 C
from the dataframe above:
names_1to5 = ['track', 'batch']
names_1to10 = ['ip', 'size']
I want to use a function that gets a dataframe and perform the above transformation only on columns with numbers within those ranges.
I know that if the column 'max()' is 5 than it's 1to5 same idea when max() is 10
What I already did:
def test(df):
list_1to5 = []
list_1to10 = []
for col in df:
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
I tried the above but it's returning the following error msg:
'>=' not supported between instances of 'float' and 'str'
The type of the columns is 'object' maybe this is the reason. If this is the reason, how can I fix the function without the need to cast these columns to float as there are several, sometimes hundreds of these columns and if I run:
df['column'].max() I get 10 or 5
What's the best way to create this this function?
Use:
string = """alpha IP track batch size
A 1 2 3 5
B 9 1 2 8
C 10 5 5 10"""
temp = [x.split() for x in string.split('\n')]
cols = temp[0]
data = temp[1:]
def test(df):
list_1to5 = []
list_1to10 = []
for col in df.columns:
if df[col].dtype!='O':
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
df = pd.DataFrame(data, columns = cols, dtype=float)
Output:
(['track', 'batch'], ['IP', 'size'])
I have a CSV sheet with a lot of columns (column names are long questions). I am writing a code for value.counts() of each question (Number of "yes" and "no") in each column.
My code is:
import pandas as pd
df = pd.read_csv("fixed_site.csv", encoding= 'unicode_escape')
q1 = df['Any shortage of supplies? (Vaccines/Syringes)'].value_counts()
You can see the column name i.e "Any shortage of supplies? (Vaccines/Syringes)". There are many other very long questions as compared to this.
I have two questions. 1. How to avoid writing long questions manually.
2. After doing the counts() function, I want to create a CSV in which in the first column there will be "Question" and next two columns there will be values of "Yes" and "No". like below
q1_name = "Any shortage of supplies? (Vaccines/Syringes)"
q1_analysis = [q1_name, q1["Yes"], q34["No"]]
fixedsite_analysis = pd.DataFrame(data=[total_visits], columns=["QUESTION","YES", "NO"])
fixedsite_analysis.to_csv("fixedsite_analysis.csv", index = False)
How to do it simply with less code (not copying or writing the name of every column).
Thank you for your help
Suppose the following dataframe:
df = pd.DataFrame({'Q1': list('YYYN'), 'Q2': list('NNYY'), 'Q3': list('YNNN')})
print(df)
# Output
Q1 Q2 Q3
0 Y N Y
1 Y N N
2 Y Y N
3 N Y N
Use melt and pivot_table to reshape your dataframe:
out = (df.melt(var_name='question', value_name='answer').assign(dummy=1)
.pivot_table('dummy', 'question', 'answer', aggfunc='count')
.rename_axis(columns=None).reset_index())
print(out)
# Output
question N Y
0 Q1 1 3
1 Q2 2 2
2 Q3 3 1
The dummy variable is set to allow pivot_table to count value 'Yes' or 'No'.
for column in df.columns:
yes, no = df[column].value_counts()['YES'], df[column].value_counts()['NO']
result = pd.DataFrame({'question': column,
'YES': yes,
'NO': no},
index=[0])
result.to_csv(f'{column}.csv', index=False)
I have a pandas dataframe with thousands of columns and I would like to perform the following operations for each column of the dataframe:
check if the value i-th and i-1-th values are in the range (between x and y);
if #1 is satisfied, then find log(i/i-1) ** 2 of the column;
if #1 is not satisfied, assume 0;
find the total of #2 for each column.
Here is a dataframe with a single column:
d = {'col1': [10, 15, 23, 16, 5, 14, 11, 4]}
df = pd.DataFrame(data = d)
df
x = 10 and y = 20
Here is what I can do for this single column:
df["IsIn"] = "NA"
for i in range(1, len(df.col1)):
if (x < df.col1[i] < y) & (x < df.col1[i - 1] < y):
df.IsIn[i] = 1
else:
df.IsIn[i] = 0
df["rets"] = np.log(df["col1"] / df["col1"].shift(1))
df["var"] = df["IsIn"] * df["rets"]**2
Total = df["var"].sum()
Total
Ideally, I would have a (1 by n-cols) dataframe of Totals for each column. How can I best achieve this? I would also appreciate if you can supplement your answer with detailed explanation.
Yes, this is an instance where apply works. You only need to wrap your logic in a function. Also, consider between and shift on the condition to eliminate the first loop:
def func(s, x=10, y=20):
'''
compute the value given a series
'''
# mask where values are between x and y
valid = s.between(x,y)
# shift `valid` and double check
valid = valid & valid.shift(fill_value=False)
# squared log, mask with `valid`, and sum
return (np.log(s/s.shift())**2 * valid).sum()
# apply `func` on the columns
df.apply(func, x=10, y=20)
Output:
col1 0.222561
dtype: float64
I would like to apply a function to every column of my grouped multiindex pandas dataframe.
If I had a function my_function() that returns a scalar, I would use
data_grouped = data.groupby(['type'])
data_transf = data_grouped.apply(lambda x: my_function(x))
However, consider another function my_function_array() takes an array (all n rows within one group) as an input and returns an n x 2 array as the output.
How can I apply this to every column of my grouped dataframe data_grouped? That is, I want to take every column of my grouped data of m rows and replace it by the n x 2 output of my_function_array().
Here's some sample data. There are other groups (types) but I only show one
type frame x y
F1675 1 77.369027 108.013249
2 107.784096 22.177883
3 22.385162 65.024619
4 65.152003 77.74970
def my_function_array(data_vec, D=2, T=2):
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return embed_data
Appyling the function to the second column y
my_function_array(np.array([108.013249, 22.177883, 65.024619, 77.74970]))
I have
array([[ 65.024619, 108.013249],
[ 77.7497 , 22.177883]])
So, the expected output is
type frame x_1 x_2 y_1 y_2
F1675 1 22.385162 77.369027 65.024619 108.013249
2 65.152003 107.784096 77.7497 22.177883
where x_1 and x_2 are the two columns resulting from x (the naming is not important, can be anything). Note that the groups have become shorter and wider.
I think you need return pd.DataFrame:
def my_function_array(data_vec, D=2, T=2):
# print (data_vec.name)
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return pd.DataFrame(embed_data).add_prefix(data_vec.name)
f = lambda x: pd.concat([my_function_array(x[y]) for y in x], axis=1)
data_transf = data.groupby(['type']).apply(f)
print (data_transf)
x0 x1 y0 y1
type
F1675 0 22.385162 77.369027 65.024619 108.013249
1 65.152003 107.784096 77.749700 22.177883
I am working with Python Pandas and I have a following table that consists of many rows:
Date X X Date Y Y
0 2014-03-31- 0.390- 2014-04-24- 1.80
1 2014-04-01- 0.385- 2014-04-25- 1.75
What I want to do is for every index(row), take the value of x and y from every column in the row and make new rows from them and have something like this:
0 2014-03-31 0.390
1 2014-04-24 1.80
The reason why I am trying to do this is that I want to interpolate between those 2 dates
I tried different merging, remerging and playing with the dataframe but it didn't really help
You can try the following:
new_df = pd.DataFrame()
for index, row in df.iterrows():
first_date = row['Date X']
second_date = row['Date Y']
x = row['X']
y = row['Y']
to_add = pd.DataFrame(data={'Dates': [first_date, second_date], 'params': [x, y]})
new_df = new_df.append(to_add)
print(new_df)
, where new_df is the new dataframe with all data and df is the original one.