split values from columns and generate sequence number

split values from columns and generate sequence number - pandas

I have two columns in a df. each column has multiple values in 1 row.I want to split each value in a new row in another table and generate sequence number. given data is
x y
76.25, 345.65 87.12,96.45
78.12,35.1,98.27 85.23,65.2,56.63
new df should be like this
x 76.25
y 87.12
sequence number 1
x 345.65
y 96.45
sequence number 2
x 78.12
y 85.23
sequence number 1
x 35.1
y 65.21
sequence number 2
x 98.27
y 56.63
sequence number 3
all values are strings. I have no idea how should I do it.Should I write a function or there is any command in dataframe? any help is appreciated

You can do it using iterrows() + concat():
df = pd.DataFrame({
'x': ('76.25,345.65', '78.12,35.1,98.27'),
'y': ('87.12,96.45', '85.23,65.2,56.63')
})
def get_parts():
for index, row in df.iterrows():
x = row['x'].split(',')
y = row['y'].split(',')
for index, _ in enumerate(x):
# len(x) must be equal len(y)...
yield 'x', x[index]
yield 'y', y[index]
# generate number after each splitted item
yield 'sequence number', index + 1
# generate Series from parts and union into new dataframe
new_df = pd.concat([
pd.Series([p[1]], [p[0]])
for p in get_parts()
])
Hope this helps.

Related

Classifying pandas columns according to range limits

I have a dataframe with several numeric columns and their range goes either from 1 to 5 or 1 to 10
I want to create two lists of these columns names this way:
names_1to5 = list of all columns in df with numbers ranging from 1 to 5
names_1to10 = list of all columns in df with numbers from 1 to 10
Example:
IP track batch size type
1 2 3 5 A
9 1 2 8 B
10 5 5 10 C
from the dataframe above:
names_1to5 = ['track', 'batch']
names_1to10 = ['ip', 'size']
I want to use a function that gets a dataframe and perform the above transformation only on columns with numbers within those ranges.
I know that if the column 'max()' is 5 than it's 1to5 same idea when max() is 10
What I already did:
def test(df):
list_1to5 = []
list_1to10 = []
for col in df:
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
I tried the above but it's returning the following error msg:
'>=' not supported between instances of 'float' and 'str'
The type of the columns is 'object' maybe this is the reason. If this is the reason, how can I fix the function without the need to cast these columns to float as there are several, sometimes hundreds of these columns and if I run:
df['column'].max() I get 10 or 5
What's the best way to create this this function?

Use:
string = """alpha IP track batch size
A 1 2 3 5
B 9 1 2 8
C 10 5 5 10"""
temp = [x.split() for x in string.split('\n')]
cols = temp[0]
data = temp[1:]
def test(df):
list_1to5 = []
list_1to10 = []
for col in df.columns:
if df[col].dtype!='O':
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
df = pd.DataFrame(data, columns = cols, dtype=float)
Output:
(['track', 'batch'], ['IP', 'size'])

who to get name of column from dataframe automatically

I have a CSV sheet with a lot of columns (column names are long questions). I am writing a code for value.counts() of each question (Number of "yes" and "no") in each column.
My code is:
import pandas as pd
df = pd.read_csv("fixed_site.csv", encoding= 'unicode_escape')
q1 = df['Any shortage of supplies? (Vaccines/Syringes)'].value_counts()
You can see the column name i.e "Any shortage of supplies? (Vaccines/Syringes)". There are many other very long questions as compared to this.
I have two questions. 1. How to avoid writing long questions manually.
2. After doing the counts() function, I want to create a CSV in which in the first column there will be "Question" and next two columns there will be values of "Yes" and "No". like below
q1_name = "Any shortage of supplies? (Vaccines/Syringes)"
q1_analysis = [q1_name, q1["Yes"], q34["No"]]
fixedsite_analysis = pd.DataFrame(data=[total_visits], columns=["QUESTION","YES", "NO"])
fixedsite_analysis.to_csv("fixedsite_analysis.csv", index = False)
How to do it simply with less code (not copying or writing the name of every column).
Thank you for your help

Suppose the following dataframe:
df = pd.DataFrame({'Q1': list('YYYN'), 'Q2': list('NNYY'), 'Q3': list('YNNN')})
print(df)
# Output
Q1 Q2 Q3
0 Y N Y
1 Y N N
2 Y Y N
3 N Y N
Use melt and pivot_table to reshape your dataframe:
out = (df.melt(var_name='question', value_name='answer').assign(dummy=1)
.pivot_table('dummy', 'question', 'answer', aggfunc='count')
.rename_axis(columns=None).reset_index())
print(out)
# Output
question N Y
0 Q1 1 3
1 Q2 2 2
2 Q3 3 1
The dummy variable is set to allow pivot_table to count value 'Yes' or 'No'.

for column in df.columns:
yes, no = df[column].value_counts()['YES'], df[column].value_counts()['NO']
result = pd.DataFrame({'question': column,
'YES': yes,
'NO': no},
index=[0])
result.to_csv(f'{column}.csv', index=False)

Multiple operations iterating over dataframe columns (apply function?)

I have a pandas dataframe with thousands of columns and I would like to perform the following operations for each column of the dataframe:
check if the value i-th and i-1-th values are in the range (between x and y);
if #1 is satisfied, then find log(i/i-1) ** 2 of the column;
if #1 is not satisfied, assume 0;
find the total of #2 for each column.
Here is a dataframe with a single column:
d = {'col1': [10, 15, 23, 16, 5, 14, 11, 4]}
df = pd.DataFrame(data = d)
df
x = 10 and y = 20
Here is what I can do for this single column:
df["IsIn"] = "NA"
for i in range(1, len(df.col1)):
if (x < df.col1[i] < y) & (x < df.col1[i - 1] < y):
df.IsIn[i] = 1
else:
df.IsIn[i] = 0
df["rets"] = np.log(df["col1"] / df["col1"].shift(1))
df["var"] = df["IsIn"] * df["rets"]**2
Total = df["var"].sum()
Total
Ideally, I would have a (1 by n-cols) dataframe of Totals for each column. How can I best achieve this? I would also appreciate if you can supplement your answer with detailed explanation.

Yes, this is an instance where apply works. You only need to wrap your logic in a function. Also, consider between and shift on the condition to eliminate the first loop:
def func(s, x=10, y=20):
'''
compute the value given a series
'''
# mask where values are between x and y
valid = s.between(x,y)
# shift `valid` and double check
valid = valid & valid.shift(fill_value=False)
# squared log, mask with `valid`, and sum
return (np.log(s/s.shift())**2 * valid).sum()
# apply `func` on the columns
df.apply(func, x=10, y=20)
Output:
col1 0.222561
dtype: float64

In pandas, how to apply a function to every column that returns two columns

I would like to apply a function to every column of my grouped multiindex pandas dataframe.
If I had a function my_function() that returns a scalar, I would use
data_grouped = data.groupby(['type'])
data_transf = data_grouped.apply(lambda x: my_function(x))
However, consider another function my_function_array() takes an array (all n rows within one group) as an input and returns an n x 2 array as the output.
How can I apply this to every column of my grouped dataframe data_grouped? That is, I want to take every column of my grouped data of m rows and replace it by the n x 2 output of my_function_array().
Here's some sample data. There are other groups (types) but I only show one
type frame x y
F1675 1 77.369027 108.013249
2 107.784096 22.177883
3 22.385162 65.024619
4 65.152003 77.74970
def my_function_array(data_vec, D=2, T=2):
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return embed_data
Appyling the function to the second column y
my_function_array(np.array([108.013249, 22.177883, 65.024619, 77.74970]))
I have
array([[ 65.024619, 108.013249],
[ 77.7497 , 22.177883]])
So, the expected output is
type frame x_1 x_2 y_1 y_2
F1675 1 22.385162 77.369027 65.024619 108.013249
2 65.152003 107.784096 77.7497 22.177883
where x_1 and x_2 are the two columns resulting from x (the naming is not important, can be anything). Note that the groups have become shorter and wider.

I think you need return pd.DataFrame:
def my_function_array(data_vec, D=2, T=2):
# print (data_vec.name)
N = len(data_vec) - (D-1)*T # length of embedded signal
embed_data = np.zeros([N,D])
for di in range(-D//2,D//2):
embed_data[:,di] = data_vec[ np.arange((D//2+di)*T, N+(D//2+di)*T) ]
return pd.DataFrame(embed_data).add_prefix(data_vec.name)
f = lambda x: pd.concat([my_function_array(x[y]) for y in x], axis=1)
data_transf = data.groupby(['type']).apply(f)
print (data_transf)
x0 x1 y0 y1
type
F1675 0 22.385162 77.369027 65.024619 108.013249
1 65.152003 107.784096 77.749700 22.177883

For every row, I want to make first row of every column as a new row

I am working with Python Pandas and I have a following table that consists of many rows:
Date X X Date Y Y
0 2014-03-31- 0.390- 2014-04-24- 1.80
1 2014-04-01- 0.385- 2014-04-25- 1.75
What I want to do is for every index(row), take the value of x and y from every column in the row and make new rows from them and have something like this:
0 2014-03-31 0.390
1 2014-04-24 1.80
The reason why I am trying to do this is that I want to interpolate between those 2 dates
I tried different merging, remerging and playing with the dataframe but it didn't really help

You can try the following:
new_df = pd.DataFrame()
for index, row in df.iterrows():
first_date = row['Date X']
second_date = row['Date Y']
x = row['X']
y = row['Y']
to_add = pd.DataFrame(data={'Dates': [first_date, second_date], 'params': [x, y]})
new_df = new_df.append(to_add)
print(new_df)
, where new_df is the new dataframe with all data and df is the original one.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

split values from columns and generate sequence number - pandas

Related

Classifying pandas columns according to range limits

who to get name of column from dataframe automatically

Multiple operations iterating over dataframe columns (apply function?)

In pandas, how to apply a function to every column that returns two columns

For every row, I want to make first row of every column as a new row

Categories

Resources