Create a Combined CSV Files - pandas

I have two CSV files reviews_positive.csv and reviews_negative.csv. How can I combine them into one CSV file, but in the following condition:
Have odd rows fill with reviews from reviews_positive.csv and even rows fill up with reviews from reviews_negative.csv.
I am using Pandas
I need this specific order because I want to build a balanced dataset for training using neural networks

Here is a working example
from io Import StringIO
import pandas as pd
pos = """rev
a
b
c"""
neg = """rev
e
f
g
h
i"""
pos_df = pd.read_csv(StringIO(pos))
neg_df = pd.read_csv(StringIO(neg))
Solution
pd.concat with the keys parameter to label the source dataframes as well as to preserve the desired order of positive first. Then we sort_index with parameter sort_remaining=False
pd.concat(
[pos_df, neg_df],
keys=['pos', 'neg']
).sort_index(level=1, sort_remaining=False)
rev
pos 0 a
neg 0 e
pos 1 b
neg 1 f
pos 2 c
neg 2 g
3 h
4 i
That said, you don't have to interweave them to take balanced samples. You can use groupby with sample
pd.concat(
[pos_df, neg_df],
keys=['pos', 'neg']
).groupby(level=0).apply(pd.DataFrame.sample, n=3)
rev
pos pos 1 b
2 c
0 a
neg neg 1 f
4 i
3 h

Related

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

How to return multiple columns using apply in Pandas dataframe

I am trying to apply a function to a column of a Pandas dataframe, the function returns a list of tuples. This is my function:
def myfunc(text):
values=[]
sections=api_call(text)
for (part1, part2, part3) in sections:
value=(part1, part2, part3)
values.append(value)
return values
For example,
sections=myfunc("History: Had a fever\n Allergies: No")
print(sections)
output:
[('past_medical_history', 'History:', 'History: Had a fever\n '), ('allergies', 'Allergies:', 'Allergies: No')]
For each tuple, I would like to create a new column. For example:
the original dataframe looks like this:
id text
0 History: Had a fever\n Allergies: No
1 text2
and after applying the function, I want the dataframe to look like this (where xxx is various text content):
id text part1 part2 part3
0 History: Had... past_... History: History: ...
0 Allergies: No allergies Allergies: Allergies: No
1 text2 xxx xxx xxx
1 text2 xxx xxx xxx
1 text2 xxx xxx xxx
...
I could loop through the dataframe and generate a new dataframe but it would be really slow. I tried following code but received a ValueError. Any suggestions?
df.apply(lambda x: pd.Series(myfunc(x['col']), index=['part1', 'part2', 'part3']), axis=1)
I did a little bit more research, so my question actually boils down to how to unnest a column with a list of tuples. I found the answer from this link Split a list of tuples in a column of dataframe to columns of a dataframe
helps. And here is what I did
# step1: sectionizing
df["sections"] =df["text"].apply(myfunc)
# step2: unnest the sections
part1s = []
part2s = []
part3s = []
ids = []
def create_lists(row):
tuples = row['sections']
id = row['id']
for t in tuples:
part1s.append(t[0])
part2s.append(t[1])
part3s.append(t[2])
ids.append(id)
df.apply(create_lists, axis=1)
new_df = pd.DataFrame({"part1" :part1s, "part2": part2s, "part3": part3s,
"id": ids})[["part1", "part2", 'part3', "id"]]
But the performance is not so good. I wonder if there is better way.
The idea here is to set up some data and a function that can be operated on this data to generate three items that we can return. Choosing split and comma-separated values seems to be quick and mirror the function you are after.
import pandas as pd
data = { 'names' : ['x,a,c','y,er,rt','z,1,ere']}
df = pd.DataFrame(data)
gives
names
0 x,a,c
1 y,er,rt
2 z,1,ere
now
def myfunc(text):
sections=text.split(',')
return sections
df[['part1', 'part2', 'part3']] = df['names'].apply(myfunc)
will give
names part1 part2 part3
0 x,a,c x y z
1 y,er,rt a er 1
2 z,1,ere c rt ere
Which is probably not what you want, however
df['part1'] ,df['part2'], df['part3'] = zip(*df['names'].apply(myfunc))
gives
names part1 part2 part3
0 x,a,c x a c
1 y,er,rt y er rt
2 z,1,ere z 1 ere
which is probably what you want.
Converting the tuple to new columns:
To convert the tuple column value to new columns, you can do the following:
df[['part1', 'part2', 'part3']] = pd.DataFrame(df['text'].tolist())
print (df)
The output of this will be:
text part1 \
0 (past_medical_history, History:, History: Had ... past_medical_history
1 (allergies, Allergies:, Allergies: No) allergies
part2 part3
0 History: History: Had a fever\n
1 Allergies: Allergies: No
If the tuples in df['text'] is varying (not constant 3 items), then you can concat as follows:
df = pd.concat([df[['text']],pd.DataFrame(df['text'].tolist()).add_prefix('part')],axis=1)
This will give you the same result as earlier. Column names will differ slightly.
Converting comma separated values in a column to separate columns
You don't need to have a function to do this. You already have a pd.Series. All you have to do is split and expand.
df[['part1', 'part2', 'part3']] = df['names'].str.split(',',expand=True)
Output of this will be:
names part1 part2 part3
0 a,b,c a b c
1 e,f,g e f g
2 x,y,z x y z
In case you have odd number of values in the names column and you want to split them into 3 parts, you can do it as follows:
within the split, you can specify how many columns you want to split them into. value of n sets the split to n parts (starting with 0. If you need 3 columns, n=2)
import pandas as pd
data = { 'names' : ['a,b,c','d,e,f','p,q,r,s','x,y,z']}
df = pd.DataFrame(data)
df = pd.concat([df[['names']],df['names'].str.split(',',n=2,expand=True).add_prefix('part')],axis=1)
print (df)
The output will be:
names part0 part1 part2
0 a,b,c a b c
1 d,e,f d e f
2 p,q,r,s p q r,s
3 x,y,z x y z
Or you can also do it as follows:
df[['part1', 'part2', 'part3']] = df['names'].str.split(',',n=2,expand=True)
This will give you the same result as follows:
names part1 part2 part3
0 a,b,c a b c
1 d,e,f d e f
2 p,q,r,s p q r,s
3 x,y,z x y z
And in case you want to get all the values split into each column, then you can do this:
df = pd.concat([df[['names']],df['names'].str.split(',',expand=True).add_prefix('part').fillna('')],axis=1)
The output of this will be:
names part0 part1 part2 part3
0 a,b,c a b c
1 d,e,f d e f
2 p,q,r,s p q r s
3 x,y,z x y z
You can decide to do np.nan instead if you want to store NaN values.
In case you have multiple delimiters to consider and split the column, then use this.
import pandas as pd
data = { 'names' : ['a,b,c','d,e,f','p;q,r,s','x,y\nz,w']}
df = pd.DataFrame(data)
df = pd.concat([df[['names']],df['names'].str.split(',|\n|;',expand=True).add_prefix('part').fillna('')],axis=1)
print (df)
The output will be as follows:
names part0 part1 part2 part3
0 a,b,c a b c
1 d,e,f d e f
2 p;q,r,s p q r s
3 x,y\nz,w x y z w

how to make a range of column negative in numpy?

Just a newbie using numpy. I have a long data but simply like this:
a 3
b 2
c 1
d 0
e 1
f 2
g 3
I want to have output:
a -3
b -2
c -1
d 0
e 1
f 2
g 3
I tried to use numpy to negate data above column2=0, but I always get error.
can anyone help me please?
If the values are really ascending indices then like that:
import numpy as np
a = np.arange(-3, 4)
print(a)
b = np.zeros((7, 2))
print(b)
b[:, 1] = a
to make this slightly more general: given an array arr that you want to change sign up to (and excluding) a value v, you could
import numpy as np
arr = np.array([3,2,1,0,1,2,3])
v = 0
# find the index of the first occurrence of v:
idx = np.argmax(arr == v)
# change the sign up to index idx-
# since argmax returns 0 if v is not found, we have to check that:
if arr[idx] == v:
arr[:idx] = arr[:idx] * -1
# to not touch the original array, e.g.
# arr_new = np.concatenate([arr[:idx]*-1, arr[idx:]])
# could put an else condition here, raise ValueError or sth like that
print(arr)
# [-3 -2 -1 0 1 2 3]

Pandas dataframe with column of type set: How to initialize and update values

Assuming I have a pandas dataframe with existing content:
import pandas as pd
df = pd.DataFrame(columns=['content'])
df['content'] = pd.Series(["a","b","c","d","e"])
df.head()
content
0 a
1 b
2 c
3 d
4 e
How can add a column with empty sets?
How can I update a subset of recordsĀ“ sets (e.g. set.add(value))
Both questions turn out more difficult than expected.
To initialize a new column with empty sets:
df['Sets'] = [set() for _ in range(len(df))]
df.head()
content Sets
0 a {}
1 b {}
2 c {}
3 d {}
4 e {}
To update a a subset of recordsĀ“ sets with a unique string:
row_ids_to_update = [1,3,4]
column_id_set = df.columns.get_loc("Sets")
update_string = "uid12345"
df.iloc[row_ids_to_update, column_id_set].apply(
lambda x: x.add(update_string))
df.head()
content Sets
0 a {}
1 b {uid12345}
2 c {}
3 d {uid12345}
4 e {uid12345}
Perhaps there's a faster way for large quantities of updates, e.g. avoiding lambda?

pandas faster series of lists unrolling for one-hot encoding?

I'm reading from a database that had many array type columns, which pd.read_sql gives me a dataframe with columns that are dtype=object, containing lists.
I'd like an efficient way to find which rows have arrays containing some element:
s = pd.Series(
[[1,2,3], [1,2], [99], None, [88,2]]
)
print s
..
0 [1, 2, 3]
1 [1, 2]
2 [99]
3 None
4 [88, 2]
1-hot-encoded feature tables for an ML application and I'd like to end up with tables like:
contains_1 contains_2, contains_3 contains_88
0 1 ...
1 1
2 0
3 nan
4 0
...
I can unroll a series of arrays like so:
s2 = s.apply(pd.Series).stack()
0 0 1.0
1 2.0
2 3.0
1 0 1.0
1 2.0
2 0 99.0
4 0 88.0
1 2.0
which gets me at the being able to find the elements meeting some test:
>>> print s2[(s2==2)].index.get_level_values(0)
Int64Index([0, 1, 4], dtype='int64')
Woot! This step:
s.apply(pd.Series).stack()
produces a great intermediate data-structure (s2) that's fast to iterate over for each category. However, the apply step is jaw-droppingly slow (many 10's of seconds for a single column with 500k rows with lists of 10's of items), and I have many columns.
Update: It seems likely that having the data in a series of lists to begin with in quite slow. Performing unroll in the SQL side seems tricky (I have many columns that I want to unroll). Is there a way to pull array data into a better structure?
import numpy as np
import pandas as pd
import cytoolz
s0 = s.dropna()
v = s0.values.tolist()
i = s0.index.values
l = [len(x) for x in v]
c = cytoolz.concat(v)
n = np.append(0, np.array(l[:-1])).cumsum().repeat(l)
k = np.arange(len(c)) - n
s1 = pd.Series(c, [i.repeat(l), k])
UPDATE: What worked for me...
def unroll(s):
s = s.dropna()
v = s.values.tolist()
c = pd.Series(x for x in cytoolz.concat(v)) # 16 seconds!
i = s.index
lens = np.array([len(x) for x in v]) #s.apply(len) is slower
n = np.append(0, lens[:-1]).cumsum().repeat(lens)
k = np.arange(sum(lens)) - n
s = pd.Series(c)
s.index = [i.repeat(lens), k]
s = s.dropna()
return s
It should be possible to replace:
s = pd.Series(c)
s.index = [i.repeat(lens), k]
with:
s = pd.Series(c, index=[i.repeat(lens), k])
But this doesn't work. (Says is ok here )