splitting the file and rearrangement - pandas

My text file contain data of temp variation
#
1 2
2 4
3 4
#
6 1
3 2
1 7
I want the column values to be splitted at # and generate the new files by appending the splitted files
expected output1
1 6
2 3
3 1
expected output2
2 1
4 2
4 7

A more complex problem than it seems at first glance. I had to carefully look at the example output to fully see what was going on.
Simulated text file:
sim_txt = io.StringIO('''
#
1 2
2 4
3 4
#
6 1
3 2
1 7
''')
df = pd.read_csv(sim_txt, sep='\s+', header=None, names=[0,1])
df_out = df.assign(out=df[0].str.contains('#').cumsum()) \
.pivot(columns=['out']).apply(lambda x: x.shift(-len(x)//2) if x.name[1]==2 else x) \
.dropna().astype(int)
print(df_out)
0 1
out 1 2 1 2
1 1 6 2 1
2 2 3 4 2
3 3 1 4 7
Then save to individual files:
for c in df_out.columns.get_level_values(0).unique():
df_out.loc[1:,c].to_csv(fr'd:\jchtempnew\SO\output{c+1}.txt', index=None, header=None, sep=' ')
output1
1 6
2 3
3 1
output2
2 1
4 2
4 7

Related

Allotting unique identifier to a group of groups in pandas dataframe

Given a frame like this
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,6,3,7,3,2,11,13,10,1,5],'B':[1,1,1,2,2,2,2,3,3,3,3,3,4,4],
'C':[1,1,1,1,1,1,1,2,2,2,2,2,3,3]})
I want to allot a unique identifier to multiple groups in column B. For example, going from top for every two groups allot a unique identifier as shown in red boxes in below image. The end result would look like below:
Currently I am doing like below but it seems to be over kill. It's taking too much time to update even 70,000 rows:
b_unique_cnt = df['B'].nunique()
the_list = list(range(1, b_unique_cnt+1))
slice_size = 2
list_of_slices = zip(*(iter(the_list),) * slice_size)
counter = 1
df['D'] = -1
for i in list_of_slices:
df.loc[df['B'].isin(i), 'D'] = counter
counter = counter + 1
df.head(15)
You could do
df['new'] = df.B.factorize()[0]//2+1
#(df.groupby(['B'], sort=False).ngroup()//2).add(1)
df
Out[153]:
A B C new
0 1 1 1 1
1 2 1 1 1
2 3 1 1 1
3 4 2 1 1
4 6 2 1 1
5 3 2 1 1
6 7 2 1 1
7 3 3 2 2
8 2 3 2 2
9 11 3 2 2
10 13 3 2 2
11 10 3 2 2
12 1 4 3 2
13 5 4 3 2

Insert a level o in the existing data frame such that 4 columns are grouped as one

I want to do multiindexing for my data frame such that MAE,MSE,RMSE,MPE are grouped together and given a new index level. Similarly the rest of the four should be grouped together in the same level but different name
> mux3 = pd.MultiIndex.from_product([list('ABCD'),list('1234')],
> names=['one','two'])###dummy data
> df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3) #### dummy data frame
> print(df3) #intended output required for the data frame in the picture given below
Assuming column groups are already in the appropriate order we can simply create an np.arange over the length of the columns and floor divide by 4 to get groups and create a simple MultiIndex.from_arrays.
Sample Input and Output:
import numpy as np
import pandas as pd
initial_index = [1, 2, 3, 4] * 3
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, len(initial_index))), columns=initial_index
)
1 2 3 4 1 2 3 4 1 2 3 4 # Column headers are in repeating order
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
# Create New Columns
df3.columns = pd.MultiIndex.from_arrays([
np.arange(len(df3.columns)) // 4, # Group Each set of 4 columns together
df3.columns # Keep level 1 the same as current columns
], names=['one', 'two']) # Set Names (optional)
df3
one 0 1 2
two 1 2 3 4 1 2 3 4 1 2 3 4
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
If columns are in mixed order:
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, 8)), columns=[1, 1, 3, 2, 4, 3, 2, 4]
)
df3
1 1 3 2 4 3 2 4 # Cannot select groups positionally
0 3 6 6 0 9 8 4 7
1 0 0 7 1 5 7 0 1
2 4 6 2 9 9 9 9 1
We can convert Index.to_series then enumerate columns using groupby cumcount then sort_index if needed to get in order:
df3.columns = pd.MultiIndex.from_arrays([
# Enumerate Groups to create new level 0 index
df3.columns.to_series().groupby(df3.columns).cumcount(),
df3.columns
], names=['one', 'two']) # Set Names (optional)
# Sort to Order Correctly
# (Do not sort before setting columns it will break alignment with data)
df3 = df3.sort_index(axis=1)
df3
one 0 1
two 1 2 3 4 1 2 3 4 # Notice Data has moved with headers
0 3 0 6 9 6 4 8 7
1 0 1 7 5 0 0 7 1
2 4 9 2 9 6 9 9 1

which rows are duplicates to each other

I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)
Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2

pandas: bin data into specific number of bins of specific size

I would like to bin a dataframe by the values in a single column into bins of a specific size and number.
Here is an example df:
df= pd.DataFrame(np.random.randint(0,10000,size=(10000, 4)), columns=list('ABCD'))
Say I want to bin by column D, I will first sort the data:
df.sort('D')
I would now wish to bin so that the first if bin size is 50 and bin number is 100, the first 50 values will go into bin 1, the next into bin 2, and so on and so forth. Any remaining values after the twenty bins should all go into the final bin. Is there anyway of doing this?
EDIT:
Here is a sample input:
x = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
And here is the expected output:
A B C D bin
0 6 8 6 5 3
1 5 4 9 1 1
2 5 1 7 4 3
3 6 3 3 3 2
4 2 5 9 3 2
5 2 5 1 3 2
6 0 1 1 0 1
7 3 9 5 8 3
8 2 4 0 1 1
9 6 4 5 6 3
As an extra aside, is it also possible to bin any equal values in the same bin? So for example, say I have bin 1 which contains values, 0,1,1 and then bin 2 contains 1,1,2. Is there any way of putting those two 1 values in bin 2 into bin 1? This will create very uneven bin sizes but this is not an issue.
It seems you need floor divide np.arange and then assign to new column:
idx = df['D'].sort_values().index
df['b'] = pd.Series(np.arange(len(df)) // 3 + 1, index = idx)
print (df)
A B C D bin b
0 6 8 6 5 3 3
1 5 4 9 1 1 1
2 5 1 7 4 3 3
3 6 3 3 3 2 2
4 2 5 9 3 2 2
5 2 5 1 3 2 2
6 0 1 1 0 1 1
7 3 9 5 8 3 4
8 2 4 0 1 1 1
9 6 4 5 6 3 3
Detail:
print (np.arange(len(df)) // 3 + 1)
[1 1 1 2 2 2 3 3 3 4]
EDIT:
I create another question about problem with last values here:
N = 3
idx = df['D'].sort_values().index
#one possible solution, thanks divakar
def replace_irregular_groupings(a, N):
n = len(a)
m = N*(n//N)
if m!=n:
a[m:] = a[m-1]
return a
idx = df['D'].sort_values().index
arr = replace_irregular_groupings(np.arange(len(df)) // N + 1, N)
df['b'] = pd.Series(arr, index = idx)
print (df)
A B C D bin b
0 6 8 6 5 3 3
1 5 4 9 1 1 1
2 5 1 7 4 3 3
3 6 3 3 3 2 2
4 2 5 9 3 2 2
5 2 5 1 3 2 2
6 0 1 1 0 1 1
7 3 9 5 8 3 3
8 2 4 0 1 1 1
9 6 4 5 6 3 3

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2