I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])

No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})

Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000

You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')


Merge value_counts of different pandas dataframes

I have a list of pandas dataframes in which i do the value_counts of a column and finally append all the results to another dataframe.
df_AB = pd.read_pickle('df_AB.pkl')
df_AC = pd.read_pickle('df_AC.pkl')
df_AD = pd.read_pickle('df_AD.pkl')
df_AE = pd.read_pickle('df_AE.pkl')
df_AF = pd.read_pickle('df_AF.pkl')
df_AG = pd.read_pickle('df_AG.pkl')
The format of the above dataframes is as below (Example: df_AB):
id is_valid
121 True
122 False
123 True
For every pandas dataframe, I would need to get the value_counts of is_valid column and store the results to df_result. I tried the below code but doesn't seem to work as expected.
df_AB_VC = df_AB['is_valid'].value_counts()
df_AB_VC['group'] = "AB"
df_AC_VC = df_AC['is_valid'].value_counts()
df_AC_VC['group'] = "AC"
Result dataframe (df_result):
Group is_valid_True_Count is_Valid_False_Count
AB 2 1
Any leads would be appreciated
I think you just need to work on the dataframes a bit more systematically:
groups = ['AB', 'AC', 'AD',...]
out = pd.DataFrame({
g: pd.read_pickle(f'df_{g}.pkl')['is_valid'].value_counts()
for g in groups
Do not use variables, that makes your code much more complicated. Use a container
files = ['df_AB.pkl', 'df_AC.pkl', 'df_AD.pkl', 'df_AE.pkl', 'df_AF.pkl']
# using the XX part in "df_XX.pkl", you need to adapt to your real use-case
dataframes = {f[3:5]: pd.read_pickle(f) for f in files}
# compute counts
counts = (pd.DataFrame({k: d['is_valid'].value_counts()
for k,d in dataframes.items()})
example output:
is_valid_True_Count is_valid_False_Count
AB 2 1
AC 2 1
Use pathlib to extract group name then collect data into dictionary before concatenate all entries:
import pandas as pd
import pathlib
data = {}
for pkl in pathlib.Path().glob('df_*.pkl'):
group = pkl.stem.split('_')[1]
df = pd.read_pickle(pkl)
data[group] = df['is_valid'].value_counts() \
.add_prefix('is_valid_') \
df = pd.concat(data, axis=1).T
>>> df
is_valid_True_Count is_valid_False_Count
AD 2 1
AB 4 2
AC 0 3

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
2, dsds
df = pd.read_csv(StringIO(data))
I want to groupby id
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

Multiply many columns by one column in dask

I want to multiply roughly 50,000 columns with one other column in a large dask dataframe (6_500_000 x 50_002). The solution, using a for loop, works but is painfully slow. Below I tried two other appraoches that failed. Any advice is appreciated.
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df[['a','b']].multiply(df['c'], axis="index")
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=1)
# works but very slow for large datasets:
for column in ['a', 'b']:
ddf[column] = ddf[column] * ddf['c']
# don't work:
ddf[['a','b']].multiply(ddf['c'], axis="index")
ddf[['a', 'b']].map_partitions(pd.DataFrame.mul, other=ddf['c'] ).compute()
Use .mul for dask:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
ddf = dd.from_pandas(df, npartitions=1)
ddf[['a','b']] = ddf[['a','b']].mul(ddf['c'], axis='index') # or axis=0
a b c
0 7 28 7
1 16 40 8
2 27 54 9
You basically had it for pandas, just multiply() isn't inplace. I also changed to using .loc for all but one column so you don't type 50,000 column names :)
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df.loc[:, df.columns != 'c']=df.loc[:, df.columns != 'c'].multiply(df['c'], axis="index")
a b c
0 7 28 7
1 16 40 8
2 27 54 9
NOTE: I am not familiar with Dask, but I imagine that it is the same issue for that attempt.

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
print (df)
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

pandas series to multi-column html

I have a Pandas series (slice of larger DF corresponding to a single record) that I would like to display as html. While i can convert the Series to a DataFrame and use the .to_html() function this will result in a two-column html output. To save space/give a better aspect ratio I would like to return a four- or six-column HTML table where the series has been wrapped into two or three sets of index-value columns, like so.
import pandas as pd
s = pd.Series( [12,34,56,78,54,77], index=['a', 'b', 'c', 'd', 'e','f'])
#will yield 2 column hmtl-ed output
# I would like a format like this:
a 12 d 78
b 34 e 54
c 56 f 77
This is what I have currently:
x = s.reshape((3,2))
y = s.index.reshape((3,2))
new = [ ]
for i in range(len(x)):
z = pd.DataFrame(new)
z.T.to_html(header=False, index=False)
Is there a better way that I might have missed?
zach cp
This is simpler. The ordering is slightly different (read across rows, not down columns), but I'm not sure if that will matter to you.
In [17]: DataFrame(np.array([s.index.values, s.values]).T.reshape(3, 4))
0 1 2 3
0 a 12 b 34
1 c 56 d 78
2 e 54 f 77
As in your example, you'll need to omit the "header" and the index from the HTML.