Pandas data frame creation using static data - pandas

I have a data set like this : {'IT',[1,20,35,44,51,....,1000]}
I want to convert this into python/pandas data frame. I want to see output in the below format. How to achieve this output.
Dept Count
IT 1
IT 20
IT 35
IT 44
IT 51
.. .
.. .
.. .
IT 1000
Below way i can write, but this is not efficient way for huge data.
data = [['IT',1],['IT',2],['IT',3]]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)

No need for a list comprehension since pandas will automatically fill IT in for every row.
import pandas as pd
d = {'IT':[1,20,35,44,51,1000]}
df = pd.DataFrame({'dept': 'IT', 'count': d['IT']})

Use list comprehension for tuples and pass to DataFrame constructor:
d = {'IT':[1,20,35,44,51], 'NEW':[1000]}
data = [(k, x) for k, v in d.items() for x in v]
df = pd.DataFrame(data,columns=['Dept','Count'])
print(df)
Dept Count
0 IT 1
1 IT 20
2 IT 35
3 IT 44
4 IT 51
5 NEW 1000

You can use melt
import pandas as pd
d = {'IT': [10]*100000}
df = pd.DataFrame(d)
df = pd.melt(df, var_name='Dept', value_name='Count')

Related

Merge value_counts of different pandas dataframes

I have a list of pandas dataframes in which i do the value_counts of a column and finally append all the results to another dataframe.
df_AB = pd.read_pickle('df_AB.pkl')
df_AC = pd.read_pickle('df_AC.pkl')
df_AD = pd.read_pickle('df_AD.pkl')
df_AE = pd.read_pickle('df_AE.pkl')
df_AF = pd.read_pickle('df_AF.pkl')
df_AG = pd.read_pickle('df_AG.pkl')
The format of the above dataframes is as below (Example: df_AB):
df_AB:
id is_valid
121 True
122 False
123 True
For every pandas dataframe, I would need to get the value_counts of is_valid column and store the results to df_result. I tried the below code but doesn't seem to work as expected.
df_AB_VC = df_AB['is_valid'].value_counts()
df_AB_VC['group'] = "AB"
df_AC_VC = df_AC['is_valid'].value_counts()
df_AC_VC['group'] = "AC"
Result dataframe (df_result):
Group is_valid_True_Count is_Valid_False_Count
AB 2 1
AC
AD
.
.
.
Any leads would be appreciated
I think you just need to work on the dataframes a bit more systematically:
groups = ['AB', 'AC', 'AD',...]
out = pd.DataFrame({
g: pd.read_pickle(f'df_{g}.pkl')['is_valid'].value_counts()
for g in groups
}).T
Do not use variables, that makes your code much more complicated. Use a container
files = ['df_AB.pkl', 'df_AC.pkl', 'df_AD.pkl', 'df_AE.pkl', 'df_AF.pkl']
# using the XX part in "df_XX.pkl", you need to adapt to your real use-case
dataframes = {f[3:5]: pd.read_pickle(f) for f in files}
# compute counts
counts = (pd.DataFrame({k: d['is_valid'].value_counts()
for k,d in dataframes.items()})
.T.add_prefix('is_valid_').add_suffix('_Count')
)
example output:
is_valid_True_Count is_valid_False_Count
AB 2 1
AC 2 1
Use pathlib to extract group name then collect data into dictionary before concatenate all entries:
import pandas as pd
import pathlib
data = {}
for pkl in pathlib.Path().glob('df_*.pkl'):
group = pkl.stem.split('_')[1]
df = pd.read_pickle(pkl)
data[group] = df['is_valid'].value_counts() \
.add_prefix('is_valid_') \
.add_suffix('_Count')
df = pd.concat(data, axis=1).T
>>> df
is_valid_True_Count is_valid_False_Count
AD 2 1
AB 4 2
AC 0 3

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

Multiply many columns by one column in dask

I want to multiply roughly 50,000 columns with one other column in a large dask dataframe (6_500_000 x 50_002). The solution, using a for loop, works but is painfully slow. Below I tried two other appraoches that failed. Any advice is appreciated.
Pandas
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df[['a','b']].multiply(df['c'], axis="index")
Dask
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=1)
# works but very slow for large datasets:
for column in ['a', 'b']:
ddf[column] = ddf[column] * ddf['c']
# don't work:
ddf[['a','b']].multiply(ddf['c'], axis="index")
ddf[['a', 'b']].map_partitions(pd.DataFrame.mul, other=ddf['c'] ).compute()
Use .mul for dask:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
ddf = dd.from_pandas(df, npartitions=1)
ddf[['a','b']] = ddf[['a','b']].mul(ddf['c'], axis='index') # or axis=0
ddf.compute()
Out[1]:
a b c
0 7 28 7
1 16 40 8
2 27 54 9
You basically had it for pandas, just multiply() isn't inplace. I also changed to using .loc for all but one column so you don't type 50,000 column names :)
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6], 'c':[7,8,9]})
df.loc[:, df.columns != 'c']=df.loc[:, df.columns != 'c'].multiply(df['c'], axis="index")
Output:
a b c
0 7 28 7
1 16 40 8
2 27 54 9
NOTE: I am not familiar with Dask, but I imagine that it is the same issue for that attempt.

How to quickly normalise data in pandas dataframe?

I have a pandas dataframe as follows.
import pandas as pd
df = pd.DataFrame({
'A':[1,2,3],
'B':[100,300,500],
'C':list('abc')
})
print(df)
A B C
0 1 100 a
1 2 300 b
2 3 500 c
I want to normalise the entire dataframe. Since column C is not a numbered column what I do is as follows (i.e. remove C first, normalise data and add the column).
df_new = df.drop('concept', axis=1)
df_concept = df[['concept']]
from sklearn import preprocessing
x = df_new.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_new = pd.DataFrame(x_scaled)
df_new['concept'] = df_concept
However, I am sure that there is more easy way of doing this in pandas (given the column names that I do not need to normalise, then do the normalisation straightforward).
I am happy to provide more details if needed.
Use DataFrame.select_dtypes for DataFrame with numeric columns and then normalize with division by minimal and maximal values and then assign back only normalized columns:
df1 = df.select_dtypes(np.number)
df[df1.columns]=(df1-df1.min())/(df1.max()-df1.min())
print (df)
A B C
0 0.0 0.0 a
1 0.5 0.5 b
2 1.0 1.0 c
In case you want to apply any other functions on the data frame, you can use df[columns] = df[columns].apply(func).

pandas series to multi-column html

I have a Pandas series (slice of larger DF corresponding to a single record) that I would like to display as html. While i can convert the Series to a DataFrame and use the .to_html() function this will result in a two-column html output. To save space/give a better aspect ratio I would like to return a four- or six-column HTML table where the series has been wrapped into two or three sets of index-value columns, like so.
import pandas as pd
s = pd.Series( [12,34,56,78,54,77], index=['a', 'b', 'c', 'd', 'e','f'])
#pd.DataFrame(s).to_html()
#will yield 2 column hmtl-ed output
# I would like a format like this:
a 12 d 78
b 34 e 54
c 56 f 77
This is what I have currently:
x = s.reshape((3,2))
y = s.index.reshape((3,2))
new = [ ]
for i in range(len(x)):
new.append(y[i])
new.append(x[i])
z = pd.DataFrame(new)
z.T.to_html(header=False, index=False)
Is there a better way that I might have missed?
thanks
zach cp
This is simpler. The ordering is slightly different (read across rows, not down columns), but I'm not sure if that will matter to you.
In [17]: DataFrame(np.array([s.index.values, s.values]).T.reshape(3, 4))
Out[17]:
0 1 2 3
0 a 12 b 34
1 c 56 d 78
2 e 54 f 77
As in your example, you'll need to omit the "header" and the index from the HTML.