I want to create a pandas dataframe where there are 5000 columns (n=5000) and one row (row G). For row G, 1 (in 10% of samples) or 0 (in 90% of samples).
import pandas as pd
df = pd.DataFrame({"G": np.random.choice([1,0], p=[0.1, 0.9], size=5000)}).T
I also want to add column names such that it is "Cell" followed by "1..5000":
Cell1
Cell2
Cell3
Cell5000
G
0
0
1
0
The columns will default to a RangeIndex from 0-4999. You can add 1 to the column values, and then use DataFrame.add_prefix to add the string "Cell" before all of the column names.
df.columns += 1
df = df.add_prefix("Cell")
print(df)
Cell1 Cell2 Cell3 ... Cell5000
G 0 0 0 ... 0
For a single-liner, you can also add 1 and prefix with "Cell" by converting the column index dtype manually.
df.columns = "Cell" + (df.columns + 1).astype(str)
To make a single row DataFrame, I would construct my data with numpy in the correct shape instead of transposing a DataFrame. You can also pass in the columns as you want them numbered and the index labelled.
import numpy as np
import pandas as pd
df = pd.DataFrame(
np.random.choice([1,0], p=[.1, .9], size=(1, size)),
columns=np.arange(1, size+1),
index=["G"]
).add_prefix("Cell")
print(df)
Cell1 Cell2 Cell3 ... Cell4999 Cell5000
G 0 0 0 ... 0 0
Another Method could be:
size = 5000
pd.DataFrame.from_dict(
{"G": np.random.choice([1,0], p=[0.1, 0.9], size=size)},
columns=(f'Cell{x}' for x in range(1, size+1)),
orient='index'
)
Output:
Cell1 Cell2 Cell3 Cell4 Cell5 Cell6 Cell7 Cell8 Cell9 ... Cell4992 Cell4993 Cell4994 Cell4995 Cell4996 Cell4997 Cell4998 Cell4999 Cell5000
G 0 0 0 0 0 1 0 1 0 ... 0 0 0 0 0 0 0 0 0
[1 rows x 5000 columns]
Related
I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.
This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300
Can use please help with below problem:
Given two dataframes df1 and df2, need to get something like result dataframe.
import pandas as pd
import numpy as np
feature_list = [ str(i) for i in range(6)]
df1 = pd.DataFrame( {'value' : [0,3,0,4,2,5]})
df2 = pd.DataFrame(0, index=np.arange(6), columns=feature_list)
Expected Dataframe :
Need to be driven by comparing values from df1 with column names (features) in df2. if they match, we put 1 in resultDf
Here's expected output (or resultsDf):
I think you need:
(pd.get_dummies(df1['value'])
.rename(columns = str)
.reindex(columns = df2.columns,
index = df2.index,
fill_value = 0))
0 1 2 3 4 5
0 1 0 0 0 0 0
1 0 0 0 1 0 0
2 1 0 0 0 0 0
3 0 0 0 0 1 0
4 0 0 1 0 0 0
5 0 0 0 0 0 1
I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?
Example:
import pandas as pd
import numpy as np
nRows = 2
nCols = 3
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)
df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())
It is when creating df3 I get memory error.
The DF's in the example:
df
0
0 NaN
1 NaN
df2
0 [[0.6704675101784022, 0.41730480236712697, 0.5...
1 [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object
df3
0 1 2
0 0.670468 0.417305 0.558690
0 0.140387 0.198101 0.800745
First I think working with lists in pandas is not good idea, if possible, you can avoid it.
So I believe you can simplify your code a lot:
nRows = 2
nCols = 3
np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
0 1 2
0 0.903482 0.393081 0.623970
1 0.637877 0.880499 0.299172
Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).
import pandas as pd
import numpy as np
import time
np.random.seed(1)
nRows = 25000
nCols = 10000
numberOfChunks = 5
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)
for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))),
np.arange(int(round(nRows/float(numberOfChunks))), nRows + int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
df2tmp = df2.iloc[start:stop]
if start == 0:
df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
continue
df3tmp = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
df3 = pd.concat([df3, df3tmp])
I want to create a row number series - but not override my date index.
I can do it with a loop but I think there must be an easier way?
_cnt = [ ]
for i in range ( len ( df ) ):
_cnt.append ( i )
df[ 'row' ] = _cnt
Thanks.
Probably the easiest way:
df['row'] = range(len(df))
>>> df
0 1
0 0.444965 0.993382
1 0.001578 0.174628
2 0.663239 0.072992
3 0.664612 0.291361
4 0.486449 0.528354
>>> df['row'] = range(len(df))
>>> df
0 1 row
0 0.444965 0.993382 0
1 0.001578 0.174628 1
2 0.663239 0.072992 2
3 0.664612 0.291361 3
4 0.486449 0.528354 4
I have the following code:
t = 12
s = numpy.array(df.Array.tolist())
s[s<t] = 0
thresh = numpy.where(s>0, s-t, 0)
df['NewArray'] = list(thresh)
while it works, surely there must be a more pandas-like way of doing it.
EDIT:
df.Array.head() looks like this:
0 [0.771511552006, 0.771515476223, 0.77143569165...
1 [3.66720695274, 3.66722560562, 3.66684636758, ...
2 [2.3047433839, 2.30475510675, 2.30451676559, 2...
3 [0.999991522708, 0.999996609066, 0.99989319662...
4 [1.11132718786, 1.11133284052, 0.999679589875,...
Name: Array, dtype: object
IIUC you can simply subtract and use clip_lower:
In [29]: df["NewArray"] = (df["Array"] - 12).clip_lower(0)
In [30]: df
Out[30]:
Array NewArray
0 10 0
1 11 0
2 12 0
3 13 1
4 14 2