I have the following table as a data frame.
0
8
990
15
70
85
36
2
43
5
68
61
62
624
65
82
523
98
I want to create a new column after every third row. So the data should look like this.
Thanks in advance.
Looks like your column can be converted into an array(i.e., list). If this is the case, you can break down the value by array and create an array of array. Then, use the array of array to create a dataframe.
The code might look something like this:
listofitems = [...]
## create new dataframe based on list index jump
newdf = pd.DataFrame([listofitems[i::3] for i in range(3)])
## transpose dataframe to 3 columns dataframe
newdf = newdf.T
For the example given above, 4139055 rows is not a big data. If you do have a big and complex data, please take a look at PySpark specifically on dataframe with Sparks. This is one of the big data frameworks that help optimizing data transformation over big dataframe.
import pandas as pd
import numpy as np
numbers= [0,
8,
990,
15,
70,
85,
36,
2,
43,
5,
68,
61,
62,
624,
65,
82,
523,
98]
pd.DataFrame(np.reshape(numbers, (6,3)))
Related
I want to create 2 subsets with columns whose names start with radius_, area_. Let me provide you fake data. Sorry that I modified below a bit
data = {'radius_mean':[18, 21, 20, 11, 20],
'radius_se':[1, 0.5, 0.7, 0.4, 0.8],
'area_mean': [1001, 1326, 1203, 386, 1200],
'area_se': [153, 75, 94, 27, 95]}
df=pd.DataFrame(data)
df1=pd.DataFrame().
df2=pd.DataFrame().
subsets=[df1, df2].
features=['radius', 'area'].
for subset, feature in zip(subsets, features):
subcol=[col for col in df.columns if col.startswith(feature+ '_')].
print(subcol).
subset=df[subcol].
print(subset.head()).
I expect df1.
['radius_mean', 'radius_se'].
radius_mean radius_se.
0 18 1.0.
1 21 0.5.
2 20 0.7.
3 11 0.4.
4 20 0.8.
I expect df2, as shown below. However, data1 and data2 are empty, but subset is created, as shown below:
['area_mean', 'area_se'].
area_mean area_se.
0 1001 153.
1 1326 75.
2 1203 94.
3 386 27.
4 1200 95.
You're running into an issue because of how references to dataframes are handled. Your logic makes sense, but I think what's happening is that copies of your tables are made instead of keeping references to the original tables, so when you try to update the originals you're really updating copies. You can side-step this issue by creating data1 and data2 AFTER your loop like I show later in the code
import pandas as pd
import io #you don't need this, it's just for me to read in the cancer table
#again you don't need this, this just lets me get the cancer table
cancer = pd.read_csv(io.StringIO("""
radius_mean radius_se radius_worst area_mean area_se area_worst
17.99 1.0950 25.38 1001.0 153.40 2019.0
20.57 0.5435 24.99 1326.0 74.08 1956.0
19.69 0.7456 23.57 1203.0 94.03 1709.0
11.42 0.4956 14.91 386.1 27.23 567.7
20.29 0.7572 22.54 1297.0 94.44 1575.0
"""),delim_whitespace=True)
data1=pd.DataFrame()
data2=pd.DataFrame()
dsets=[data1, data2] #copies of data1 and data2 are made
#editing the entries in the dsets list doesn't update data1 or data2 since they are different copies
dsets[0] = pd.DataFrame({'a':[1,2,3]}) #trying to update 0-index, doesn't update data1
print(dsets[0]) #changed
print(data1) #not changed
#in your loop the same 'copy' issue is happening again so data1 and data2 don't get updated
features=['radius', 'area']
for dset, feature in zip(dsets,features):
subcol=[col for col in cancer.columns if col.startswith(feature+ '_')]
dset=cancer[subcol]
print(data1) #still not updated
SOLUTION: create data1 and data2 for the first time in the loop instead
dsets = []
features=['radius', 'area']
for feature in features:
subcol=[col for col in cancer.columns if col.startswith(feature+ '_')]
dsets.append(cancer[subcol])
data1,data2 = dsets
print(data1)
print(data2)
I have a function that works as expected and outputs 4 values
targets(0.0034) #calling the function
(1014.0, 260, 176, 84)
All I want to do is to put these values into a data frame so it looks like this
value 1 value 2 value 3 value 4
0 1014 260 176 84
I have tried
to make a new dataframe new = pd.DataFrame(columns=['value1','value2','value3','value4'])
then tried to append in different ways but keep getting stuck. I
Have tried to reassign the values but everything I have tried seems to be a dead end.
What am I missing? Thanks!
There are a lot of different ways to do it, but with a single tuple entered as the data param for DataFrame we get 4 rows. So we can use .T to transpose the data and get four columns and one row. We can then rename the columns.
def targets():
return (1014.0, 260, 176, 84)
df = pd.DataFrame(targets()).T.rename({0:'value 1', 1:'value 2', 2:'value 3', 3:'value 4'}, axis='columns')
print(df)
value 1 value 2 value 3 value 4
0 1014.0 260.0 176.0 84.0
The following is a snippet of dataframe I have:
node group
0 28 1
167 28 2
I want to create a dictionary like structure from the above dataframe
and want to have something like
{28:1}
{28:2}
I tried to create it via
groupDict=groupTest.to_dict(orient='index')
which generates
{0: {'node': 28, 'group': 1}, 167: {'node': 28, 'group': 2}}
which is valid pandas way. But How would I generate
{28:1}
{28:2}
Potentional solution as advised in comments below by anky_91:
df.groupby('node')['group'].agg(list).to_dict()
Somehow I ended up with a list that looks like this [ 1 36 2 72 37 74] instead of [ 1, 36, 2, 72, 37,74]. How can I convert it so that I can these values to select the rows of matrix A, which is a 5266 x 441 matrix in my case? The output should be a 6 x 441 matrix.
Although I don't see the difference between your lists (why does one have commas, and the other not?), I think you can use the tf.gather function to end up with the matrix you want to get: https://www.tensorflow.org/api_docs/python/tf/gather
Consider my series as below: First column is article_id and the second column is frequency count.
article_id
1 39
2 49
3 187
4 159
5 158
...
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
I got this series from a dataframe with the following command:
logs.loc[logs['article_id'] <= 17029].groupby('article_id')['article_id'].count()
logs is the dataframe here and article_id is one of the columns in it.
How do I plot a bar chart(using Matlplotlib) such that the article_id is on the X-axis and the frequency count on the Y-axis ?
My natural instinct was to convert it into a list using .tolist() but that doesn't preserve the article_id.
IIUC you need Series.plot.bar:
#pandas 0.17.0 and above
s.plot.bar()
#pandas below 0.17.0
s.plot('bar')
Sample:
import pandas as pd
import matplotlib.pyplot as plt
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
print (s)
1 39
2 49
3 187
4 159
5 158
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
s.plot.bar()
plt.show()
The new pandas API suggests the following way:
import pandas as pd
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
s.plot(kind="bar", figsize=(20,10))
If you are working on Jupyter, you don't need the matplotlib library.
Just use 'bar' in kind parameter of plot
Example
series = read_csv('BwsCount.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.plot(kind='bar')
Default value of kind is 'line' (ie. series.plot() --> will automatically plot line graph)
For your reference:
kind : str
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot