Related
How can I separate this data column by 'A','B' ...?
The first column as an index must be retained.
df = pd.DataFrame(data)
df = df[['seconds', 'marker', 'data1', 'data2', 'data3']]
seconds,marker,data1,data2,data3
00001,A,3,3,0,42,0
00002,B,3,3,0,34556,0
00003,C,3,3,0,42,0
00004,A,3,3,0,1833,0
00004,B,3,3,0,6569,0
00005,C,3,3,0,2454,0
00006,C,3,3,0,3256,0
00007,C,3,3,0,5423,0
00008,A,3,3,0,569,0
You can just get the unique values in the letter column (that's what I called it). And then filter the DataFrame containing all values using these unique values.
I am storing the newly created DataFrames in a dictionary here, but you could also store them in a list or whatever. I've used the input you have provided but have given the first 2 columns the names index and letter as they were unnamed in your .csv.
import pandas as pd
df = pd.DataFrame({
'index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8},
'letter': {0: 'A', 1: 'B', 2: 'C', 3: 'A', 4: 'B', 5: 'C', 6: 'C', 7: 'C', 8: 'A'},
'seconds': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3},
'marker': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3},
'data1': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'data2': {0: 42, 1: 34556, 2: 42, 3: 1833, 4: 6569, 5: 2454, 6: 3256, 7: 5423, 8: 569},
'data3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0}
})
# get unique values
unique_values = df["letter"].unique()
# filter "big" dataframe using one of the unique value at a time
split_dfs = {value: df[df["letter"] == value] for value in unique_values}
print(split_dfs["A"])
print(split_dfs["B"])
print(split_dfs["C"])
Expected output:
index letter seconds marker data1 data2 data3
0 1 A 3 3 0 42 0
3 4 A 3 3 0 1833 0
8 8 A 3 3 0 569 0
index letter seconds marker data1 data2 data3
1 2 B 3 3 0 34556 0
4 4 B 3 3 0 6569 0
index letter seconds marker data1 data2 data3
2 3 C 3 3 0 42 0
5 5 C 3 3 0 2454 0
6 6 C 3 3 0 3256 0
7 7 C 3 3 0 5423 0
As you can see the index is preserved.
This is my dataframe:
df = pd.DataFrame(
{
"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],
"score": [3, 5, 6, 2, 4, 1],
}
)
I want to compare the score of bob_x with 'bob_y, and retain the row with the lowest, and do the same for jay_xandjay_y. No change is required for madandjoe`.
You can first split the names by _ and keep the first part, then groupby and keep the lowest value:
import pandas as pd
df = pd.DataFrame({"name": ["bob_x", "mad", "jay_x", "bob_y", "jay_y", "joe"],"score": [3, 5, 6, 2, 4, 1]})
df['name'] = df['name'].str.split('_').str[0]
df.groupby('name')['score'].min().reset_index()
Result:
name
score
0
bob
2
1
jay
4
2
joe
1
3
mad
5
I have the following dataframe with timeseries data:
df = pd.DataFrame(columns = ['id', 'value'])
df['value'] =[9, 16, 10, 12, 11, 14]
df['id'] = [1, 1, 1, 2, 2, 2]
For each timeseries (defined by column 'id' I want to calculate the variance to find timeseries that do not change at all or only very little.
The final dataframe should look like this:
df_end = pd.DataFrame(columns = ['id','value', 'var'])
df_end['value'] =[9, 16, 10, 12, 11, 14]
df_end['id'] = [1, 1, 1, 2, 2, 2]
df_end['var'] = [21, 21, 21, 2.3, 2.3, 2.3]
I tried:
df.groupby(df['id']).var()
which gives me the values, but I couldn't put it into the df in the right form. I am sure, there is a handy function for this that I don't know about yet!
Thanks for helping out!
Use GroupBy.transform with specify column value:
df['var'] = df.groupby('id')['value'].transform('var')
print (df)
id value var
0 1 9 14.333333
1 1 16 14.333333
2 1 10 14.333333
3 2 12 2.333333
4 2 11 2.333333
5 2 14 2.333333
I am trying to do a conditional assignation to the rows of a specific column: target. I have done some research, and it seemed that the answer was given here: "How to do row processing and item assignment in dask".
I will reproduce my necessity. Mock data set:
x = [3, 0, 3, 4, 0, 0, 0, 2, 0, 0, 0, 6, 9]
y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
mock = pd.DataFrame(dict(target = x, speed = y))
The look of mock is:
In [4]: mock.head(7)
Out [4]:
speed target
0 200 3
1 300 0
2 400 3
3 215 4
4 219 0
5 360 0
6 280 0
Having this Pandas DataFrame, I convert it into a Dask DataFrame:
mock_dask = dd.from_pandas(mock, npartitions = 2)
I apply my conditional rule: all values in target above 0, must be 1, all others 0 (binaryze target). Following the mentioned thread above, it should be:
result = mock_dask.target.where(mock_dask.target > 0, 1)
I have a look at the result dataset and it is not working as expected:
In [7]: result.head(7)
Out [7]:
0 3
1 1
2 3
3 4
4 1
5 1
6 1
Name: target, dtype: object
As we can see, the column target in mock and result are not the expected results. It seems that my code is converting all 0 original values to 1, instead of the values that are greater than 0 into 1 (the conditional rule).
Dask newbie here, Thanks in advance for your help.
OK, the documentation in Dask DataFrame API is pretty clear. Thanks to #MRocklin feedback I have realized my mistake. In the documentation, where function (the last one in the list) is used with the following syntax:
DataFrame.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Thus, the correct code line would be:
result = mock_dask.target.where(mock_dask.target <= 0, 1)
This will output:
In [7]: result.head(7)
Out [7]:
0 1
1 0
2 1
3 1
4 0
5 0
6 0
Name: target, dtype: int64
Which is the expected output.
They seem to be the same to me
In [1]: import pandas as pd
In [2]: x = [1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 6, 9]
...: y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
...: mock = pd.DataFrame(dict(target = x, speed = y))
...:
In [3]: import dask.dataframe as dd
In [4]: mock_dask = dd.from_pandas(mock, npartitions = 2)
In [5]: mock.target.where(mock.target > 0, 1).head(5)
Out[5]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64
In [6]: mock_dask.target.where(mock_dask.target > 0, 1).head(5)
Out[6]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64
Lets say I have a list:
lits = [1, 1, 1, 2, 0, 0, 0, 0, 3, 3, 1, 4, 5, 2, 2, 2, 0, 0, 0]
and i need this to become [1, 1, 2, 0, 0, 3, 3, 1, 4, 5, 2, 2, 0, 0]
(Delete duplicates, but only in a chain of duplicates. Going to do this on a huge HDF5 file, with pandas, numpy. Would rather not use a for loop iterating through all elements.
table = table.drop_duplicates(cols='[SPEED OVER GROUND [kts]]', take_last=True)
Is there a modification I can do to this code?
In pandas you can do a boolean mask, selecting a row only if it is differs from either the preceding or succeeding value:
>>> df=pd.DataFrame({ 'lits':lits })
>>> df[ (df.lits != df.lits.shift(1)) | (df.lits != df.lits.shift(-1)) ]
lits
0 1
2 1
3 2
4 0
7 0
8 3
9 3
10 1
11 4
12 5
13 2
15 2
16 0
18 0