How to shift all the values from a certain point of the dataframe to the right? - dataframe

Example:
I have this dataset
A B C D E
0 0.1 0.2 0.3 0.4 0.5
1 1.1 1.2 1.3 1.4 1.5
2 2.1 2.2 2.4 2.5 2.6
3 3.1 3.2 3.4 3.5 3.6
4 4.1 4.2 4.4 4.5 4.6
5 5.1 5.2 5.3 5.4 5.5
What I would like to have is:
A B C D E
0 0.1 0.2 0.3 0.4 0.5
1 1.1 1.2 1.3 1.4 1.5
2 2.1 2.2 2.4 2.5 2.6
3 3.1 3.2 3.4 3.5 3.6
4 4.1 4.2 4.4 4.5 4.6
5 5.1 5.2 5.3 5.4 5.5
So I need to shift only certain rows and only certain columns to the right.
Not all the lines and columns have to be affected by that shift. I hope it's clear, thank you.

Pandas would be a lovely way to solve this. Use the .loc to select the rows and columns and use .shift() to move them to the right.
import pandas as pd
df.loc[2:4, ['C','D']] = df.loc[2:4, ['C','D']].shift(1, axis=1)
If you share your dataframe code to define df, I can fully test the loc/shift solution.

Related

Average of certain values in pandas dataframe with if condition

index
column 1
column 2
1
1
1.2
2
1.2
1.5
3
2.2
2.5
4
3
3.1
5
3.3
3.5
6
3.6
3.8
7
3.9
4.0
8
4.0
4.0
9
4.0
4.1
10
4.1
4.0
I created a moving average with df.rolling(). But I just want to have the average of the "constant" value (here around 4), that is not changing more than 10% any more.
My first approach was to try it with an if condition, but my attemps to just create an average of certain values in the column failed.
Does anyone have ideas?

Ifelse leaving other observations as is

In R and tidy verse, there is a way to use ifelse() such that I can change several of the observations in a variable but then I can leave other observations that I don't want changed as they are but just setting else to that column (so in the example below, "Virginica and "Versicolor" would remain the same. Can't figure out how to do that in pandas.
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
Minimal reproducible example:
iris\
.assign(new_species = iris['species'].apply(lambda x: "set" if x=="setosa" else species))
This comes up with an error and if I put species in quotes, "species" becomes the actual observation.
Thanks much!
James
Use replace:
iris['new_spicies'] = iris['species'].replace('setosa', 'set')
Output:
sepal_length sepal_width petal_length petal_width species new_spicies
0 5.1 3.5 1.4 0.2 setosa set
1 4.9 3.0 1.4 0.2 setosa set
2 4.7 3.2 1.3 0.2 setosa set
3 4.6 3.1 1.5 0.2 setosa set
4 5.0 3.6 1.4 0.2 setosa set
.. ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica virginica
146 6.3 2.5 5.0 1.9 virginica virginica
147 6.5 3.0 5.2 2.0 virginica virginica
148 6.2 3.4 5.4 2.3 virginica virginica
149 5.9 3.0 5.1 1.8 virginica virginica
[150 rows x 6 columns]

How to add a dictionary keys and values to specific columns in an empty dataframe

I have a dict with 3 keys, and each value is a list of numpy arrays.
I'd like to to append this dictionary to an empty dataframe so that the values of the numpy arrays in the list will be the first numbers(column 'x'), the values at the second position in the numpy arrays(column 'y'), and the keys to be the final column (column 'z'), like so:
my_dict = {0: [array([5.4, 3.9, 1.3, 0.4]), array([4.9, 3. , 1.4, 0.2]),array([4.6, 3.6, 1. , 0.2]), array([4.6, 3.2, 1.4, 0.2]), array([4.7, 3.2, 1.6, 0.2])],
1: [array([6.1, 2.9, 4.7, 1.4]), array([5.9, 3. , 4.2, 1.5]), array([7.4, 2.8, 6.1, 1.9])],
2: [array([7. , 3.2, 4.7, 1.4]), array([5.6, 2.7, 4.2, 1.3])]}
I'd like to get the below df:
x y z
0 5.4 3.9 0
1 4.9 3. 0
2 4.6 3.6 0
3 4.6 3.2 0
4 4.7 3.2 0
5 6.1 2.9 1
6 5.9 3. 1
7 7.4 2.8 1
8 7. 3.2 2
9 5.6 2.7 2
it's a bit tricky, how can i do it?
This will do it:
data = [j[:2].tolist() + [k] for k, v in my_dict.items() for j in v]
df = pd.DataFrame(data, columns=list('xyz'))
df
x y z
0 5.4 3.9 0
1 4.9 3.0 0
2 4.6 3.6 0
3 4.6 3.2 0
4 4.7 3.2 0
5 6.1 2.9 1
6 5.9 3.0 1
7 7.4 2.8 1
8 7.0 3.2 2
9 5.6 2.7 2
Try this:
target_df=pd.DataFrame(columns=['x','y','z']) # empty dataframe
for k,v in my_dict.items():
for val in v:
d={'x':[val[0]], 'y':[val[1]], 'z':[k]}
target_df=pd.concat([target_df, pd.DataFrame(d)], ignore_index=True)
print(target_df) will give desired dataframe
x y z
0 5.4 3.9 0
1 4.9 3.0 0
2 4.6 3.6 0
3 4.6 3.2 0
4 4.7 3.2 0
5 6.1 2.9 1
6 5.9 3.0 1
7 7.4 2.8 1
8 7.0 3.2 2
9 5.6 2.7 2

Using pandas loc in hy

I want to do the following in hy:
from StringIO import StringIO
import pandas as pd
s = """sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa"""
df = pd.read_table(StringIO(s), sep="\s+")
df.loc[df.sepal_length > 4.5]
How do I do the last sentence?
I have tried (.loc df (> df.sepal_length 4.5))
but it just returns a locindexer.
There are two ways to do this:
Using the . macro:
(. df loc [(> df.sepal-length 4.5)])
Using get:
(get df.loc (> df.sepal-length 4.5))
Protip: always try running hy2py on your Hy files. It shows you what the resulting Python looks like. The output isn't always valid syntax, but it shows you what gets compiled into what. Both of these get compiled down to df.loc[(df.sepal_length > 4.5)].
One more thing: notice I used sepal-length. Hy converts dashes in identifiers to underscores, so that's the same thing as sepal_length, but it's considered better style.

strange error with RCassandra trying to store the "iris" dataset

I am new to Apache Cassandra. I installed the R extension RCassandra and created the following table:
CREATE TABLE iris (
id varchar,
"Sepal.Length" double,
"Sepal.Width" double,
"Petal.Length" double,
"Petal.Width" double,
Species text,
PRIMARY KEY(id)
);
After successful connection and keyspace selection I try the following:
> RC.write.table(c,'iris',iris)
The iris dataset has the following form:
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
...
But I get the following error:
Error in RC.write.table(c, "iris", iris) :
Cassandra exception: Not enough bytes to read value of component 0
Maybe the id field is troublesome, so I created the following dataset by adding an id column to the dataframe:
iris2 = data.frame(id=row.names(iris),iris)
iris2
id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
...
but I still get the same error:
> RC.write.table(c,'iris',iris2)
Error in RC.write.table(c, "iris", iris2) :
Cassandra exception: Not enough bytes to read value of component 0
Any ideas on how to solve this?
Are you creating the column family ("table") in Cassandra using CQLSH? Was myself having the same problem. Then switched to using cassandra-cli and created the column family the this way, and it worked.
Most likely RCassandra uses cli underneath, and because CFs created in CQLSH are not visible in cassandra-cli, it can't see CQL tables.