So want to determine what values are in a Pandas Dataframe:
import pandas as pd
d = {'col1': [1,2,3,4,5,6,7], 'col2': [3, 4, 3, 5, 7,22,3]}
df = pd.DataFrame(data=d)
col2 hast the unique values 3,4,5,6,22 (domain). Each value that exists shall be determined. But only once.
Is there anyway to fastly extract what the domain is in a Pandas Dataframe Column?
Use df.max() and df.min() to find the range.
print(df["col2"].unique())
by Andrej Kesely is the solution. Perfect!
Related
I am trying to plot some results obtained after optimisation using Gurobi.
I have converted the dictionary to python dataframe.
it is 96*1
But now how do I use this dataframe to plot as 1st row-value, 2nd row-value, I am attaching the snapshot of the same.
Please anyone can help me in this?
x={}
for t in time1:
x[t]= [price_energy[t-1]*EnergyResource[174,t].X]
df = pd.DataFrame.from_dict(x, orient='index')
df
You can try pandas.DataFrame(data=x.values()) to properly create a pandas DataFrame while using row numbers as indices.
In the example below, I have generated a (pseudo) random dictionary with 10 values, and stored it as a data frame using pandas.DataFrame giving a name to the only column as xyz. To understand how indexing works, please see Indexing and selecting data.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Create a dictionary 'x'
rng = np.random.default_rng(121)
x = dict(zip(np.arange(10), rng.random((1, 10))[0]))
# Create a dataframe from 'x'
df = pd.DataFrame(x.values(), index=x.keys(), columns=["xyz"])
print(df)
print(df.index)
# Plot the dataframe
plt.plot(df.index, df.xyz)
plt.show()
This prints df as:
xyz
0 0.632816
1 0.297902
2 0.824260
3 0.580722
4 0.593562
5 0.793063
6 0.444513
7 0.386832
8 0.214222
9 0.029993
and gives df.index as:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
and also plots the figure:
Id like to ask for help in fixing the missing values in pandas dataframe (python)
here is the dataset
In this dataset I found a missing value in ['Item_Weight'] column.
I don't want to drop the missing values because I found out by sorting them. the missing value is "miss type" by someone who encoded it.
here is the sorted dataset
Now I created a lookup dataset so I can merge them to fill na missing values.
How can I merge them or join them only to fill the missing values (Nan) using the lookup table I made? Or is there any other way without using a lookup table?
Looking at this you will probably want to use something along the lines of map instead of join/merge this is an example of how to use map with your data.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Column1' : ['A', 'B', 'C'],
'Column2' : [1, np.nan, 3]
})
df
df_map = pd.DataFrame({
'Column1' : ['A', 'B', 'C'],
'Column2' : [1, 2, 3]
})
df_map
#Looks to find where the column you specify is null, then using your map df will map the value from column1 to column2
df['Column2'] = np.where(df['Column2'].isna(), df['Column1'].map(df_map.set_index('Column1')['Column2']), df['Column2'])
I had to create my own dataframes since you used screenshots. In the future, the use of screenshots is not considered best to help developers with assistance.
This will probably work:
df = df.sort_values(['Item_Identifier', 'Item_Weight']).ffill()
But I can't test it since you didn't give us anything to work with.
i have the following issue:
when i use .loc funtion it returns a series not a single value with no index.
As i need to do some math operation with the selected cells. the function that i am using is:
import pandas as pd
data = [[82,1], [30, 2], [3.7, 3]]
df = pd.DataFrame(data, columns = ['Ah-Step', 'State'])
df['Ah-Step'].loc[df['State']==2]+ df['Ah-Step'].loc[df['State']==3]
.values[0] will do what OP wants.
Assuming one wants to obtain the value 30, the following will do the work
df.loc[df['State'] == 2, 'Ah-Step'].values[0]
print(df)
[Out]: 30.0
So, in OP's specific case, the operation 30+3.7 could be done as follows
df.loc[df['State'] == 2, 'Ah-Step'].values[0] + df['Ah-Step'].loc[df['State']==3].values[0]
[Out]: 33.7
I have a dataframe in python, called df. It contains two variables, Name and Age. I want to do a loop in python to generate 10 new column dataframes, called Age_1, Age_2, Age_3....Age_10 which contain the values of Age.
So far I have tried:
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
for i in range(1,11):
df[Age_'i'] = df['Age']
Just use this for loop:
for x in range(0,11):
df['Age_'+str(x)]=df['Age']
OR
for x in range(0,11):
df['Age_{}'.format(x)]=df['Age']
OR
for x in range(0,11):
df['Age_%s'%(x)]=df['Age']
Now if you print df you will get your desired output:
you can use .assign and ** unpacking.
df.assign(**{f'Age_{i}' : df['Age'] for i in range(11)})
I want to replicate the between_time function of Pandas in PySpark.
Is it possible since in Spark the dataframe is distributed and there is no indexing based on datetime?
i = pd.date_range('2018-04-09', periods=4, freq='1D20min')
ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)
ts.between_time('0:45', '0:15')
Is something similar possible in PySpark?
pandas.between_time - API
If you have a timestamp column, say ts, in a Spark dataframe, then for your case above, you can just use
import pyspark.sql.functions as F
df2 = df.filter(F.hour(F.col('ts')).between(0,0) & F.minute(F.col('ts')).between(15,45))