replace values in column in data frame with average value - dataframe

I am working with data analysis using python. I want to replace all data < 120 in one column with average_steam. average_steam= 123
to access the column in data frame I write "data.steam" then I get all values
the code I tried is:
average_steam=data.steamin.mean()
print(average_steam)
data.steamin.replace(data.steamin<=120,average_steam, inplace=True)

Related

How can I query a column of a dataframe on a specific value and get the values of two other columns corresponding to that value

I have a data frame where the first column contains various countries' ISO codes, while the other 2 columns contain dataset numbers and Linkedin profile links.
Please refer to the image.
I need to query the data frame's first "FBC" column on the "IND" value and get the corresponding values of the "no" and "Linkedin" columns.
Can somebody please suggest a solution?
Using query():
If you want just the no and Linkedin values.
df = df.query("FBC.eq('IND')")[["no", "Linkedin"]]
If you want all 3:
df = df.query("FBC.eq('IND')")

Adding The Results Of Two Queries In Pandas Dataframes

I am trying to do data analysis for the first time using Pandas in a Jupyter notebook and was wondering what I am doing wrong.
I have created a data frame for the results of a query to store a table that represents the total population I am comparing to.
ds
count
2022-28-9
100
2022-27-9
98
2022-26-9
99
2022-25-9
98
This data frame is called total_count
I have created a data frame for the results of a query to store a table that represents the count of items that are out of SLA to be divided by the total.
ds
oo_sla
2022-28-9
60
2022-27-9
38
2022-26-9
25
2022-25-9
24
This data frame is called out_of_sla
These two data sets are created by Presto queries from Hive tables if that matters.
I am now trying to divide those results to get a % out of SLA but I am getting errors.
data = {"total_count"[], "out_of_sla"[]}
df = pd.DataFrame(data)
df["result"] = [out_of_sla]/[total_count]
print(df)
I am getting an error for invalid syntax on line 3. My goal was to create a trend of in/out of sla status and a widget for the most recent datestamps sla. Any insight is appreciated.

How to use previous row value in next row in pyspark dataframe

I have a pyspark dataframe and I want to perform calculation as
for i in range(0,(length-1)):
x[i] = (x[i-1] - y[i-1]) * np.exp(-(t[i] -t[i-1])/v[i-1]) + y[i-1]
Where x,y,t and v are lists of float type columns created using
x = df.select(‘col_x’).rdd.flatMap(lambda x:x).collect()
And similarly y,t and v for respective columns.
This method works but not efficiently for data in bulk.
I want to perform this calculation in pyspark dataframe column. I want to update x column after every row and then use that updated value for calculating next row.
I have created columns to get previous row using lag as
df = df.withColumn(prev_val_x),F.lag(df.x,1).over(my_window)
And then calculating and updating x as -
df = df.withColumn(‘x’,(col(‘prev_val_x’) - col(‘prev_val_y’))
but it does not update the value with previous row value.
Creating lists for 4 columns using collect() takes a lot of time thus gives a memory error. Therefore, want to calculate within the dataframe column itself. Let column x has values as- 4.38,0,0,0,…till the end. X column only has value in its first row and then has 0 filled in all rows. Y,t and v has float values in it.
How do I proceed with this?
Any help would be appreciated!

dataframe not reading float values

I have a dataframe that contains times in float format for example 12.0, 12.25, 12.75 with 27 columns. I have an if which checks if a user-given time is in the dataframe, but it only recognizes the 12.0 formatted time out of the dataframe. I am checking from the dataframe df4 in a specific column "Timestamp" if the given time (return_time) is in the column and get the corresponding index so that I can change its value in each column and then write it into a csv file.
if return_time in df4["Timestamp"]:
idx=df4[df4["Timestamp"]==return_time].index.values
df4.loc[idx,i]="CHARGING"
df4.to_csv("test.csv")
I should have gotten 27 different times as a result to store them in each column of the csv but out of the 27 i only get a few that correspond to the .0 type. It doesn't recognize the other types like .25 .50 or .75
You don't need to get the values of index that you want to select using .loc[...]. Just use idx=df4[df4["Timestamp"]==return_time].index

Write pandas data to a CSV file if column sums are greater than a specified value

I have a CSV file whose columns are frequency counts of words, and whose rows are time periods. I want to sum for each column the total frequencies. Then I want to write to a CSV file for sums greater than or equal to 30, the column and row values, thus dropping columns whose sums are less than 30.
Just learning python and pandas. I know it is a simple question, but my knowledge is at that level. Your help is most appreciated.
I can read in the CSV file and compute the column sums.
df = pd.read_csv('data.csv')
Except of data file containing 3,874 columns and 100 rows
df.sum(axis = 0, skipna = True)
Excerpt of sums for columns
I am stuck on how to create the output file so that it looks like the original file but no longer has columns whose sums were less than 30.
I am stuck on how to write to a CSV file each row for each column whose sums are greater than or equal to 30. The layout of the output file would be the same as for the input file. The sums would not be included in the output.
Thanks very much for your help.
So, here is a link showing an excerpt of a file containing 100 rows and 3,857 columns:
It's easiest to do this in two steps:
1. Filter the DataFrame to just the columns you want to save
df_to_save = df.loc[:, (df.sum(axis=0, skipna=True) >= 30)]
.loc is for picking rows/columns based either on labels or conditions; the syntax is .loc[rows, columns], so : means "take all the rows", and then the second part is the condition on our columns - I've taken the sum you'd given in your question and set it greater than or equal to 30.
2. Save the filtered DataFrame to CSV
df_to_save.to_csv('path/to/write_file.csv', header=True, index=False)
Just put your filepath in as the first argument. header=True means the header labels from the table will be written back out to the file, and index=False means the numbered row labels Pandas automatically created when you read in the CSV won't be included in the export.
See this answer here: How to delete a column in pandas dataframe based on a condition? . Note, the solution for your question doesn't need isnull() before the sum(), as that is specific to their question for counting NaN values.