What is the cleanest way to create a new column based on a conditional of an existing column? - pandas

In pandas I currently have a data frame containing a column of strings: {Urban, Suburban, Rural}. The column I would like to create is conditional of the first column (i.e. Urban, Suburban, Rural are associated with the corresponding colors) {Coral, Skyblue, Gold}
I tried copying the first column and then using .replace but my new column seems to return NaN values now instead of the colors.
new_column = merge_table["type"]
merge_table["color"] = new_column
color_df = merge_table["color"].replace({'Urban': 'Coral', 'Suburban': 'Skyblue', 'Rural': 'Gold'})
data = pd.DataFrame({'City Type': type,
'Bubble Color': color_df
})
data.head()

You can do
merge_table['New col']=merge_table["color"].replace({'Urban': 'Coral', 'Suburban': 'Skyblue', 'Rural': 'Gold'})

Okay. in the future, its worth typing the codes using 'Code Samples' so that we can view your code easier.
Lots of areas can improve your code. Firstly you do the entire thing in one line:
merge_table["color"] = merge_table["type"].map(mapping_dictionary)
Series.map() is around 4 times faster than Series.replace() for your information.
also other tips:
never use type as a variable name, use something more specific like city_type. type is already a standard built-in method
data = pd.DataFrame({'City Type': city_type, 'Bubble Color': color_df})
if make a copy of a column, use:
a_series = df['column_name'].copy()

Related

Convert a spark dataframe to a column

I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column.
So basically, this is my dataframe:
val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ?
Thank you,
#Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot
From the documentation of the class org.apache.spark.sql.Column
A column that will be computed based on the data in a DataFrame. A new
column is constructed based on the input columns present in a
dataframe:
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated
with a DataFrame. col("columnName.field") // Extracting a
struct field col("a.column.with.dots") // Escape . in column
names. $"columnName" // Scala short hand for a named
column. expr("a + 1") // A column that is constructed
from a parsed SQL Expression. lit("abc") // A
column that produces a literal (constant) value.
If filled_column2 is a DataFrame, you could do:
filled_column2("col1")
******** EDITED AFTER CLARIFICATION ************
Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:
val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)
This way, you are also selecting the product_id that you will use as key. Then, you can do the following
val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
.join(filled_column, promo_txn_cnt_seas_df1("product_id") === filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)
Is this what you are trying to achieve?

Pandas Replace_ column values

Hello,
I am analyzing the next dataset with this information .
The column ['program_number'] is an object but I want to change it to a integer colum.
I have tried to replace some values but it doesn´t work.
as you can see, some values like 6 is duplicate. like '6 ' and 6.
How can I resolve it? Many thanks
UPDATE
Didn't see 1X and 3X at first.
If you need those numbers and just want to remove the X then:
df["Program"] = df["Program"].str.strip(" X").astype(int)
If there is data in the column which aren't numbers or which shouldn't be converted, you can use pd.to_numeric with errors='corece'. If there are cells which can't be converted, you'll get NaN. Be aware that this will result in floating numbers.
df["Program"] = pd.to_numeric(df["Program"], errors="coerce")
old
You want to use str.strip() here, rather than replace.
Try this:
df1['program_number'] = df1['program_number'].str.strip().astype(int)

HighRadius Ivoice prediction Challenges

Generate a new column "avgdelay" from the existing columns
Note - You are expected to make a new column "avgdelay" by grouping "name_customer" column with reapect to mean of the "Delay" column.
This new column "avg_delay" is meant to store "customer_name" wise delay
groupby('name_customer')['Delay'].mean(numeric_only=False)
Display the new "avg_delay" column
Can Anyone guide me
Let df be your main data frame
avgdelay = df.groupby('name_customer')['Delay'].mean(numeric_only=False)
You need to add the "avg_delay" column with the maindata, mapped with "name_customer" column
Note - You need to use map function to map the avgdelay with respect to "name_customer" column
df['avg_delay'] = df['name_customer'].map(avgdelay)

Pandas.dropna method can't delete Nan value rows(or columns)

I now have some data, it's may contain null values
I want to delete it's null value (a whole row or a whole column)
How can I deal with the comparison?
Here is my data
https://reurl.cc/5lONv6
it will have some null values ​​in the time series data
following is my code
c=pd.read_csv('./in/historical_01A190.txt',error_bad_lines=False)
c.dropna(axis=0,how='any',inplace=True)
c.dropna(axis=1,how='any',inplace=True)
c.to_csv('./out/historical_01A190.txt',index=False)
but it's didn't work
anyone can help me?
Okay, first of all, your data isn't saved as a csv. It's saved as a tab-separated file.
So you need to open it using pd.read_table
>>> c=pd.read_table('./data.txt',error_bad_lines=False,sep='\t')
Second, your data is full of nans -- if you use dropna on either rows or columns, you end up with just one row or column (dates) left. But using the correct opener on your file, the dropna and to_csv functions work.
If you don't assing the variable then it will only create a view which is not stored in memory.
c = c.dropna(axis=0,how='any',inplace=True)
c = c.dropna(axis=1,how='any',inplace=True)
c = c.to_csv('./out/historical_01A190.txt',index=False)
Try this.

Performing calculations on multiple columns in dataframe and create new columns

I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.