I am trying to select only columns from a dataframe that start with a p or that contain an s. I am using the following:
df2 = (df.filter(regex ='(^p)' or '(s)'))
df2
But that only selects columns that start with a p. It ignores the second part and doesn't select columns that have an s in the column name. Anyone knows how can I filter so that both conditions are true and my algorithm select both, columns starting with P and also columns that contain an s?
Use the pipe character | which is equivalent to OR in regex.
df2 = (df.filter(regex ='^p|s'))
Related
I want to add a new column to a datframe "table" (name: conc) which uses the values in columns (plate, ab) to get the numeric value from the dataframe "concs"
Below is what I mean, with the dataframe "exp" used to show what I expect the data to look like
what is the proper way to do this. Is it using some multiple condition, or do I need to reshape the concs dataframe somehow?
Use DataFrame.melt with left join for new column concs, if no match is created NaNs:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table,on=['plate', 'ab'], how='left')
Solution should be simplify - if same columns names 'plate', 'ab' in both DataFrames and need merge by both is possible omit on parameter:
exp = concs.melt('plate', var_name='ab', value_name='concs').merge(table, how='left')
First melt the concs dataframe and then merge with table:
out = concs.melt(id_vars=['plate'],
value_vars=concs.columns.drop('plate').tolist(),
var_name='ab').merge(table, on=['plate', 'ab'
]).rename(columns={'value': 'concs'})
or just make good use of parameters of melt like in jezraels' answer:
out = concs.melt(id_vars=['plate'],
value_name='concs',
var_name='ab').merge(table, on=['plate', 'ab'])
i have a dataframe with many col names having _paid as part of the name (eg. A_paid, B_paid. etc). I need to fill miss values in any col that has _paid as part of the name. (note: i am not allowed to replace missing value in other cols with no _paid as part of the name).
I tried to use .fillna(), but not sure how to make it do fuzzy search on col names.
If you want to select any column that has _paid in it:
paid_cols = df.filter(like="_paid").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid", regex=False)]
andthen
df[paid_cols] = df[paid_cols].fillna(...)
If you need _paid to be at the end only, then with $ anchor in a regex:
paid_cols = df.filter(regex="_paid$").columns
or
paid_cols = df.columns[df.columns.str.contains("_paid$")]
then the same fillna above.
Imagine I have the following DataFrame :
10 rows x 26 columns named A to Z
What I would like to do is to make a multiple subset of the columns by their name (not the index). For instance, assume that I want columns A to D and P to Z in a new DataFrame named df2.
I tried something like this but it doesn't seem to work :
df2=df[:,[:A,:D ; :P,:Z]]
syntax: unexpected semicolon in array expression
top-level scope at Slicing.jl:1
Any idea of the way to do it ?
Thanks for any help
df2 = select(df, Between(:A,:D), Between(:P,:Z))
or
df2 = df[:, All(Between(:A,:D), Between(:P,:Z))]
if you are sure your columns are only from :A to :Z you can also write:
df2 = select(df, Not(Between(:E, :O)))
or
df2 = df[:, Not(Between(:E, :O))]
Finally, you can easily find an index of the column using columnindex function, e.g.:
columnindex(df, :A)
and later use column numbers - if this is something what you would prefer.
In Julia you can also build Ranges with Chars and hence when your columns are named just by single letters yet another option is:
df[:, Symbol.(vcat('A':'D', 'P':'Z'))]
I would like to concatenate all the columns with comma-delimitted in pandas.
But as you can seem it is very laborious tasks since I manually typed all the column indices.
de = data[3]+","+data[4]+","+data[5]+....+","+data[1511]
do you have any idea to avoid above procedure in pandas in python3?
First convert all columns to strings by DataFrame.astype and then possible add join per rows:
df = data.astype(str).apply(','.join, axis=1)
Or after convert to strings add ,, then sum and last remove last , by Series.str.rstrip:
df = data.astype(str).add(',').sum(axis=1).str.rstrip(',')
Is it possible to only merge some columns? I have a DataFrame df1 with columns x, y, z, and df2 with columns x, a ,b, c, d, e, f, etc.
I want to merge the two DataFrames on x, but I only want to merge columns df2.a, df2.b - not the entire DataFrame.
The result would be a DataFrame with x, y, z, a, b.
I could merge then delete the unwanted columns, but it seems like there is a better method.
You want to use TWO brackets, so if you are doing a VLOOKUP sort of action:
df = pd.merge(df,df2[['Key_Column','Target_Column']],on='Key_Column', how='left')
This will give you everything in the original df + add that one corresponding column in df2 that you want to join.
You could merge the sub-DataFrame (with just those columns):
df2[list('xab')] # df2 but only with columns x, a, and b
df1.merge(df2[list('xab')])
If you want to drop column(s) from the target data frame, but the column(s) are required for the join, you can do the following:
df1 = df1.merge(df2[['a', 'b', 'key1']], how = 'left',
left_on = 'key2', right_on = 'key1').drop(columns = ['key1'])
The .drop(columns = 'key1') part will prevent 'key1' from being kept in the resulting data frame, despite it being required to join in the first place.
You can use .loc to select the specific columns with all rows and then pull that. An example is below:
pandas.merge(dataframe1, dataframe2.iloc[:, [0:5]], how='left', on='key')
In this example, you are merging dataframe1 and dataframe2. You have chosen to do an outer left join on 'key'. However, for dataframe2 you have specified .iloc which allows you to specific the rows and columns you want in a numerical format. Using :, your selecting all rows, but [0:5] selects the first 5 columns. You could use .loc to specify by name, but if your dealing with long column names, then .iloc may be better.
This is to merge selected columns from two tables.
If table_1 contains t1_a,t1_b,t1_c..,id,..t1_z columns,
and table_2 contains t2_a, t2_b, t2_c..., id,..t2_z columns,
and only t1_a, id, t2_a are required in the final table, then
mergedCSV = table_1[['t1_a','id']].merge(table_2[['t2_a','id']], on = 'id',how = 'left')
# save resulting output file
mergedCSV.to_csv('output.csv',index = False)
Slight extension of the accepted answer for multi-character column names, using inner join by default:
df1 = df1.merge(df2[["Key_Column", "Target_Column1", "Target_Column2"]])
This assumes that Key_Column is the only column both dataframes have in common.