How to drop multiple column names given in a list from Spark DataFrame? - dataframe

I have a dynamic list which is created based on value of n.
n = 3
drop_lst = ['a' + str(i) for i in range(n)]
df.drop(drop_lst)
But the above is not working.
Note:
My use case requires a dynamic list.
If I just do the below without list it works
df.drop('a0','a1','a2')
How do I make drop function work with list?
Spark 2.2 doesn't seem to have this capability. Is there a way to make it work without using select()?

You can use the * operator to pass the contents of your list as arguments to drop():
df.drop(*drop_lst)

You can give column name as comma separated list e.g.
df.drop("col1","col11","col21")

This is how drop specified number of consecutive columns in scala:
val ll = dfwide.schema.names.slice(1,5)
dfwide.drop(ll:_*).show
slice take two parameters star index and end index.

Use simple loop:
for c in drop_lst:
df = df.drop(c)

You can use drop(*cols) 2 ways .
df.drop('age').collect()
df.drop(df.age).collect()
Check the official documentation DataFrame.drop

Related

Compare two comma separated columns

I want to compare two columns actual_data and pipeline_data based on source column bcz every source has different format.
I am trying to achieve the result column based on comparision between actual_data and pipeline_data .
I am new to pandas and looking for a way to implement this.
df['result'] = np.where(df['pipeline_data'].str.len() == df['actual_data'].str.len(), 'Match', np.where(df['pipeline_data'].str.len() > df['actual_data'].str.len(), 'Length greater than actual_data', 'Length shorter than actual_data'))
The code above should to what you want to do.

Using to_datetime several columns names

I am working with several CSV's that first N columns are information and then the next Ms (M is big) columns are information regarding a date.
This is the dataframe picture
I need to set just the columns between N+1 to N+M - 1 columns name to date format.
I tried this, in this case N+1 = 5, no matter M, I suppose that I can use -1 to not affect the last column name.
ContDiarios.columns[5:-1] = pd.to_datetime(ContDiarios.columns[5:-1])
but I get the following error:
TypeError: Index does not support mutable operations
The way you are doing is not feasable. Please try this way
def convert(x):
try:
return pd.to_datetime(x)
except:
return x
x.columns = map(convert,x.columns)
Or you can also use df.rename property to convert it.

Dataframe Row(sum(fld)) to a discrete value

I have this:
df = sqlContext.sql(qry)
df2 = df.withColumn("ext", df.lvl * df.cnt)
ttl = df2.agg(F.sum("ext")).collect()
which returns this:
[Row(sum(ext)=1285430)]
How do devolve this down to just the discreet value 1285430 without it being a list Row(sum())?
I've researched and tried so many things I'm totally stymed.
No need for collect:
n = ...your transformation logic and agg... .first().getInt(0)
Access the first row and then get the first element as int.
df2.agg(F.sum("ext")).collect()(0).getInt(0)
Take a look at the documentation: Spark ScalaDoc.
Also can df.collect()[0][0] -or- df.collect()[0]['sum(ext)']

Performing calculations on multiple columns in dataframe and create new columns

I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.

Finding the count of a set of substrings in pandas dataframe

I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this
training['concat']
0 svAxu$paxArWAn
1 xvAxaSa$varRANi
2 AxAna$xurbale
3 go$BakwAH
4 viXi$Bexena
5 nIwi$kuSalaM
6 lafkA$upamam
7 yaSas$lipsoH
8 kaSa$AGAwam
9 hewumaw$uwwaram
10 varRa$pUgAn
My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur
reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
#The length of dicitioanry is 2000
Particularly I need to find those substrings which occur more than twice
I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.
elites = dict()
for reg_pat in reg_:
count = 0
eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
if eliter >=3:
elites[reg_pat] = reg_[reg_pat]
You can use apply instead str.contains, it is faster:
reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}
elites = dict()
for reg_pat in reg_:
if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
elites[reg_pat] = reg_[reg_pat]
print (elites)
{'a': 0.0005}
Hopefully I have interpreted your question correctly. I'm inclined to stay away from regex here (in fact, I've never used it in conjunction with pandas), but it's not wrong, strictly speaking. In any case, I find it hard to believe that any regex operations are faster than a simple in check, but I could be wrong on that.
for substr in reg:
totalStringAppearances = training.apply((lambda string: substr in string))
totalStringAppearances = totalStringAppearances.sum()
if totalStringAppearances > 2:
reg[substr] = totalStringAppearances / len(training)
else:
# do what you want to with the very rare substrings
Some gotchas:
If you wanted something like a substring 'a' in 'abcdefa' to return 2, then this will not work. It merely checks for existence of the substring in each string.
Inside the apply(), I am using a potentially unreliable exploitation of booleans. See this question for more details.
Post-edit: Jezrael's answer is more complete as it uses the same variable names. But, in a simple case, regarding regex vs. apply and in, I validate his claim, and my presumption: