I have following pandas dataframe with following columns
code nozzle_no nozzle_var nozzle_1 nozzle_2 nozzle_3 nozzle_4
I want to get columns names nozzle_1,nozzle_2,nozzle_3,nozzle_4 from above dataframe
I am doing following in pandas
colnames= sir_df_subset.columns[sir_df_subset.columns.str.contains(pat = 'nozzle_')]
But, it also includes following nozzle_no and nozzle_var, which I do not want. How to do it in pandas?
You can use df.filter regex param here:
df.filter(regex='nozzle_\d+')
The .str.contains has a regex flag, that is True by default, so you can enter a regex:
colnames= sir_df_subset.columns[sir_df_subset.columns.str.contains(pat = 'nozzle_\d+$')]
but the answer of #anky_91 with df.filter is MUCH better.
Related
I want to create a new DataFrame from another for rows that meet a condition such as:
uk_cities_df['location'] = cities_df['El Tarter'].where(cities_df['AD'] == 'GB')
uk_cities_df[:5]
but the resulting uk_cities_df is returning NaN.
The csv file that I am needing to extract from has no headers so it used the first row values for such. I need to only include rows in uk_cities_df include the ISO code "GB" so "El Tarter" denotes the values for location and "AD" for iso code.
Could you please provide a visual of what uk_cities_df and cities_df look like ?
From what I can gather, I think you might be looking for the .loc operator,
you could try for example :
uk_cities_df['location'] = cities_df.loc[cities_df['AD'] == 'GB']['location']
Also, I did not really get what role 'El Tarter' plays here, maybe you could give more details ?
This is an incredibly basic question but after reading through Stack and the documentation for str.replace, I'm not finding the answer. Trying to drop all of the punctuation in a column. What would be the proper syntax for this?
This works but it's absurd:
igog['text']=igog['text'].str.replace(",","").str.replace("/","").str.replace(".","").str.replace("\"","").str.replace("?","").str.replace(";","")
This doesn't:
igog['text'] = igog['text'].str.replace({","," ","/","",".","","\"","","?","",";",""}).
Because I keep getting "replace() missing 1 required positional argument: 'repl'".
Thanks in advance!
You can make a simple loop like this:
t=",/.\?;"
for i in t:
igog[text]=igog[text].replace(i,"")
or you can use regex:
igog['text'].str.replace("[,/.\?;]", "")
or you can use re.sub():
import re
igog['text'] = re.sub('[,/.\?;]', "", igog['text'])
or you can define a translation table :
igog['text'].translate({ord(ch):' ' for ch in ',/.\?;'})
I am trying to create a new field during indexing however the fields become columns instead of values when i try to concat. What am i doing wrong ? I have looked in the docs and seems according ..
Would appreciate some help on this.
e.g.
.csv file
**Header1**, **Header2**
Value1 ,121244
transform.config
[test_transformstanza]
SOURCE_KEY = fields:Header1,Header2
REGEX =^(\w+\s+)(\d+)
FORMAT =
testresult::$1.$2
WRITE_META = true
fields.config
[testresult]
INDEXED = True
The regex is good, creates two groups from the data, but why is it creating a new field instead of assigning the value to result?. If i was to do ... testresult::$1 or testresult::$2 it works fine, but when concatenating it creates multiple headers with the value as headername. Is there an easier way to concat fields , e.g. if you have a csv file with header names can you just not refer to the header names? (i know how to do these using calculated fields but want to do it during indexing)
Thanks
I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...
We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead
Is there a way to select only few columns while importing the data using readtable ?
Something like pandas read_csv "usecols" method
movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_col_names, usecols=range(5))
According to this issue https://github.com/JuliaStats/DataFrames.jl/issues/568 as #DSM pointed out, current implementation of DataFrames does not support this.