selecting columns while importing data with DataFrames.readtable - dataframe

Is there a way to select only few columns while importing the data using readtable ?
Something like pandas read_csv "usecols" method
movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_col_names, usecols=range(5))

According to this issue https://github.com/JuliaStats/DataFrames.jl/issues/568 as #DSM pointed out, current implementation of DataFrames does not support this.

Related

Bokeh Not Assigning A Date?

I'm having an issue plotting a timeline using Bokeh. The documentation says that you should be able to reassign the default using '''parse_dates''' but when I create the plot:
example = pd.read_csv(
"exampledataframe.csv",
parse_dates=["Date"],
infer_datetime_format=True,
index_col=0
);
display_timeline(example)
I get this error
KeyError: "None of [Index(['TimeGenerated'], dtype='object')] are in the [columns]
Why isn't Bokeh reassigning the index to the "Date" column in my dataframe? I can't find anything about the issue.
Also, please forgive mistakes in terminology. I am a novice in data analysis.
You can use the arguments names and header in pandas.read_csv to rename the Date column :
example = pd.read_csv(
"exampledataframe.csv",
parse_dates=["Date"],
infer_datetime_format=True,
index_col=0,
names=["TimeGenerated", "...."],
header=None
);
display_timeline(example)
Make sure to assign/add (by order) in names argument the list of the columns names of your .csv.

Group Pandas: How to concat or merge/join/append two csv files with same index but different extensions in grouped data?

I'd like to concat or merge or append/join two csv files with the same indix ID but different extensions on the same ID. The data are grouped by ID also. The 1st file looks like this:
ID,year,age
810006862,2000,49
810006862,2001,
810006862,2002,
810006862,2003,52
810023112,2003,27
810023112,2004,28
810023112,2005,29
810023112,2006,30
810033622,2000,24
810033622,2001,25
and the 2nd file looks like this:
ID,year,from1,to1
810006862,2002,15341,15705
810006862,2003,15706,16070
810006862,2004,16071,16436
810006862,2005,,
810023112,2000,14610,14975
810023112,2001,14976,15340
810023112,2003,15825,16523
810033622,2000,13211,14876
810033622,2001,14761,14987
I have set index of ID for both files after reading it to dataframe, and then concat them together, but it gets error message of "ValueError: Shape of passed values is (25, 2914), indices imply (25, 251)"
I've tried the following codes:
sp = pd.read_csv('sp1.csv')
sp = sp.set_index('ID')
op = pd.read_csv('op1.csv')
op = op.set_index('ID')
ff = pd.concat([sp, op], join = 'outer', sort = False, axis = 1)
I've also tried concat the two files together without setting up index, and the result seemed having correct rows, but the horizontal values were incorrect related.
I've also tried merge as well, but it came with many unnecessary duplicated rows within each group. Since each group has different year and age, I found quite difficult to delete those newly generated rows using this method.
full = pd.merge(sp, op, on = 'ID', how = 'outer', sort = False)
Maybe somebody can suggest ways to easily delete these duplicates, and this will also work for me, because the merged file became so huge! Thanks in advance!
Expected results would be including all different values from both csv files. It is somewhat like this:
ID,year,age,from1,to1
810006862,2000,49,,
810006862,2001,,,
810006862,2002,,15341,15705
810006862,2003,52,15706,16070
810006862,2004,,16071,16436
810006862,2005,,,
810023112,2000,,14610,14975
810023112,2001,,14976,15340
810023112,2003,27,15825,16523
810023112,2004,28,,
810023112,2005,29,,
810023112,2006,30,,
810033622,2000,24,13211,14876
810033622,2001,25,14761,14987
I've searched online for similar posts along for quite some time, but unable to solve my problem. Can anybody offer any clue how to do this? Thanks a lot!

Not able to drop multiple columns from a .csv file in Pandas

I'm reading a csv file that has 7 columns
df = pd.read_csv('DataSet.csv',delimiter=',',usecols=['Wheel','Date','1ex','2ex','3ex','4ex','5ex'])
The problem is that the model I want to train with it, is complaining about the first 2 columns being Strings, so I want to drop them.
I first tried not to read the from the beginning with :
df = pd.read_csv('DataSet.csv',delimiter=',',usecols=['1ex','2ex','3ex','4ex','5ex'])
but it only shifted the values of two columns..so I decided to drop them.
The problem is that I'm only able to drop the first column 'Date' with
train_df.drop(columns=['Date'], inplace=True)
, train_df is a portion of df uses for testing. How do I go to also drop 'Wheel' column?
I tried
train_df.drop(labels=[["Date","Wheel"]], inplace=True)
but i get KeyError: "[('Date', 'Wheel')] not found in axis"
so I tried
train_df.drop(columns=[["Date","Wheel"]], index=1, inplace=True)
but I still get the same error.
I'm so new to Python I'm out of resources to solve this.
As always many thanks.
Try:
train_df.drop(columns=["Date","Wheel"], index=1, inplace=True)
See the examples in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

how to find column whose name contains a specific string

I have following pandas dataframe with following columns
code nozzle_no nozzle_var nozzle_1 nozzle_2 nozzle_3 nozzle_4
I want to get columns names nozzle_1,nozzle_2,nozzle_3,nozzle_4 from above dataframe
I am doing following in pandas
colnames= sir_df_subset.columns[sir_df_subset.columns.str.contains(pat = 'nozzle_')]
But, it also includes following nozzle_no and nozzle_var, which I do not want. How to do it in pandas?
You can use df.filter regex param here:
df.filter(regex='nozzle_\d+')
The .str.contains has a regex flag, that is True by default, so you can enter a regex:
colnames= sir_df_subset.columns[sir_df_subset.columns.str.contains(pat = 'nozzle_\d+$')]
but the answer of #anky_91 with df.filter is MUCH better.

Spark Dataframe sql in java - How to escape single quote

I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...
We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead