I'm reading a csv file that has 7 columns
df = pd.read_csv('DataSet.csv',delimiter=',',usecols=['Wheel','Date','1ex','2ex','3ex','4ex','5ex'])
The problem is that the model I want to train with it, is complaining about the first 2 columns being Strings, so I want to drop them.
I first tried not to read the from the beginning with :
df = pd.read_csv('DataSet.csv',delimiter=',',usecols=['1ex','2ex','3ex','4ex','5ex'])
but it only shifted the values of two columns..so I decided to drop them.
The problem is that I'm only able to drop the first column 'Date' with
train_df.drop(columns=['Date'], inplace=True)
, train_df is a portion of df uses for testing. How do I go to also drop 'Wheel' column?
I tried
train_df.drop(labels=[["Date","Wheel"]], inplace=True)
but i get KeyError: "[('Date', 'Wheel')] not found in axis"
so I tried
train_df.drop(columns=[["Date","Wheel"]], index=1, inplace=True)
but I still get the same error.
I'm so new to Python I'm out of resources to solve this.
As always many thanks.
Try:
train_df.drop(columns=["Date","Wheel"], index=1, inplace=True)
See the examples in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
Related
I have an Excel file and there are two columns in it, I want to combine them, but one of them is in datetime form and the other is object (actually time). What I want to do is convert the object one to datetime format.enter image description here
I've tried everything I can think of but I keep getting an error.
Edit :enter image description here
import pandas as pd
dataFrame = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/data.xlsx')
dataFrame.head()
output:
enter image description here
and my error
enter image description here
If I'm understanding? You'd want to split "Time" column on space and take 0 index. Finally use .cat to concatenate the string columns together. Next .pop old columns and finally wrap it all in to_datetime.
df["Time"] = df["Time"].str.split(r"\s+").str[0]
df["Datetime"] = pd.to_datetime(df.pop("Date").astype(str).str.cat(df.pop("Time"), sep=" "))
I'm having an issue plotting a timeline using Bokeh. The documentation says that you should be able to reassign the default using '''parse_dates''' but when I create the plot:
example = pd.read_csv(
"exampledataframe.csv",
parse_dates=["Date"],
infer_datetime_format=True,
index_col=0
);
display_timeline(example)
I get this error
KeyError: "None of [Index(['TimeGenerated'], dtype='object')] are in the [columns]
Why isn't Bokeh reassigning the index to the "Date" column in my dataframe? I can't find anything about the issue.
Also, please forgive mistakes in terminology. I am a novice in data analysis.
You can use the arguments names and header in pandas.read_csv to rename the Date column :
example = pd.read_csv(
"exampledataframe.csv",
parse_dates=["Date"],
infer_datetime_format=True,
index_col=0,
names=["TimeGenerated", "...."],
header=None
);
display_timeline(example)
Make sure to assign/add (by order) in names argument the list of the columns names of your .csv.
Is there a way to create a subset dataframe from a dataframe and be sure that its values will be used afterward?
I have a huge PySpark Dataframe like this (simplified example):
id
timestamp
value
1
1658919602
5
1
1658919604
9
2
1658919632
2
Now I want to take a sample from it to test something, before running on the entire Dataframe. I get a sample by:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
df_sample.show() shows some values.
Then I run this command, and sometimes it returns values that are present in df_sample and sometimes it returns values that are not present in df_sample but in df.
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol'))
As if it's not using df_sample but picking in a non deterministic way 10 rows from df.
Interestingly, if I run df_sample.show() afterwards, it shows the same values as when it was first called.
Why is this happening?
Here's full code:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
# shows some values
df_sample.show()
# run query
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol')
# df_temp sometimes shows values that are present in df_sample, but sometimes shows values that aren't present in df_sample but in df
df_temp.show()
# Shows the exact same values as when it was first called
df_sample.show()
Edit1: I understand that Spark is lazy, but is there any way to force it to not be lazy in this scenario?
We can use sample function provided by spark to achieve this.Every time you run a sample() function it returns a different set of sampling records, To regenerate the same sample every time as you need to compare the results from your previous run. To get consistent same random sampling uses the same slice value for every run.
df=spark.range(100)
# Execute first time
print(df.sample(0.1,123).collect())
# Execute Second time with same seed-123
print(df.sample(0.1,123).collect())
# Execute with different seed-456
print(df.sample(0.1,456).collect())
Refer spark docs
Stratum sampling in spark
What worked was using df_sample = df.limit(10).cache() or df_sample = df.limit(10).persist(). Samkart's comment pointed me in this direction.
I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...
We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead
Is there a way to select only few columns while importing the data using readtable ?
Something like pandas read_csv "usecols" method
movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_col_names, usecols=range(5))
According to this issue https://github.com/JuliaStats/DataFrames.jl/issues/568 as #DSM pointed out, current implementation of DataFrames does not support this.