sample csvI have a .csv file which has many columns consisting of formulas like this:
="010",
="011"
By default when I am storing it to a pandas dataframe, the formulas are stored as it is. Can anyone help me like how to convert these values in whole dataframe to values:
010,
011
You can use Series.apply() with a custom function.
Here is a very basic example that should work with the data you have shown.
def to_float(item):
return float(item[1:].strip('"'))
df['COUNTRY CODE'] = df['COUNTRY CODE'].apply(to_float)
The function to_float() is just an example, you shall decide how to implement the custom function to do the transformation.
Related
I read in the excel file like so:
data = sc.read_excel('/Users/user/Desktop/CSVB.xlsx',sheet= 'Sheet1', dtype= object)
There are 3 columns in this data set that I need to work with as .obs but it looks like everything is in the .X data matrix.
Anyone successfully subset after reading in the file or is there something I need to do beforehand?
Okay, so assuming sc stands for scanpy package, the read_excel just takes the first row as .var and the first column as .obs of the AnnData object.
The data returned by read_excel can be tweaked a bit to get what you want.
Let's say the index of the three columns you want in the .obs are stored in idx variable.
idx = [1,2,4]
Now, .obs is just a Pandas DataFrame, and data.X is just a Numpy matrix (see here). Thus, the job is simple.
# assign some names to the new columns
new_col_names = ['C1', 'C2', 'C3']
# add the columns to data.obs
data.obs[new_col_names] = data.X[:,idx]
If you may wish to remove the idx columns from data.X, I suggest making a new AnnData object for this.
Coming from Python, I started using Julia for its speed in a big-data project. When reading data from .xlsx files, the datatype in each column is "any", despite most of the data being integers or floats.
Is there any Julia-way of inferring the datatypes in a DataFrame (like df = infertype.(df))?
This may be difficult in Julia, given the reduced flexibility on dataypes, but any tips on how to accomplish it would be appreciated. Assume, ex-ante, I do not know which column is which, but the types can only be int, float, string or date.
Using DataFrames
Using XLSX
df = DataFrame(XLSX.readtable("MyFile.xlsx", "Sheet1")...)
You can just do:
df = DataFrame(XLSX.readtable("MyFile.xlsx", "Sheet1"; infer_eltypes=true)...)
Additionally, it is worth knowing that typing in Julia ? before the command shows help that can contain such information:
help?> XLSX.readtable
readtable(filepath, sheet, [columns]; [first_row], [column_labels], [header], [infer_eltypes], [stop_in_empty_row], [stop_in_row_function]) -> data, column_labels
Returns tabular data from a spreadsheet as a tuple (data, column_labels). (...)
(...)
Use infer_eltypes=true to get data as a Vector{Any} of typed vectors. The default value is infer_eltypes=false.
(...)
I need to save a bunch of PySpark DataFrames as csv tables. The tables should also have the same names as the DataFrames.
The code should be something like that:
for table in ['ag01','a5bg','h68chl', 'df42', 'gh63', 'ur55', 'ui99']:
ppath='hdfs://hadoopcentralprodlab01/..../data/'+table+'.csv'
table.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").save(ppath)
The problem is here that in the command "table.repartition(1)..." I need the actual names of data frames without ''. So in this form the code doesn't work. But If write "for table in [ag01,a5bg,...]", so without quotes in the list, I then cannot define the path because I cannot concantenate the name of data frame and a string. How can I resolve this dilemma?
Thanks in advance!
Having a bunch of variable names not considered good coding practice. You should have used a list or a dictionary in the first place. But if you're stuck in this already, you can use eval to get the dataframe stored in that variable.
for table in ['ag01', 'a5bg', 'h68chl', 'df42', 'gh63', 'ur55', 'ui99']:
ppath = 'hdfs://hadoopcentralprodlab01/..../data/'+table+'.csv'
df = eval(table)
df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").save(ppath)
I have the checkout column in Dataframe of type 'object' in '2017-08-04T23:31:19.000+02:00' format.
But i want it in the format as shown in the image.
Can anyone help me please.
Thank you:)
You should be able to convert the object column to a date time column, then use the built in date and time functions.
# create an intermediate column that we won't store on the DataFrame
checkout_as_datetime = pd.to_datetime(df['checkout'])
# Add the desired columns to the dataframe
df['checkout_date'] = checkout_as_datetime.dt.date
df['checkout_time'] = checkout_as_datetime.dt.time
Though, if you're goal isn't to write these specific new columns out somewhere, but to use them for other calculations, it may be simpler to just overwrite your original column and use the datetime methods from there.
df['checkout'] = pd.to_datetime(df['checkout'])
df['checkout'].dt.date # to access the date
I haven't tested this, but something along the lines of:
df['CheckOut_date'] = pd.to_datetime(df["CheckOut_date"].dt.strftime('%Y-%m-%d'))
df['CheckOut_time'] = pd.to_datetime(df["CheckOut_time"].dt.strftime('%H:%m:%s'))
Is there way to read date ("2000-01") variables from text files into a Julia DataFrame directly, as a date? There's no documentation on this from what I have seen.
df = readtable("path/dates.txt", eltypes = [Date, Date])
This doesn't work, even though it seems like it should. My usual process is to read the dates in as strings and then loop over each row to create a new date variable. This has become a bottleneck in some of my processes now, do to the size of the DataFrames.
My usual flow is to do something like this:
full_df[:real_date] = Date(full_df[:temp_dte_string], "m/d/y")
Thank you
I don't think there's currently any way to do the loading in a single step like your first suggested code. However you can speed up the second method somewhat by making a DateFormat object and calling Date with that instead of with a string.
(This is mentioned briefly here.)
dfmt = Dates.DateFormat(“m/d/y”)
full_df[:real_date] = Date(full_df[:temp_dte_string], dfmt)
(For some reason I thought Date was not vectorized and have been doing this inside a for loop in all my code. Whoops.)
By delete a variable do you mean delete a column or a row? If you mean the former, then there's a few other ways to do this including things like
function vectorin(a, b) #IMHO this should be in base
bset = Set(b)
[i in bset for i in a]
end
df = DataFrame(A1="", A2="", A3="", D="", E="", F="") #Some long list of columns
badCols = [:D, :F] #Some long list of columns you want to remove
df = df[names(df)[!vectorin(names(df), badCols)]]
Sometimes I read in csv files with a lot of extra columns, then just do something like
df = readtable("data.csv")
df = df[[:Only, :the, :cols, :I, :want]]