I have a dataframe in spark which has missing values.
I am trying to delete the columns which have more than 50% missing values
See below code:
frac = fil_data.count() * .50
print(frac)
t_data = fil_data.dropna(thresh=390951)
print(t_data.count())
The print count gives me 0
why is this happening?
How can we resolve this
I researched a bit before posting, everyone uses the plain simple dropna(how = any or all) instead of threshold
Note that dropna will always drop rows, not columns.
For the correct use of the thresh option see the docs:
thresh – int, default None If specified, drop rows that have less than thresh non-null values. This overwrites the how parameter.
So you drop all rows having less than 390951 non-null values, which is probably all since you don't have 400k cols I assume
Related
I would like to slice my dataframe using iloc (rather than loc) + some condition based on one of the dataframe's columns and assign a value to all the items in this slice (which is effectively a subset of the main dataframe).
My simplified attempt:
df.iloc[:, 1:21][df['column1'] == 'some_value'] = 1
This is meant to take a slice of the dataframe:
All rows;
Columns 2 to 20;
Then slice it again:
Only the rows where column1 = some_value.
The slicing works fine, but equalling this to 1 doesn't work. Nothing changes in df and I get this warning
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
I really need to use iloc rather than loc if possible. It feels like there should be a way of doing this?
You can search for the error on SO. In short, you should update on one single loc/iloc:
df.loc[df['column1']=='some_value', df.columns[1:21]] = 1
I have a pandas dateframe of two columns ['company'] which is a string and ['publication_datetime'] which is a datetime.
I want to group by company and the publication_date , adding a new column with the maximum publication_datetime for each record.
so far i have tried:
issuers = news[['company','publication_datetime']]
issuers['publication_date'] = issuers['publication_datetime'].dt.date
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max()
my group by does not appear to work.
i get the following error
ValueError: Wrong number of items passed 3, placement implies 1
You need the transform() method to cast the result in the original dimension of the dataframe.
issuers['max'] = issuers.groupby(['company', 'publication_date'])['publication_datetime'].transform('max')
The result of your groupby() before was returning a multi-indexed group object, which is why it's complaining about 3 values (first group, second group, and then values). But even if you just returned the values, it's combining like groups together, so you'll have fewer values than needed.
The transform() method returns the group results for each row of the dataframe in a way that makes it easy to create a new column. The returned values are an indexed Series with the indices being the original ones from the issuers dataframe.
Hope this helps! Documentation for transform here
The thing is by doing what you are doing you are trying to set a DataFrame to a column value.
Doing the following will get extract only the values without the two indexe columns:
issuers['publication_datetime_max'] = issuers.groupby(['company','publication_date'], as_index=False)['publication_datetime'].max().tolist()
Hope this help !
I'm using read_csv to make a df, but the csv includes some garbage rows before the actual columns, the actual columns are located say in the 5th rows in the csv.
Here's the thing, I don't know how many garbage rows are there in advance and I can only read_csv once, so I can't use "head" or "skiprows" in read_csv.
So my question is how to select a different row as the columns in the df or just delete the first n rows including the columns? If I were to use "df.iloc[3:0]" the columns are still there.
Thanks for your help.
EDIT: Updated so that it also resets the index and does not include an index name:
df.columns = df.iloc[4].values
df = df.iloc[5:].reset_index(drop=True)
If you know your column names start in row 5 as in your example, you can do:
df.columns = df.iloc[4]
df = df.iloc[5:]
If the number of garbage rows is determined, then you can use 'iloc', example the number of garbage rows is 3 firs rows (index 0,1,2), then you can use the following code to get all remaining actual data rows:
df=df.iloc[3:]
If the number of garbage rows is not determined, then you must search the index of first actual data rows from the garbage rows. so you can find the first index of actual data rows and can be used to get all remaining data rows.
df=df.iloc[n:]
n=fisrt index of actual data
I have a column of datetimes and need to change several of these values to new datetimes. When I set the values using df.loc[indices, 'col'] = new_datetimes, the unaffected values are coerced to int while the new set values are in datetime. If I set the values one at a time, no type coercion occurs.
For illustration I created a sample df with just one column.
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[[1,3,4]] = [dt.datetime(2019,1,2)]*3
df
This produces the following:
output
If I change indices 1,3,4 individually:
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[1] = dt.datetime(2019,1,2)
df.loc[3] = dt.datetime(2019,1,2)
df.loc[4] = dt.datetime(2019,1,2)
df
I get the correct output:
output
A suggestion was to turn the list to a numpy array before setting, which does resolve the issue. However, if you try to set multiple columns (some of which are not datetime) using a numpy array, The issue arises again.
In this example the dataframe has two columns and I try to set both columns.
df = pd.DataFrame({'dt':[dt.datetime(2019,1,1)]*5, 'value':[1,1,1,1,1]})
df.loc[[1,3,4]] = np.array([[dt.datetime(2019,1,2)]*3, [2,2,2]]).T
df
This gives the following output:
output
Can someone please explain what is causing the coercion and how to prevent it from doing so? The code I wrote that uses this was written over a month ago and used to work just fine, could it be one of those warnings about future version of pandas deprecating certain functionalities?
An explanation of what is going on would be greatly appreciated because I wrote a other codes that likely employ similar functionality want to make sure everything works as intended.
The solution proposed by w-m has such an "awkward detail" than
the result column has also the time part (it didn't have it
before).
I have also such a remark, that DataFrames are tables not Series,
so they have columns, each with its name and it is a bad habit to
rely on default column names (consecutive numbers).
So I propose another solution, addressing both above issues:
To create the source DataFrame I executed:
df = pd.DataFrame([dt.datetime(2019, 1, 1)]*5, columns=['c1'])
Note that I provided a name for the only column.
Then I created another DataFrame:
df2 = pd.DataFrame([dt.datetime(2019,1,2)]*3, columns=['c1'], index=[1,3,4])
It contains your "new" dates and the numbers which you used in loc
I set as the index (again with the same column name).
Then, to update df, use (not surprisingly) df.update:
df.update(df2)
This function performs in-place update, so if you print(df), you will get:
c1
0 2019-01-01
1 2019-01-02
2 2019-01-01
3 2019-01-02
4 2019-01-02
As you can see, under indices 1, 3 and 4 you have new dates
and there is no time part, just like before.
[dt.datetime(2019,1,2)]*3 is a Python list of objects. This particular list happens to contain only datetimes, but Pandas does not seem to recognize that, and treats it as it is - a list of any kind of objects.
If you convert it into a typed array, then Pandas will keep the original dtype of the column intact:
df.loc[[1,3,4]] = np.asarray([dt.datetime(2019,1,2)]*3)
I hope this workaround helps you, but you may still want to file a bug with Pandas. I don't have an explanation as to why the datetime objects should be coerced to ints in the first output example.
I am trying to preset the dimensions of my data frame in pandas so that I can have 500 rows by 300 columns. I want to set it before I enter data into the dataframe.
I am working on a project where I need to take a column of data, copy it, shift it one to the right and shift it down by one row.
I am having trouble with the last row being cut off when I shift it down by one row (eg: I started with 23 rows and it remains at 23 rows despite the fact that I shifted down by one and should have 24 rows).
Here is what I have done so far:
bolusCI = pd.DataFrame()
##set index to very high number to accommodate shifting row down by 1
bolusCI = bolus_raw[["Activity (mCi)"]].copy()
activity_copy = bolusCI.shift(1)
activity_copy
pd.concat([bolusCI, activity_copy], axis =1)
Thanks!
There might be a more efficient way to achieve what you are looking to do, but to directly answer your question you could do something like this to init the DataFrame with certain dimensions
pd.DataFrame(columns=range(300),index=range(500))
You just need to define the index and columns in the constructor. The simplest way is to use pandas.RangeIndex. It mimics np.arange and range in syntax. You can also pass a name parameter to name it.
pd.DataFrame
pd.Index
df = pd.DataFrame(
index=pd.RangeIndex(500),
columns=pd.RangeIndex(300)
)
print(df.shape)
(500, 300)