Datetime column coerced to int when setting with .loc and slice - pandas

I have a column of datetimes and need to change several of these values to new datetimes. When I set the values using df.loc[indices, 'col'] = new_datetimes, the unaffected values are coerced to int while the new set values are in datetime. If I set the values one at a time, no type coercion occurs.
For illustration I created a sample df with just one column.
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[[1,3,4]] = [dt.datetime(2019,1,2)]*3
df
This produces the following:
output
If I change indices 1,3,4 individually:
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[1] = dt.datetime(2019,1,2)
df.loc[3] = dt.datetime(2019,1,2)
df.loc[4] = dt.datetime(2019,1,2)
df
I get the correct output:
output
A suggestion was to turn the list to a numpy array before setting, which does resolve the issue. However, if you try to set multiple columns (some of which are not datetime) using a numpy array, The issue arises again.
In this example the dataframe has two columns and I try to set both columns.
df = pd.DataFrame({'dt':[dt.datetime(2019,1,1)]*5, 'value':[1,1,1,1,1]})
df.loc[[1,3,4]] = np.array([[dt.datetime(2019,1,2)]*3, [2,2,2]]).T
df
This gives the following output:
output
Can someone please explain what is causing the coercion and how to prevent it from doing so? The code I wrote that uses this was written over a month ago and used to work just fine, could it be one of those warnings about future version of pandas deprecating certain functionalities?
An explanation of what is going on would be greatly appreciated because I wrote a other codes that likely employ similar functionality want to make sure everything works as intended.

The solution proposed by w-m has such an "awkward detail" than
the result column has also the time part (it didn't have it
before).
I have also such a remark, that DataFrames are tables not Series,
so they have columns, each with its name and it is a bad habit to
rely on default column names (consecutive numbers).
So I propose another solution, addressing both above issues:
To create the source DataFrame I executed:
df = pd.DataFrame([dt.datetime(2019, 1, 1)]*5, columns=['c1'])
Note that I provided a name for the only column.
Then I created another DataFrame:
df2 = pd.DataFrame([dt.datetime(2019,1,2)]*3, columns=['c1'], index=[1,3,4])
It contains your "new" dates and the numbers which you used in loc
I set as the index (again with the same column name).
Then, to update df, use (not surprisingly) df.update:
df.update(df2)
This function performs in-place update, so if you print(df), you will get:
c1
0 2019-01-01
1 2019-01-02
2 2019-01-01
3 2019-01-02
4 2019-01-02
As you can see, under indices 1, 3 and 4 you have new dates
and there is no time part, just like before.

[dt.datetime(2019,1,2)]*3 is a Python list of objects. This particular list happens to contain only datetimes, but Pandas does not seem to recognize that, and treats it as it is - a list of any kind of objects.
If you convert it into a typed array, then Pandas will keep the original dtype of the column intact:
df.loc[[1,3,4]] = np.asarray([dt.datetime(2019,1,2)]*3)
I hope this workaround helps you, but you may still want to file a bug with Pandas. I don't have an explanation as to why the datetime objects should be coerced to ints in the first output example.

Related

SQL fill null values with another column

My problem is that I have a dataframe which has null values, but these null values are filled with another column of the same data frame, then I would like to know how to take that column and put the information of the other column to fill the missing data. I'm using deepnote
link:
https://deepnote.com
For example:
Column A
Column B
Cell 1
Cell 2
NULL
Cell 4
My desired output:
Column A
Cell 1
Cell 4
I think it should be with sub queries and using some WHERE, any ideas?
thanks for the question and welcome to StackOverflow.
It is not 100% clear which direction you need your solution to go, so I am offering two alternatives which I think should get you going.
Pandas way
You seem to be working with Pandas dataframes. The usual way to work with Pandas dataframes is to use Pandas builtin functions. In this case, there is literally a function for filling null values, it's called fillna. We can use it to fill values from another column like this:
df_raw = pd.DataFrame(data={'Column A': ['Cell 1', None], 'Column B': ['Cell 2', 'Cell 4']})
# copying the original dataframe to a clean one
df_clean = df_raw.copy()
# applying the fillna to fill null values from another column
df_clean['Column A'] = df_clean['Column A'].fillna(df_clean['Column B'])
This will make your df_clean look like you need
Column A
Cell 1
Cell 4
Dataframe SQL way
You mentioned "queries" and "where" in your question which seems you might be playing with some combination of Python and SQL world. Enter DuckDB world which supports exactly this, in Deepnote we call these Dataframe SQLs.
You can query e.g. CSV files directly from these Dataframe SQL blocks, but you can also use a previously defined Dataframe.
select * from df_raw
In order to fill the null values like you are requesting, we can use standard SQL querying and a function called coalesce as Paul correctly pointed out.
select coalesce("Column A", "Column B") as "Column A" from df_raw
This will also create what you need in SQL world. In Deepnote, specifically, this will also give you a Dataframe.
Column A
Cell 1
Cell 4
Feel free to check out my project in Deepnote with these examples, and go ahead and duplicate it if you want to iterate on the code a bit. There is also plenty more alternatives, if you're in a real SQL database and want to update existing columns, you would use update statement. And if you are in a pure Python, this is of course also possible in a loop or using lambda functions.

Pandas dataframe mixed dtypes when reading csv

I am reading in a large dataframe that is throwing a DtypeWarning: Columns (I understand this warning) but am struggling to prevent it (I don't want to set low_memory to False as I would like to specify the correct dtypes.
For every columns, the majority of rows are float values and the last 3 rows are string (metadata basically, information about each column). I understand that I can set the dtype per column when reading in the csv, however I do not know how to change rows 1:n to be float32 for example and the last 3 rows to be strings. I would like to avoid reading in two separate CSVs. The resulting dtype of all columns after reading in the dataframe is 'object'. Below is a reproducible example. The dtype warning is not thrown when reading in i am guessing because of the size of the dataframe - however the result is exactly the same as the problem i am facing. i would like to make the first 3 rows float32 and the last 3 string so that they are the correct dtype. thank you!
reproducible example:
df = pd.DataFrame([[0.1, 0.2,0.3],[0.1, 0.2,0.3],[0.1, 0.2,0.3],
['info1', 'info2','info3'],['info1', 'info2','info3'],['info1', 'info2','info3']],
index=['index1', 'index2', 'index3', 'info1', 'info2', 'info3'],
columns=['column1', 'column2', 'column3'] )
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv', index_col=0)

Selecting columns from a dataframe

I have a dataframe of monthly returns for 1,000 stocks with ids as column names.
monthly returns
I need to select only the columns that match the values in another dataframe which includes the ids I want.
permno list
I'm sure this is really quite simple, but I have been struggling for 2 days and if someone has an easy solution it would be so very much appreciated. Thank you.
You could convert the single-column permno list dataframe (osr_curr_permnos) into a list, and then use that list to select certain columns from your main dataframe (all_rets).
To convert the osr_curr_permnos column "0" into a list, you can use .to_list()
Then, you can use that list to slice all_rets and .copy() to make a fresh copy of it into a new dataframe.
The python code might look something like:
keep = osr_curr_permnos['0'].to_list()
selected_rets = all_rets[keep].copy()
"keep" would be a list, and "selected_rets" would be your new dataframe.
If there's a chance that osr_curr_permnos would have duplicates, you'll want to filter those out:
keep = osr_curr_permnos['0'].drop_duplicates().to_list()
selected_rets = all_rets[keep].copy()
As I expected, the answer was more simple than I was making it. Basically, I needed to take the integer values in my permnos list and recast those as strings.
osr_curr_permnos['0'] = osr_curr_permnos['0'].apply(str)
keep = osr_curr_permnos['0'].values
Then I can use that to select columns from my returns dataframe which had string values as column headers.
all_rets[keep]
It was all just a mismatch of int vs. string.

pandas merge produce duplicate columns

n1 = DataFrame({'zhanghui':[1,2,3,4] , 'wudi':[17,'gx',356,23] ,'sas'[234,51,354,123] })
n2 = DataFrame({'zhanghui_x':[1,2,3,5] , 'wudi':[17,23,'sd',23] ,'wudi_x':[17,23,'x356',23] ,'wudi_y':[17,23,'y356',23] ,'ddd':[234,51,354,123] })
code above defined two DataFrame objects. I wanna use 'zhanghui' field from n1 and 'zhanghui_x' field from n2 as "on" field merge n1 and n2,so my code like this:
n1.merge(n2,how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
and then result columns given like this :
sas wudi_x zhanghui ddd wudi_y wudi_x wudi_y zhanghui_x
Some duplicate columns appeared,such as 'wudi_x' ,'wudi_y'.
So it's a pandas inner problems or I had a wrong usage about pd.merge ?
From pandas documentation, the merge() function has following properties;
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
where suffixes denote default suffix string to be attached to 'over-lapping' columns with defaults '_x' and '_y'.
I'm not sure if I understood your follow-up question correctly, but;
#case1
if the first dataFrame has column 'column_name_x' and the second dataFrame has column 'column_name' then there are no over-lapping columns and therefore no suffixes are attached.
#case2
if the first dataFrame has columns 'column_name', 'column_name_x' and the second dataFrame also has column 'column_name', the default suffixes attach to over-lapping columns and therefore the first frame's 'columnn_name' becomes 'column_name_x' and result in a duplicate of already existing column.
You can however, pass a None value to one(not all) of the suffixes to ensure that column names of certain dataFrame remain as-is.
Your approach is right, pandas automatically gives postscripts after merging the columns that are "duplicated" with the original headers given a postscript _x, _y, etc.
you can first select what columns to merge and proceed:
cols_to_use = n2.columns - n1.columns
n1.merge(n2[cols_to_use],how = 'inner',left_on = 'zhanghui',right_on='zhanghui_x')
result columns:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
When I tried to run cols_to_use = n2.columns - n1.columns,it gave me a TypeError like this:
cannot perform __sub__ with this index type: <class pandas.core.indexes.base.Index'>
then I tried to use code below:
cols_to_use = [i for i in list(n2.columns) if i not in list(n1.columns) ]
It worked fine,result columns given like this:
sas wudi zhanghui ddd wudi_x wudi_y zhanghui_x
So,#S Ringne's method really resolved my problems.
=============================================
Pandas just simply add suffix such as '_x' to resolve the duplicate-column-name problem when it comes to merging two Frame objects.
But what will it happen if the name form of 'a-column-name'+'_x' appears in either Frame object? I used to think that it will check if the name form of 'a-column-name'+'_x' appears, But actually pandas doesn't have this check?

matplotlib: value error x and y have different dimensions

Having some difficulty plotting out values grouped by a text/name field and along a range of dates. The issue is that while I can group by the name and generate plots for some of the date ranges, there are instances where the grouping contains missing date values (just the nature of the overall dataset).
That is to say, I may very well be able to plot for a date_range('10/1/2013', '10/31/2013') for SOME of the grouped values, but there are instances where there is no '10/15/2013' within that range and therefore will throw the error mentioned in the title of this post.
Thanks for any input!
plt.rcParams['legend.loc'] = 'best'
dtable = pd.io.parsers.read_table(str(datasource), sep=',')
unique_keys = np.unique(dtable['KEY'])
index = date_range(d1frmt, d2frmt)
for key in unique_keys:
values = dtable[dtable['KEY'] == key]
plt.figure()
plt.plot(index, values['VAL']) <--can fail if index is missing a date
plt.xlim(xmin=d1frmt,xmax=d2frmt)
plt.xticks(rotation=270)
plt.xticks(size='small')
plt.legend(('H20'))
plt.ylabel('Head (ft)')
plt.title('Well {0}'.format(key))
fig = str('{0}.png'.format(key))
out = str(outputloc) + "\\" + str(fig)
plt.savefig(out)
plt.close()
You must have a date column, or index, in your dtable. Otherwise you dont know which in values['Val'] belong to which date.
If you do, there are two ways.
Since you make a subset based on a key, you can either use the index (if its a datetime!) of that subset:
plt.plot(values.index.to_pydatetime(), values['VAL'])
or reindex the subset to your 'target' range":
values = values.reindex(index)
plt.plot(index.to_pydatetime(), values['VAL'])
By default, reindex inserts NaN values as missing data.
It would be easier if you gave a working example, its a bit hard to answer without knowing what your Dataframe looks like.