Convert pandas dataframe with missing values from object to int - pandas

I am trying to change the values in the data frame below to ints so I can changes these times hh/mm/ss into a number value based on hours (e.g. for row two hrs_cor would equal 5.5).
hrs mins secs
0 None None
1 None None
2 5 30 00
3 5 22 30
4 8 00 00
... .. ... ...
1052 None None
1053 None None
1054 None None
1055 None None
1056 None None
The issue I am running is converting the data frame into numeric values, and I think it is due to the empty cells. So far I have tried variations of the code below:
MID_calc['hrs'] = MID_calc.to_numeric(MID_calc['hrs'], errors='coerce').astype('INT46')
And this error is returned:
AttributeError: 'DataFrame' object has no attribute 'to_numeric'
Currently, all values are objects
hrs object
mins object
secs object
dtype: object
I have looked through several posts, but nothing seems to be working. Any help would be greatly appreciated!

You need to use
import pandas as pd
MID_calc['hrs'] = pd.to_numeric(MID_calc['hrs'] , errors='ignore').astype('INT46')
dont just copy the code directly from the website. Understand what it does.

Related

how to add a character to every value in a dataframe without losing the 2d structure

Today my problem is this: I have a dataframe of 300 X 41. Its encoded with numbers. I want to append an 'a' to each value in the dataframe so that another down stream program will not fuss about these being 'continuous variables' which they arent, they are factors. Simple right?
Every way I can think to do this though returns a dataframe or object that is not 300x 41...but just one long list of altered values:
Please end this headache for me. How can I do this in a way that returns a 400 X 31 altered output?
> dim(x)
[1] 300 41
>x2 <- sub("^","a",x)
>dim(x2)
[1] 12300 1

Removing the .0 from a pandas column

After a simple merge of two dataframes the following X column becomes an object and an ".0" is being added at the end for no apparent reason. I tried replacing the nan values with an integer and then converting the whole column to an integer hoping for the .0 to be gone. The code runs but it doesn't really change the dtype of that column. Also, I tried removing the .0 with the rstrip command but then all it really does is it removes everything and even the values that are 249123.0 become NaN which doesn't make sense. I know that is a very basic issue but I am not sure what else could I try at this point.
Input:
Age ID
22 23105.0
34 214541.0
51 0
8 62341.0
Desired output:
Age ID
22 23105
34 214541
51 0
8 62341
Any ideas would be much appreciated.
One of the ways to get rid of the trailing .0 in an object column is to use pandas.DataFrame.replace :
df['ID'] = df['ID'].replace(r'\.0$', '', regex=True).astype(np.int64)
# Output :
print(df)
Age ID
0 22 23105
1 34 214541
2 51 0
3 8 62341

DateTime Between DateTime Range in Separate DataFrames

*Edited for clarity
I need to find values from ['DateA'], ['DateB'], etc. between a range of dates ['Start']['End'] and return a running count to ['A']['B'], etc., for every time ['DateA]', ['DateB'], etc. falls within the ['Start']['End'] range.
The start datetime and end datetime are overlapping 53-week increments. If any of the dates (A, B, C, or D) are within the 53-week window, the script will count those datetimes and return an integer. To add to the confusion, DateD will only return an integer if the datetime coincides with a value of "Failed" from the Pass_Fail element. As example:
Expected Output:
Start End A B C D
10/30/19 11/04/20 0 0 4 3
11/06/19 11/11/20 1 0 3 3
11/13/19 11/18/20 1 0 3 3
Dates DataFrame (simplified)
- the coinciding fail dates below are 02/13/20, 06/21/20, 07/15/20 (hence int "3" in col D of Expected Output above.
DateA DateB DateC DateD Pass_Fail
0 11/07/20 None 06/21/20 02/09/20 Pass
1 None None 06/11/20 12/14/19 Pass
2 None None 09/21/19 03/26/20 Pass
3 None None 03/20/20 02/13/20 Fail
4 None None 08/16/20 06/21/20 Fail
5 None None None 01/06/20 Pass
6 None None None 04/03/20 Pass
7 None None None 07/15/20 Fail
8 None None None 02/20/20 Pass
9 None None None 03/22/20 Pass
10 None None None 11/15/19 Pass
I'm certain this is simple, but I'm obviously just starting out and couldn't locate a direct answer/problem-solve this myself.
Many thanks!
-DD

Unexpected groupby result: some rows are missing

I am facing an issue with transforming my data using Pandas' groupby. I have a table (several million rows and 3 variables) that I am trying to group by "Date" variable.
Snippet from a raw table:
Date V1 V2
07_19_2017_17_00_06 10 5
07_19_2017_17_00_06 20 6
07_19_2017_17_00_08 15 3
...
01_07_2019_14_06_59 30 1
01_07_2019_14_06_59 40 2
The goal is to group rows with the same value of "Date" by applying a mean function over V1 and sum function over V2. So that the expected result resembles:
Date V1 V2
07_19_2017_17_00_06 15 11 # This row has changed
07_19_2017_17_00_08 15 3
...
01_07_2019_14_06_59 35 3 # and this one too!
My code:
df = df.groupby(['Date'], as_index=False).agg({'V1': 'mean', 'V2': 'sum'})
The output I am getting, however, is totally unexpected and I am can't find a reasonable explanation of why it happens. It seems like Pandas is only processing data from 01_01_2018_00_00_01 to 12_31_2018_23_58_40, instead of 07_19_2017_17_00_06 to 01_07_2019_14_06_59.
Date V1 V2
01_01_2018_00_00_01 30 3
01_01_2018_00_00_02 20 4
...
12_31_2018_23_58_35 15 3
12_31_2018_23_58_40 16 11
If you have any clue, I would really appreciate your input. Thank you!
I suspect that the issue is based around Pandas not recognizing the date format that I've used. A solution turned out to be quite simple: convert all of the dates into UNIX time format, divide by 60 and then, repeat the groupby procedure.

When I sort a dataFrame and then write it to a csv file, the values get converted to float, How to avoid this?

I have a df:
Id Reputation
1 50
3 20
2 10
If I sort by reputation I have
Id Reputation
2 10
3 20
1 50
After executing
df.to_csv("output.csv", index = False)
In my CSV file :
Id,Reputation
2,10.0
3,20.0
1,50.0
How to avoid this?
I doubt it that sorting changed dtype. You can check (before sorting):
print df.dtypes
I guess Reputation was float.
Anyway, you can always change dtypes before to_csv:
df['Reputation'] = df['Reputation'].astype(int)