how to convert a column of pandas series without the header - pandas

It is quite odd as I hadn't experienced the issue until now,, for conversion of data series.
So I have wind speed data by date & hour at different heights, retrieved from NREL.
file09 = 'wind/wind_yr2009.txt'
wind09 = pd.read_csv(file09, encoding = "utf-8", names = ['DATE (MM/DD/YYYY)', 'HOUR-MST', 'AWS#20m [m/s]', 'AWS#50m [m/s]', 'AWS#80m [m/s]', 'AMPLC(2-80m)'])
file10 = 'wind/wind_yr2010.txt'
wind10 = pd.read_csv(file10, encoding = "utf-8", names = ['DATE (MM/DD/YYYY)', 'HOUR-MST', 'AWS#20m [m/s]', 'AWS#50m [m/s]', 'AWS#80m [m/s]', 'AMPLC(2-80m)'])
I merge the two readings of .txt files below
wind = pd.concat([wind09, wind10], join='inner')
Then drop the duplicate headings..
wind = wind.reset_index().drop_duplicates(keep='first').set_index('index')
print(wind['HOUR-MST'])
Printing would return smth like the following -
index
0 HOUR-MST
1 1
2 2
I wasn't sure at first but apparently index 0 is on HOUR-MST, which is the column heading. Python does recognize it as I can infer the column data using the specific header. Yet, when I try converting into int
temp = hcodebook.iloc[wind['HOUR-MST'].astype(int) - 1]
Both errors were returned, as I later tried to convert to float
ValueError: invalid literal for int() with base 10: 'HOUR-MST'
ValueError: could not convert string to float: 'HOUR-MST'
I verified it is only the 0th index that has strings by using try/except in for loop.
I think the reason is because I didnt' use the parameter sep when reading these file - as that is the only difference with the previous attempts with other files where the data conversion is troubling me.
Yet it doesn't necessarily enlighten me in how to address it.
Kindly advise.

MCVE:
from io import StringIO
import pandas as pd
cfile = StringIO("""A B C D
1 2 3 4
5 6 7 8""")
pd.read_csv(cfile, names=['a','b','c','d'], sep='\s\s+')
Header included in data:
a b c d
0 A B C D
1 1 2 3 4
2 5 6 7 8
Use skiprows to avoid getting headers:
from io import StringIO
import pandas as pd
​
cfile = StringIO("""A B C D
1 2 3 4
5 6 7 8""")
pd.read_csv(cfile, names=['a','b','c','d'], sep='\s\s+', skiprows=1)
No headers:
a b c d
0 1 2 3 4
1 5 6 7 8

Related

ValueError: could not convert string to float: '1.598.248'

how do i convert a number dataframe column in millions to double or float
0 1.598.248
1 1.323.373
2 1.628.781
3 1.551.707
4 1.790.930
5 1.877.985
6 1.484.103
0 15982480.0
1 13233730.0
2 16287810.0
3 15517070.0
4 17909300.0
5 18779850.0
6 14841030.0
You will need to remove the full stops. You can use pandas replace method then convert it into a float:
df['col'] = df['col'].replace('\.', '', regex=True).astype('float')
Example
>>> df = pd.DataFrame({'A': ['1.1.1', '2.1.2', '3.1.3', '4.1.4']})
>>> df
A
0 1.1.1
1 2.1.2
2 3.1.3
3 4.1.4
>>> df['col'] = df['col'].replace('\.', '', regex=True).astype('float')
>>> df['A']
A
0 111.0
1 212.0
2 313.0
3 414.0
>>> df['A'].dtype
float64
I'm assuming that because you have two full stops that the data is of type string. However, this should work even if you have some integers or floats in that column as well.
my_col_name
0 1.598.248
1 1.323.373
2 1.628.781
3 1.551.707
4 1.790.930
5 1.877.985
6 1.484.103
With the df above, you can try below code, with 3 steps: (1) change column type to string, (2) do string replace character, (3) change column type to float
col = 'my_col_name'
df[col] = df[col].astype('str')
df[col] = df[col].str.replace('.','')
df[col] = df[col].astype('float')
print(df)
Please note the above will result in a warning: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
So you could use below code with regex=True, also, I've combined in 1 line:
df[col] = df[col].astype('str').str.replace('.','', regex=True).astype('float')
print(df)
Output
my_col_name
0 15982480.0
1 13233730.0
2 16287810.0
3 15517070.0
4 17909300.0
5 18779850.0
6 14841030.0

Classifying pandas columns according to range limits

I have a dataframe with several numeric columns and their range goes either from 1 to 5 or 1 to 10
I want to create two lists of these columns names this way:
names_1to5 = list of all columns in df with numbers ranging from 1 to 5
names_1to10 = list of all columns in df with numbers from 1 to 10
Example:
IP track batch size type
1 2 3 5 A
9 1 2 8 B
10 5 5 10 C
from the dataframe above:
names_1to5 = ['track', 'batch']
names_1to10 = ['ip', 'size']
I want to use a function that gets a dataframe and perform the above transformation only on columns with numbers within those ranges.
I know that if the column 'max()' is 5 than it's 1to5 same idea when max() is 10
What I already did:
def test(df):
list_1to5 = []
list_1to10 = []
for col in df:
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
I tried the above but it's returning the following error msg:
'>=' not supported between instances of 'float' and 'str'
The type of the columns is 'object' maybe this is the reason. If this is the reason, how can I fix the function without the need to cast these columns to float as there are several, sometimes hundreds of these columns and if I run:
df['column'].max() I get 10 or 5
What's the best way to create this this function?
Use:
string = """alpha IP track batch size
A 1 2 3 5
B 9 1 2 8
C 10 5 5 10"""
temp = [x.split() for x in string.split('\n')]
cols = temp[0]
data = temp[1:]
def test(df):
list_1to5 = []
list_1to10 = []
for col in df.columns:
if df[col].dtype!='O':
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
df = pd.DataFrame(data, columns = cols, dtype=float)
Output:
(['track', 'batch'], ['IP', 'size'])

Saving a kdb table to a dataframe then saving the dataframe to a csv. null and string values outputting to csv incorrectly?

I am saving a kdb table to a dataframe then saving the dataframe to a csv. This works, however, the csv file and if i print(dataframe); null values are showing as " b" ", and all other string values are showing as " b'STRING' ".
Running Python 3.7, pandas 0.24.2 and qpython 2.0.0.
df = pandas.DataFrame(qpython query)
df.to_csv(path_or_buf="",
sep=",", na_rep='',
float_format=None,
columns=None,
header=True, index=False,
index_label=None, mode='w+', compression=None, quoting=None, quotechar='"',
line_terminator="\n", chunksize=50, tupleize_cols=None, date_format=None,
doublequote=True,
escapechar=None, decimal='.', encoding='utf-8')
I expected the KDB table to output to the csv correctly, with nulls being an empty column and strings just showing the string, without " b'STRING' ".
Any advice or help would be greatly appreciated. If anyone needs any more information, I'd be happy to provide.
Example in csv:
Null cells show as : b"
Cells containing strings show as:" b'Euro' " when in fact should just show "Euro"
qPython has some functionality for converting a kdb table to a pandas dataframe. I begin by creating a table in kdb "t" that has 4 columns where the third column is a column of symbols and the 4th is a column of characters. The entrys in the first row are entirely nulls.
t:([] a: 0N, til 99; b: 0Nf, 99?1f; c: `, 99?`3; d: " ", 99?" ")
a b c d
-----------------
0 0.4123573 iee x
1 0.8397208 app l
2 0.3392927 ncm w
3 0.285506 pjn c
The table can then be read into Python using QConnection. If we convert the table to a dataframe after it is read in we can see that the symbols and chars are converted to bytes and the nulls are not converted correctly.
df=pandas.DataFrame(q('t'))
df.head()
a b c d
0 -9223372036854775808 NaN b'' b' '
1 0 0.412357 b'iee' b'x'
2 1 0.839721 b'app' b'l'
3 2 0.339293 b'ncm' b'w'
4 3 0.285506 b'pjn' b'c'
However if we use the pandas=True argument with our q query then most of the table is converted appropriately as desired:
df=q('t', pandas=True)
df.head()
a b c d
0 NaN NaN b''
1 0.0 0.412357 b'iee' x
2 1.0 0.839721 b'app' l
3 2.0 0.339293 b'ncm' w
4 3.0 0.285506 b'pjn' c
However notice that entries stored as symbols in kdb are not converted as desired. In this case the following code will manually decode any columns specified in string_cols from bytes into strings using a similar method to the one suggested by Callum.
string_cols = ['c']
df[string_cols] = df[string_cols].applymap(lambda s : s.decode('utf-8'))
giving an end result of:
df.head()
a b c d
0 NaN NaN
1 0.0 0.412357 iee x
2 1.0 0.839721 app l
3 2.0 0.339293 ncm w
4 3.0 0.285506 pjn c
Which can easily be converted to a csv file.
Hope this helps
I would have expected strings in kdb to be handled fine, as QPYTHON should convert null strings to python null strings. Null symbols, however, are converted to _QNULL_SYM. In this case, I think the 'b' prefix indicates a byte literal. You can try to decode the byte objects before saving to a csv
Normally in python I would do something along the following
df['STRINGCOL'] = df['STRINGCOL'].apply(lambda s: s.decode('utf-8'))
I don't have much experience with QPYTHON but I believe using qnull() will convert the null to a pythonic value.
df['STRINGCOL'] = df['STRINGCOL'].apply(lambda s: qnull(s))

remove trailing * or # from data in pandas

Im reading a csv using pandas and getting all datatypes as object
NO is the column having numeric values with trailing * and # in some observations.
I tried
import numpy as np
tai[np.isfinite(tai['NO'])]
TypeError: ufunc 'isfinite' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
How can I remove all rows which have * or # in trailing in NO columns?
Consider this dataframe,
No
0 1
1 2#
2 3
3 4*
4 #5
You can use this to remove ONLY trailing characters,
df['No'] = df['No'].str.replace('[#|*]$', '')
You get
No
0 1
1 2
2 3
3 4
4 #5
A more generalized solution incase you want to remove these characters from the entire column and keep only the numbers
df['No' ] = df['No'].str.extract('(\d+)', expand = False)
You get
No
0 1
1 2
2 3
3 4
4 5

combine csv files, sort them by time and average the colums

I have many datasets in csv files they look like in the picture that I attached.
In the first column is always the time in minutes, but the time steps and the total number of rows differ between the raw data files. I'd like to have one output file (csv file) in which all the raw files are combined and sorted by the time. So that the time increases from the top to the bottom of the column.
raw data and output
The concentration column should be averaged, when more than one number exists.
I tried like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
d1.columns
d2.columns
merged_outer = pd.merge(d1,d2, on='time', how='outer')
print merged_outer
but it doesn't lead to the correct output. I'm a beginner in Pandas but I hope I explaind the problem well enough. Thank you for any idea or suggestion!
Thank you for your idea. Unfortunately, when I run it I get an error message saying that dat1.txt doesn't exist. This seems strange to me as I read the raw files initially by:
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
Sorry, here the data as raw text:
raw data 1
time column2 column3 concentration
1 2 4 3
2 2 4 6
4 2 4 2
7 2 4 5
raw data 2
time column2 column3 concentration
1 2 4 6
2 2 4 2
8 2 4 9
10 2 4 5
12 2 4 7
Something like this might work
filenames = ['dat1.txt', 'dat2.txt',...]
dataframes = {filename: pd.read_csv(filename, sep="\t") for filename in filenames}
merged_outer = pd.concat(dataframes).groupby('time').mean()
When you pass a dict to pd.concat, it creates a MultiIndex DataFrame with the dict keys as level0