Python Pandas extract columns from .csv - pandas

I have a .csv file, which I can read in Pandas. The .csv file looks like the following.
a b c d e
1 4 3 2 5
6 7 8 3 6
...
What I need to achieve is, that I can extract a and b as a column vector and
[c d e] as a matrix. I used Pandas with the following code to read the .csv file:
pd.read_csv('data.csv', sep=',',header=None)
But this will give me a vector like this: [[a,b,c,d,e],[1,4,3,2,5],...]
How can I extract the columns? I heared about df.iloc, but this cannot be used here, since after pd.read_csv there is only one column.

You should be able to do that with:
ds = pd.read_csv('data.csv', sep=',',header=0)
column_a = ds["a"]
matrix = ds[["c","d","e"]]

Related

Adding file name to column name pandas dataframe

I have a pandas dataframe created from several csv files. The csv files are all structured the same way, so I have the same column names over and over again. I want the column names to be expanded by the file names (which I have as a list) they come from.
From this I know how to add a count to same name columns and I know how to rename columns. But I fail at bringing the right file name to the right column value.
That should be the relevant part of the code:
for i in range(0,len(file_list)):
data = pd.read_table(file_list[i], encoding='unicode_escape')
df = pd.DataFrame(data)
df = df.drop(droplist,axis=1)
main_dataframe = pd.concat([main_dataframe, df], axis = 1)
You can use a dictionary in concat to generate a MultiIndex:
list_of_files = ['f1.csv', 'f2.csv']
pd.concat({f: pd.read_table(f, encoding='unicode_escape', sep=',')
for f in list_of_files}, axis=1)
example:
# f1.csv
a,b
1,2
3,4
# f2.csv
a,b
5,6
7,8
output:
f1.csv f2.csv
a b a b
0 1 2 5 6
1 3 4 7 8
Alternative using add_prefix in the list comprehension:
pd.concat([pd.read_table(f, encoding='unicode_escape', sep=',')
.add_prefix(f[:-3]) # add prefix without ".csv" extension
for f in list_of_files], axis=1))
output:
f1.a f1.b f2.a f2.b
0 1 2 5 6
1 3 4 7 8

Read TXT or DAT file in Python

I need to read a .DAT or .TXT file, extract the column names and assign them to new names and write the data to a pandas dataframe.
I have an environment variable called 'filetype' and based on it's value(DAT or TXT), I need to read the file accordingly and extract column names from it and assign to new column names.
My input .dat/.txt file has just 2 columns and it looks like as below:
LN_ID,LN_DT
1234,10/01/2020
4567,10/01/2020
8888,10/01/2020
9999,10/01/2020
Read the above file and create new columns new_loan_id=loan_id and new_ln_dt=ln_dt and write to a pandas dataframe
I've tried using pandas something like below but it's giving some error and I also want to check first if myfile is .dat or .txt based on the environment variable 'filetype' value and proceed.
df=pd.read_csv('myfile.dat',sep=',')
new_cols=['new_ln_id','new_ln_dt']
df.columns=new_cols
I think there could be some better and easy way. Appreciate if anyone can help. Thanks!
It is unclear from your question whether you want two new empty columns or if you want to replace the existing names. Either way, you can do this for dte given by:
Add columns
LN_ID LN_DT
0 1234 10/01/2020
1 4567 10/01/2020
2 8888 10/01/2020
3 9999 10/01/2020
define the new columns
cols = ['new_ln_id','new_ln_dt']
and `
print(pd.concat([dte,pd.DataFrame(columns=cols)]))
which gives
LN_ID LN_DT new_ln_id new_ln_dt
0 1234.0 10/01/2020 NaN NaN
1 4567.0 10/01/2020 NaN NaN
2 8888.0 10/01/2020 NaN NaN
3 9999.0 10/01/2020 NaN NaN
Replace column names
df.rename(columns={"LN_ID": "new_ln_id", "LN_DT": "new_ln_dt"})
Thanks for your response and Sorry for the confusion. I want to rename the 2 columns. But, actually, I want to check first whether it's a .dat or .txt file based on unix environment variable called 'filetype'.
For ex: if filetype='TXT' or 'DAT' then read the input file say 'abc.dat' or 'abc.txt' into a new pandas dataframe and rename the 2 columns. I hope it's clear.
Here is what I did. I've created a function to check if the filetype is "dat" or "txt" and read the file into a pandas dataframe and then I'm renaming the 2 columns. The function is loading the data but it's not renaming the columns as required. Appreciate if anyone can point me what am I missing.
filetype=os.environ['TYPE']
print(filetype)
DAT
def load(file_type):
if file_type.lower()=="dat":
df=pd.read_csv(input_file, sep=',',engine='python')
if df.columns[0]=="LN_ID":
df.columns[0]="new_ln_id"
if df.columns[1]=="LN_DT":
df.columns[1]="new_ln_dt"
return(df)
else:
if file_type.lower()=="txt":
df=pd.read_csv("infile",sep=",",engine='python')
if df.columns[0]=="LN_ID":
df.columns[0]="new_ln_id"
if df.columns[1]=="LN_DT":
df.columns[1]="new_ln_dt"
return(df)
load(filetype)
Alternative
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(path) if isfile(join(path, f))]
filename = os.path.join(path, onlyfiles[0])
if filename.endswith('.txt'):
dte = pd.read_csv(filename, sep=",")
elif filename.endswith('.dat'):
dte = pd.read_csv(filename, sep=",")
dte.rename(columns={"LN_ID": "new_ln_id", "LN_DT": "new_ln_dt"})

Saving a kdb table to a dataframe then saving the dataframe to a csv. null and string values outputting to csv incorrectly?

I am saving a kdb table to a dataframe then saving the dataframe to a csv. This works, however, the csv file and if i print(dataframe); null values are showing as " b" ", and all other string values are showing as " b'STRING' ".
Running Python 3.7, pandas 0.24.2 and qpython 2.0.0.
df = pandas.DataFrame(qpython query)
df.to_csv(path_or_buf="",
sep=",", na_rep='',
float_format=None,
columns=None,
header=True, index=False,
index_label=None, mode='w+', compression=None, quoting=None, quotechar='"',
line_terminator="\n", chunksize=50, tupleize_cols=None, date_format=None,
doublequote=True,
escapechar=None, decimal='.', encoding='utf-8')
I expected the KDB table to output to the csv correctly, with nulls being an empty column and strings just showing the string, without " b'STRING' ".
Any advice or help would be greatly appreciated. If anyone needs any more information, I'd be happy to provide.
Example in csv:
Null cells show as : b"
Cells containing strings show as:" b'Euro' " when in fact should just show "Euro"
qPython has some functionality for converting a kdb table to a pandas dataframe. I begin by creating a table in kdb "t" that has 4 columns where the third column is a column of symbols and the 4th is a column of characters. The entrys in the first row are entirely nulls.
t:([] a: 0N, til 99; b: 0Nf, 99?1f; c: `, 99?`3; d: " ", 99?" ")
a b c d
-----------------
0 0.4123573 iee x
1 0.8397208 app l
2 0.3392927 ncm w
3 0.285506 pjn c
The table can then be read into Python using QConnection. If we convert the table to a dataframe after it is read in we can see that the symbols and chars are converted to bytes and the nulls are not converted correctly.
df=pandas.DataFrame(q('t'))
df.head()
a b c d
0 -9223372036854775808 NaN b'' b' '
1 0 0.412357 b'iee' b'x'
2 1 0.839721 b'app' b'l'
3 2 0.339293 b'ncm' b'w'
4 3 0.285506 b'pjn' b'c'
However if we use the pandas=True argument with our q query then most of the table is converted appropriately as desired:
df=q('t', pandas=True)
df.head()
a b c d
0 NaN NaN b''
1 0.0 0.412357 b'iee' x
2 1.0 0.839721 b'app' l
3 2.0 0.339293 b'ncm' w
4 3.0 0.285506 b'pjn' c
However notice that entries stored as symbols in kdb are not converted as desired. In this case the following code will manually decode any columns specified in string_cols from bytes into strings using a similar method to the one suggested by Callum.
string_cols = ['c']
df[string_cols] = df[string_cols].applymap(lambda s : s.decode('utf-8'))
giving an end result of:
df.head()
a b c d
0 NaN NaN
1 0.0 0.412357 iee x
2 1.0 0.839721 app l
3 2.0 0.339293 ncm w
4 3.0 0.285506 pjn c
Which can easily be converted to a csv file.
Hope this helps
I would have expected strings in kdb to be handled fine, as QPYTHON should convert null strings to python null strings. Null symbols, however, are converted to _QNULL_SYM. In this case, I think the 'b' prefix indicates a byte literal. You can try to decode the byte objects before saving to a csv
Normally in python I would do something along the following
df['STRINGCOL'] = df['STRINGCOL'].apply(lambda s: s.decode('utf-8'))
I don't have much experience with QPYTHON but I believe using qnull() will convert the null to a pythonic value.
df['STRINGCOL'] = df['STRINGCOL'].apply(lambda s: qnull(s))

pandas get columns without copy

I have a data frame with multiple columns, and I want to get some of them, and drop others, without copying a new dataframe
I suppose it should be
df = df['col_a','col_b']
but I'm not sure whether it copy a new one or not. Is there any better way to do this?
Your approach should work, apart from one minor issue:
df = df['col_a','col_b']
shoud be:
df = df[['col_a','col_b']]
Because you assign the subset df back to df, it's essentially equivalent to dropping the other columns.
If you would like to drop other columns in place, you can do:
df.drop(columns=df.columns.difference(['col_a','col_b']),inplace=True)
Let me know if this is what you want.
you have a dataframe df with multiple columns a, b, c, d and e. You want to select let us say a and b and store them back in df. To achieve this, you can do :
df=df[['a', 'b']]
Input dataframe df:
a b c d e
1 1 1 1 1
3 2 3 1 4
When you do :
df=df[['a', 'b']]
output will be :
a b
1 1
3 2

combine csv files, sort them by time and average the colums

I have many datasets in csv files they look like in the picture that I attached.
In the first column is always the time in minutes, but the time steps and the total number of rows differ between the raw data files. I'd like to have one output file (csv file) in which all the raw files are combined and sorted by the time. So that the time increases from the top to the bottom of the column.
raw data and output
The concentration column should be averaged, when more than one number exists.
I tried like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
d1.columns
d2.columns
merged_outer = pd.merge(d1,d2, on='time', how='outer')
print merged_outer
but it doesn't lead to the correct output. I'm a beginner in Pandas but I hope I explaind the problem well enough. Thank you for any idea or suggestion!
Thank you for your idea. Unfortunately, when I run it I get an error message saying that dat1.txt doesn't exist. This seems strange to me as I read the raw files initially by:
d1 = pd.read_csv('dat1.txt', sep="\t")
d2 = pd.read_csv('dat2.txt', sep="\t")
Sorry, here the data as raw text:
raw data 1
time column2 column3 concentration
1 2 4 3
2 2 4 6
4 2 4 2
7 2 4 5
raw data 2
time column2 column3 concentration
1 2 4 6
2 2 4 2
8 2 4 9
10 2 4 5
12 2 4 7
Something like this might work
filenames = ['dat1.txt', 'dat2.txt',...]
dataframes = {filename: pd.read_csv(filename, sep="\t") for filename in filenames}
merged_outer = pd.concat(dataframes).groupby('time').mean()
When you pass a dict to pd.concat, it creates a MultiIndex DataFrame with the dict keys as level0