Pandas read .csv separated by whitespace but columns with names that contain spaces - pandas

I have a .csv file that have to read. It is separated by a whitespace but the column names also have spaces. Something like this:
column1 another column final column
value ONE valueTWO valueTHREE
I have been trying to read it withthis but it confuses with the spaces of the column names (not separators). I tried using read_fwf and read_csv but did not worked:
df_mccf=pd.read_fwf(r'C:\Users\MatíasGuerreroIrarrá\OneDrive - BIWISER\Orizon\MCCF\inputs\valores-MCCF (3).csv',\
colspecs=[(0, 4), (5, 10), (11, 21), (22, 32), (33, 54), (55, 1000)])
and:
df_mccf=pd.read_fwf(r'C:\Users\MatíasGuerreroIrarrá\OneDrive - BIWISER\Orizon\MCCF\inputs\valores-MCCF (3).csv',\
sep=' ')
get this
and with this line:
df_mccf=pd.read_csv(r'C:\Users\MatíasGuerreroIrarrá\OneDrive - BIWISER\Orizon\MCCF\inputs\valores-MCCF (3).csv',\
encoding='UTF-16', delim_whitespace=True)
got this
Any help would be really amazing.

I'd suggest you ignore the header altogether and instead pass the names argument. That way you can use the whitespace separator for the rest of the file:
import io
import pandas as pd
data = """column one column two column three
a 1 x
b 2 y
"""
with io.StringIO(data) as f:
df = pd.read_csv(
f,
delim_whitespace=True,
names=['one', 'two', 'three'], # custom header names
skiprows=1, # Skip the initial row (header)
)
Result:
one two three
0 a 1 x
1 b 2 y

Related

How to read specific columns using dropna(thresh)

I have a dataframe which looks like this:
How do I drop columns which have 3 missing values from Q1 - Q8.
Following which, for Q1 - Q8, if there are 2 or less missing value, to input default value as "0".
I have tried various forms of dropna(thresh=N) but I am not sure if it can read specific columns only.
One thing you could do is :
Split your dataset in 2 datasets :
df_no_change = df[['A', 'B', 'C', 'D']]
df_change = df[['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8']]
Then apply the column removal to df_change
df_change = df_change.dropna(thresh=len(df_change) - 2, axis=1)
And finally re concatenate your dataframes
df = pd.concat([df_change, df_no_change], axis=1)
This may work

Pandas Combining header rows error: too many values to unpack

I'm trying to follow the answer for this StackOverflow question: Pandas: combining header rows of a multiIndex DataFrame because I have the same need.
I've put this data into foobar.txt:
first,bar,baz,foo
second,one,two,three
A,1,2,3
B,8,9,10
I want to create a dataframe that looks like this:
first-second bar-one baz-two foo-three
A 1 2 3
B 8 9 10
I'm following the first answer to the linked question, which uses list comprehension, so my entire code looks like this:
import pandas as pd
df = pd.read_csv(r'C:\Temp\foobar.txt')
df.columns = [f'{i}{j}' for i, j in df.columns]
However, I get a "too many values to unpack error":
Exception has occurred: ValueError
too many values to unpack (expected 2)
File ".\test.py", line 32, in <listcomp>
df.columns = [f'{i}{j}' for i, j in df.columns]
File ".\test.py", line 32, in <module>
df.columns = [f'{i}{j}' for i, j in df.columns]
I've looked at other examples where folks hit the same error and I'm certain it's due to the fact that I have more than 2 values from df.columns, but I'm not sure how to fix that, nor do I understand why the answer I linked to above doesn't hit this problem.
You have to read the CSV by specifying the header rows to get MultiIndex
df = pd.read_csv(r'C:\Temp\foobar.txt', header=[0,1])
df.columns
MultiIndex([('first', 'second'),
( 'bar', 'one'),
( 'baz', 'two'),
( 'foo', 'three')],
)
df.columns = [f'{i}{j}' for i, j in df.columns]
df.columns
Index(['firstsecond', 'barone', 'baztwo', 'foothree'], dtype='object')

Finding Mismatch of a column having decimal value using Pandas

I have 2 csv files with same headers. I merged them with primary keys. Now from the merged file, I need to create another file with data which has all matching values and mismatch at 7th decimal place for col1 and col2 which are float value columns. What is the best way to do that?
generate some data that matches shape you note
simple case of do equality of rounded numbers, then to_csv()
included sample 5 rows
from pathlib import Path
b = np.random.randint(1,100, 100)
df1 = pd.DataFrame(b+np.random.uniform(10**-8, 10**-7, 100), columns=["col1"])
df2 = pd.DataFrame(b+np.random.uniform(10**-8, 10**-7, 100), columns=["col2"])
fn = Path.cwd().joinpath("SO_same.csv")
df1.join(df2).assign(eq7dp=lambda df: df.col1.round(7).eq(df.col2.round(7))).head(5).to_csv(fn)
with open(fn) as f: contents=f.read()
print(contents)
output
,col1,col2,eq7dp
0,37.00000005733964,37.00000002893621,False
1,46.00000001386966,46.00000008236663,False
2,99.00000007870301,99.00000007452154,True
3,42.00000001906606,42.00000001278533,True
4,79.00000007529009,79.00000007372863,True
supplement
In comments you note you want to use a np.where() expression, to select col1 if equal else False. You need to ensure that 2nd and 3rd parameters to np.where() are compatible. NB False is zero when converted to an int/float.
df1.join(df2).assign(eq7dp=lambda df: df.col1.round(7).eq(df.col2.round(7)),
col3=lambda df: np.where(df.col1.round(7).eq(df.col2.round(7)),df.col1,np.full(len(df),False))
)

Get names of dummy variables created by get_dummies

I have a dataframe with a very large number of columns of different types. I want to encode the categorical variables in my dataframe using get_dummies(). The question is: is there a way to get the column headers of the encoded categorical columns created by get_dummies()?
The hard way to do this would be to extract a list of all categorical variables in the dataframe, then append the different text labels associated to each categorical variable to the corresponding column headers. I wonder if there is an easier way to achieve the same end.
I think the way that should work with all the different uses of get_dummies would be:
#example data
import pandas as pd
df = pd.DataFrame({'P': ['p', 'q', 'p'], 'Q': ['q', 'p', 'r'],
'R': [2, 3, 4]})
dummies = pd.get_dummies(df)
#get column names that were not in the original dataframe
new_cols = dummies.columns[~dummies.columns.isin(df.columns)]
new_cols gives:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')
I think the first column is the only column preserved when using get_dummies, so you could also just take the column names after the first column:
dummies.columns[1:]
which on this test data gives the same result:
Index(['P_p', 'P_q', 'Q_p', 'Q_q', 'Q_r'], dtype='object')

Get row and column index of value in Pandas df

Currently I'm trying to automate scheduling.
I'll get requirement as a .csv file.
However, the number of day changes by month, and personnel also changes occasionally, which means the number of columns and rows is not fixed.
So, I want to put value '*' as a marker meaning end of a table. Unfortunately, I can't find a function or method that take a value as a parameter and return a(list of) index(name of column and row or index numbers).
Is there any way that I can find a(or a list of) index of a certain value?(like coordinate)
for example, when the data frame is like below,
|column_1 |column_2
------------------------
1 | 'a' | 'b'
------------------------
2 | 'c' | 'd'
how can I get 'column_2' and '2' by the value, 'd'? It's something similar to the opposite of .loc or .iloc.
Interesting question. I also used a list comprehension, but with np.where. Still I'd be surprised if there isn't a less clunky way.
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
[(i, np.where(df[i] == 'd')[0].tolist()) for i in list(df) if len(np.where(df[i] == 'd')[0]) > 0]
> [[('column_2', [1])]
Note that it returns the numeric (0-based) index, not the custom (1-based) index you have. If you have a fixed offset you could just add a +1 or whatever to the output.
If I understand what you are looking for: Find the (index value, column location) for a value in a dataframe. You can use list comprehension in a loop. Probably wont be the fastest if your dataframe is large.
# assume this dataframe
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
# list comprehension
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 0), (3, 0), (1, 1)]
change df.columns.get_loc to col if you want the column value rather than location:
[(df[col][df[col].eq('abc')].index[i], col) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
# [(0, 'col'), (3, 'col'), (1, 'col2')]
I might be misunderstanding something, but np.where should get the job done.
df_tmp = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
solution = np.where(df_tmp == 'd')
solution should contain row and column index.
Hope this helps!
To search single value:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df == 'd'].stack().index.tolist()
[Out]:
[('column_2', 2)]
To search a list of values:
df = pd.DataFrame({'column_1':['a','c'], 'column_2':['b','d']}, index=[1,2])
df[df.isin(['a', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (2, 'column_2')]
Also works when value occurs at multiple places:
df = pd.DataFrame({'column_1':['test','test'], 'column_2':['test','test']}, index=[1,2])
df[df == 'test'].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_1'), (2, 'column_2')]
Explanation
Select cells where the condition matches:
df[df.isin(['a', 'b', 'd'])]
[Out]:
column_1 column_2
1 a b
2 NaN d
stack() reshapes the columns to index:
df[df.isin(['a', 'b', 'd'])].stack()
[Out]:
1 column_1 a
column_2 b
2 column_2 d
Now the dataframe is a multi-index:
df[df.isin(['a', 'b', 'd'])].stack().index
[Out]:
MultiIndex([(1, 'column_1'),
(1, 'column_2'),
(2, 'column_2')],
)
Convert this multi-index to list:
df[df.isin(['a', 'b', 'd'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Note
If a list of values are searched, the returned result does not preserve the order of input values:
df[df.isin(['d', 'b', 'a'])].stack().index.tolist()
[Out]:
[(1, 'column_1'), (1, 'column_2'), (2, 'column_2')]
Had a similar need and this worked perfectly
# deals with case sensitivity concern
df = raw_df.applymap(lambda s: s.upper() if isinstance(s, str) else s)
# get the row index
value_row_location = df.isin(['VALUE']).any(axis=1).tolist().index(True)
# get the column index
value_column_location = df.isin(['VALUE']).any(axis=0).tolist().index(True)
# Do whatever you want to do e.g Replace the value above that cell
df.iloc[value_row_location - 1, value_column_location] = 'VALUE COLUMN'