Can pandas parse csv file with unknown number of comments, a header and line(s) to be skipped - pandas

I have a file that has an unknown number of comments, followed by a header, followed by a second row that has to do with data types but is really just junk to me.
# Comment Line
# Another comment -- there could be lots
index value
not wanted
1 10
2 20
With a priori knowledge of the number of comments (which sort of violates the idea of comments) the file can be read by
pandas pd.read_csv(fname, header=0, comment='#', skiprows=[3])
In my case, though, the number 3 is unknown. I only know the header is index 0 not counting comments and I know that the unwanted row is index 1 not counting comments. header works the way I want but not skiprows. Is there a way to make use of this information to read the file easily? By "easily", I mean something short of the following which opens the file, counts the preliminary comments, then reads:
ncomment = 0
crows = []
fname = "sample.csv"
with open(fname,"r") as f:
while f.readline().startswith("#"):
crows.append(ncomment)
ncomment += 1
crows = crows + [ncomment+1]
data = pd.read_csv(fname,header=0,skiprows = crows, index_col=0,delim_whitespace = True)
print(data)

You can with header + comment to get a MultiIndex, then drop the unwanted level. header is evaluated after the comments are removed, so it's always [0, 1]. (I'm using delim_whitespace=True because there aren't any ','s in your sample data).
df = pd.read_csv('sample.csv', comment='#', header=[0, 1], delim_whitespace=True)
# index value
# not wanted
#0 1 10
#1 2 20
We can drop in the same line:
df = (pd.read_csv('sample.csv', comment='#', header=[0, 1], delim_whitespace=True)
.droplevel(1, axis=1))
# index value
#0 1 10
#1 2 20

Related

How to make dataframe from different parts of an Excel sheet given specific keywords?

I have one Excel file where multiple tables are placed in same sheet. My requirement is to read certain tables based on keyword. I have read tables using skip rows and nrows method, which is working as of now, but in future it won't work due to dynamic table length.
Is there any other workaround apart from skip rows & nrows method to read table as shown in picture?
I want to read data1 as one table & data2 as another table. Out of which in particular I want columns "RR","FF" & "WW" as two different data frames.
Appreciate if some one can help or guide to do this.
Method I have tried:
all_files=glob.glob(INPATH+"*sample*")
df1 = pd.read_excel(all_files[0],skiprows=11,nrows= 3)
df2 = pd.read_excel(all_files[0],skiprows=23,nrows= 3)
This works fine, the only problem is table length will vary every time.
With an Excel file identical to the one of your image, here is one way to do it:
import pandas as pd
df = pd.read_excel("file.xlsx").dropna(how="all").reset_index(drop=True)
# Setup
targets = ["Data1", "Data2"]
indices = [df.loc[df["Unnamed: 0"] == target].index.values[0] for target in targets]
dfs = []
for i in range(len(indices)):
# Slice df starting from first indice to second one
try:
data = df.loc[indices[i] : indices[i + 1] - 1, :]
except IndexError:
data = df.loc[indices[i] :, :]
# For one slice, get only values where row starts with 'rr'
r_idx = data.loc[df["Unnamed: 0"] == "rr"].index.values[0]
data = data.loc[r_idx:, :].reset_index(drop=True).dropna(how="all", axis=1)
# Cleanup
data.columns = data.iloc[0]
data.columns.name = ""
dfs.append(data.loc[1:, :].iloc[:, 0:3])
And so:
for item in dfs:
print(item)
# Output
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout
rr ff ww
1 car1 1000000 sellout
2 car2 1500000 to be sold
3 car3 1300000 sellout

Dataframe Column into Multiple Columns by delimiter ',' : expand = True, n =-1

My first question thanks) Sorry for lengthy formulationenter image description here
Researched all related posts
What I have
my Dataframe column (please see screenshot) is strings separated by delimiter ',' Car parameters.
My Dataframe:-
Some rows come with mileage while others not (screenshot) hence some rows have fewer delimiters.
The Task
Need to create 5 columns (max number of delimiters) to store CarParameters separately (Mileage, GearBox, HP, Body etc)
If a row doesn't have Mileage Put 0 in the Mileage Column
What I know and works well
df["name"].str.split(" ", expand = True) by default n=-1 and splits into necessary columns
example:
The issue:
If I use the str.split(" ", expand = True) method - GearBox (ATM) is wrongly put under newly created Mileage column because that row is short of one delimiter (screenshot)
Result:-
-
You can try lambda function combined with list concatenation like below.
>>> import pandas as pd
>>> df = pd.DataFrame([['1,2,3,4,5'],['2,3,4,5']], columns=["CarParameters"])
>>> print(pd.DataFrame(df.CarParameters.apply(
lambda x: str(x).split(',')).apply(
lambda x: [0]*(5-len(x)) + x).to_list(), columns=list("ABCDE")))
A B C D E
0 1 2 3 4 5
1 0 2 3 4 5

How to read a very messy .txt file using pd.read.csv() with multiple conditions

I have a very messy .txt file that I'm attempting to read in using pd.read_csv(). The file has multiple challenges to overcome 1) The first 12 lines are not needed and therefore need to be skipped, the next 50 rows are needed, the next 14 rows need to be Skipped, next 50 rows needed, next 14 to be skipped , and so on. 2) Each normal row of data actually exists across 2 rows of data in this report, meaning that we need to lift the 2nd row of data up to the 1st row of data and place it to the right in new columns. (This action would halve the number of total rows and double the number of columns of the desired dataframe) 3) The last challenge is that the first row of data has 8 spaces of seperation between values while the 2 row of data has anywhere from 8 through to 17 spaces of sep between values.
I thought the best way to approach this would be to first remove the rows that I don't need. I would then find way to merge row 1 with row2 / row 3 with row 4/ row 5 with row 6 until all rows are correctly consolidated. I would then use the 'sep' function to separate values of each row for anything that has 8 spaces and over. This would hopefully get to my desired Output - has anyone ever had a similar challenge that they have overcome?
First picture is an image of the raw data
Second picture is my ideal output
Ok, so the error_bad_lines=False combined with sep = '\s+|\^+' worked a treat.
I then solved the problem of bad lines by removing them one by one.
I then solved the '1 row over 2 rows' problem by splitting the dataframe into two dfs (df8,df9) and recombined them on axis=1. Looks perfect now.
import pandas as pd #importing Pandas Package to wrangle data
boltcogs = 'ABAPlist.txt'
df = pd.read_csv(boltcogs,skiprows=12,error_bad_lines=False,header = None ,sep = '\s+|\^+')
df1 = df[df.iloc[:,0] != 'Production' ] ## removing verbose lines
df2 = df1[df1.iloc[:,0] != '----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------' ]
df3 = df2[df2.iloc[:,0] != 'Kuala' ] ## removing bad rows
df4 = df3[df3.iloc[:,0] != 'Operating' ] ## removing bad rows
df5 = df4[df4.iloc[:,0] != 'Plant:' ] ## removing bad rows
df6 = df5[df5.iloc[:,0] != 'Costing' ] ## removing bad rows
df7 = df6[df6.iloc[:,0] != 'Currency:' ] ## removing bad rows
df8 = df7.iloc[0::2, :].reset_index() # Selecting every second row to get second half of row
df9 = df7.iloc[1::2, :].reset_index()# Selecting remainder to to get first half of row
df10 = pd.concat([df8, df9], axis=1, ignore_index=True) # joining them together

Find rows in dataframe column containing questions

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0

Read from the specific lines of a csv in pandas [duplicate]

I'm having trouble figuring out how to skip n rows in a csv file but keep the header which is the 1 row.
What I want to do is iterate but keep the header from the first row. skiprows makes the header the first row after the skipped rows. What is the best way of doing this?
data = pd.read_csv('test.csv', sep='|', header=0, skiprows=10, nrows=10)
You can pass a list of row numbers to skiprows instead of an integer.
By giving the function the integer 10, you're just skipping the first 10 lines.
To keep the first row 0 (as the header) and then skip everything else up to row 10, you can write:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
Other ways to skip rows using read_csv
The two main ways to control which rows read_csv uses are the header or skiprows parameters.
Supose we have the following CSV file with one column:
a
b
c
d
e
f
In each of the examples below, this file is f = io.StringIO("\n".join("abcdef")).
Read all lines as values (no header, defaults to integers)
>>> pd.read_csv(f, header=None)
0
0 a
1 b
2 c
3 d
4 e
5 f
Use a particular row as the header (skip all lines before that):
>>> pd.read_csv(f, header=3)
d
0 e
1 f
Use a multiple rows as the header creating a MultiIndex (skip all lines before the last specified header line):
>>> pd.read_csv(f, header=[2, 4])
c
e
0 f
Skip N rows from the start of the file (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=3)
d
0 e
1 f
Skip one or more rows by giving the row indices (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=[2, 4])
a
0 b
1 d
2 f
Great answers already. Consider this generalized scenario:
Say your xls/csv has junk rows in the top 2 rows (row #0,1). Row #2 (3rd row) is the real header and you want to load 10 rows starting from row #50 (i.e 51st row).
Here's the snippet:
pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)
To expand on #AlexRiley's answer, the skiprows argument takes a list of numbers which determines what rows to skip. So:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
is the same as:
pd.read_csv('test.csv', sep='|', skiprows=[1,2,3,4,5,6,7,8,9])
The best way to go about ignoring specific rows would be to create your ignore list (either manually or with a function like range that returns a list of integers) and pass it to skiprows.
If you're iterating through a long csv file, you can use the chunksize argument. If for some reason you need to manually step through it, you can try the following as long as you know how many iterations you need to go through:
for i in range(num_iters):
pd.read_csv('test.csv', sep='|', header=0,
skiprows = range(i*10 + 1, (i+1)*10), nrows=10)
If you need to skip/drop specific rows, say the first 3 rows (i.e. 0,1,2) and then 2 more rows (i.e. 4,5). You can use the following to retain the header row:
df = pd.read_csv(file_in, delimiter='\t', skiprows=[0,1,2,4,5], encoding='utf-16', usecols=cols)