Read from the specific lines of a csv in pandas [duplicate] - pandas

I'm having trouble figuring out how to skip n rows in a csv file but keep the header which is the 1 row.
What I want to do is iterate but keep the header from the first row. skiprows makes the header the first row after the skipped rows. What is the best way of doing this?
data = pd.read_csv('test.csv', sep='|', header=0, skiprows=10, nrows=10)

You can pass a list of row numbers to skiprows instead of an integer.
By giving the function the integer 10, you're just skipping the first 10 lines.
To keep the first row 0 (as the header) and then skip everything else up to row 10, you can write:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
Other ways to skip rows using read_csv
The two main ways to control which rows read_csv uses are the header or skiprows parameters.
Supose we have the following CSV file with one column:
a
b
c
d
e
f
In each of the examples below, this file is f = io.StringIO("\n".join("abcdef")).
Read all lines as values (no header, defaults to integers)
>>> pd.read_csv(f, header=None)
0
0 a
1 b
2 c
3 d
4 e
5 f
Use a particular row as the header (skip all lines before that):
>>> pd.read_csv(f, header=3)
d
0 e
1 f
Use a multiple rows as the header creating a MultiIndex (skip all lines before the last specified header line):
>>> pd.read_csv(f, header=[2, 4])
c
e
0 f
Skip N rows from the start of the file (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=3)
d
0 e
1 f
Skip one or more rows by giving the row indices (the first row that's not skipped is the header):
>>> pd.read_csv(f, skiprows=[2, 4])
a
0 b
1 d
2 f

Great answers already. Consider this generalized scenario:
Say your xls/csv has junk rows in the top 2 rows (row #0,1). Row #2 (3rd row) is the real header and you want to load 10 rows starting from row #50 (i.e 51st row).
Here's the snippet:
pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)

To expand on #AlexRiley's answer, the skiprows argument takes a list of numbers which determines what rows to skip. So:
pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))
is the same as:
pd.read_csv('test.csv', sep='|', skiprows=[1,2,3,4,5,6,7,8,9])
The best way to go about ignoring specific rows would be to create your ignore list (either manually or with a function like range that returns a list of integers) and pass it to skiprows.

If you're iterating through a long csv file, you can use the chunksize argument. If for some reason you need to manually step through it, you can try the following as long as you know how many iterations you need to go through:
for i in range(num_iters):
pd.read_csv('test.csv', sep='|', header=0,
skiprows = range(i*10 + 1, (i+1)*10), nrows=10)

If you need to skip/drop specific rows, say the first 3 rows (i.e. 0,1,2) and then 2 more rows (i.e. 4,5). You can use the following to retain the header row:
df = pd.read_csv(file_in, delimiter='\t', skiprows=[0,1,2,4,5], encoding='utf-16', usecols=cols)

Related

how to extract a character out of string in a column with 100k rows

I have a dataframe with a column x that has 100k rows is as follows:
x
DIV.CDN
DIV.XYN
VIM.NGN
VIM.AHY
I need to extract the 3rd character from the dot(.) to the right, for example:
N
N
N
Y
How to do it Pandas dataframe?
Use str accessor:
>>> df
x
0 DIV.CDN
1 DIV.XYN
2 VIM.NGN
3 VIM.AHY
>>> df['x'].str[-1]
0 N
1 N
2 N
3 Y
Name: x, dtype: object
Please, read the documentation: Working with text data
Assuming that each column value is a string, for each one you can grab it with a simple index:
third_character = string[-1]
Here, it is essentially grabbing the last element of the string. In order to do it over all of the rows, one option is to implement a for-loop that iteratively does this over all rows, appending to a list each time.

Getting same value from list in dataframe column using Python

I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.
Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()

Dataframe Column into Multiple Columns by delimiter ',' : expand = True, n =-1

My first question thanks) Sorry for lengthy formulationenter image description here
Researched all related posts
What I have
my Dataframe column (please see screenshot) is strings separated by delimiter ',' Car parameters.
My Dataframe:-
Some rows come with mileage while others not (screenshot) hence some rows have fewer delimiters.
The Task
Need to create 5 columns (max number of delimiters) to store CarParameters separately (Mileage, GearBox, HP, Body etc)
If a row doesn't have Mileage Put 0 in the Mileage Column
What I know and works well
df["name"].str.split(" ", expand = True) by default n=-1 and splits into necessary columns
example:
The issue:
If I use the str.split(" ", expand = True) method - GearBox (ATM) is wrongly put under newly created Mileage column because that row is short of one delimiter (screenshot)
Result:-
-
You can try lambda function combined with list concatenation like below.
>>> import pandas as pd
>>> df = pd.DataFrame([['1,2,3,4,5'],['2,3,4,5']], columns=["CarParameters"])
>>> print(pd.DataFrame(df.CarParameters.apply(
lambda x: str(x).split(',')).apply(
lambda x: [0]*(5-len(x)) + x).to_list(), columns=list("ABCDE")))
A B C D E
0 1 2 3 4 5
1 0 2 3 4 5

Can pandas parse csv file with unknown number of comments, a header and line(s) to be skipped

I have a file that has an unknown number of comments, followed by a header, followed by a second row that has to do with data types but is really just junk to me.
# Comment Line
# Another comment -- there could be lots
index value
not wanted
1 10
2 20
With a priori knowledge of the number of comments (which sort of violates the idea of comments) the file can be read by
pandas pd.read_csv(fname, header=0, comment='#', skiprows=[3])
In my case, though, the number 3 is unknown. I only know the header is index 0 not counting comments and I know that the unwanted row is index 1 not counting comments. header works the way I want but not skiprows. Is there a way to make use of this information to read the file easily? By "easily", I mean something short of the following which opens the file, counts the preliminary comments, then reads:
ncomment = 0
crows = []
fname = "sample.csv"
with open(fname,"r") as f:
while f.readline().startswith("#"):
crows.append(ncomment)
ncomment += 1
crows = crows + [ncomment+1]
data = pd.read_csv(fname,header=0,skiprows = crows, index_col=0,delim_whitespace = True)
print(data)
You can with header + comment to get a MultiIndex, then drop the unwanted level. header is evaluated after the comments are removed, so it's always [0, 1]. (I'm using delim_whitespace=True because there aren't any ','s in your sample data).
df = pd.read_csv('sample.csv', comment='#', header=[0, 1], delim_whitespace=True)
# index value
# not wanted
#0 1 10
#1 2 20
We can drop in the same line:
df = (pd.read_csv('sample.csv', comment='#', header=[0, 1], delim_whitespace=True)
.droplevel(1, axis=1))
# index value
#0 1 10
#1 2 20

How to read a very messy .txt file using pd.read.csv() with multiple conditions

I have a very messy .txt file that I'm attempting to read in using pd.read_csv(). The file has multiple challenges to overcome 1) The first 12 lines are not needed and therefore need to be skipped, the next 50 rows are needed, the next 14 rows need to be Skipped, next 50 rows needed, next 14 to be skipped , and so on. 2) Each normal row of data actually exists across 2 rows of data in this report, meaning that we need to lift the 2nd row of data up to the 1st row of data and place it to the right in new columns. (This action would halve the number of total rows and double the number of columns of the desired dataframe) 3) The last challenge is that the first row of data has 8 spaces of seperation between values while the 2 row of data has anywhere from 8 through to 17 spaces of sep between values.
I thought the best way to approach this would be to first remove the rows that I don't need. I would then find way to merge row 1 with row2 / row 3 with row 4/ row 5 with row 6 until all rows are correctly consolidated. I would then use the 'sep' function to separate values of each row for anything that has 8 spaces and over. This would hopefully get to my desired Output - has anyone ever had a similar challenge that they have overcome?
First picture is an image of the raw data
Second picture is my ideal output
Ok, so the error_bad_lines=False combined with sep = '\s+|\^+' worked a treat.
I then solved the problem of bad lines by removing them one by one.
I then solved the '1 row over 2 rows' problem by splitting the dataframe into two dfs (df8,df9) and recombined them on axis=1. Looks perfect now.
import pandas as pd #importing Pandas Package to wrangle data
boltcogs = 'ABAPlist.txt'
df = pd.read_csv(boltcogs,skiprows=12,error_bad_lines=False,header = None ,sep = '\s+|\^+')
df1 = df[df.iloc[:,0] != 'Production' ] ## removing verbose lines
df2 = df1[df1.iloc[:,0] != '----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------' ]
df3 = df2[df2.iloc[:,0] != 'Kuala' ] ## removing bad rows
df4 = df3[df3.iloc[:,0] != 'Operating' ] ## removing bad rows
df5 = df4[df4.iloc[:,0] != 'Plant:' ] ## removing bad rows
df6 = df5[df5.iloc[:,0] != 'Costing' ] ## removing bad rows
df7 = df6[df6.iloc[:,0] != 'Currency:' ] ## removing bad rows
df8 = df7.iloc[0::2, :].reset_index() # Selecting every second row to get second half of row
df9 = df7.iloc[1::2, :].reset_index()# Selecting remainder to to get first half of row
df10 = pd.concat([df8, df9], axis=1, ignore_index=True) # joining them together