Pandas read_csv BZ2 File Always Includes File Name - pandas

Every time I load a .bz2 file into a Pandas dataframe I get the name of the file as the first column of the first row of the dataframe. I'm using tar to compress the files. I have written the following something.txt file:
1 2 3 4 5
2 3 4 5 6
6 7 8 9 10
I compress it via tar -cvjf something.txt.bz2 something.txt. Then, I decompress the data and move it to a secure file location:
tar -xvjf something.txt.bz2
mv something.txt something.txt.2
Now I load the data in a python script in three different ways:
>>> data1 = pd.read_csv("something.txt")
>>> data2 = pd.read_csv("something.txt.2")
>>> data3 = pd.read_csv("something.txt.bz2")
and here's what i get when I read these data back again:
>>> data1
1 2 3 4 5
0 2 3 4 5 6
1 6 7 8 9 10
>>> data2
1 2 3 4 5
0 2 3 4 5 6
1 6 7 8 9 10
>>> data3
something.txt 2 3 4 5
0 2.0 3.0 4.0 5.0 6.0
1 6.0 7.0 8.0 9.0 10.0
2 NaN NaN NaN NaN NaN
Does anybody know why this is happening???

This is how it works on my end. First, consider your data set something.txt:
"c0" "c1" "c2" "c3" "c4"
1 2 3 4 5
2 3 4 5 6
6 7 8 9 10
where I named the columns and used a single space as a separator for consistency. Then, compress it by using bzip2 (not tar):
bzip2 -z9 something.txt
This will replace something.txt with something.txt.bz2 in your base directory. Finally, initiate a python session and perform the following:
import pandas as pd
df = pd.read_csv("something.txt.bz2", compression="bz2", sep=" ")
df
The last line will print:
c0 c1 c2 c3 c4
0 1 2 3 4 5
1 2 3 4 5 6
2 6 7 8 9 10
which shows the expected values.
I hope this helps!

Related

pandas function to find index of the first future instance where column is less than each row's value

I'm new to Stack Overflow, and I just have a question about solving a problem in pandas. I am looking to create a function that returns the index of the first future instance where a column is less than each row's value for that column.
For example, consider the dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Val': [1, 2, 3, 4, 0, 1, -1, -2, -3]}, index = np.arange(0,9))
df
Index
Val
0
1
1
2
2
3
3
4
4
0
5
1
6
-1
7
-2
8
-3
I am looking for the output:
Index
F(Val)
0
4
1
4
2
4
3
4
4
6
5
6
6
7
7
8
8
NaN
Or the series/array equivalent of F(Val).
I've been able to solve this quite easily using for loops, but obviously this is extremely slow on the large dataset I am working with an not a very elegant or optimal solution. My hope is that the solution is an efficient pandas function that employs vectorization.
Also, as a bonus question (if anyone can assist), how might the maximum value between each row's index and the F(Val) index be computed using vectorization? The output should look like:
Index
G(Val)
0
4
1
4
2
4
3
4
4
1
5
1
6
-1
7
-2
8
NaN
Thanks!
You can use:
grp = df['Val'].lt(df['Val'].shift()).shift(fill_value=0).cumsum()
df['F(Val)'] = df.groupby(grp).transform(lambda x: x.index[-1]).shift(-1)
print(df)
# Output
Val F(Val)
0 1 4.0
1 2 4.0
2 3 4.0
3 4 4.0
4 0 6.0
5 1 6.0
6 -1 7.0
7 -2 8.0
8 -3 NaN
Using numpy broadcasting and the lower triangle:
a = df['Val'].to_numpy()
m = np.tril(a[:,None]<=a, k=-1)
df['F(Val)'] = np.where(m.any(0), m.argmax(0), np.nan)
Same logic with expanding:
df['F(Val)'] = (df.loc[::-1, 'Val'].expanding()
.apply(lambda x: s.idxmax() if len(s:=(x.iloc[-2::-1]<=x.iloc[-1]))
else np.nan)
)
Output (with a difference to the provided one):
Val F(Val)
0 1 5.0 # here the next is 5
1 2 4.0
2 3 4.0
3 4 4.0
4 2 5.0
5 -2 7.0
6 -1 7.0
7 -2 8.0
8 -3 NaN

Assign column values from another dataframe with repeating key values

Please help me in Pandas, i cant find good solution
Tried map, assign, merge, join, set_index.
Maybe just i am too tired :)
df:
m_num A B
0 1 0 9
1 1 1 8
2 2 2 7
3 2 3 6
4 3 4 5
5 3 5 4
df1:
m_num C
0 2 99
1 2 88
df_final:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99
3 2 3 6 88
4 3 4 5 NaN
5 3 5 4 NaN
Try:
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')
print(df2)
Prints:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99.0
3 2 3 6 88.0
4 3 4 5 NaN
5 3 5 4 NaN
Explanation:
There may be better solutions out there but this was my thought process. The problem is slightly tricky in the sense that because 'm_num' is the only common key and it and it has repeating values.
So first I created a dataframe matching df and df1 here so that I can use the index as another key for the subsequent merge.
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
This prints:
m_num A B
0 2 2 7
1 2 3 6
As you can see above, now we have the index 0 and 1 in addition to the m_num as key which we can use to match with df1.
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
This prints:
m_num A B C
0 2 2 7 99
1 2 3 6 88
Then tie the above resultant dataframe to the original df and do a left join to get the output.
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')

Compute lagged means per name and round in pandas

I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas
You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5

How to append to a data frame from multiple loops

I have a code, which takes in files from csv and takes a price difference, but to make it simplar I made a reproducible example as seen below. I want to append each result to the end of a specific column name. For example the first loop will go through size 1 and minute 1 so it should append to column names 1;1, for file2, file 3, file4. So the output should be :
1;1 1;2 1;3 2;1 2;2 2;3
0 0 0 same below as for 1
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
5 5 5
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
6 6 6
0 0 0
0 0 0
0 0 0
2 2 2
2 2 2
4 4 4
4 4 4
6 6 6
7 7 7
I am using a loop to set prefixed data frame columns, because in my original code the number of minutes, sizes, and files is inputted by the user.
import numpy as np
import pandas as pd
file =[1,2,3,4,5,6,6,2]
file2=[1,2,3,4,5,6,7,8]
file3=[1,2,3,4,5,6,7,9]
file4=[1,2,1,2,1,2,1,2]
size=[1,2]
minutes=[1,2,3]
list1=[file,file2,file3]
data=pd.DataFrame(file)
data2=pd.DataFrame(file2)
data3=pd.DataFrame(file3)
list1=(data,data2,data3)
datas=pd.DataFrame(file4)
col_names = [str(sizer)+';'+str(number) for sizer in size for number in minutes]
datanew=pd.DataFrame(columns=col_names)
for sizes in size:
for minute in minutes:
for files in list1:
pricediff=files-data
datanew[str(sizes)+';'+str(minute)]=datanew[str(sizes)+';'+str(minute)].append(pricediff,ignore_index=True)
print(datanew)
Edit: When trying this line: datanew=datanew.append({str(sizes)+';'+str(minute): df['pricediff']},ignore_index=True) It appends the data but the result isn't "clean"
The result from my original data, gives me this:
111;5.0,1111;5.0
"0 4.5
1 0.5
2 8
3 8
4 8
...
704 3.5
705 0.5
706 11.5
707 0.5
708 9.0
Name: pricediff, Length: 709, dtype: object",
"price 0.0
0 0.0
Name: pricediff, dtype: float64",
"0 6.5
1 6.5
2 3.5
3 13.0
Name: pricediff, Length: 4, dtype: float64",
What you are looking for is:
datanew=datanew.append({str(sizes)+';'+str(minute): pricediff}, ignore_index=True)
This happens because you cannot change length of a single column of a dataframe without modifying length of the whole data frame.
Now consider the below as an example:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqr"), "b": [1,3,5,4,2,7], "c": list("pqrtuv")})
print(df)
#this will fail:
#df["c"]=df["c"].append("abc", ignore_index=True)
#print(df)
#what you can do instead:
df=df.append({"c": "abc"}, ignore_index=True)
print(df)
#you can even create new column that way:
df=df.append({"x": "abc"}, ignore_index=True)
Edit
In order to append pd.Series do literally the same:
abc=pd.Series([-1,-2,-3], name="c")
df=df.append({"c": abc}, ignore_index=True)
print(df)
abc=pd.Series([-1,-2,-3], name="x")
df=df.append({"x": abc}, ignore_index=True)

Interpolate values in pandas using the nearest method?

How do I interpolate the nearest number?
My pd.Series named df1
0 RK
1 1
2 2
3 3
4 4
5 NaN
6 6
7 7
8 8
9 NaN
10 10
And I would like to interpolate the nearest number to replace NaN, like this
0 RK
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
According to Official Doc pandas.Series.interpolate
I tried
df1 = df1.interpolate(method='nearest',axis=0)
but it doesn't change.
Need help, and thanks in advance. :~)
just do
df1.interpolate()
don't bother with the method='nearest' option. The default method='linear' should do the trick.