Seperate GeneID seperated by "/" in one column to multiple columns - tidyverse

I have a data.frame with only one column of gene ids separated by "/", please see Data.
rows
x
row1
5788/3689/5230/8826/302/79026/203068/476
row2
3312/6892/811/3123/3122
How can I obtain a data.frame with only one ID in each cell like the data below Thank you very much.
rows
v1
v2
v3
v4
v5
v6
v7
v8
row1
5788
3689
5230
8826
302
79026
203068
476
row2
3312
6892
811
3123
3122

You can use read.table using simply the x-column as a text input vector and specify both fill=TRUE and sep="/"
read.table(text=dat$x, sep="/", fill =TRUE)
V1 V2 V3 V4 V5 V6 V7 V8
1 5788 3689 5230 8826 302 79026 203068 476
2 3312 6892 811 3123 3122 NA NA NA
And cbind it to either the original dataframe or to just the rows column:
cbind(rows=dat$rows, read.table(text=dat$x, sep="/", fill =TRUE))
#------------
rows V1 V2 V3 V4 V5 V6 V7 V8
1 row1 5788 3689 5230 8826 302 79026 203068 476
2 row2 3312 6892 811 3123 3122 NA NA NA
The fill= parameter lets read.table accept input lines with varying numbers of items per line. The sep= parameter allows a different separation character than the usual whitespace or comma separators.NAvalues are placed to padd out the shorter input lines. You are essentially using a single column of data to be considered as an input file to be parsed.

Related

Filling Empty Rows with Dictionary Values via For Loop Pandas

I have a dictionary that looks like this:
my_dict = {2078:'T20',2153:'T20',2223:'T21',2219:'T21'}
My data frame:
Date Code Fleet KM
2021-20-03 2078 T20 20
2021-21-03 2078 NaN 22
2021-21-03 2153 T20 23
2021-21-03 2153 NaN 23
2021-22-03 2223 NaN 28
2021-22-03 2223 NaN 30
2021-22-03 2219 T21 23
2021-23-03 2219 NaN 23
I want to use the values of the dictionary to fill the empty rows in the Fleet column in my df.
So I wrote the code:
for index, row in df.iterrows():
if (pd.isnull(row['Fleet'])):
row['Fleet']= my_dict.row['Fleet']
However, when I check df.info() I can see that the code did not apply even though it runs.
Could someone tell me what I am doing wrong?
Use Series.map by dictionary and raplace missing values by Series.fillna:
df['Fleet'] = df['Fleet'].fillna(df['Code'].map(my_dict))
Or Series.combine_first:
df['Fleet'] = df['Fleet'].combine_first(df['Code'].map(my_dict))

How to concatenate a dataframe to a multiindex main dataframe along columns

I have tried a few answers but was not able to get the desired result in my case.
I am working with stocks data.
I have a list ['3MINDIA.NS.csv', 'AARTIDRUGS.NS.csv', 'AARTIIND.NS.csv', 'AAVAS.NS.csv', 'ABB.NS.csv']
for every stock in the list I get an output which contains trades and related info.. it looks something like this:
BUY SELL profits rel_profits
0 2004-01-13 2004-01-27 -44.200012 -0.094606
1 2004-02-05 2004-02-16 18.000000 0.044776
2 2005-03-08 2005-03-11 25.000000 0.048077
3 2005-03-31 2005-04-01 13.000000 0.025641
4 2005-10-11 2005-10-26 -20.400024 -0.025342
5 2005-10-31 2005-11-04 67.000000 0.095578
6 2006-05-22 2006-06-05 -55.100098 -0.046693
7 2007-03-06 2007-03-14 3.000000 0.001884
8 2007-03-19 2007-03-28 41.500000 0.028222
9 2007-07-31 2007-08-14 69.949951 0.038224
10 2008-01-24 2008-02-05 25.000000 0.013055
11 2009-11-04 2009-11-05 50.000000 0.031250
12 2010-12-10 2010-12-15 63.949951 0.018612
13 2011-02-02 2011-02-15 -53.050049 -0.015543
14 2011-09-30 2011-10-07 74.799805 0.018181
15 2015-12-09 2015-12-18 -215.049805 -0.019523
16 2016-01-18 2016-02-01 -475.000000 -0.046005
17 2016-11-16 2016-11-30 -1217.500000 -0.096877
18 2018-03-26 2018-04-02 0.250000 0.000013
19 2018-05-22 2018-05-25 250.000000 0.012626
20 2018-06-05 2018-06-12 101.849609 0.005361
21 2018-09-25 2018-10-10 -2150.000000 -0.090717
22 2021-01-27 2021-02-03 500.150391 0.024638
23 2021-06-30 2021-07-07 393.000000 0.016038
24 2021-08-12 2021-08-13 840.000000 0.035279
25 NaN NaN -1693.850281 0.995277
# note: every dataframe will have a last row with NaN values in buy, sell columns
# each datafram has different number of rows
Now I tried to add an extra level of index to this dataframe like this:
symbol = name of the stock from given list for ex. for 3MINDIA.NS.csv symbol is 3MINDIA
trades.columns = pd.MultiIndex.from_product([[symbol], trades.columns])
after this I tried to concatenate each trades dataframe that is generated in the loop to a main dataframe using:
result_df = pd.concat([result_df, trades], axis=1)
# I am trying to do this so that Whenever
I call result_df[symbol] I should be able
to see the trade dates for that particular symbol.
But I get a result_df that has lot of NaN values because each trades dataframe has variable number of rows in it.
IS there any way I can combine trades dataframes along the columns with stock symbol as higher level index and not get all the NaN values in my result_df
result_df I got
So I found a way to get what I wanted.
first I added this code in loop
trades = pd.concat([trades], keys=[symbol], names=['Stocks'])
after this I used concatenate again on result_df and trades
# Desired Result
result_df = pd.concat([result_df, trades], axis=0, ignore_index=False)
And BAM!!! This is exactly what I wanted

How to manipulate data in arrays using pandas

Have data in dataframe and need to compare current value of one column and prior of value of another column. Current time is row 5 in this dataframe and here's the desired output:
target data is streamed and captured into a DataFrame, then that array is multiplied by a constant to generate another column, however unable to generate the third column comp, which should compare current value of prod with prior value of the comp from comp.
df['temp'] = self.temp
df['prod'] = df['temp'].multiply(other=const1)
Another user had suggested using this logic but it is generates errors because the routine's array doesn't match the size of the DataFrame:
for i in range(2, len(df['temp'])):
df['comp'].append(max(df['prod'][i], df['comp'][i - 1]))
Let's try this, I think this will capture your intended logic:
df = pd.DataFrame({'col0':[1,2,3,4,5]
,'col1':[5,4.9,5.5,3.5,6.3]
,'col2':[2.5,2.45,2.75,1.75,3.15]
})
df['col3'] = df['col2'].shift(-1).cummax().shift()
print(df)
Output:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 2.75
4 5 6.3 3.15 3.15

Split SparkR dataframe into list of dataframes

I am new to sparkR and trying to split the sparkR dataframe in to list of Dataframes based on columns.
The data has a billion records of Sls_D(date), mdse_item_i(item id), co_loc_i(location id), traffic_ti_8_00, traffic_ti_9_00, traffic_ti_10_00, traffic_ti_11_00 (each has no of traffic in the specific hour).
Data Snapshot:
sls_d co_loc_i mdse_item_i traffic_ti_8_00 traffic_ti_9_00 traffic_ti_10_00 traffic_ti_11_00
1 2016-10-21 1592 4694620 1 113 156 209
2 2016-10-21 1273 4694620 1 64 152 249
3 2016-10-21 1273 15281024 1 64 152 249
4 2016-10-21 1498 4694620 2 54 124 184
5 2016-10-21 1498 15281024 2 54 124 184
Desired Output:
sls_d co_loc_i mdse_item_i traffic_ti_8_00 traffic_ti_9_00 traffic_ti_10_00 traffic_ti_11_00
2016-10-21 4 4694620 3 67 145 283
A list of Dataframes.
d.2 = split(data.2.2,list(data.2.2$mdse_item_i,data.2.2$co_loc_i,data.2.2$sls_d))
Error in x[ind[[k]]] : Expressions other than filtering predicates
are not supported in the first parameter of extract operator [ or
subset() method.
Is there any way around to do this in sparkR apart from converting the sparkDataframe to base R.
As converting the sparkdataframe to base R results in memory error and defeats the problem of parallel processing.
Any help is greatly appreciated.
Your question is somewhat unclear; if you mean to split the columns of a Spark dataframe, you should use select. Here is an example using the iris data in SparkR 2.2:
df <- as.DataFrame(iris) # Spark dataframe
df
# SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]
# separate the length-related & width-related columns into 2 Spark dataframes:
df_length = select(df, 'Sepal_Length', 'Petal_Length')
df_width = select(df, 'Sepal_Width', 'Petal_Width')
head(collect(df_width)) # for demonstration purposes only
# Sepal_Width Petal_Width
# 1 3.5 0.2
# 2 3.0 0.2
# 3 3.2 0.2
# 4 3.1 0.2
# 5 3.6 0.2
# 6 3.9 0.4
Now, you can put these 2 Spark dataframes into an R list, but I'm not sure how useful this will be - any list operations that may make sense are not usable [EDIT after comment]:
my_list = c(df_length, df_width)
head(collect(my_list[[1]]))
# Sepal_Length Petal_Length
# 1 5.1 1.4
# 2 4.9 1.4
# 3 4.7 1.3
# 4 4.6 1.5
# 5 5.0 1.4
# 6 5.4 1.7

Can't seem to shorten decimal numbers of my Pandas column

So I have a df column which I created by taking an average of three other columns
df['Avg_Grade'] = df.loc[:,'G1':'G3'].mean(axis =1)
The series looks like this (just a sample)
Avg_Grade
0 5.666667
1 5.333333
2 8.333333
3 14.666667
4 8.666667
I'm trying to truncate the output to show something like
0 5.67 (5.66 is also fine)
1 5.33
2 8.33
3 14.67
4 8.67
I've played around with the moduleDecimal with the following code, but I'm getting an error.
from decimal import *
getcontext().prec = 4
df['Avg_Grade'] = Decimal(df.loc[:,'G1':'G3'].mean(axis =1))
Traceback (most recent call last):
File "<pyshell#409>", line 1, in <module>
df['Avg_Grade'] = Decimal(df.loc[:,'G1':'G3'].mean(axis =1))
File "C:\Python27\lib\decimal.py", line 657, in __new__
raise TypeError("Cannot convert %r to Decimal" % value)
TypeError: Cannot convert 0 5.666667
There are a few ways you can do this, but they won't work in all situations.
Here's an example dataframe:
In [1]:
df = pd.DataFrame(10*np.random.rand(4,3), columns=['G1','G2','G3'])
df['Avg_Grade'] = df.loc[:,'G1':'G3'].mean(axis =1)
df
Out [1]:
G1 G2 G3 Avg_Grade
0 9.843159 4.155922 9.652694 7.883925
1 2.108822 9.347634 9.271351 6.909269
2 2.681108 3.071449 0.387151 2.046569
3 4.017461 9.464408 0.395482 4.625783
1. Use a global pandas setting
All floats will be displayed with two decimals. You can use either of the following:
pd.options.display.precision = 2
pd.set_option('display.precision', 2)
In [3]: df
Out[3]:
G1 G2 G3 Avg_Grade
0 9.84 4.16 9.65 7.88
1 2.11 9.35 9.27 6.91
2 2.68 3.07 0.39 2.05
3 4.02 9.46 0.40 4.63
2. Use a global setting within a with statement.
All floats displayed within the with statement will display with two decimals, but after it will revert to the regular value (default:6)
In [4]: with pd.option_context('display.precision', 2):
print(df)
Out[4]:
G1 G2 G3 Avg_Grade
0 9.84 4.16 9.65 7.88
1 2.11 9.35 9.27 6.91
2 2.68 3.07 0.39 2.05
3 4.02 9.46 0.40 4.63
Once you're outside of the with statement:
In [5]: print(df['Avg_Grade'])
0 7.883925
1 6.909269
2 2.046569
3 4.625783
Name: Avg_Grade, dtype: float64
print(df['Avg_Grade'])
3. Using an HTML styler.
This requires you run your code in a Jupyter Notebook.
df.style.set_precision(3)
4. Using round()
If you want to display something, you can also use something like:
df.round(2)
df['Avg_Grade'].round(2)
5. Creating another dataframe or modifying in place
This way will let you customize the precision column by column, but the underlying data is changed, so you might want to do that on a copy.
# Create a copy so we don't mess up the original df
df_print = df.copy()
# Round down some numbers
df_print['Avg_Grade'] = df_print['Avg_Grade'].round(2)
df_print['G1'] = df_print['Avg_Grade'].round(4)
# Add more decimals: need to switch that to a string representation
df_print['G3'] = df_print['G3'].map(lambda x: "{:,.10f}".format(x))
# display
df_print
G1 G2 G3 Avg_Grade
0 7.88 4.155922 9.6526935480 7.88
1 6.91 9.347634 9.2713506079 6.91
2 2.05 3.071449 0.3871511232 2.05
3 4.63 9.464408 0.3954815519 4.63
If you don't want to round values inside columns you can just change global display settings:
pd.set_option('display.precision', 2)
It also works for styler