I have a table that looks like the one below:
Product 1 2 3 4 5 6 7 8 9
005778 110023 112623 117273 4371 4377 50563
070370 110023 112623 1930 40007 4216 4310 4318 4428 56257
010702 110023 112623 2392 40007
012702 110023 112623 2392 40007
017965 110023 112623 2392 40007
017966 110023 112623 2392 40007
034350 110023 112623 2622 40007 56257
024940 110023 112623 2622 40007 56257
071300 110023 112623 40007 4215 4216 4218 4321 56257
071330 110023 112623 40007 4215 4216 4218 4321 56257
I want it to look like this:
Product 1 2 3 4 5 6 7 8 9
005778 110023 112623 117273 4371 4377 50563
070370 110023 112623 1930 40007 4216 4310 4318 4428 56257
010702/012702/017965/017966 110023 112623 2392 40007
034350/024940 110023 112623 2622 40007 56257
071300/071330 110023 112623 40007 4215 4216 4218 4321 56257
I have attempted to use Allen Browne's ConcatRelated() without success. I am attempting to combine this data to use in an Access 2010 report. VBA or SQL solution would be appreciated.
One way to do it would be to add a Calculated field to the table to concatenate fields [1] through [9] together as a string, i.e.,
Field Name: ValuesString
Expression: "|" & [1] & "|" & [2] & "|" & [3] & "|" & [4] & "|" & [5] & "|" & [6] & "|" & [7] & "|" & [8] & "|" & [9] & "|"
Result Type: Text
Then you can use ConcatRelated() like so
SELECT
Max(ConcatRelated("Product","YourTable","ValuesString=""" & [ValuesString] & """","","/")) AS ProductList,
Max(YourTable.[1]) AS MaxOf1,
Max(YourTable.[2]) AS MaxOf2,
Max(YourTable.[3]) AS MaxOf3,
Max(YourTable.[4]) AS MaxOf4,
Max(YourTable.[5]) AS MaxOf5,
Max(YourTable.[6]) AS MaxOf6,
Max(YourTable.[7]) AS MaxOf7,
Max(YourTable.[8]) AS MaxOf8,
Max(YourTable.[9]) AS MaxOf9
FROM YourTable
GROUP BY YourTable.ValuesString;
returning
ProductList MaxOf1 MaxOf2 MaxOf3 MaxOf4 MaxOf5 MaxOf6 MaxOf7 MaxOf8 MaxOf9
--------------------------- ------ ------ ------ ------ ------ ------ ------ ------ ------
005778 110023 112623 117273 4371 4377 50563
070370 110023 112623 1930 40007 4216 4310 4318 4428 56257
010702/012702/017965/017966 110023 112623 2392 40007
034350/024940 110023 112623 2622 40007 56257
071300/071330 110023 112623 40007 4215 4216 4218 4321 56257
in SQL2005 or later, you can try something like:
select products, t.*
from ( select distinct [1],[2],[3],[4],[5],[6],[7],[8],[9] from tbl ) t
cross apply (select products = stuff(
(select '/' + product from tbl z
where z.[1] = t.[1] AND z.[2] = t.[2] ... AND z.[9] = t.[9]
for xml path('')) --concat list
, 1,1,'') ) z --remove first '/'
You will need to do isnull(z.[1],'') = isnull(t.[1],'') etc. if they can be null.
Related
I have a df like the following ...
df = pd.DataFrame({"A":[8001,7999,np.nan,8030,9900, 9342,9324,8534,8358,9457, np.nan ,8999,8492,np.nan,np.nan],
"B":[201,209,298,300,np.nan, 342,324,854,858,457, 145,189,192,134,135],
"C":[11991,15631,47998,38030,19900, 29342,np.nan,28534,28358,29457, 27245,28999,28492,29334,28234]}, index=pd.Index('2019-06-17 00:00:00','2019-06-17 00:01:01', '2019-06-17 00:02:00', '2019-06-17 00:03:04', '2020-06-17 00:04:00', '2020-06-17 00:05:00', '2020-06-17 00:06:00', '2020-06-17 00:07:00','2020-06-17 00:08:00','2020-06-17 00:09:00','2020-06-17 00:10:00','2020-06-17 00:11:00','2020-06-17 00:12:00','2020-06-17 00:13:00', '2020-06-17 00:14:00'))
Time A B C
'2019-06-17 00:00:00' 8001 201 11991
'2019-06-17 00:01:01' 7999 209 15631
'2019-06-17 00:02:00' 7998 298 47998
'2019-06-17 00:03:04' NaN 300 38030
'2020-06-17 00:04:00' 9900 NaN 19900
'2020-06-17 00:05:00' 9342 342 29342
'2020-06-17 00:06:00' 9324 324 NaN
'2020-06-17 00:07:00' 8534 854 28534
'2020-06-17 00:08:00' 8358 858 28358
'2020-06-17 00:09:00' 9457 457 29457
'2020-06-17 00:10:00' NaN 145 27245
'2020-06-17 00:11:00' 8999 189 28999
'2020-06-17 00:12:00' 8492 192 28492
'2020-06-17 00:13:00' NaN 134 29334
'2020-06-17 00:14:00' NaN 135 28234
... and I can sequentialize it. So far no problem.
(I already got code for this, so the outcome for a sequence_Length==5 is:)
array([
[['2019-06-17 00:00:00' 8001 201 11991 ]
['2019-06-17 00:01:01' 7999 209 15631 ]
['2019-06-17 00:02:00' 7998 298 47998 ]
['2019-06-17 00:03:04' NaN 300 38030 ]
['2020-06-17 00:04:00' 9900 NaN 19900 ]]
[['2019-06-17 00:01:01' 7999 209 15631 ]
['2019-06-17 00:02:00' 7998 298 47998 ]
['2019-06-17 00:03:04' NaN 300 38030 ]
['2020-06-17 00:04:00' 9900 NaN 19900 ]
['2020-06-17 00:05:00' 9342 342 29342 ]]
[['2019-06-17 00:02:00' 7998 298 47998 ]
['2019-06-17 00:03:04' NaN 300 38030 ]
['2020-06-17 00:04:00' 9900 NaN 19900 ]
['2020-06-17 00:05:00' 9342 342 29342 ]
['2020-06-17 00:06:00' 9324 324 NaN ]]
[['2019-06-17 00:03:04' NaN 300 38030 ]
['2020-06-17 00:04:00' 9900 NaN 19900 ]
['2020-06-17 00:05:00' 9342 342 29342 ]
['2020-06-17 00:06:00' 9324 324 NaN ]
['2020-06-17 00:07:00' 8534 854 28534 ]]
[['2020-06-17 00:04:00' 9900 NaN 19900 ]
['2020-06-17 00:05:00' 9342 342 29342 ]
['2020-06-17 00:06:00' 9324 324 NaN ]
['2020-06-17 00:07:00' 8534 854 28534 ]
['2020-06-17 00:08:00' 8358 858 28358 ]]
[['2020-06-17 00:05:00' 9342 342 29342 ]
['2020-06-17 00:06:00' 9324 324 NaN ]
['2020-06-17 00:07:00' 8534 854 28534 ]
['2020-06-17 00:08:00' 8358 858 28358 ]
['2020-06-17 00:09:00' 9457 457 29457 ]]
[['2020-06-17 00:06:00' 9324 324 NaN ]
['2020-06-17 00:07:00' 8534 854 28534 ]
['2020-06-17 00:08:00' 8358 858 28358 ]
['2020-06-17 00:09:00' 9457 457 29457 ]
['2020-06-17 00:10:00' NaN 145 27245 ]]
[['2020-06-17 00:07:00' 8534 854 28534 ]
['2020-06-17 00:08:00' 8358 858 28358 ]
['2020-06-17 00:09:00' 9457 457 29457 ]
['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]]
[['2020-06-17 00:08:00' 8358 858 28358 ]
['2020-06-17 00:09:00' 9457 457 29457 ]
['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]]
[['2020-06-17 00:09:00' 9457 457 29457 ]
['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]
['2020-06-17 00:13:00' NaN 134 29334 ]]
[['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]
['2020-06-17 00:13:00' NaN 134 29334 ]
['2020-06-17 00:14:00' NaN 135 28234 ]]
])
# (i kept the timeindices in there, to clear things up)
I cant work with nans, but i also am not allowed to interpolate the data. So:
Question:
How do I move up all the previous values to where the nans are?
(like in this example:)
# Before:
[['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]
['2020-06-17 00:13:00' NaN 134 29334 ]
['2020-06-17 00:14:00' NaN 135 28234 ]]
# Expected Result: # v-- the last 5 valid values in "A" till the current index (=='2020-06-17 00:14:00')
[['2020-06-17 00:10:00' 8534 145 27245 ]
['2020-06-17 00:11:00' 8358 189 28999 ]
['2020-06-17 00:12:00' 9457 192 28492 ]
['2020-06-17 00:13:00' 8999 134 29334 ]
['2020-06-17 00:14:00' 8492 135 28234 ]]
probably a few early samples get lost, when the sequence length gets too high, but thats okay. (since there could be a very high amount of nans in one column)
EDIT: (how to do this?)
In steps it would be like this:
Step 1: (In the df) get the X last valid elements up to date (eg. 00:14:00') (as list)
Step 2: Replace the "column" (in the nested list) by the list from Step 1 (how to select it)
I got a DataFrame:
date phone sensor pallet
126 2019-04-15 940203 C0382C391A4D 47
127 2019-04-15 940203 C0382D392A4D 47
133 2019-04-16 940203 C0382C391A4D 47
134 2019-04-16 940203 C0382D392A4D 47
138 2019-04-17 940203 C0382C391A4D 47
139 2019-04-17 940203 C0382D392A4D 47
144 2019-04-18 940203 C0382C391A4D 47
145 2019-04-18 940203 C0382D392A4D 47
156 2019-04-19 940203 C0382D392A4D 47
157 2019-04-19 940203 C0382C391A4D 47
277 2019-04-15 941557 C0392D362735 32
279 2019-04-15 941557 C03633364D50 32
286 2019-04-16 941557 C03633364D50 32
287 2019-04-16 941557 C0392D362735 32
296 2019-04-17 941557 C03633364D50 32
297 2019-04-17 941557 C0392D362735 32
305 2019-04-18 941557 C0392D362735 32
306 2019-04-18 941557 C03633364D50 32
317 2019-04-19 941557 C03633364D50 32
318 2019-04-19 941557 C0392D362735 32
561 2019-04-15 942316 C0384639224D 45
562 2019-04-15 942316 C03632364950 45
563 2019-04-15 942316 C03920363835 45
564 2019-04-15 942316 C0382939384D 45
573 2019-04-16 942316 C0382939384D 45
574 2019-04-16 942316 C0384639224D 45
575 2019-04-16 942316 C03632364950 45
i want to be able to make subplot for each pallet which contain the sensors arrived in each date.
example:
i have tried few methods:
ax.plot_date
looping through opened ax's and plotting through each 1
grouped = pallets_arrived.groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
ax.plot_date(grouped.get_group(key)['date'], grouped.get_group(key)['sensor'], 'o')
plt.show()
return pallets_arrived
which gives wierdly formatted repeating dates (index the Df with date isnt solving the prob)
Df plotting
grouped = pallets_arrived.groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
grouped.get_group(key).plot(x='date', y='sensor', ax=ax)
ax.legend()
plt.show()
or
grouped = pallets_arrived.set_index('date').groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
grouped.get_group(key).plot(grouped.get_group(key).index, y='sensor', ax=ax)
ax.legend()
plt.show()
pyplot
grouped = pallets_arrived.groupby('pallet')
nrows = 2
ncols = 2
fig, axs = plt.subplots(nrows, ncols)
targets = zip(grouped.groups.keys(), axs.flatten())
for i, (key, ax) in enumerate(targets):
plt.sca(ax)
plt.plot(grouped.get_group(key)['date'], grouped.get_group(key)['sensor'])
ax.legend()
plt.show()
which again
Pivot pallets to Plot() on columns(pallets)
which doesnt work because there are more than 1 sensor in each pallet in same date. so there is a duplicated value error...
I really dont know what method to use to make this 1 correct:
grouping similar dates in x axis.
being able to plot each pallet to different subplot.
i think i dont get the pandas wrapping of matplotlib correctly.
ill be glad for some explenation because im reading guides and cant understand the preferred method for those stuff.
Thanks alot for the helpers.
you can use matplotlib to plot categorical data:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
%matplotlib inline
fig, ax = plt.subplots()
ax.scatter(df['date'], df['sensor'])
plt.show()
or if you want to color the groups:
fig, ax = plt.subplots()
for _,g in df.groupby('pallet'):
ax.scatter(g['date'], g['sensor'])
plt.show()
you can also add a legend:
fig, ax = plt.subplots()
for _,g in df.groupby('pallet'):
ax.scatter(g['date'], g['sensor'], label='Pallet_'+str(_))
ax.legend()
plt.show()
I want to know the scatter plot of the sum of the flight fields per minute. My information is as follows
http://python2018.byethost10.com/flights.csv
My grammar is as follows
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
Df=pd.read_csv('flights.csv')
Df["time_hour"] = pd.to_datetime(df['time_hour'])
grp = df.groupby(by=[df.time_hour.map(lambda x : (x.hour, x.minute))])
a=grp.sum()
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Produced the following error:
Produced the following error
Traceback (most recent call last):
File "I:/PycharmProjects/1223/raise1/char3.py", line 10, in
Plt.scatter(a.index, a['flight'], c='b', marker='o')
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 3470, in scatter
Edgecolors=edgecolors, data=data, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib__init__.py", line 1855, in inner
Return func(ax, *args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes_axes.py", line 4320, in scatter
Alpha=alpha
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 927, in init
Collection.init(self, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 159, in init
Offsets = np.asanyarray(offsets, float)
File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 544, in asanyarray
Return array(a, dtype, copy=False, order=order, subok=True)
ValueError: setting an array element with a sequence.
How can I produce the following results? Thank you.
http://python2018.byethost10.com/image.png
Problem is in aggregation, in your code it return tuples in index.
Solution is convert time_dt column to strings HH:MM by Series.dt.strftime:
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
All together:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
#first column is index and second clumn is parsed to datetimes
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00 122793 37856 87445 11282.0 72838 366 1256
05:01 120780 44810 82113 11115.0 71168 435 1310
05:02 122793 52989 99975 11165.0 72068 515 1489
05:03 120780 57653 98323 10366.0 65137 561 1553
05:04 122793 67706 110230 10026.0 63118 661 1606
05:05 122793 75807 126426 9161.0 55371 742 1607
05:06 120780 82010 120753 10804.0 67827 799 2110
05:07 122793 90684 130339 8408.0 52945 890 1684
05:08 120780 93687 114415 10299.0 63271 922 1487
05:09 122793 101571 99526 11525.0 72915 1002 1371
05:10 122793 107252 107961 10383.0 70137 1056 1652
05:11 120780 111351 120261 10949.0 73350 1098 1551
05:12 122793 120575 135930 8661.0 57406 1190 1575
05:13 120780 118272 104763 7784.0 55886 1166 1672
05:14 122793 37289 109300 9838.0 63582 364 889
05:15 122793 42374 67193 11480.0 78183 409 1474
05:16 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
#rotate labels of x axis
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Another solution is convert datetimes to times:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = 'Noto Serif CJK TC'
matplotlib.rcParams['font.family']='sans-serif'
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.time]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00:00 122793 37856 87445 11282.0 72838 366 1256
05:01:00 120780 44810 82113 11115.0 71168 435 1310
05:02:00 122793 52989 99975 11165.0 72068 515 1489
05:03:00 120780 57653 98323 10366.0 65137 561 1553
05:04:00 122793 67706 110230 10026.0 63118 661 1606
05:05:00 122793 75807 126426 9161.0 55371 742 1607
05:06:00 120780 82010 120753 10804.0 67827 799 2110
05:07:00 122793 90684 130339 8408.0 52945 890 1684
05:08:00 120780 93687 114415 10299.0 63271 922 1487
05:09:00 122793 101571 99526 11525.0 72915 1002 1371
05:10:00 122793 107252 107961 10383.0 70137 1056 1652
05:11:00 120780 111351 120261 10949.0 73350 1098 1551
05:12:00 122793 120575 135930 8661.0 57406 1190 1575
05:13:00 120780 118272 104763 7784.0 55886 1166 1672
05:14:00 122793 37289 109300 9838.0 63582 364 889
05:15:00 122793 42374 67193 11480.0 78183 409 1474
05:16:00 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
This is what the head of my data frame looks like
> head(d19_1)
SMZ SIZ1_diff SIZ1_base SIZ2_diff SIZ2_base SIZ3_diff SIZ3_base SIZ4_diff SIZ4_base SIZ5_diff SIZ5_base
1 1 -620 4170 -189 1347 -35 2040 82 1437 244 1533
2 2 -219 831 -57 255 -4 392 8 282 14 297
3 3 -426 834 -162 294 -134 379 -81 241 -22 221
4 4 -481 676 -142 216 -114 267 -50 158 -43 166
5 5 -233 1711 -109 584 54 913 71 624 74 707
6 6 -322 1539 -79 512 -50 799 23 532 63 576
Total_og Total_base %_SIZ1 %_SIZ2 %_SIZ3 %_SIZ4 %_SIZ5 Total_og Total_base
1 11980 12648 14.86811 14.03118 1.715686 5.706333 15.916504 11980 12648
2 2156 2415 26.35379 22.35294 1.020408 2.836879 4.713805 2156 2415
3 1367 2314 51.07914 55.10204 35.356201 33.609959 9.954751 1367 2314
4 790 1736 71.15385 65.74074 42.696629 31.645570 25.903614 790 1736
5 5339 5496 13.61777 18.66438 5.914567 11.378205 10.466761 5339 5496
6 4362 4747 20.92268 15.42969 6.257822 4.323308 10.937500 4362 4747
The datatype of the data frame is as below str(d19_1)
> str(d19_1)
'data.frame': 1588 obs. of 20 variables:
$ SMZ : int 1 2 3 4 5 6 7 8 9 10 ...
$ SIZ1_diff : int -620 -219 -426 -481 -233 -322 -176 -112 -34 -103 ...
$ SIZ1_base : int 4170 831 834 676 1711 1539 720 1396 998 1392 ...
$ SIZ2_diff : int -189 -57 -162 -142 -109 -79 -12 72 -36 -33 ...
$ SIZ2_base : int 1347 255 294 216 584 512 196 437 343 479 ...
$ SIZ3_diff : int -35 -4 -134 -114 54 -50 16 4 26 83 ...
$ SIZ3_base : int 2040 392 379 267 913 799 361 804 566 725 ...
$ SIZ4_diff : int 82 8 -81 -50 71 23 36 127 46 75 ...
$ SIZ4_base : int 1437 282 241 158 624 532 242 471 363 509 ...
$ SIZ5_diff : int 244 14 -22 -43 74 63 11 143 79 125 ...
$ SIZ5_base : int 1533 297 221 166 707 576 263 582 429 536 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
$ %_SIZ1 : num 14.9 26.4 51.1 71.2 13.6 ...
$ %_SIZ2 : num 14 22.4 55.1 65.7 18.7 ...
$ %_SIZ3 : num 1.72 1.02 35.36 42.7 5.91 ...
$ %_SIZ4 : num 5.71 2.84 33.61 31.65 11.38 ...
$ %_SIZ5 : num 15.92 4.71 9.95 25.9 10.47 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
When I run the below query, it is returning me the below error and I don't know why. I don't have any column in table
Query
d20_1 <- sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')
Error:
Error in rsqlite_send_query(conn#ptr, statement) :
table d19_1 has no column named <NA>
Your code works correctly for me:
d19_1 <- structure(list(SMZ = 1:6, SIZ1_diff = c(-620L, -219L, -426L,
-481L, -233L, -322L), SIZ1_base = c(4170L, 831L, 834L, 676L,
1711L, 1539L), SIZ2_diff = c(-189L, -57L, -162L, -142L, -109L,
-79L), SIZ2_base = c(1347L, 255L, 294L, 216L, 584L, 512L), SIZ3_diff = c(-35L,
-4L, -134L, -114L, 54L, -50L), SIZ3_base = c(2040L, 392L, 379L,
267L, 913L, 799L), SIZ4_diff = c(82L, 8L, -81L, -50L, 71L, 23L
), SIZ4_base = c(1437L, 282L, 241L, 158L, 624L, 532L), SIZ5_diff = c(244L,
14L, -22L, -43L, 74L, 63L), SIZ5_base = c(1533L, 297L, 221L,
166L, 707L, 576L), Total_og = c(11980L, 2156L, 1367L, 790L, 5339L,
4362L), Total_base = c(12648L, 2415L, 2314L, 1736L, 5496L, 4747L
), X._SIZ1 = c(14.86811, 26.35379, 51.07914, 71.15385, 13.61777,
20.92268), X._SIZ2 = c(14.03118, 22.35294, 55.10204, 65.74074,
18.66438, 15.42969), X._SIZ3 = c(1.715686, 1.020408, 35.356201,
42.696629, 5.914567, 6.257822), X._SIZ4 = c(5.706333, 2.836879,
33.609959, 31.64557, 11.378205, 4.323308), X._SIZ5 = c(15.916504,
4.713805, 9.954751, 25.903614, 10.466761, 10.9375), Total_og.1 = c(11980L,
2156L, 1367L, 790L, 5339L, 4362L), Total_base.1 = c(12648L, 2415L,
2314L, 1736L, 5496L, 4747L)), .Names = c("SMZ", "SIZ1_diff",
"SIZ1_base", "SIZ2_diff", "SIZ2_base", "SIZ3_diff", "SIZ3_base",
"SIZ4_diff", "SIZ4_base", "SIZ5_diff", "SIZ5_base", "Total_og",
"Total_base", "X._SIZ1", "X._SIZ2", "X._SIZ3", "X._SIZ4", "X._SIZ5",
"Total_og.1", "Total_base.1"), row.names = c(NA, -6L), class = "data.frame")
library(sqldf)
sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')
I have a multi FASTA file that needs to be parsed so Glimmer multi-extract script can process it. It is composed of many contigs each with it's own header that starts with ">". What I need is to add each header as a new column, the problem is I don't know very much about the linux bash or awk for that matter.
>contig-7
orf00002 1741 461
orf00003 3381 1747
>Wcontig-7000023
>Wcontig-11112
orf00001 426 2648
orf00002 2710 4581
orf00003 4569 5480
orf00004 6990 6133
orf00006 9180 7108
orf00007 10201 9209
orf00008 11663 10203
orf00009 12489 11680
orf00010 13153 12473
orf00011 14382 13225
orf00013 14715 15968
orf00014 19868 16410
>Wcontig-1674000002
orf00001 2995 637
orf00002 2497 1166
orf00003 2984 2529
I need to have each contig header added as a first column along with a tab delimiter.
>contig-7
>contig-7 orf00002 1741 461
>contig-7 orf00003 3381 1747
>Wcontig-7000023
>Wcontig-11112
>Wcontig-11112 orf00001 426 2648
>Wcontig-11112 orf00002 2710 4581
>Wcontig-11112 orf00003 4569 5480
>Wcontig-11112 orf00004 6990 6133
>Wcontig-11112 orf00006 9180 7108
>Wcontig-11112 orf00007 10201 9209
>Wcontig-11112 orf00008 11663 10203
>Wcontig-11112 orf00009 12489 11680
>Wcontig-11112 orf00010 13153 12473
>Wcontig-11112 orf00011 14382 13225
>Wcontig-11112 orf00013 14715 15968
>Wcontig-11112 orf00014 19868 16410
>Wcontig-1674000002
>Wcontig-1674000002 orf00001 2995 637
>Wcontig-1674000002 orf00002 2497 1166
>Wcontig-1674000002 orf00003 2984 2529
Also, after adding the new column I have to erase all the headers, so it would end up looking like this
>contig-7 orf00002 1741 461
>contig-7 orf00003 3381 1747
>Wcontig-11112 orf00001 426 2648
>Wcontig-11112 orf00002 2710 4581
>Wcontig-11112 orf00003 4569 5480
>Wcontig-11112 orf00004 6990 6133
>Wcontig-11112 orf00006 9180 7108
>Wcontig-11112 orf00007 10201 9209
>Wcontig-11112 orf00008 11663 10203
>Wcontig-11112 orf00009 12489 11680
>Wcontig-11112 orf00010 13153 12473
>Wcontig-11112 orf00011 14382 13225
>Wcontig-11112 orf00013 14715 15968
>Wcontig-11112 orf00014 19868 16410
>Wcontig-1674000002 orf00001 2995 637
>Wcontig-1674000002 orf00002 2497 1166
>Wcontig-1674000002 orf00003 2984 2529
Awk can be really handy to solve this problem:
awk '{if($1 ~ /contig/){c=$1}else{print c"\t"$0}}' <yourfile>