Convert all colors in pdf into one specific color - pdf

I'm working on a php project where I need to perform some pdf manipulation.
I need to "convert" all colors of a vector file(pdf) into one very specific color (a spot color in my case.)
Here is an illustrated example
The input file can vary, and it can contain any color (so I can't just convert all "red" or "green" to my target color).
I have a fair idea on how to do it on a raster image using imagemagick's composite, but I'm unsure if it's even possible with a vector image.
My first approach was to create a template pdf, with a filled rectangle in the desired color. My hope was then to use ghostscript to somehow apply the input file as a mask on said template. But I assume this wouldn't be possible as vector files are different from raster images.
My second approach was to use ghostscript to convert all colors (regardless of colorspace) into the desired color. But after extensive googling, I've only found solutions that convert from one colorspace to another (i.e. sRGB to CMYK, CMYK to gray-scale, etc.)
I'm not much of a designer, so perhaps I am simply lacking the proper "terms" for these "actions".
TL;DR
I am looking for a library/tool that can help me "convert" all colors of a vector file(pdf) into one very specific color.
The input file may vary (various shapes and colors), but will always be a pdf file without any fonts.
Output must remain as a vector file (read, no rasterisation.)
I have root access on a VPS running linux (centos7, I assume that is irrelevant.)

You could try rasterising at a high resolution and converting the colours with ImageMagick, then re-vectorising with potrace
So, if you had a PDF, you would do:
convert -density 288 document.pdf ...
As you have provided a PNG, I will do:
convert image.png -fill black -fuzz 10% +opaque white pgm:- | potrace -b svg -o result.svg -
which gives this SVG:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="800.000000pt" height="450.000000pt" viewBox="0 0 800.000000 450.000000"
preserveAspectRatio="xMidYMid meet">
<metadata>
Created by potrace 1.13, written by Peter Selinger 2001-2015
</metadata>
<g transform="translate(0.000000,450.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M4800 4324 c0 -50 -2 -55 -17 -49 -84 35 -140 -17 -130 -119 7 -77
70 -120 122 -82 16 11 21 11 33 0 7 -8 18 -12 23 -9 5 4 9 76 9 161 0 147 -1
154 -20 154 -18 0 -20 -7 -20 -56z m-22 -90 c46 -32 18 -134 -38 -134 -25 0
-40 29 -40 79 0 39 19 71 43 71 7 0 23 -7 35 -16z"/>
<path d="M4926 4358 c-9 -12 -16 -35 -16 -50 0 -18 -5 -28 -15 -28 -8 0 -15
-7 -15 -15 0 -8 7 -15 15 -15 12 0 15 -17 15 -89 0 -89 6 -105 38 -94 8 3 12
31 12 94 0 88 0 89 25 89 16 0 25 6 25 15 0 9 -9 15 -25 15 -21 0 -25 5 -25
30 0 30 7 34 43 30 13 -1 18 4 15 17 -5 29 -72 30 -92 1z"/>
<path d="M3347 4364 c-4 -4 -7 -16 -7 -26 0 -14 6 -19 23 -16 14 2 22 10 22
23 0 20 -25 32 -38 19z"/>
<path d="M4170 4310 c0 -23 -4 -30 -20 -30 -11 0 -20 -7 -20 -15 0 -8 9 -15
20 -15 18 0 20 -7 20 -80 0 -74 2 -81 25 -96 32 -21 75 -12 75 17 0 16 -4 19
-21 14 -30 -10 -39 9 -39 83 l0 62 30 0 c20 0 30 5 30 15 0 10 -10 15 -30 15
-27 0 -30 3 -30 30 0 23 -4 30 -20 30 -16 0 -20 -7 -20 -30z"/>
<path d="M3345 4278 c-3 -8 -4 -59 -3 -114 2 -80 6 -99 18 -99 12 0 15 19 15
109 0 79 -4 111 -12 113 -7 3 -15 -2 -18 -9z"/>
<path d="M3453 4283 c-9 -3 -13 -34 -13 -108 0 -74 4 -105 13 -108 29 -10 37
6 37 78 0 57 4 75 18 88 46 42 72 10 72 -91 0 -54 4 -71 15 -76 22 -8 26 10
23 104 -3 77 -5 84 -31 104 -24 17 -32 19 -59 8 -18 -6 -38 -8 -47 -3 -9 5
-22 6 -28 4z"/>
<path d="M3687 4283 c-4 -3 -7 -71 -7 -150 l0 -143 25 0 c23 0 25 4 25 45 0
42 2 45 19 35 33 -17 61 -11 92 19 24 25 29 37 29 81 0 95 -51 141 -119 107
-25 -13 -31 -13 -35 -1 -6 15 -19 18 -29 7z m122 -47 c19 -22 23 -78 9 -106
-29 -55 -88 -26 -88 43 0 62 48 100 79 63z"/>
<path d="M3927 4284 c-4 -4 -7 -45 -7 -91 0 -76 2 -86 25 -108 27 -28 61 -32
92 -10 18 13 22 13 27 0 3 -8 12 -12 21 -9 13 5 15 24 13 113 -3 98 -4 106
-23 106 -18 0 -20 -8 -23 -75 -4 -94 -28 -128 -72 -100 -10 6 -16 34 -20 91
-5 75 -15 101 -33 83z"/>
<path d="M4432 4282 c-9 -7 -12 -43 -10 -148 3 -136 4 -139 26 -142 20 -3 22
1 22 41 l0 45 35 -11 c31 -9 39 -8 63 10 37 27 54 83 42 136 -15 68 -64 94
-120 63 -20 -12 -26 -12 -35 0 -6 8 -15 10 -23 6z m122 -54 c22 -31 20 -81 -3
-109 -19 -23 -21 -23 -48 -9 -24 13 -28 23 -31 62 -3 39 1 49 20 62 30 22 44
20 62 -6z"/>
<path d="M4310 4096 c0 -30 30 -43 47 -21 16 23 5 45 -23 45 -19 0 -24 -5 -24
-24z"/>
<path d="M4046 3795 l-67 -141 -227 -12 c-418 -22 -765 -74 -1127 -167 -612
-157 -1080 -387 -1387 -684 -214 -205 -323 -393 -359 -615 -16 -101 -6 -270
20 -361 136 -461 637 -856 1409 -1111 152 -51 434 -125 583 -154 l66 -13 -30
-169 c-16 -93 -27 -171 -24 -174 2 -3 124 58 271 135 l266 140 80 -9 c44 -5
197 -14 339 -21 259 -12 617 -3 844 21 l88 9 265 -140 c146 -77 268 -138 270
-136 5 4 -41 294 -52 328 -4 13 8 19 58 28 465 89 939 260 1278 461 626 370
880 871 686 1356 -69 174 -228 375 -415 526 -517 418 -1411 697 -2402 750
l-226 12 -71 141 -70 140 -66 -140z m-202 -407 c-31 -62 -119 -241 -196 -398
-76 -156 -140 -285 -142 -287 -3 -3 -799 -120 -1156 -170 -102 -14 -188 -29
-193 -32 -4 -4 102 -113 235 -242 133 -129 353 -344 489 -479 l248 -245 -45
-260 c-25 -143 -58 -332 -73 -420 l-27 -160 -41 2 c-61 2 -333 68 -515 124
-674 209 -1153 533 -1334 905 -59 121 -77 209 -71 349 5 137 35 235 109 359
58 97 206 261 311 344 463 366 1242 627 2097 701 69 6 141 13 160 15 19 1 72
4 118 4 l82 2 -56 -112z m906 86 c760 -79 1420 -283 1875 -581 864 -566 763
-1326 -245 -1840 -266 -136 -602 -253 -942 -328 -92 -21 -173 -35 -181 -32 -9
3 -20 44 -31 114 -10 59 -42 248 -72 419 l-54 311 213 210 c116 115 337 331
489 479 153 148 274 271 270 275 -4 3 -106 20 -227 37 -452 64 -1118 162
-1120 164 -6 6 -195 387 -291 587 l-104 214 137 -7 c76 -4 203 -14 283 -22z
m-424 -2761 c137 -73 200 -111 193 -118 -14 -14 -794 -14 -809 1 -7 7 49 41
192 117 112 58 207 107 212 107 5 0 100 -48 212 -107z"/>
<path d="M1815 3669 c-46 -47 -113 -80 -221 -111 -62 -17 -106 -22 -204 -22
-137 0 -185 12 -221 58 -48 61 -211 80 -449 53 -118 -14 -400 -63 -408 -72 -3
-3 28 -145 32 -145 1 0 55 11 120 25 181 37 365 58 481 53 98 -3 105 -5 125
-30 113 -144 579 -119 806 44 50 35 109 108 97 118 -5 4 -33 21 -63 38 l-55
31 -40 -40z"/>
<path d="M7647 575 c-66 -79 -247 -137 -432 -138 -134 0 -170 10 -221 61 -18
17 -53 37 -84 46 -70 21 -238 21 -395 0 -122 -15 -364 -60 -372 -68 -5 -5 17
-119 26 -133 4 -7 47 -2 121 13 181 37 358 56 477 52 l108 -3 37 -37 c120
-117 482 -110 720 13 75 40 168 123 168 151 0 10 -110 80 -122 77 -2 0 -16
-16 -31 -34z"/>
</g>
</svg>
which looks like this as a PNG (because StackOverflow doesn't allow SVG images AFAIK):
You can make all the PATHs your preferred shade of green by editing the SVG, like this:
sed 's/path /path fill="#7CBE89" /' black.svg > green.svg

You could do this with Ghostscript, but you would need some PostScript programming experience.
Essentially you want to override all the setcolor/setcolorspace operations by looking at each setcolor operation, checking the colour space and values to see if its your target colour and, if it is, set the colorspace and values to your desired target.
The various PDF operations to set colour space and values are all defined in ghostpdl/Resource/Init/pdf_draw.ps. You'll need to modify the definitions of:
/G and /g (stroke and fill colours in DeviceGray)
/RG and /rg (stroke and fill colours in DeviceRGB)
/K and /k (stroke and fill colours in DeviceCMYK)
/SC and /sc (stroke and fill colours in Indexed, CalGray, CalRGB or Lab)
/SCN and /scn (stroke and fill colours in Pattern, Separation, DeviceN or ICCBased)
There are quite a few wrinkles in there;
You can probably ignore Pattern spaces and just deal with any colours that are set by the pattern itself.
For SC/sc and /SCN/scn you need to figure out whether the colour specified is the target colour, assuming your target can be specified in these spaces. Note that /Indexed is particularly interesting as it can have a base space of any of the other spaces, so you need to look and see.
Finally note that images (bitmaps) are specified differently, and altering those would be much harder.
Depending on the exact nature of the requirement (ie what space/colours constitute valid targets) this could be quite a lengthy task, and it will require someone with PostScript programming ability to write it.
Oh, and on a final note, have you considered transparency ? That can specify the blending colour space too, which might mean that after you had substituted the colour, it would be blended in a different colour space, resulting in your careful substitution disappearing.
Lest you think this unlikely I should mention that a number of PDF producers create files with transparency groups in them, even when no actual transparency operations take place.

Related

react native why my svg icon is not showing?

Why my paypal icon is not showing ?
I have a SVG HTML code and I convert it into react component:
Paypal svg
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="100.000000pt" height="26.000000pt" viewBox="0 0 100.000000 26.000000"
preserveAspectRatio="xMidYMid meet">
<g transform="translate(0.000000,26.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M37 253 c-3 -5 -10 -42 -16 -83 -7 -41 -14 -87 -17 -103 -5 -25 -3
-27 24 -27 27 0 31 4 37 40 6 37 9 40 39 40 17 0 43 8 56 17 l25 17 -19 -22
c-11 -12 -26 -22 -35 -23 -11 0 -12 -2 -3 -6 7 -3 23 2 36 10 20 13 23 23 21
64 -2 26 -4 51 -4 54 -3 22 -133 41 -144 22z"/>
<path d="M319 198 c-1 -2 -7 -36 -13 -75 -11 -68 -11 -73 6 -73 11 0 18 7 18
20 0 21 5 24 51 34 32 7 54 49 39 77 -8 14 -21 19 -55 19 -25 0 -45 -1 -46 -2z
m64 -34 c11 -11 -5 -34 -24 -34 -21 0 -22 1 -13 24 6 16 25 21 37 10z"/>
<path d="M711 144 c0 -11 3 -14 6 -6 3 7 2 16 -1 19 -3 4 -6 -2 -5 -13z"/>
<path d="M436 131 c-19 -21 -21 -55 -3 -69 6 -5 31 -10 54 -9 39 0 42 2 48 31
4 17 9 38 11 48 7 24 -88 24 -110 -1z m64 -17 c11 -12 10 -18 -3 -32 -16 -15
-18 -15 -34 0 -13 14 -14 20 -3 32 16 20 24 20 40 0z"/>
<path d="M560 145 c0 -2 7 -23 16 -45 15 -38 15 -42 0 -59 -21 -23 -20 -31 2
-31 24 1 119 140 95 140 -9 0 -27 -12 -39 -27 l-22 -28 -7 28 c-5 17 -14 27
-26 27 -10 0 -19 -2 -19 -5z"/>
<path d="M80 90 c0 -5 7 -7 15 -4 8 4 15 8 15 10 0 2 -7 4 -15 4 -8 0 -15 -4
-15 -10z"/>
<path d="M66 33 c-6 -14 -5 -15 5 -6 7 7 10 15 7 18 -3 3 -9 -2 -12 -12z"/>
</g>
</svg>
https://react-svgr.com/playground/?native=true
I convert it and added it
<Pressable style={s.btn}>
<Text>Paypal</Text>
<Svg
xmlns="http://www.w3.org/2000/svg"
width={133.333}
height={34.667}
viewBox="0 0 100 26"
>
<Path d="M3.7.7c-.3.5-1 4.2-1.6 8.3C1.4 13.1.7 17.7.4 19.3-.1 21.8.1 22 2.8 22c2.7 0 3.1-.4 3.7-4 .6-3.7.9-4 3.9-4 1.7 0 4.3-.8 5.6-1.7l2.5-1.7-1.9 2.2C15.5 14 14 15 13.1 15.1c-1.1 0-1.2.2-.3.6.7.3 2.3-.2 3.6-1 2-1.3 2.3-2.3 2.1-6.4-.2-2.6-.4-5.1-.4-5.4C17.8.7 4.8-1.2 3.7.7zM31.9 6.2c-.1.2-.7 3.6-1.3 7.5-1.1 6.8-1.1 7.3.6 7.3 1.1 0 1.8-.7 1.8-2 0-2.1.5-2.4 5.1-3.4 3.2-.7 5.4-4.9 3.9-7.7-.8-1.4-2.1-1.9-5.5-1.9-2.5 0-4.5.1-4.6.2zm6.4 3.4c1.1 1.1-.5 3.4-2.4 3.4-2.1 0-2.2-.1-1.3-2.4.6-1.6 2.5-2.1 3.7-1zM71.1 11.6c0 1.1.3 1.4.6.6.3-.7.2-1.6-.1-1.9-.3-.4-.6.2-.5 1.3zM43.6 12.9c-1.9 2.1-2.1 5.5-.3 6.9.6.5 3.1 1 5.4.9 3.9 0 4.2-.2 4.8-3.1.4-1.7.9-3.8 1.1-4.8.7-2.4-8.8-2.4-11 .1zm6.4 1.7c1.1 1.2 1 1.8-.3 3.2-1.6 1.5-1.8 1.5-3.4 0-1.3-1.4-1.4-2-.3-3.2 1.6-2 2.4-2 4 0zM56 11.5c0 .2.7 2.3 1.6 4.5 1.5 3.8 1.5 4.2 0 5.9-2.1 2.3-2 3.1.2 3.1 2.4-.1 11.9-14 9.5-14-.9 0-2.7 1.2-3.9 2.7l-2.2 2.8-.7-2.8C60 12 59.1 11 57.9 11c-1 0-1.9.2-1.9.5z" />
<Path d="M8 17c0 .5.7.7 1.5.4.8-.4 1.5-.8 1.5-1 0-.2-.7-.4-1.5-.4S8 16.4 8 17zM6.6 22.7c-.6 1.4-.5 1.5.5.6.7-.7 1-1.5.7-1.8-.3-.3-.9.2-1.2 1.2z" />
</Svg>
</Pressable>
But I see nothing what I am doing wrong ?
I believe you don't need SVG component to display SVG. You can require it and use it as your Image component source:
const paypal = require('path_to_file/paypal.svg');
Then you should add this to your return:
<Image source={paypal} style={{width: 50px, height: 50px}} />
Don't forget to set width and height, otherwise your image won't display

pandas df add new column based on proportion of two other columns from another dataframe

I have df1 which has three columns (loadgroup, cartons, blocks) like this
loadgroup
cartons
blocks
cartonsPercent
blocksPercent
1
2269
14
26%
21%
2
1168
13
13%
19%
3
937
8
11%
12%
4
2753
24
31%
35%
5
1686
9
19%
13%
total(sum of column)
8813
68
100%
100%
The interpretation is like this: out of df1 26% cartons which is also 21% of blocks are assigned to loadgroup 1, etc. we can assume blocks are 1 to 68, cartons are 1 to 8813.
I also have df2 which also has cartons and blocks columns. but does not have loadgroup.
My goal is to assign loadgroup (1-5 as well) to df2 (100 blocks 29608 cartons in total), but keep the proportions, for example, for df2, 26% cartons 21% blocks assign loadgroup 1, 13% cartons 19% blocks assign loadgroup 2, etc.
df2 is like this:
block
cartons
0
533
1
257
2
96
3
104
4
130
5
71
6
68
7
87
8
99
9
51
10
291
11
119
12
274
13
316
14
87
15
149
16
120
17
222
18
100
19
148
20
192
21
188
22
293
23
120
24
224
25
449
26
385
27
395
28
418
29
423
30
244
31
327
32
337
33
249
34
528
35
528
36
494
37
540
38
368
39
533
40
614
41
462
42
350
43
618
44
463
45
552
46
397
47
401
48
397
49
365
50
475
51
379
52
541
53
488
54
383
55
354
56
760
57
327
58
211
59
356
60
552
61
401
62
320
63
368
64
311
65
421
66
458
67
278
68
504
69
385
70
242
71
413
72
246
73
465
74
386
75
231
76
154
77
294
78
275
79
169
80
398
81
227
82
273
83
319
84
177
85
272
86
204
87
139
88
187
89
263
90
90
91
134
92
67
93
115
94
45
95
65
96
40
97
108
98
60
99
102
total 100 blocks
29608 cartons
I want to add loadgroup column to df2, try to keep those proportions as close as possible. How to do it please? Thank you very much for the help.
I don't know how to find loadgroup column based on both cartons percent and blocks percent. But generate random loadgroup based on either cartons percent or blocks percent is easy.
Here is what I did. I generate 100,000 seeds first, then for each seed, I add column loadgroup1 based on cartons percent, loadgroup2 based on blocks percent, then calculate both percentages, then compare with df1 percentages, get absolute difference, record it. For these 100,000 seeds, I take the minimum difference one as my solution, which is sufficient for my job.
But this is not the optimal solution, and I am looking for quick and easy way to do this. Hope somebody can help.
Here is my code.
df = pd.DataFrame()
np.random.seed(10000)
seeds = np.random.randint(1, 1000000, size = 100000)
for i in range(46530, 46537):
print(seeds[i])
np.random.seed(seeds[i])
df2['loadGroup1'] = np.random.choice(df1.loadgroup, len(df2), p = df1.CartonsPercent)
df2['loadGroup2'] = np.random.choice(df1.loadgroup, len(df2), p = df1.blocksPercent)
df2.reset_index(inplace = True)
three = pd.DataFrame(df2.groupby('loadGroup1').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['CartonsPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
three = pd.DataFrame(df2.groupby('loadGroup2').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['blocksPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
df.sort_values(by = 'AbsDiff', ascending = True, inplace = True)
df = df.head(10)
Actually the first row of df will tell me the seed I am looking for, I kept 10 rows just for curiosity.
Here is my solution.
block
cartons
loadgroup
0
533
4
1
257
1
2
96
4
3
104
4
4
130
4
5
71
2
6
68
1
7
87
4
8
99
4
9
51
4
10
291
4
11
119
2
12
274
2
13
316
4
14
87
4
15
149
5
16
120
3
17
222
2
18
100
2
19
148
2
20
192
3
21
188
4
22
293
1
23
120
2
24
224
4
25
449
1
26
385
5
27
395
3
28
418
1
29
423
4
30
244
5
31
327
1
32
337
5
33
249
4
34
528
1
35
528
1
36
494
5
37
540
3
38
368
2
39
533
4
40
614
5
41
462
4
42
350
5
43
618
4
44
463
2
45
552
1
46
397
3
47
401
3
48
397
1
49
365
1
50
475
4
51
379
1
52
541
1
53
488
2
54
383
2
55
354
1
56
760
5
57
327
4
58
211
2
59
356
5
60
552
4
61
401
1
62
320
1
63
368
3
64
311
3
65
421
2
66
458
5
67
278
4
68
504
5
69
385
4
70
242
4
71
413
1
72
246
2
73
465
5
74
386
4
75
231
1
76
154
4
77
294
4
78
275
1
79
169
4
80
398
4
81
227
4
82
273
1
83
319
3
84
177
4
85
272
5
86
204
3
87
139
1
88
187
4
89
263
4
90
90
4
91
134
4
92
67
3
93
115
3
94
45
2
95
65
2
96
40
4
97
108
2
98
60
2
99
102
1
Here are the summaries.
loadgroup
cartons
blocks
cartonsPercent
blocksPercent
1
7610
22
26%
22%
2
3912
18
13%
18%
3
3429
12
12%
12%
4
9269
35
31%
35%
5
5388
13
18%
13%
It's very close to my target though.

Pandas - cross columns reference

My data is a bit complicated, I separate into 2 sections: (A) Explain data, (B) Desire output
(A) - Explain data:
My data as follow:
comp date adj_date val
0 a 1999-12-31 NaT 50
1 a 2000-01-31 NaT 51
2 a 2000-02-29 NaT 52
3 a 2000-03-31 NaT 53
4 a 2000-04-30 NaT 54
5 a 2000-05-31 NaT 55
6 a 2000-06-30 NaT 56
----------------------------------
7 a 2000-07-31 2000-01-31 57
8 a 2000-08-31 2000-02-29 58
9 a 2000-09-30 2000-03-31 59
10 a 2000-10-31 2000-04-30 60
11 a 2000-11-30 2000-05-31 61
12 a 2000-12-31 2000-06-30 62
13 a 2001-01-31 2000-07-31 63
14 a 2001-02-28 2000-08-31 64
15 a 2001-03-31 2000-09-30 65
16 a 2001-04-30 2000-10-31 66
17 a 2001-05-31 2000-11-30 67
18 a 2001-06-30 2000-12-31 68
----------------------------------
19 a 2001-07-31 2001-01-31 69
20 a 2001-08-31 2001-02-28 70
21 a 2001-09-30 2001-03-31 71
22 a 2001-10-31 2001-04-30 72
23 a 2001-11-30 2001-05-31 73
24 a 2001-12-31 2001-06-30 74
25 a 2002-01-31 2001-07-31 75
26 a 2002-02-28 2001-08-31 76
27 a 2002-03-31 2001-09-30 77
28 a 2002-04-30 2001-10-31 78
29 a 2002-05-31 2001-11-30 79
30 a 2002-06-30 2001-12-31 80
----------------------------------
31 a 2002-07-31 2002-01-31 81
32 a 2002-08-31 2002-02-28 82
33 a 2002-09-30 2002-03-31 83
34 a 2002-10-31 2002-04-30 84
35 a 2002-11-30 2002-05-31 85
36 a 2002-12-31 2002-06-30 86
37 a 2003-01-31 2002-07-31 87
38 a 2003-02-28 2002-08-31 88
39 a 2003-03-31 2002-09-30 89
40 a 2003-04-30 2002-10-31 90
41 a 2003-05-31 2002-11-30 91
42 a 2003-06-30 2002-12-31 92
----------------------------------
date: is the actual date, as end of month.
adj_date = date + MonthEnd(-6)
val: is given value
I want to create new column val_new where:
it is referencing to val of previous year December
val_new is then applied to date as from date.July to date.(year+1).June, Or equivalently in adj_date it is from adj_date.Jan to adj_date.Dec
(B) - Desire Output:
comp date adj_date val val_new
0 a 1999-12-31 NaT 50 NaN
1 a 2000-01-31 NaT 51 NaN
2 a 2000-02-29 NaT 52 NaN
3 a 2000-03-31 NaT 53 NaN
4 a 2000-04-30 NaT 54 NaN
5 a 2000-05-31 NaT 55 NaN
6 a 2000-06-30 NaT 56 NaN
-------------------------------------------
7 a 2000-07-31 2000-01-31 57 50.0
8 a 2000-08-31 2000-02-29 58 50.0
9 a 2000-09-30 2000-03-31 59 50.0
10 a 2000-10-31 2000-04-30 60 50.0
11 a 2000-11-30 2000-05-31 61 50.0
12 a 2000-12-31 2000-06-30 62 50.0
13 a 2001-01-31 2000-07-31 63 50.0
14 a 2001-02-28 2000-08-31 64 50.0
15 a 2001-03-31 2000-09-30 65 50.0
16 a 2001-04-30 2000-10-31 66 50.0
17 a 2001-05-31 2000-11-30 67 50.0
18 a 2001-06-30 2000-12-31 68 50.0
-------------------------------------------
19 a 2001-07-31 2001-01-31 69 62.0
20 a 2001-08-31 2001-02-28 70 62.0
21 a 2001-09-30 2001-03-31 71 62.0
22 a 2001-10-31 2001-04-30 72 62.0
23 a 2001-11-30 2001-05-31 73 62.0
24 a 2001-12-31 2001-06-30 74 62.0
25 a 2002-01-31 2001-07-31 75 62.0
26 a 2002-02-28 2001-08-31 76 62.0
27 a 2002-03-31 2001-09-30 77 62.0
28 a 2002-04-30 2001-10-31 78 62.0
29 a 2002-05-31 2001-11-30 79 62.0
30 a 2002-06-30 2001-12-31 80 62.0
-------------------------------------------
31 a 2002-07-31 2002-01-31 81 74.0
32 a 2002-08-31 2002-02-28 82 74.0
33 a 2002-09-30 2002-03-31 83 74.0
34 a 2002-10-31 2002-04-30 84 74.0
35 a 2002-11-30 2002-05-31 85 74.0
36 a 2002-12-31 2002-06-30 86 74.0
37 a 2003-01-31 2002-07-31 87 74.0
38 a 2003-02-28 2002-08-31 88 74.0
39 a 2003-03-31 2002-09-30 89 74.0
40 a 2003-04-30 2002-10-31 90 74.0
41 a 2003-05-31 2002-11-30 91 74.0
42 a 2003-06-30 2002-12-31 92 74.0
-------------------------------------------
I have two solutions, but both comes at a cost:
Solution 1: to create sub_dec dataframe where we take val of Dec each year. Then merge back to main data. This one works fine, but I don't like this solution because our actual data will involve a lot of merge, and it is not easy and convenient to keep track of all those merges.
Solution 2: (1) I create a lag by shift(7), (2) set other adj_date but Jan to None, (3) then use groupby with ffill. This solution works nicely, but if there is any missing rows, or the date is not continuous, then the entire output is wrong
create adj_year:
data['adj_year'] = data['adj_date'].dt.year
cross referencing by shift(7):
data['val_new'] = data.groupby('comp')['val'].shift(7)
setting other adj_date except Jan to be None:
data.loc[data['adj_date'].dt.month != 1, 'val_new'] = None
using ffill to fill in None by each group of ['comp', 'adj_year']:
data['val_new'] = data.groupby(['comp', 'adj_year'])['val_new'].ffill()
If you have any suggestion to overcome the drawback of Solution 02, or any other new solution is appreciated.
Thank you
You can use Timedelta with correct conversion from seconds to months, according to your needs ,
check these two resources for more info:
https://docs.python.org/3/library/datetime.html
pandas: function equivalent to SQL's datediff()?

pandas how to filter and slice with multiple conditions

Using pandas, how do I return dataframe filtered by value of 2 in 'GEN' column, value 20 in 'AGE' column and exclude columns with name 'GEN' and 'BP'? Thanks in advance:)
AGE GEN BMI BP S1 S2 S3 S4 S5 S6 Y
59 2 32.1 101 157 93.2 38 4 4.8598 87 151
48 1 21.6 87 183 103.2 70 3 3.8918 69 75
72 2 30.5 93 156 93.6 41 4 4.6728 85 141
24 1 25.3 84 198 131.4 40 5 4.8903 89 206
50 1 23 101 192 125.4 52 4 4.2905 80 135
23 1 22.6 89 139 64.8 61 2 4.1897 68 97
20 2 22 90 160 99.6 50 3 3.9512 82 138
66 2 26.2 114 255 185 56 4.5 4.2485 92 63
60 2 32.1 83 179 119.4 42 4 4.4773 94 110
20 1 30 85 180 93.4 43 4 5.3845 88 310
You can do this -
cols = df.columns[~df.columns.isin(['GEN','BP'])]
out=df.loc[(df['GEN'] == 2) & (df['AGE'] == 20),cols]
OR
out=df.query("'GEN'==2 and 'AGE'==20").loc[cols]

To find avg in pig and sort it in ascending order

have a schema with 9 fields and i want to take only two fields(6,7 i.e $5,$6) and i want to calculate the average of $5 and i want to sort the $6 in ascending order so how to do this task can some one help me.
Input Data:
N368SW 188 170 175 17 -1 MCO MHT 1142
N360SW 100 115 87 -10 5 MCO MSY 550
N626SW 114 115 90 13 14 MCO MSY 550
N252WN 107 115 84 -10 -2 MCO MSY 550
N355SW 104 115 85 -1 10 MCO MSY 550
N405WN 113 110 96 14 11 MCO ORF 655
N456WN 110 110 92 24 24 MCO ORF 655
N743SW 144 155 124 7 18 MCO PHL 861
N276WN 142 150 129 -2 6 MCO PHL 861
N369SW 153 145 134 30 22 MCO PHL 861
N363SW 151 145 137 5 -1 MCO PHL 861
N346SW 141 150 128 51 60 MCO PHL 861
N785SW 131 145 118 -15 -1 MCO PHL 861
N635SW 144 155 127 -6 5 MCO PHL 861
N242WN 298 300 276 68 70 MCO PHX 1848
N439WN 130 140 111 -4 6 MCO PIT 834
N348SW 140 135 124 7 2 MCO PIT 834
N672SW 136 135 122 9 8 MCO PIT 834
N493WN 151 160 136 -9 0 MCO PVD 1073
N380SW 170 155 155 13 -2 MCO PVD 1073
N705SW 164 160 147 6 2 MCO PVD 1073
N233LV 157 160 143 1 4 MCO PVD 1073
N786SW 156 160 139 6 10 MCO PVD 1073
N280WN 160 160 146 1 1 MCO PVD 1073
N282WN 104 95 81 10 1 MCO RDU 534
N694SW 89 100 77 3 14 MCO RDU 534
N266WN 94 95 82 9 10 MCO RDU 534
N218WN 98 100 77 12 14 MCO RDU 534
N355SW 47 50 35 15 18 MCO RSW 133
N388SW 44 45 30 37 38 MCO RSW 133
N786SW 46 50 31 4 8 MCO RSW 133
N707SA 52 50 33 10 8 MCO RSW 133
N795SW 176 185 153 -9 0 MCO SAT 1040
N402WN 176 185 161 4 13 MCO SAT 1040
N690SW 123 130 107 -1 6 MCO SDF 718
N457WN 135 130 105 20 15 MCO SDF 718
N720WN 144 155 131 13 24 MCO STL 880
N775SW 147 160 135 -6 7 MCO STL 880
N291WN 136 155 122 96 115 MCO STL 880
N247WN 144 155 127 43 54 MCO STL 880
N748SW 179 185 159 -4 2 MDW ABQ 1121
N709SW 176 190 158 21 35 MDW ABQ 1121
N325SW 110 105 97 36 31 MDW ALB 717
N305SW 116 110 90 107 101 MDW ALB 717
N403WN 145 165 128 -6 14 MDW AUS 972
N767SW 136 165 125 59 88 MDW AUS 972
N730SW 118 120 100 28 30 MDW BDL 777
i have written the code like this but it is not working properly:
a = load '/path/to/file' using PigStorage('\t');
b = foreach a generate (int)$5 as field_a:int,(chararray)$6 as field_b:chararray;
c = group b all;
d = foreach c generate b.field_b,AVG(b.field_a);
e = order d by field_b ASC;
dump e;
I am facing error at order by:
grunt> a = load '/user/horton/sample_pig_data.txt' using PigStorage('\t');
grunt> b = foreach a generate (int)$5 as fielda:int,(chararray)$6 as fieldb:chararray;
grunt> describe #;
b: {fielda: int,fieldb: chararray}
grunt> c = group b all;
grunt> describe #;
c: {group: chararray,b: {(fielda: int,fieldb: chararray)}}
grunt> d = foreach c generate b.fieldb,AVG(b.fielda);
grunt> e = order d by fieldb ;
2017-01-05 15:51:29,623 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025:
<line 6, column 15> Invalid field projection. Projected field [fieldb] does not exist in schema: :bag{:tuple(fieldb:chararray)},:double.
Details at logfile: /root/pig_1483631021021.log
I want output like(not related to input data):
(({(Bharathi),(Komal),(Archana),(Trupthi),(Preethi),(Rajesh),(siddarth),(Rajiv) },
{ (72) , (83) , (87) , (75) , (93) , (90) , (78) , (89) }),83.375)
If you have found the answer, best practice is to post it so that others referring to this can have a better understanding.