Format numbers inside print output in pandas [duplicate] - pandas

How do I print an integer with commas as thousands separators?
1234567 ⟶ 1,234,567
It does not need to be locale-specific to decide between periods and commas.

Locale unaware
'{:,}'.format(value) # For Python ≥2.7
f'{value:,}' # For Python ≥3.6
Locale aware
import locale
locale.setlocale(locale.LC_ALL, '') # Use '' for auto, or force e.g. to 'en_US.UTF-8'
'{:n}'.format(value) # For Python ≥2.7
f'{value:n}' # For Python ≥3.6
Reference
Per Format Specification Mini-Language,
The ',' option signals the use of a comma for a thousands separator. For a locale aware separator, use the 'n' integer presentation type instead.

I'm surprised that no one has mentioned that you can do this with f-strings in Python 3.6+ as easy as this:
>>> num = 10000000
>>> print(f"{num:,}")
10,000,000
... where the part after the colon is the format specifier. The comma is the separator character you want, so f"{num:_}" uses underscores instead of a comma. Only "," and "_" is possible to use with this method.
This is equivalent of using format(num, ",") for older versions of python 3.
This might look like magic when you see it the first time, but it's not. It's just part of the language, and something that's commonly needed enough to have a shortcut available. To read more about it, have a look at the group subcomponent.

I got this to work:
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US')
'en_US'
>>> locale.format("%d", 1255000, grouping=True)
'1,255,000'
Sure, you don't need internationalization support, but it's clear, concise, and uses a built-in library.
P.S. That "%d" is the usual %-style formatter. You can have only one formatter, but it can be whatever you need in terms of field width and precision settings.
P.P.S. If you can't get locale to work, I'd suggest a modified version of Mark's answer:
def intWithCommas(x):
if type(x) not in [type(0), type(0L)]:
raise TypeError("Parameter must be an integer.")
if x < 0:
return '-' + intWithCommas(-x)
result = ''
while x >= 1000:
x, r = divmod(x, 1000)
result = ",%03d%s" % (r, result)
return "%d%s" % (x, result)
Recursion is useful for the negative case, but one recursion per comma seems a bit excessive to me.

For inefficiency and unreadability it's hard to beat:
>>> import itertools
>>> s = '-1234567'
>>> ','.join(["%s%s%s" % (x[0], x[1] or '', x[2] or '') for x in itertools.izip_longest(s[::-1][::3], s[::-1][1::3], s[::-1][2::3])])[::-1].replace('-,','-')

Here is the locale grouping code after removing irrelevant parts and cleaning it up a little:
(The following only works for integers)
def group(number):
s = '%d' % number
groups = []
while s and s[-1].isdigit():
groups.append(s[-3:])
s = s[:-3]
return s + ','.join(reversed(groups))
>>> group(-23432432434.34)
'-23,432,432,434'
There are already some good answers in here. I just want to add this for future reference. In python 2.7 there is going to be a format specifier for thousands separator. According to python docs it works like this
>>> '{:20,.2f}'.format(f)
'18,446,744,073,709,551,616.00'
In python3.1 you can do the same thing like this:
>>> format(1234567, ',d')
'1,234,567'

Here's a one-line regex replacement:
re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1,", "%d" % val)
Works only for inegral outputs:
import re
val = 1234567890
re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1,", "%d" % val)
# Returns: '1,234,567,890'
val = 1234567890.1234567890
# Returns: '1,234,567,890'
Or for floats with less than 4 digits, change the format specifier to %.3f:
re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1,", "%.3f" % val)
# Returns: '1,234,567,890.123'
NB: Doesn't work correctly with more than three decimal digits as it will attempt to group the decimal part:
re.sub("(\d)(?=(\d{3})+(?!\d))", r"\1,", "%.5f" % val)
# Returns: '1,234,567,890.12,346'
How it works
Let's break it down:
re.sub(pattern, repl, string)
pattern = \
"(\d) # Find one digit...
(?= # that is followed by...
(\d{3})+ # one or more groups of three digits...
(?!\d) # which are not followed by any more digits.
)",
repl = \
r"\1,", # Replace that one digit by itself, followed by a comma,
# and continue looking for more matches later in the string.
# (re.sub() replaces all matches it finds in the input)
string = \
"%d" % val # Format the string as a decimal to begin with

This is what I do for floats. Although, honestly, I'm not sure which versions it works for - I'm using 2.7:
my_number = 4385893.382939491
my_string = '{:0,.2f}'.format(my_number)
Returns: 4,385,893.38
Update: I recently had an issue with this format (couldn't tell you the exact reason), but was able to fix it by dropping the 0:
my_string = '{:,.2f}'.format(my_number)

You can also use '{:n}'.format( value ) for a locale representation. I think this is the simpliest way for a locale solution.
For more information, search for thousands in Python DOC.
For currency, you can use locale.currency, setting the flag grouping:
Code
import locale
locale.setlocale( locale.LC_ALL, '' )
locale.currency( 1234567.89, grouping = True )
Output
'Portuguese_Brazil.1252'
'R$ 1.234.567,89'

Slightly expanding the answer of Ian Schneider:
If you want to use a custom thousands separator, the simplest solution is:
'{:,}'.format(value).replace(',', your_custom_thousands_separator)
Examples
'{:,.2f}'.format(123456789.012345).replace(',', ' ')
If you want the German representation like this, it gets a bit more complicated:
('{:,.2f}'.format(123456789.012345)
.replace(',', ' ') # 'save' the thousands separators
.replace('.', ',') # dot to comma
.replace(' ', '.')) # thousand separators to dot

Here are some ways to do it with formatting (compatible with floats and ints)
num = 2437.68
# Way 1: String Formatting
'{:,}'.format(num)
>>> '2,437.68'
# Way 2: F-Strings
f'{num:,}'
>>> '2,437.68'
# Way 3: Built-in Format Function
format(num, ',')
>>> '2,437.68'

I'm sure there must be a standard library function for this, but it was fun to try to write it myself using recursion so here's what I came up with:
def intToStringWithCommas(x):
if type(x) is not int and type(x) is not long:
raise TypeError("Not an integer!")
if x < 0:
return '-' + intToStringWithCommas(-x)
elif x < 1000:
return str(x)
else:
return intToStringWithCommas(x / 1000) + ',' + '%03d' % (x % 1000)
Having said that, if someone else does find a standard way to do it, you should use that instead.

The accepted answer is fine, but I actually prefer format(number,','). Easier for me to interpret and remember.
https://docs.python.org/3/library/functions.html#format

From the comments to activestate recipe 498181 I reworked this:
import re
def thous(x, sep=',', dot='.'):
num, _, frac = str(x).partition(dot)
num = re.sub(r'(\d{3})(?=\d)', r'\1'+sep, num[::-1])[::-1]
if frac:
num += dot + frac
return num
It uses the regular expressions feature: lookahead i.e. (?=\d) to make sure only groups of three digits that have a digit 'after' them get a comma. I say 'after' because the string is reverse at this point.
[::-1] just reverses a string.

Python 3
--
Integers (without decimal):
"{:,d}".format(1234567)
--
Floats (with decimal):
"{:,.2f}".format(1234567)
where the number before f specifies the number of decimal places.
--
Bonus
Quick-and-dirty starter function for the Indian lakhs/crores numbering system (12,34,567):
https://stackoverflow.com/a/44832241/4928578

Simplest answer:
format (123456, ",")
Result:
'123,456'

from Python version 2.6 you can do this:
def format_builtin(n):
return format(n, ',')
For Python versions < 2.6 and just for your information, here are 2 manual solutions, they turn floats to ints but negative numbers work correctly:
def format_number_using_lists(number):
string = '%d' % number
result_list = list(string)
indexes = range(len(string))
for index in indexes[::-3][1:]:
if result_list[index] != '-':
result_list.insert(index+1, ',')
return ''.join(result_list)
few things to notice here:
this line: string = '%d' % number beautifully converts a number to a string, it supports negatives and it drops fractions from floats, making them ints;
this slice indexes[::-3] returns each third item starting from
the end, so I used another slice [1:] to remove the very last item
cuz I don't need a comma after the last number;
this conditional if l[index] != '-' is being used to support negative numbers, do not insert a comma after the minus sign.
And a more hardcore version:
def format_number_using_generators_and_list_comprehensions(number):
string = '%d' % number
generator = reversed(
[
value+',' if (index!=0 and value!='-' and index%3==0) else value
for index,value in enumerate(reversed(string))
]
)
return ''.join(generator)

I am a Python beginner, but an experienced programmer. I have Python 3.5, so I can just use the comma, but this is nonetheless an interesting programming exercise. Consider the case of an unsigned integer. The most readable Python program for adding thousands separators appears to be:
def add_commas(instr):
out = [instr[0]]
for i in range(1, len(instr)):
if (len(instr) - i) % 3 == 0:
out.append(',')
out.append(instr[i])
return ''.join(out)
It is also possible to use a list comprehension:
add_commas(instr):
rng = reversed(range(1, len(instr) + (len(instr) - 1)//3 + 1))
out = [',' if j%4 == 0 else instr[-(j - j//4)] for j in rng]
return ''.join(out)
This is shorter, and could be a one liner, but you will have to do some mental gymnastics to understand why it works. In both cases we get:
for i in range(1, 11):
instr = '1234567890'[:i]
print(instr, add_commas(instr))
1 1
12 12
123 123
1234 1,234
12345 12,345
123456 123,456
1234567 1,234,567
12345678 12,345,678
123456789 123,456,789
1234567890 1,234,567,890
The first version is the more sensible choice, if you want the program to be understood.

Universal solution
I have found some issues with the dot separator in the previous top voted answers. I have designed a universal solution where you can use whatever you want as a thousand separator without modifying the locale. I know it's not the most elegant solution, but it gets the job done. Feel free to improve it !
def format_integer(number, thousand_separator='.'):
def reverse(string):
string = "".join(reversed(string))
return string
s = reverse(str(number))
count = 0
result = ''
for char in s:
count = count + 1
if count % 3 == 0:
if len(s) == count:
result = char + result
else:
result = thousand_separator + char + result
else:
result = char + result
return result
print(format_integer(50))
# 50
print(format_integer(500))
# 500
print(format_integer(50000))
# 50.000
print(format_integer(50000000))
# 50.000.000

Use separators and decimals together in float numbers :
(In this example, two decimal places)
large_number = 4545454.26262666
print(f"Formatted: {large_number:,.2f}")
Result:
Formatted: 4,545,454.26

Here's one that works for floats too:
def float2comma(f):
s = str(abs(f)) # Convert to a string
decimalposition = s.find(".") # Look for decimal point
if decimalposition == -1:
decimalposition = len(s) # If no decimal, then just work from the end
out = ""
for i in range(decimalposition+1, len(s)): # do the decimal
if not (i-decimalposition-1) % 3 and i-decimalposition-1: out = out+","
out = out+s[i]
if len(out):
out = "."+out # add the decimal point if necessary
for i in range(decimalposition-1,-1,-1): # working backwards from decimal point
if not (decimalposition-i-1) % 3 and decimalposition-i-1: out = ","+out
out = s[i]+out
if f < 0:
out = "-"+out
return out
Usage Example:
>>> float2comma(10000.1111)
'10,000.111,1'
>>> float2comma(656565.122)
'656,565.122'
>>> float2comma(-656565.122)
'-656,565.122'

One liner for Python 2.5+ and Python 3 (positive int only):
''.join(reversed([x + (',' if i and not i % 3 else '') for i, x in enumerate(reversed(str(1234567)))]))

I'm using python 2.5 so I don't have access to the built-in formatting.
I looked at the Django code intcomma (intcomma_recurs in code below) and realized it's inefficient, because it's recursive and also compiling the regex on every run is not a good thing either. This is not necessary an 'issue' as django isn't really THAT focused on this kind of low-level performance. Also, I was expecting a factor of 10 difference in performance, but it's only 3 times slower.
Out of curiosity I implemented a few versions of intcomma to see what the performance advantages are when using regex. My test data concludes a slight advantage for this task, but surprisingly not much at all.
I also was pleased to see what I suspected: using the reverse xrange approach is unnecessary in the no-regex case, but it does make the code look slightly better at the cost of ~10% performance.
Also, I assume what you're passing in is a string and looks somewhat like a number. Results undetermined otherwise.
from __future__ import with_statement
from contextlib import contextmanager
import re,time
re_first_num = re.compile(r"\d")
def intcomma_noregex(value):
end_offset, start_digit, period = len(value),re_first_num.search(value).start(),value.rfind('.')
if period == -1:
period=end_offset
segments,_from_index,leftover = [],0,(period-start_digit) % 3
for _index in xrange(start_digit+3 if not leftover else start_digit+leftover,period,3):
segments.append(value[_from_index:_index])
_from_index=_index
if not segments:
return value
segments.append(value[_from_index:])
return ','.join(segments)
def intcomma_noregex_reversed(value):
end_offset, start_digit, period = len(value),re_first_num.search(value).start(),value.rfind('.')
if period == -1:
period=end_offset
_from_index,segments = end_offset,[]
for _index in xrange(period-3,start_digit,-3):
segments.append(value[_index:_from_index])
_from_index=_index
if not segments:
return value
segments.append(value[:_from_index])
return ','.join(reversed(segments))
re_3digits = re.compile(r'(?<=\d)\d{3}(?!\d)')
def intcomma(value):
segments,last_endoffset=[],len(value)
while last_endoffset > 3:
digit_group = re_3digits.search(value,0,last_endoffset)
if not digit_group:
break
segments.append(value[digit_group.start():last_endoffset])
last_endoffset=digit_group.start()
if not segments:
return value
if last_endoffset:
segments.append(value[:last_endoffset])
return ','.join(reversed(segments))
def intcomma_recurs(value):
"""
Converts an integer to a string containing commas every three digits.
For example, 3000 becomes '3,000' and 45000 becomes '45,000'.
"""
new = re.sub("^(-?\d+)(\d{3})", '\g<1>,\g<2>', str(value))
if value == new:
return new
else:
return intcomma(new)
#contextmanager
def timed(save_time_func):
begin=time.time()
try:
yield
finally:
save_time_func(time.time()-begin)
def testset_xsimple(func):
func('5')
def testset_simple(func):
func('567')
def testset_onecomma(func):
func('567890')
def testset_complex(func):
func('-1234567.024')
def testset_average(func):
func('-1234567.024')
func('567')
func('5674')
if __name__ == '__main__':
print 'Test results:'
for test_data in ('5','567','1234','1234.56','-253892.045'):
for func in (intcomma,intcomma_noregex,intcomma_noregex_reversed,intcomma_recurs):
print func.__name__,test_data,func(test_data)
times=[]
def overhead(x):
pass
for test_run in xrange(1,4):
for func in (intcomma,intcomma_noregex,intcomma_noregex_reversed,intcomma_recurs,overhead):
for testset in (testset_xsimple,testset_simple,testset_onecomma,testset_complex,testset_average):
for x in xrange(1000): # prime the test
testset(func)
with timed(lambda x:times.append(((test_run,func,testset),x))):
for x in xrange(50000):
testset(func)
for (test_run,func,testset),_delta in times:
print test_run,func.__name__,testset.__name__,_delta
And here are the test results:
intcomma 5 5
intcomma_noregex 5 5
intcomma_noregex_reversed 5 5
intcomma_recurs 5 5
intcomma 567 567
intcomma_noregex 567 567
intcomma_noregex_reversed 567 567
intcomma_recurs 567 567
intcomma 1234 1,234
intcomma_noregex 1234 1,234
intcomma_noregex_reversed 1234 1,234
intcomma_recurs 1234 1,234
intcomma 1234.56 1,234.56
intcomma_noregex 1234.56 1,234.56
intcomma_noregex_reversed 1234.56 1,234.56
intcomma_recurs 1234.56 1,234.56
intcomma -253892.045 -253,892.045
intcomma_noregex -253892.045 -253,892.045
intcomma_noregex_reversed -253892.045 -253,892.045
intcomma_recurs -253892.045 -253,892.045
1 intcomma testset_xsimple 0.0410001277924
1 intcomma testset_simple 0.0369999408722
1 intcomma testset_onecomma 0.213000059128
1 intcomma testset_complex 0.296000003815
1 intcomma testset_average 0.503000020981
1 intcomma_noregex testset_xsimple 0.134000062943
1 intcomma_noregex testset_simple 0.134999990463
1 intcomma_noregex testset_onecomma 0.190999984741
1 intcomma_noregex testset_complex 0.209000110626
1 intcomma_noregex testset_average 0.513000011444
1 intcomma_noregex_reversed testset_xsimple 0.124000072479
1 intcomma_noregex_reversed testset_simple 0.12700009346
1 intcomma_noregex_reversed testset_onecomma 0.230000019073
1 intcomma_noregex_reversed testset_complex 0.236999988556
1 intcomma_noregex_reversed testset_average 0.56299996376
1 intcomma_recurs testset_xsimple 0.348000049591
1 intcomma_recurs testset_simple 0.34600019455
1 intcomma_recurs testset_onecomma 0.625
1 intcomma_recurs testset_complex 0.773999929428
1 intcomma_recurs testset_average 1.6890001297
1 overhead testset_xsimple 0.0179998874664
1 overhead testset_simple 0.0190000534058
1 overhead testset_onecomma 0.0190000534058
1 overhead testset_complex 0.0190000534058
1 overhead testset_average 0.0309998989105
2 intcomma testset_xsimple 0.0360000133514
2 intcomma testset_simple 0.0369999408722
2 intcomma testset_onecomma 0.207999944687
2 intcomma testset_complex 0.302000045776
2 intcomma testset_average 0.523000001907
2 intcomma_noregex testset_xsimple 0.139999866486
2 intcomma_noregex testset_simple 0.141000032425
2 intcomma_noregex testset_onecomma 0.203999996185
2 intcomma_noregex testset_complex 0.200999975204
2 intcomma_noregex testset_average 0.523000001907
2 intcomma_noregex_reversed testset_xsimple 0.130000114441
2 intcomma_noregex_reversed testset_simple 0.129999876022
2 intcomma_noregex_reversed testset_onecomma 0.236000061035
2 intcomma_noregex_reversed testset_complex 0.241999864578
2 intcomma_noregex_reversed testset_average 0.582999944687
2 intcomma_recurs testset_xsimple 0.351000070572
2 intcomma_recurs testset_simple 0.352999925613
2 intcomma_recurs testset_onecomma 0.648999929428
2 intcomma_recurs testset_complex 0.808000087738
2 intcomma_recurs testset_average 1.81900000572
2 overhead testset_xsimple 0.0189998149872
2 overhead testset_simple 0.0189998149872
2 overhead testset_onecomma 0.0190000534058
2 overhead testset_complex 0.0179998874664
2 overhead testset_average 0.0299999713898
3 intcomma testset_xsimple 0.0360000133514
3 intcomma testset_simple 0.0360000133514
3 intcomma testset_onecomma 0.210000038147
3 intcomma testset_complex 0.305999994278
3 intcomma testset_average 0.493000030518
3 intcomma_noregex testset_xsimple 0.131999969482
3 intcomma_noregex testset_simple 0.136000156403
3 intcomma_noregex testset_onecomma 0.192999839783
3 intcomma_noregex testset_complex 0.202000141144
3 intcomma_noregex testset_average 0.509999990463
3 intcomma_noregex_reversed testset_xsimple 0.125999927521
3 intcomma_noregex_reversed testset_simple 0.126999855042
3 intcomma_noregex_reversed testset_onecomma 0.235999822617
3 intcomma_noregex_reversed testset_complex 0.243000030518
3 intcomma_noregex_reversed testset_average 0.56200003624
3 intcomma_recurs testset_xsimple 0.337000131607
3 intcomma_recurs testset_simple 0.342000007629
3 intcomma_recurs testset_onecomma 0.609999895096
3 intcomma_recurs testset_complex 0.75
3 intcomma_recurs testset_average 1.68300008774
3 overhead testset_xsimple 0.0189998149872
3 overhead testset_simple 0.018000125885
3 overhead testset_onecomma 0.018000125885
3 overhead testset_complex 0.0179998874664
3 overhead testset_average 0.0299999713898

this is baked into python per PEP -> https://www.python.org/dev/peps/pep-0378/
just use format(1000, ',d') to show an integer with thousands separator
there are more formats described in the PEP, have at it

babel module in python has feature to apply commas depending on the locale provided.
To install babel run the below command.
pip install babel
usage
format_currency(1234567.89, 'USD', locale='en_US')
# Output: $1,234,567.89
format_currency(1234567.89, 'USD', locale='es_CO')
# Output: US$ 1.234.567,89 (raw output US$\xa01.234.567,89)
format_currency(1234567.89, 'INR', locale='en_IN')
# Output: ₹12,34,567.89

This does money along with the commas
def format_money(money, presym='$', postsym=''):
fmt = '%0.2f' % money
dot = string.find(fmt, '.')
ret = []
if money < 0 :
ret.append('(')
p0 = 1
else :
p0 = 0
ret.append(presym)
p1 = (dot-p0) % 3 + p0
while True :
ret.append(fmt[p0:p1])
if p1 == dot : break
ret.append(',')
p0 = p1
p1 += 3
ret.append(fmt[dot:]) # decimals
ret.append(postsym)
if money < 0 : ret.append(')')
return ''.join(ret)

I have a python 2 and python 3 version of this code. I know that the question was asked for python 2 but now (8 years later lol) people will probably be using python 3. Python 3 Code:
import random
number = str(random.randint(1, 10000000))
comma_placement = 4
print('The original number is: {}. '.format(number))
while True:
if len(number) % 3 == 0:
for i in range(0, len(number) // 3 - 1):
number = number[0:len(number) - comma_placement + 1] + ',' + number[len(number) - comma_placement + 1:]
comma_placement = comma_placement + 4
else:
for i in range(0, len(number) // 3):
number = number[0:len(number) - comma_placement + 1] + ',' + number[len(number) - comma_placement + 1:]
break
print('The new and improved number is: {}'.format(number))
Python 2 Code: (Edit. The python 2 code isn't working. I am thinking that the syntax is different).
import random
number = str(random.randint(1, 10000000))
comma_placement = 4
print 'The original number is: %s.' % (number)
while True:
if len(number) % 3 == 0:
for i in range(0, len(number) // 3 - 1):
number = number[0:len(number) - comma_placement + 1] + ',' + number[len(number) - comma_placement + 1:]
comma_placement = comma_placement + 4
else:
for i in range(0, len(number) // 3):
number = number[0:len(number) - comma_placement + 1] + ',' + number[len(number) - comma_placement + 1:]
break
print 'The new and improved number is: %s.' % (number)

Here is another variant using a generator function that works for integers:
def ncomma(num):
def _helper(num):
# assert isinstance(numstr, basestring)
numstr = '%d' % num
for ii, digit in enumerate(reversed(numstr)):
if ii and ii % 3 == 0 and digit.isdigit():
yield ','
yield digit
return ''.join(reversed([n for n in _helper(num)]))
And here's a test:
>>> for i in (0, 99, 999, 9999, 999999, 1000000, -1, -111, -1111, -111111, -1000000):
... print i, ncomma(i)
...
0 0
99 99
999 999
9999 9,999
999999 999,999
1000000 1,000,000
-1 -1
-111 -111
-1111 -1,111
-111111 -111,111
-1000000 -1,000,000

Italy:
>>> import locale
>>> locale.setlocale(locale.LC_ALL,"")
'Italian_Italy.1252'
>>> f"{1000:n}"
'1.000'

Just subclass long (or float, or whatever). This is highly practical, because this way you can still use your numbers in math ops (and therefore existing code), but they will all print nicely in your terminal.
>>> class number(long):
def __init__(self, value):
self = value
def __repr__(self):
s = str(self)
l = [x for x in s if x in '1234567890']
for x in reversed(range(len(s)-1)[::3]):
l.insert(-x, ',')
l = ''.join(l[1:])
return ('-'+l if self < 0 else l)
>>> number(-100000)
-100,000
>>> number(-100)
-100
>>> number(-12345)
-12,345
>>> number(928374)
928,374
>>> 345

For floats:
float(filter(lambda x: x!=',', '1,234.52'))
# returns 1234.52
For ints:
int(filter(lambda x: x!=',', '1,234'))
# returns 1234

Related

Changing digits in numbers based on a conditions

In Norway we have something called D- and S-numbers. These are National identification number where the day or month of birth are modified.
D-number
[d+4]dmmyy
S-number
dd[m+5]myy
I have a column with dates, some of them normal (ddmmyy) and some of them are formatted as D- or S-numbers. Leading zeroes are also missing.
df = pd.DataFrame({'dates': [241290, #24.12.90
710586, #31.05.86
105299, #10.02.99
56187] #05.11.87
})
dates
0 241290
1 710586
2 105299
3 56187
I've written this function to add leading zero and convert the dates, but this solution doesn't feel that great.
def func(s):
s = s.astype(str)
res = []
for index, value in s.items():
# Make sure all dates have 6 digits (add leading zero)
if len(value) == 5:
value = ('0' + value)
# Convert S- and D-dates to regular dates
if int(value[0]) > 3:
# substract 4 from the first digit
res.append(str(int(value[0]) - 4) + value[1:])
elif int(value[2]) > 1:
# subtract 5 from the third digit
res.append(value[:2] + str(int(value[2]) - 5) + value[3:])
else:
res.append(value)
return pd.Series(res)
Is there a smoother and faster way of accomplishing the same result?
Normalize dates by padding with 0 then explode into 3 columns of two digits (day, month, year). Apply your rules and combine columns to a DateTimeIndex:
# Suggested by #HenryEcker
# Changed: .pad(6, fillchar='0') to .zfill(6)
dates = df['dates'].astype(str).str.zfill(6).str.findall('(\d{2})') \
.apply(pd.Series).astype(int) \
.rename(columns={0: 'day', 1: 'month', 2: 'year'}) \
.agg({'day': lambda d: d if d <= 31 else d - 40,
'month': lambda m: m if m <= 12 else m - 50,
'year': lambda y: 1900 + y})
df['dates2'] = pd.to_datetime(dates)
Output:
>>> df
dates dates2
0 241290 1990-12-24
1 710586 1986-05-31
2 105299 1999-02-10
3 56187 1987-11-05
>>> dates
day month year
0 24 12 1990
1 31 5 1986
2 10 2 1999
3 5 11 1987
You can keep the Series as integers until the final step. The disadvantage of the method below is that the offsets do not match what the specifications say and may take more mental power to comprehend:
def func2(s):
# In mathematical operations, digits are counted from right
# so "first digit" becomes sixth and "third digit" becomes
# fourth in a 6-digit number
delta = np.select(
[s // 10**5 % 10 > 3, s // 10**3 % 10 > 1],
[4 * 10**5 , 5 * 10**3 ],
0
)
return (s - delta).astype('str').str.pad(6, fillchar='0')

Add 'document_id' column to pandas dataframe of word-id's and wordcounts

I have following dataset:
import pandas as pd
jsonDF = pd.DataFrame({'DOCUMENT_ID':[263403828328665088,264142543883739136], 'MESSAGE':['#Zuora wants to help #Network4Good with Hurric...','#ztrip please help spread the good word on hel...']})
DOCUMENT_ID MESSAGE
0 263403828328665088 #Zuora wants to help #Network4Good with Hurric...
1 264142543883739136 #ztrip please help spread the good word on hel...
I am trying to reshape my data in the form of
docID wordID count
0 1 118 1
1 1 285 1
2 1 1229 1
3 1 1688 1
4 1 2068 1
I used following
r=[]
for i in jsonDF['MESSAGE']:
for j in sortedValues(wordsplit(i)):
r.append(j)
IDCount_Re=pd.DataFrame(r)
IDCount_Re[:5]
gives me following result
0 17
1 help 2
2 wants 1
3 hurricane 1
4 relief 1
5 text 1
6 sandy 1
7 donate 1
8 6
9 please 1
I can get word counts
I have no idea to to append Document_ID to the in the above dataframe.
Following functions were used to split words
from nltk.corpus import stopwords
import re
def wordsplit(wordlist):
j=wordlist
j=re.sub(r'\d+', '', j)
j=re.sub('RT', '',j)
j=re.sub('http', '', j)
j = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
j=j.lower()
j=j.strip()
if not j in stopwords.words('english'):
yield j
def wordSplitCount(wordlist):
'''merges a list into string, splits it, removes stop words and
then counts the occurrences returning an ordered dictitonary'''
#stopwords=set(stopwords.words('english'))
string1=''.join(list(itertools.chain(filter(None, wordlist))))
cnt=Counter()
j = []
for i in string1.split(" "):
i=re.sub(r'&', ' ', i.lower())
if i not in stopwords.words('english'):
cnt[i]+=1
return OrderedDict(cnt)
def sortedValues(wordlist):
'''creates a dictionary list of occurenced w/ values descending'''
d=wordSplitCount(wordlist)
return sorted(d.items(), key=lambda t: t[1], reverse=True)
UPDATE: SOLUTION HERE:
string split and and assign unique ids to Pandas DataFrame
'DOCUMENT_ID' is one of the two fields in each row of jsonDF. Your current code doesn't access it because it directly works on jsonDF['MESSAGE'].
Here is some non-working pseudocode - something like:
for _, row in jsonDF.iterrows():
doc_id, msg = row
words = [word for word in wordsplit(msg)][0].split() # hack
wordcounts = Counter(words).most_common() # sort by decr frequency
Then do a pd.concat(pd.DataFrame({'DOCUMENT_ID': doc_id, ...
and get the 'wordId' and 'count' fields from wordcounts.

Adding numbers in a sequence

I have a basic problem.
If we are to input:
6
1 2 3 4 10 11
The desired outcome should be:
31
Here is the coding, you must simply finish the function and it should work:
#!/bin/python3
import sys
def simpleArraySum(n, ar):
# Complete this function
n = int(input().strip())
ar = list(map(int, input().strip().split(' ')))
result = simpleArraySum(n, ar)
print(result)
We want 1+2+3+4+10+11 = 31
Use sum().
def simpleArraySum(n, ar):
return sum(ar[:n])
The [:n] truncates the array to n elements.

How to add vectors to the columns of some array in Julia?

I know that, with package DataFrames, it is possible by doing simply
julia> df = DataFrame();
julia> for i in 1:3
df[i] = [i, i+1, i*2]
end
julia> df
3x3 DataFrame
|-------|----|----|----|
| Row # | x1 | x2 | x3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
| 3 | 2 | 4 | 6 |
... but are there any means to do the same on an empty Array{Int64,2} ?
If you know how many rows you have in your final Array, you can do it using hcat:
# The number of lines of your final array
numrows = 3
# Create an empty array of the same type that you want, with 3 rows and 0 columns:
a = Array(Int, numrows, 0)
# Concatenate 3x1 arrays in your empty array:
for i in 1:numrows
b = [i, i+1, i*2] # Create the array you want to concatenate with a
a = hcat(a, b)
end
Notice that, here you know that the arrays b have elements of the type Int. Therefore we can create the array a that have elements of the same type.
Loop over the rows of the matrix:
A = zeros(3,3)
for i = 1:3
A[i,:] = [i, i+1, 2i]
end
If at all possible, it is best to create your Array with the desired number of columns from the start. That way, you can just fill in those column values. Solutions using procedures like hcat() will suffer from inefficiency, since they require re-creating the Array each time.
If you do need to add columns to an already existing Array, you will be better off if you can add them all at once, rather than in a loop with hcat(). E.g. if you start with:
n = 10; m = 5;
A = rand(n,m);
then
A = [A rand(n, 3)]
will be faster and more memory efficient than:
for idx = 1:3
A = hcat(A, rand(n))
end
E.g. compare the difference in speed and memory allocations between these two:
n = 10^5; m = 10;
A = rand(n,m);
n_newcol = 10;
function t1(A::Array, n_newcol)
n = size(A,1)
for idx = 1:n_newcol
A = hcat(A, zeros(n))
end
return A
end
function t2(A::Array, n_newcol)
n = size(A,1)
[A zeros(n, n_newcol)]
end
# Stats after running each function once to compile
#time r1 = t1(A, n_newcol); ## 0.145138 seconds (124 allocations: 125.888 MB, 70.58% gc time)
#time r2 = t2(A, n_newcol); ## 0.011566 seconds (9 allocations: 22.889 MB, 39.08% gc time)

How to delete "1" followed by trailing zeros from Data Frame row values ?

From my "Id" Column I want to remove the one and zero's from the left.
That is
1000003 becomes 3
1000005 becomes 5
1000011 becomes 11 and so on
Ignore -1, 10 and 1000000, they will be handled as special cases. but from the remaining rows I want to remove the "1" followed by zeros.
Well you can use modulus to get the end of the numbers (they will be the remainder). So just exclude the rows with ids of [-1,10,1000000] and then compute the modulus of 1000000:
print df
Id
0 -1
1 10
2 1000000
3 1000003
4 1000005
5 1000007
6 1000009
7 1000011
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep] % 1000000
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Edit: Here is a fully vectorized string slice version as an alternative (Like Alex' method but takes advantage of pandas' vectorized string methods):
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep].astype(str).str[1:].astype(int)
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Here is another way you could try to do it:
def f(x):
"""convert the value to a string, then select only the characters
after the first one in the string, which is 1. For example,
100005 would be 00005 and I believe it's returning 00005.0 from
dataframe, which is why the float() is there. Then just convert
it to an int, and you'll have 5, etc.
"""
return int(float(str(x)[1:]))
# apply the function "f" to the dataframe and pass in the column 'Id'
df.apply(lambda row: f(row['Id']), axis=1)
I get that this question is satisfactory answered. But for future visitors, what I like about alex' answer is that it does not depend on there to be exactly four zeros. The accepted answer will fail if you sometimes have 10005, sometimes 1000005 and whatever.
However, to add something more to the way we think about it. If you know it's always going to be 10000, you can do
# backup all values
foo = df.id
#now, some will be negative or zero
df.id = df.id - 10000
#back in those that are negative or zero (here, first three rows)
df.if[df.if <= 0] = foo[df.id <= 0]
It gives you the same as Karl's answer, but I typically prefer these kind of methods for their readability.