How do you control float formatting when using DataFrame.to_markdown in pandas? - pandas

I'm trying to use DataFrame.to_markdown with a dataframe that contains float values that I'd like to have rounded off. Without to_markdown() I can just set pd.options.display.float_format and everything works fine, but to_markdown doesn't seem to be respecting that option.
Repro:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [42.42, 99.11234123412341234, -23]])
pd.options.display.float_format = '{:,.0f}'.format
print(df)
print()
print(df.to_markdown())
outputs:
0 1 2
0 1 2 3
1 42 99 -23
| | 0 | 1 | 2 |
|---:|------:|--------:|----:|
| 0 | 1 | 2 | 3 |
| 1 | 42.42 | 99.1123 | -23 |
(compare the 42.42 and 99.1123 in the to_markdown table to the 42 and 99 in the plain old df)
Is this a bug or am I missing something about how to use to_markdown?

It looks like pandas uses tabulate for this formatting. If it's installed, you can use something like:
df.to_markdown(floatfmt=".0f")
output:
| | 0 | 1 | 2 |
|---:|----:|----:|----:|
| 0 | 1 | 2 | 3 |
| 1 | 42 | 99 | -23 |

Related

How to print rows of data where the difference in columns is >1

the data table i'm using is:
enter code here
|State| Sno Center| Mar-21| Apr-21|
|AP 1 | Guntur | 121 | 121.1 |
| 2 | Nellore | 118.8 | 118.3|
| 3 | Visakhapatnam| 131.6 | 131.5|
|ASM | 4 Biswanath-| 123.7 | 124.5|
| 5 | Doom-Dooma | 127.8 |128.2|
| 6 | Guwahati | 125.9 |128.2|
| 7 | Labac-Silchar| 114.2 | 115.4|
| 8 | Numaligarh- | 114.2 | 115.1|
| 9 | Sibsagar | 117.7 | 117.3|
| 10| Munger-Jamalpur|117.2 | 118.3|
I want to find the difference between the columns Mar-21 and Apr-21 and print those rows only where the difference is >1.
I tried the following
from numpy import median
import pandas as pd
from pandas.core.tools.numeric import to_numeric
df=pd.read_csv('CPIIW_421a.csv')
mydiff=(df['Apr-21']-df['Mar-21'])
print(mydiff)
df['diff']=(df['Apr-21']-df['Mar-21'])
print(df['diff'])
this code put displays only one column of difference as under, instead of details of rows.
0 | 0.1|
1 | -0.5|
2 | -0.1|
3 | 0.8|
4 | 0.4|
I need all rows display where difference is >1.
How I need to proceed further.
I also want to copy the required data in a new csv file. please advise.
I am a beginner.
thanks
I think you just need this in your last print:
print(df.loc[df['diff'] > 1])
to save it to csv:
df.loc[df['diff'] > 1].to_csv('yourcsv.csv')
You can try as follows
df_new = df[(df["Apr-21"]-df["Mar-21"]) > 1].copy()
For example
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guwahati", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, 124.5, 128.2, 128.2, 115.4, 115.1, 117.3, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 124.5
4 5 Doom-Dooma 127.8 128.2
5 6 Guwahati 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.3
9 10 Munger-Jamalpu 117.7 118.3
df_new = df[(df["Apr-21"]-df["Mar-21"]) > 1].copy()
The result
df_new
State Sno Center Mar-21 Apr-21
5 6 Guwahati 125.9 128.2
6 7 Labac-Silchar 114.2 115.4

Iterate through pandas data frame and replace some strings with numbers

I have a dataframe sample_df that looks like:
bar foo
0 rejected unidentified
1 clear caution
2 caution NaN
Note this is just a random made up df, there are lot of other columns lets say with different data types than just text. bar and foo might also have lots of empty cells/values which are NaNs.
The actual df looks like this, the above is just a sample btw:
| | Unnamed: 0 | user_id | result | face_comparison_result | created_at | facial_image_integrity_result | visual_authenticity_result | properties | attempt_id |
|-----:|-------------:|:---------------------------------|:---------|:-------------------------|:--------------------|:--------------------------------|:-----------------------------|:----------------|:---------------------------------|
| 0 | 58 | ecee468d4a124a8eafeec61271cd0da1 | clear | clear | 2017-06-20 17:50:43 | clear | clear | {} | 9e4277fc1ddf4a059da3dd2db35f6c76 |
| 1 | 76 | 1895d2b1782740bb8503b9bf3edf1ead | clear | clear | 2017-06-20 13:28:00 | clear | clear | {} | ab259d3cb33b4711b0a5174e4de1d72c |
| 2 | 217 | e71b27ea145249878b10f5b3f1fb4317 | clear | clear | 2017-06-18 21:18:31 | clear | clear | {} | 2b7f1c6f3fc5416286d9f1c97b15e8f9 |
| 3 | 221 | f512dc74bd1b4c109d9bd2981518a9f8 | clear | clear | 2017-06-18 22:17:29 | clear | clear | {} | ab5989375b514968b2ff2b21095ed1ef |
| 4 | 251 | 0685c7945d1349b7a954e1a0869bae4b | clear | clear | 2017-06-18 19:54:21 | caution | clear | {} | dd1b0b2dbe234f4cb747cc054de2fdd3 |
| 5 | 253 | 1a1a994f540147ab913fcd61b7a859d9 | clear | clear | 2017-06-18 20:05:05 | clear | clear | {} | 1475037353a848318a32324539a6947e |
| 6 | 334 | 26e89e4a60f1451285e70ca8dc5bc90e | clear | clear | 2017-06-17 20:21:54 | suspected | clear | {} | 244fa3e7cfdb48afb44844f064134fec |
| 7 | 340 | 41afdea02a9c42098a15d94a05e8452b | NaN | clear | 2017-06-17 20:42:53 | clear | clear | {} | b066a4043122437bafae3ddcf6c2ab07 |
| 8 | 424 | 6cf6eb05a3cc4aabb69c19956a055eb9 | rejected | NaN | 2017-06-16 20:00:26 |
I want to replace any strings I find with numbers, per the below mapping.
def no_strings(df):
columns=list(df)
for column in columns:
df[column] = df[column].map(result_map)
#We will need a mapping of strings to numbers to be able to analyse later.
result_map = {'unidentified':0,"clear": 1, 'suspected': 2,"caution" : 3, 'rejected':4}
So the output might look like:
bar foo
0 4 0
1 1 3
2 3 NaN
For some reason, when I run no_strings(sample_df) I get errors.
What am I doing wrong?
df['bar'] = df['bar'].map(result_map)
df['foo'] = df['foo'].map(result_map)
df
bar foo
0 4 0
1 1 3
2 3 2
However, if you wish to be on the safe side (assuming a key/value is not in your result_map and you dont want to see a NaN) do this:
df['foo'] = df['foo'].map(lambda x: result_map.get(x, 'not found'))
df['bar'] = df['bar'].map(lambda x: result_map.get(x, 'not found'))
so an out put for this df
bar foo
0 rejected unidentified
1 clear caution
2 caution suspected
3 sdgdg 0000
will result in:
bar foo
0 4 0
1 1 3
2 3 2
3 not found not found
To be extra efficient:
cols = ['foo','bar','other_columns']
for c in cols:
df[c] = df[c].map(lambda x: result_map.get(x, 'not found'))
Lets try stack, map the dict and then unstack
df.stack().to_frame()[0].map(result_map).unstack()
bar foo
0 4 0
1 1 3
2 3 2

pandas cumcount in pyspark

Currently attempting to convert a script I made from pandas to pyspark, I have a dataframe that contains data in the form of:
index | letter
------|-------
0 | a
1 | a
2 | b
3 | c
4 | a
5 | a
6 | b
I want to create the following dataframe in which the occurrence count for each instance of a letter is stored, for example the first time we see "a" its occurrence count is 0, second time 1, third time 2:
index | letter | occurrence
------|--------|-----------
0 | a | 0
1 | a | 1
2 | b | 0
3 | c | 0
4 | a | 2
5 | a | 3
6 | b | 1
I can achieve this in pandas using:
df['occurrence'] = df.groupby('letter').cumcount()
How would I go about doing this in pyspark? Cannot find an existing method that is similar.
The feature you're looking for is called window functions
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
df.withColumn("occurence", row_number().over(Window.partitionBy("letter").orderBy("index")))

Pandas: need to create dataframe for weekly search per event occurrence

If I have this events dataframe df_e below:
|------|------------|-------|
| group| event date | count |
| x123 | 2016-01-06 | 1 |
| | 2016-01-08 | 10 |
| | 2016-02-15 | 9 |
| | 2016-05-22 | 6 |
| | 2016-05-29 | 2 |
| | 2016-05-31 | 6 |
| | 2016-12-29 | 1 |
| x124 | 2016-01-01 | 1 |
...
and also know the t0 which is the beginning of time (let's say for x123 it's 2016-01-01) and tN which is the end of experiment from another dataframe df_s (2017-05-25), then how can I create the dataframe df_new which should like this
|------|------------|---------------|--------|
| group| obs. weekly| lifetime, week| status |
| x123 | 2016-01-01 | 1 | 1 |
| | 2016-01-08 | 0 | 0 |
| | 2016-01-15 | 0 | 0 |
| | 2016-01-22 | 1 | 1 |
| | 2016-01-29 | 2 | 1 |
...
| | 2017-05-18 | 1 | 1 |
| | 2017-05-25 | 1 | 1 |
...
| x124 | 2017-05-18 | 1 | 1 |
| x124 | 2017-05-25 | 1 | 1 |
Explanation: take t0 and generate rows until tN per week period. For each row R, search with that group if the event date falls within R, if True, then count how long in weeks it lives there, also set status = 1 as alive, otherwise set lifetime, status columns for this R as 0, e.g. dead.
Questions:
1) How to generate dataframes per group given t0 and tN values, e.g. generate [group, obs. weekly, lifetime, status] columns for (tN - t0) / week rows?
2) How to accomplish the construction of such df_new dataframe explained above?
I can begin with this so far =)
import pandas as pd
# 1. generate dataframes per group to get the boundary within `t0` and `tN` from df_s dataframe, where each dataframe has "group, obs, lifetime, status" columns X (tN - t0 / week) rows filled with 0 values.
df_all = pd.concat([df_group1, df_group2])
def do_that(R):
found_event_row = df_e.iloc[[R.group]]
# check if found_event_row['date'] falls into R['obs'] week
# if True, then found how long it's there
df_new = df_all.apply(do_that)
I'm not really sure if I get you but group one is not related to group two, right? if that's the case I think what you want is something like this:
import pandas as pd
df_group1 = df_group1.set_index('event date')
df_group1.index = pd.to_datetime(df_group1.index) #convert the index to datetime so you can 'resample'
df_group1['lifetime, week'] = df_group1.resample('1W').apply(lamda x: yourfuncion(x))
df_group1 = df_group1.reset_index()
df_group1['status']= df_group1.apply(lambda x: 1 if x['lifetime, week']>0 else 0)
#do the same with group2 and concat to create df_all
I'm not sure how you get 'lifetime, week' but all that's left is creating the function that generates it.

SQLite - Complex Query

This is what I want to get.
Art|CANTIDAD1|CANTIDAD2|CANTIDAD1CARGA1 |CANTIDAD2CARGA1 |CANTIDAD1CARGA2 | CANTIDAD2CARGA2
----------------------------------------------------------------------------------------------
001| 7 | 0 | 4 | 0 | 3 | 0
002| 0 | 2 | 0 | 1 | 0 | 1
003| 2 | 0 | 2 | 0 | 0 | 0
004| 3 | 0 | 1 | 0 | 2 | 0
005| 2 | 0 | 0 | 0 | 2 | 0
006| 0 | 1 | 0 | 0 | 0 | 1
I get CANTIDAD1 and CANTIDAD2 doing this query. It is the result of the sum of the amounts corresponding to the "where"
SELECT
SUM(D.NCANTIDAD1) AS NTOTCANTIDAD1,
SUM(D.NCANTIDAD2) AS NTOTCANTIDAD2
FROM
CABPEDIDOS C,
DETPEDIDOS D,
ARTICULOS A
WHERE
C.DFECHAALBARAN IS NULL
AND C.CSERIE = D.CSERIE
AND C.NPEDIDO = D.NPEDIDO
AND D.NFABRICANTE = A.NFABRICANTE
AND D.CARTICULO = A.CARTICULO
GROUP BY
D.NFABRICANTE, D.CARTICULO, A.CNOMBRE
CANTIDAD1CARGA1, CANTIDAD2CARGA1 are quantities that are in the database (d.cantidad1, d,cantidad2 are the real names, I have to sum all of them to get CANTIDAD1 and CANTIDAD2), but I need to get the quantities corresponding to the respective C.CARGA:
(CANTIDAD1 = CANTIDAD1CARGA1 + CANTIDAD1CARGA2)
How can I get these values?
** C.NCARGA can have more than one value, I need to get all CANTIDAD1CARGA'x' and CANTIDAD2CARGA'x'
I don't care if I have to use two querys,
- one for CANTIDAD1 and CANTIDAD2
- other for CANTIDAD1CARGA1, CANTIDAD2CARGA1, CANTIDAD1CARGA2... etc
I have a feeling I'm not really understanding the question, but it seems like you just need:
SELECT CANTIDAD1CARGA1 + CANTIDAD1CARGA2 AS CANTIDAD1,
CANTIDAD2CARGA1 + CANTIDAD2CARGA2 AS CANTIDAD2,
CANTIDAD1CARGA1, CANTIDAD2CARGA1, CANTIDAD1CARGA2, CANTIDAD2CARGA2
FROM ...