Swap values within classes - pandas
How can I swap values within classes please?
As shown in this table:
- - - - - - - - - - Before - - - - - - - - - - - - - - - - - - After - - - - - - - - - -
I want to do this because it is over sampled data. It is very repetitive and this causes machine learning tools to over fit.
OK, try this out:
# Setup example dataframe
df = pd.DataFrame({"Class" : [1,2,1,3,1,2,1,3,1,2,1,3,1,2,1,3],
1:[1,1,1,0,1,0,1,0,1,0,1,0,1,0,1,1],
2:[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0],
3:[0,0,1,1,1,0,1,1,0,0,1,1,1,0,1,1],
4:[1,0,1,1,1,0,1,1,1,0,1,1,1,0,1,1],
5:[0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1],
6:[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}).set_index("Class")
# Do a filter on class, and store the positions/index of matching contents
class_to_edit=3
swappable_indices = np.where(df.index==class_to_edit)[0]
# Extract the column to edit
column_to_edit=1
column_values = df[column_to_edit].values
# Decide how many values to swap, and randomly assign swaps
# No guarantee here that the swaps will not contain the same values i.e. you could
# end up swapping 1's for 1's and 0's for 0's here - it's entirely random.
number_of_swaps = 2
swap_pairs = np.random.choice(swappable_indices,number_of_swaps*2, replace=False)
# Using the swap pairs, build a map of substitutions,
# starting with a vanilla no-swap map, then updating it with the generated swaps
swap_map={e:e for e in range(0,len(column_values))}
swap_map.update({swappable_indices[e]:swappable_indices[e+1] for e in range(0,len(swap_pairs),2)})
swap_map.update({swappable_indices[e+1]:swappable_indices[e] for e in range(0,len(swap_pairs),2)})
# Having built the swap-map, apply it to the data in the column,
column_values=[column_values[swap_map[e]] for e,v in enumerate(column_values)]
# and then plug the column back into the dataframe
df[column_to_edit]=column_values
It's a bit grubby, and I'm sure there's a cleaner way to build that substitution map in perhaps a one-line list comprehension - but that should do the trick.
Alternately, there's the np.permute function which might bear some fruit in terms of adding some noise (albeit not by performing discrete swaps).
[edit] For testing, try a dataset with a bit less rigidity, here's an example of a more randomly generated one. Just edit out the columns you want to replace with fixed values if you want to impose some order in the dataset.
df = pd.DataFrame({"Class" : [1,2,1,3,1,2,1,3,1,2,1,3,1,2,1,3],
1:np.random.choice([0,1],16),
2:np.random.choice([0,1],16),
3:np.random.choice([0,1],16),
4:np.random.choice([0,1],16),
5:np.random.choice([0,1],16),
6:np.random.choice([0,1],16)}).set_index("Class")
Related
How to sort connection type into only 2 rows in Qlik sense
I have a column named Con_TYPE in which there are multiple types of connections such as fiberoptic, satellite, 3g etc. And I want to sort them only into 2 rows: fiberoptic 5 others 115 Can anybody help me? Thanks in advance
You can use Calculated dimension or Mapping load Lets imagine that the data, in its raw form, looks like this: dimension: Con_TYPE measure: Sum(value) Calculated dimension You can add expressions inside the dimension. If we have a simple if statement as an expression then the result is: dimension: =if(Con_TYPE = 'fiberoptic', Con_TYPE, 'other') measure: Sum(Value) Mapping load Mapping load is a script function so we'll have to change the script a bit: // Define the mapping. In our case we want to map only one value: // fiberoptic -> fiberoptic // we just want "fiberoptic" to be shown the same "fiberoptic" TypeMapping: Mapping Load * inline [ Old, New fiberoptic, fiberoptic ]; RawData: Load Con_TYPE, value, // --> this is where the mapping is applied // using the TypeMapping, defined above we are mapping the values // in Con_TYPE field. The third parameter specifies what value // should be given if the field value is not found in the // mapping table. In our case we'll chose "Other" ApplyMap('TypeMapping', Con_TYPE, 'Other') as Con_TYPE_Mapped ; Load * inline [ Con_TYPE , value fiberoptic, 10 satellite , 1 3g , 7 ]; // No need to drop "TypeMapping" table since its defined with the // "Mapping" prefix and Qlik will auto drop it at the end of the script And we can use the new field Con_TYPE_Mapped in the ui. And the result is: dimension: Con_TYPE_Mapped measure: Sum(Value) Pros/Cons calculated dimension + easy to use + only UI change - leads to performance issues on mid/large datasets - have to be defined (copy/paste) per table/chart. Which might lead to complications if have to be changed across the whole app (it have to be changed in each object where defined) mapping load + no performance issues (just another field) + the mapping table can be defined inline or loaded from an external source (excel, csv, db etc) + the new field can be used across the whole app and changing the values in the script will not require table/chart change - requires reload if the mapping is changed P.S. In both cases selecting Other in the tables will correctly filter the values and will show data only for 3g and satellite
How to stop Jupyter outputting truncated results when using pd.Series.value_counts()?
I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts(). The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook. I have tried unsuccessfully a couple of methods: df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows # 1st method df.col1.value_counts() # 2nd method print(df.col1.value_counts()) # 3rd method vals = df.col1.value_counts() vals # neither print(vals) doesn't work # All output something like this value1 100000 value2 10000 ... value1000 1 Currently this is what I'm using, but it's quite cumbersome: print(df.col1.value_counts()[:50]) print(df.col1.value_counts()[50:100]) print(df.col1.value_counts()[100:150]) # etc. Also, I have read this related Stack Overflow question, but haven't found it helpful. So how to stop outputting truncated results?
If you want to print all rows: pd.options.display.max_rows = 1000 print(vals) If you want to print all rows only once: with pd.option_context("display.max_rows", 1000): print(vals) Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is: option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block. #temporaly display 999 rows with pd.option_context('display.max_rows', 999): print (df.col1.value_counts())
How to add a url suffix before performing a callback in scrapy
I have a crawler that works just fine in collecting the urls I am interested in. However, before retrieving the content of these urls (i.e. the ones that satisfy rule no 3), I would like to update them, i.e. add a suffix - say '/fullspecs' - on the right-hand side. That means that, in fact, I would like to retrieve and further process - through callback function - only the updated ones. How can I do that? rules = ( Rule(LinkExtractor(allow=('something1'))), Rule(LinkExtractor(allow=('something2'))), Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'), )
You can set process_value parameter to lambda x: x+'/fullspecs' or to a function if you want to do something more complex. You'd end up with: Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive', process_value=lambda x: x+'/fullspecs') See more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor
Handling paginated SQL query results
For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging"). So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem). # First, retrieve total number of rows for the request srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where, DATA_SEP, ADD_SQL) assign(dataName, srdaGetData()) # retrieve result data <- get(dataName) numRequests <- as.numeric(data) %/% RQ_SIZE + 1 # Now, we can request & retrieve data via SQL pagination for (i in 1:numRequests) { # setup SQL pagination if (rq$where == '') rq$where <- '1=1' rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1)) # Submit data request srdaRequestData(queryURL, rq$select, rq$from, rq$where, DATA_SEP, ADD_SQL) assign(dataName, srdaGetData()) # retrieve result data <- get(dataName) # some code # add current data frame to the list dfList <- c(dfList, data) if (DEBUG) message("*", appendLF = FALSE) } # merge all the result pages' data frames data <- do.call("rbind", dfList) # save current data frame to RDS file saveRDS(data, rdataFile)
It probably falls into the category when presumably MySQL hinders LIMIT OFFSET: Why does MYSQL higher LIMIT offset slow the query down? Overall, fetching large data sets over HTTP repeatedly is not very reliable.
Since this is for your dissertation, here is a hand: ## Folder were to save the results to disk. ## Ideally, use a new, empty folder. Easier then to load from disk folder.out <- "~/mydissertation/sql_data_scrape/" ## Create the folder if not exist. dir.create(folder.out, showWarnings=FALSE, recursive=TRUE) ## The larger this number, the more memory you will require. ## If you are renting a large box on, say, EC2, then you can make this 100, or so NumberOfOffsetsBetweenSaves <- 10 ## The limit size per request RQ_SIZE <- 1000 # First, retrieve total number of rows for the request srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where, DATA_SEP, ADD_SQL) ## Get the total number of rows TotalRows <- as.numeric(srdaGetData()) TotalNumberOfRequests <- TotalRows %/% RQ_SIZE TotalNumberOfGroups <- TotalNumberOfRequests %/% NumberOfOffsetsBetweenSaves + 1 ## FYI: Total number of rows being requested is ## (NumberOfOffsetsBetweenSaves * RQ_SIZE * TotalNumberOfGroups) for (g in seq(TotalNumberOfGroups)) { ret <- lapply(seq(NumberOfOffsetsBetweenSaves), function(i) { ## function(i) is the same code you have ## inside your for loop, but cleaned up. # setup SQL pagination if (rq$where == '') rq$where <- '1=1' rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*g*(i-1)) # Submit data request srdaRequestData(queryURL, rq$select, rq$from, rq$where, DATA_SEP, ADD_SQL) # retrieve result data <- srdaGetData() # some code if (DEBUG) message("*", appendLF = FALSE) ### DONT ASSIGN TO dfList, JUST RETURN `data` # xxxxxx DONT DO: xxxxx dfList <- c(dfList, data) ### INSTEAD: ## return data }) ## save each iteration file.out <- sprintf("%s/data_scrape_%04i.RDS", folder.out, g) saveRDS(do.call(rbind, ret), file=file.out) ## OPTIONAL (this will be slower, but will keep your rams and goats in line) # rm(ret) # gc() } Then, once you are done scraping: library(data.table) folder.out <- "~/mydissertation/sql_data_scrape/" files <- dir(folder.out, full=TRUE, pattern="\\.RDS$") ## Create an empty list myData <- vector("list", length=length(files)) ## Option 1, using data.frame for (i in seq(myData)) myData[[i]] <- readRDS(files[[i]]) DT <- do.call(rbind, myData) ## Option 2, using data.table for (i in seq(myData)) myData[[i]] <- as.data.table(readRDS(files[[i]])) DT <- rbindlist(myData)
I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords. The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause. This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.
Circular Definitions in yaml
I'm trying to use yaml to represent a train network with stations and lines; a minimum working example might be 3 stations, connected linearly, so A<->B<->C. I represent the three stations as follows: --- stations: - A - B - C Now I want to store the different lines on the network, and where they start/end. To do this, I add a lines array and some anchors, as follows: --- stations: - &S-A A - &S-B B - &S-C C lines: - &L-A2C A to C: from: *S-A to: *S-C - &L-C2A C to A: from: *S-C to: *S-A and here's the part I'm having trouble with: I want to store the next stop each line at each station. Ideally something like this: --- stations: - &S-A A: next: - *L-A2C: *S-B - &S-B B: next: - *L-A2C: *S-C - *L-C2A: *S-A - &S-C C: next: - *L-C2A: *S-B (the lines array remains the same) But this fails - at least in the Python yaml library, saying yaml.composer.ComposerError: found undefined alias 'L-A2C'. I know why this is - it's because I haven't defined the line yet. But I can't define the lines first, because they depend on the stations, but now the stations depend on the lines. Is there a better way to implement this?
Congradulations! You found an issue in most (if not all) YAML implementations. I recently discovered this limitation too and I am investigating how to work around (in Ruby world). But that's not going to help you. What you are going to have to do is store the "next stops" as a separate set of data points. next-stops: *S-A: - *L-A2C: *S-B *S-B: - *L-A2C: *S-C - *L-C2A: *S-A *S-C: - *L-C2A: *S-B Does that help?