I want to delete 9 rows with duplicated IDs, got the index for them then tried delete but the length after deletion shows that 714 rows got deleted. Is the pandas index not unique? How can I do this correctly
I search the pandas dataframe and identified the rows I wanted to delete. Wrote down the IDs then attempted deletion but something went wrong it seems.
I suspect the pandas index is not unique and more than intended row got deleted.
How can I create a unique index or how to use the index correctly?
len(CrimeClean) #result 690130
CrimeCleanV1 = CrimeClean.drop([5650, 3725, 6373, 2469, 7751, 7058, 3859, 3640, 3141])
#Validation, row 7751 should not appear
CrimeCleanV1[CrimeCleanV1.Crime_ID == "56882eb6d444d5677ac90c06a0582fe70fe1fd932fd5bd902a5aa4a2aa363bf3"]
#Only one row instead of two appeared as intended
len(CrimeCleanV1) #result 689416
You can do:
CrimeCleanV1.reset_index(drop=True,inplace=True)
Then
CrimeCleanV1[~CrimeCleanV1.index.isin([List with your row numbers where now they are indexes])]
You wrote in your comment: they are not completely the same.
So I assume that the criterion to mark a row as duplicate is a list
of columns, which should be unique.
So you can drop duplicates, passing subset parameter, with just
this list of columns.
Another point to decide is which duplicated rows are actually to be
deleted (keep parameter) - leave first occurrence / leave
last occurrence / drop all.
Another important test concerning your data is to run:
CrimeClean[CrimeClean.index in [...]]
substituting as [...] the same list as you used in CrimeClean.drop.
Then you will see, how many rows exist with the passed id values.
Related
Hey guys, maybe this is a basic SQL qn. Say I have this very simple table, I need to run a simple sql statement to return a result like this:
Basically, the its to dedup Name based on it's row's Value column, whichever is larger should stay.
Thanks!
Framing the problem correctly would help you figure it out.
"Deduplication" suggests altering the table - starting with a state with duplicates, ending with a state without them. Usually done in three steps (getting the rows without duplicates into temp table, removing original table, renaming temp table).
"Removing rows with duplicated column values" also suggests alteration of data and derails train of thought.
What you do want is to get the entire table, and in cases where the columns you care about have multiple values attached get the highest one. One could say... group by columns you care about? And attach them to the highest value, a maximum value?
select id,name,max(value) from table group by id,name
I have two columns. ID, and Transcript. The numbers in the ID column are randomly generated. The Transcripts in Transcript are from an array that I cross-joined and unnested, which has generated several repeats of "ID" in Column 1 whenever there's multiple transcripts per ID.
I can't use the usual stack overflow "if" statements. The numbers in the ID column are randomly generated (and there's thousands of them), so I can't just write "if column ID is 'apple', then-."
I have to automatically identify when & where the column has a repeat, and then be able to take the second (and third, if there is one) rows, isolate the unique data in those rows, then append them onto the first row, in a new column. (I'm aware this would generate "null" data in other data rows' columns that don't have the new data. I presume I'd have to create new columns, pre-filled with '-' in them?)
I'm new to SQL, but does 'count' just tell the user the number of repeats for a particular input that the user specified?
Is there any way to write that number into a new column along a specific row (e.g., 'original instance of this ID') so that I can then issue a command based on that number? Otherwise, is it just telling the user how often something has repeated and isn't machine-readable to be used in additional steps?
Otherwise, can I number the duplicated IDs, then I can issue an order. 'if 1, insert into new column 1 row above if filled with '-. Else, insert transcript on the row above in new column 1 if column 1 has data '-'. Else if, put in column 2. etc.,'
I'm entirely lost as to how to even get it to recognise the duplication is occurring in a meaningful way, given the random nature of the ID system.
I have a data table which has 3 columns . I am populating the values of each column separately in a loop which causes the table structure to be in the format as shown in the image. How to clear every column at the beginning of the loop so that all data comes in a proper tabular format. In short I want to have a data table with variable number of rows possible for each column.
To clear every column at beginning of the loop you have two options:
loop all rows
remove column, add column
Depending on the plenty of rows the two options have different performances.
For large amount of rows i prefer the second option.
Thanks for the contributions so far. After more digging I am re-stating the question (and indeed the title of the question) as follows:
I am selecting just 2 columns from a view that contains several columns. The view returns 50,497 rows if I select all columns, but only 50,496 (i.e. 1 fewer) when I select just 2 columns, these being [Patient_ID] (which is a bigint column) and [Condition_Code] (a varchar(6) column).
Version 1:
SELECT * FROM [vw_Query1]
returns 50,497 rows.
But:
SELECT [Patient_ID], [Condition_Code] FROM [vw_Query1]
returns 50,496 rows.
I can post the code for [vw_Query1] if required, but an understanding at a fundamental level how this can happen when no GROUP BY clause has been used is the key question for me.
UPDATE:
It turns out that if I exclude one particular column, I get the lower number of rows of 50,496. This column is unique in having a Case-Sensitive collation. I still dont understand why it is dropping one particular row but at least I am getting closer to an understanding.
I have an array of 50+ elements that dictates how many hours were worked for a given week.
What is the proper way to store this information into a database table?
My initial idea was to use a delimiter, but the text is too large (280 characters) to fit.
Additionally, there seems something "wrong" with creating a table column for each element.
Ideas?
Array using delimiter (comma):
37.5,37.5,37.5,37.5,37.5,37.5,37.5,37.5,37.5,37.5, ...
The "proper" way is to store the array's contents as multiple rows in a whole other table, each with a foreign key referencing the record they belong to back in the first table. There may be other things that work for you, though.
[EDIT]: From the details you added I'm guessing your array elements consist of a number of hours worked each week and you have 50+ of them because a year has 52-ish weeks. So what I think you're looking for, is I guess that your current (main) table is called something like "employees," is that each row there should have some unique identifier for each employee record. So your new table might be called "work_weeks" and consist of something like employee_id (which matches the employee id in the current table), week_number, and hours_worked.
Seems like a 1 to many relationship. For this example, tableA is the 1 and tableBlammo is the many.
tableA => column blammoId
tableBlammo => column blammoId, column data
One row in tableA joins to multiple rows in tableBlammo via the blammoId column.
Each row in tableBlammo has one element of the array in the data column.