I have a dataframe that shows details about employees and the site they are at and the positions they have held. The dataframe has columns for Site Id, Employee ID, and StartDate (plus a lot more fields). I have this sorted by Site and Employee ID ASC and then EffectiveDate DESC (latest record is first)
Site EmployeeID StartDate 1 123 2024-09-01 1 123 2024-08-01 1 123 2024-06-01 1 123 2024-05-01 2 100 2024-06-01 2 100 2024-03-01
I need to create a new column called EndDate
which is the date of the previous record minus 1 day. We are moving to a new system so we only care about the dates that include the range 7/1/24 (or after). So for my example df, it would look like
Site EmployeeID StartDate EndDate Import 1 123 2024-09-01 Y 1 123 2024-08-01 2024-08-31 Y 1 123 2024-06-01 2024-07-31 Y 1 123 2024-05-01 2024-05-31 N 2 100 2024-06-01 Y 2 100 2024-03-01 2024-05-31 N
And then filtering for df['Import'] ='Y'
My initial idea was to iterate over df.groupby(by=['Site','EmployeeID'])
and use .iloc[]
to get the next values date, subtract 1 day, check if the EndDate
is greater than 7/1/24, then set Import to Y
or N
accordingly. The problem is that this is a very large dataset (~300K rows) and this operation would take a very long.