Quantcast
Channel: User Bijan - Stack Overflow
Viewing all articles
Browse latest Browse all 99

Pandas Groupby and Filter based on first record having date greater than specific date

$
0
0

I have a dataframe that shows details about employees and the site they are at and the positions they have held. The dataframe has columns for Site Id, Employee ID, and StartDate (plus a lot more fields). I have this sorted by Site and Employee ID ASC and then EffectiveDate DESC (latest record is first)

Site     EmployeeID         StartDate   1            123        2024-09-01   1            123        2024-08-01   1            123        2024-06-01   1            123        2024-05-01   2            100        2024-06-01   2            100        2024-03-01

I need to create a new column called EndDate which is the date of the previous record minus 1 day. We are moving to a new system so we only care about the dates that include the range 7/1/24 (or after). So for my example df, it would look like

Site     EmployeeID         StartDate       EndDate    Import   1            123        2024-09-01                       Y   1            123        2024-08-01    2024-08-31         Y   1            123        2024-06-01    2024-07-31         Y   1            123        2024-05-01    2024-05-31         N   2            100        2024-06-01                       Y   2            100        2024-03-01    2024-05-31         N

And then filtering for df['Import'] ='Y'

My initial idea was to iterate over df.groupby(by=['Site','EmployeeID']) and use .iloc[] to get the next values date, subtract 1 day, check if the EndDate is greater than 7/1/24, then set Import to Y or N accordingly. The problem is that this is a very large dataset (~300K rows) and this operation would take a very long.


Viewing all articles
Browse latest Browse all 99

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>