I am working on a Python script using PRAW to fetch posts from a subreddit (my whole goal is to make a dataset about a specific sub), and I am facing an issue with handling pagination. The script is designed to fetch new posts and save them to a CSV file. However, when it reaches the maximum limit of 1000 posts, it skips the remaining and ends the program.
The challenge I'm encountering is that I can't seem to start fetching posts from where it left off. I want to implement a solution without using the Pushshift API since I am not a moderator.
def main():
reddit = create_reddit_instance()
subreddit = get_subreddit(reddit, 'sub_name')
df, existing_ids = load_existing_data(FILENAME)
logging.info('Starting to scrape posts')
skipped_posts = 0
loaded_posts = 0
# Load the last fetched post ID
last_fetched_post_id = load_last_fetched_post_id()
# Fetch posts after the last fetched post
top_posts = list(subreddit.new(limit=None, params={
'after': last_fetched_post_id}))
for submission in top_posts:
if submission.id in existing_ids or submission.url.endswith(('.jpg', '.png', '.gif', '.jpeg')) or submission.score < 10:
logging.info(f'Skipped post {submission.id}')
skipped_posts += 1
continue
top_comments = get_top_comments(submission)
new_row = get_new_row(submission, top_comments)
df = df._append(new_row, ignore_index=True) # ! its always df._append
save_data(df, FILENAME)
logging.info(f'Loaded post {submission.id}')
loaded_posts += 1
# Save the ID of the last fetched post
save_last_fetched_post_id(submission.id)
# Sleep for a while to avoid hitting the rate limit
time.sleep(0.3)
logging.info(
f'Finished scraping posts. Total posts loaded: {loaded_posts}. Total posts skipped: {skipped_posts}.')
print("Data saved to ", FILENAME)
My code currently functions only for the first 1000 top posts. I've considered a few solutions but ruled them out:
Using the Pushshift API is a solution I'd like to avoid. Skipping old posts and running the script daily for new ones is not preferable. I am open to alternative tools or methods to overcome this limitation. Your suggestions and insights are greatly appreciated. Thank you!