Scraping hourly tweets for a given time period with SNSscrape

1.1k views Asked by At

I'm trying to circumvent the problem of snscrape not supporting gathering tweets evenly throughout the day. But I'm running into some issues with the output of the data I get. I want to collect tweets mentioning stock tickers from the SP500. But for testing I'm currently using AAPL and MSFT.

This is my code:

from datetime import datetime, timedelta
import snscrape.modules.twitter as sntwitter
import pandas as pd

# Creating list to append tweet data to
Tweets = []
tickers = ['AAPL','MSFT']
timeperiod = (datetime.strptime('2022-09-01', '%Y-%m-%d') -
              datetime.strptime('2021-06-01', '%Y-%m-%d')).days * 24

startime = datetime.now()

start_time = 1622498400
end_time = 1622502000

for s in tickers:

    for t in range(240):
        try:
            for i, tweet in enumerate(sntwitter.TwitterSearchScraper(
                    f'{s} since_time:{start_time} until_time:{end_time}').get_items()):

                if i > 60:
                    break
                Tweets.append({'Date': tweet.date, 'Text': tweet.content, 'Ticker': s})

        except RuntimeError:
            print('Error occurred')

        end_time = datetime.strptime(f'{datetime.fromtimestamp(end_time)}', '%Y-%m-%d %H:%M:%S') \
                   + timedelta(hours=t)
        start_time = end_time - timedelta(hours=1)
        start_time = start_time.timestamp()
        end_time = end_time.timestamp()

# Creating a dataframe to load the list
tweets_df = pd.DataFrame(Tweets, columns=['Date', 'Text', 'Ticker'])

tweets_df.to_csv('sampleTwitter.csv', encoding='UTF-8')
runtime = datetime.now() - startime
print(runtime)

The problem occurs when the code is finished and I look at the csv. Where I only get tweets from the first hours of the starting day. The i should break after collecting 60 tweets within the hour specified and move to the next hour and so on. I want to run this for a longer time period so for testing I currently use 10 days which equals 240 hours to loop through.

Since since_time and until_time accepts epoch time my idea is to update the epoch date with the hours I want to scrape from. My logic for this is that since_time is always equal until_time - 1 hour, and until_time equals the initial end_time + t which is hours from the initial end_time. To my understanding, which is limited in python, it does not properly collect and store the tweets. This is mainly because I get roughly 30 tweets when I should be getting 60x24x10 = 14 400 tweets (given that there is enough tweets within the hour about the query.)

The expected output is something like this:

Date Text Ticker
2021-05-31 22:57:17+00:00 sample AAPL
2021-05-31 22:47:27+00:00 sample AAPL
2021-05-31 21:47:27+00:00 sample AAPL
2021-05-31 20:47:27+00:00 sample AAPL

continuing for 10 days.

But the current output is this:

Date Text Ticker
2021-05-31 22:57:17+00:00 sample AAPL
2021-05-31 22:47:27+00:00 sample AAPL
2021-05-31 22:45:27+00:00 sample AAPL
2021-05-31 22:44:27+00:00 sample AAPL

only the last hours of the first day.

EDIT: Fixed the fault by creating a function and removing parts of the code.

def convertTime(var_time,var):
    time = datetime.strptime(f'{datetime.fromtimestamp(var_time)}', '%Y-%m-%d %H:%M:%S')+timedelta(hours=1)*var
    time = int(time.timestamp())
    return time

If anyone is interested this code does what I want to achieve.

Tweets = []

start_time = 1622498400
end_time = 1622502000

for t in range(72):
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper(
            f'your-query since_time:{convertTime(start_time, t)} until_time:{convertTime(end_time, t)}').get_items()):

        if i > 60:
            break
        Tweets.append({'Date': tweet.date, 'Text': tweet.content, 'Ticker': 'your-iterator'})

tweets_df = pd.DataFrame(Tweets, columns=['Date', 'Text', 'Ticker'])
tweets_df.to_csv('sampleTwitter.csv', encoding='UTF-8')
0

There are 0 answers