Scraping hourly tweets for a given time period with SNSscrape

1.1k views Asked by Begre At 09 September 2022 at 12:35

I'm trying to circumvent the problem of snscrape not supporting gathering tweets evenly throughout the day. But I'm running into some issues with the output of the data I get. I want to collect tweets mentioning stock tickers from the SP500. But for testing I'm currently using AAPL and MSFT.

This is my code:

from datetime import datetime, timedelta
import snscrape.modules.twitter as sntwitter
import pandas as pd

# Creating list to append tweet data to
Tweets = []
tickers = ['AAPL','MSFT']
timeperiod = (datetime.strptime('2022-09-01', '%Y-%m-%d') -
              datetime.strptime('2021-06-01', '%Y-%m-%d')).days * 24

startime = datetime.now()

start_time = 1622498400
end_time = 1622502000

for s in tickers:

    for t in range(240):
        try:
            for i, tweet in enumerate(sntwitter.TwitterSearchScraper(
                    f'{s} since_time:{start_time} until_time:{end_time}').get_items()):

                if i > 60:
                    break
                Tweets.append({'Date': tweet.date, 'Text': tweet.content, 'Ticker': s})

        except RuntimeError:
            print('Error occurred')

        end_time = datetime.strptime(f'{datetime.fromtimestamp(end_time)}', '%Y-%m-%d %H:%M:%S') \
                   + timedelta(hours=t)
        start_time = end_time - timedelta(hours=1)
        start_time = start_time.timestamp()
        end_time = end_time.timestamp()

# Creating a dataframe to load the list
tweets_df = pd.DataFrame(Tweets, columns=['Date', 'Text', 'Ticker'])

tweets_df.to_csv('sampleTwitter.csv', encoding='UTF-8')
runtime = datetime.now() - startime
print(runtime)

The problem occurs when the code is finished and I look at the csv. Where I only get tweets from the first hours of the starting day. The i should break after collecting 60 tweets within the hour specified and move to the next hour and so on. I want to run this for a longer time period so for testing I currently use 10 days which equals 240 hours to loop through.

Since since_time and until_time accepts epoch time my idea is to update the epoch date with the hours I want to scrape from. My logic for this is that since_time is always equal until_time - 1 hour, and until_time equals the initial end_time + t which is hours from the initial end_time. To my understanding, which is limited in python, it does not properly collect and store the tweets. This is mainly because I get roughly 30 tweets when I should be getting 60x24x10 = 14 400 tweets (given that there is enough tweets within the hour about the query.)

The expected output is something like this:

Date Text Ticker

2021-05-31 22:57:17+00:00 sample AAPL

2021-05-31 22:47:27+00:00 sample AAPL

2021-05-31 21:47:27+00:00 sample AAPL

2021-05-31 20:47:27+00:00 sample AAPL

Date	Text	Ticker
2021-05-31 22:57:17+00:00	sample	AAPL
2021-05-31 22:47:27+00:00	sample	AAPL
2021-05-31 21:47:27+00:00	sample	AAPL
2021-05-31 20:47:27+00:00	sample	AAPL

continuing for 10 days.

But the current output is this:

Date Text Ticker

2021-05-31 22:57:17+00:00 sample AAPL

2021-05-31 22:47:27+00:00 sample AAPL

2021-05-31 22:45:27+00:00 sample AAPL

2021-05-31 22:44:27+00:00 sample AAPL

Date	Text	Ticker
2021-05-31 22:57:17+00:00	sample	AAPL
2021-05-31 22:47:27+00:00	sample	AAPL
2021-05-31 22:45:27+00:00	sample	AAPL
2021-05-31 22:44:27+00:00	sample	AAPL

only the last hours of the first day.

EDIT: Fixed the fault by creating a function and removing parts of the code.

def convertTime(var_time,var):
    time = datetime.strptime(f'{datetime.fromtimestamp(var_time)}', '%Y-%m-%d %H:%M:%S')+timedelta(hours=1)*var
    time = int(time.timestamp())
    return time

If anyone is interested this code does what I want to achieve.

Tweets = []

start_time = 1622498400
end_time = 1622502000

for t in range(72):
    for i, tweet in enumerate(sntwitter.TwitterSearchScraper(
            f'your-query since_time:{convertTime(start_time, t)} until_time:{convertTime(end_time, t)}').get_items()):

        if i > 60:
            break
        Tweets.append({'Date': tweet.date, 'Text': tweet.content, 'Ticker': 'your-iterator'})

tweets_df = pd.DataFrame(Tweets, columns=['Date', 'Text', 'Ticker'])
tweets_df.to_csv('sampleTwitter.csv', encoding='UTF-8')

Original Q&A

TechQA.

Scraping hourly tweets for a given time period with SNSscrape

There are 0 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in TWITTER-SEARCH

Popular Questions

Trending Questions