I recently wanted to start scraping data from Twitter and found that my task was made incredibly easy by Ahmet Taspinar’s Twitterscraper module. It comes free of the restrictions of the Twitter public API’s and while not perfect, is super quick to get started with. Start a Python project and fire off a
pip install twitterscraper and you’re ready to go.
The simplest use of Twitterscraper is to harvest data from Twitter by hashtag, like so:
Which returns a JSON file like this, which I neatly formatting using a Python command (see this gist):
The fields you see above are included in the tweets pulled down - pretty self-explanatory.
If you want to query a person, or a time range, rather than a simple hashtag this is also doable - just involves more cutting and pasting.
You put together any query using Twitter Advanced Search and then copy-paste everything between the &q= and the next & sign.
Here’s how I scraped all my own tweets to make an archive.
I got the following link from advanced search:
Then ran this command:
And simple as that, had all my tweets ever. Note that this doesn’t include retweets - only tweets that originated from that handle.
Don’t expect scraping to go completely seamlessly. While I was testing I tried to scrape all of Donald Trump tweets for 2016 and hit some parsing errors. But given how much easier Twitterscraper makes things, it’s worth a first try at the very least.