- Python 3 with all necessary packages installed
- matplotlib==1.4.3
- nltk==3.0.2
- numpy==1.9.2
- pandas==0.16.0
- pymongo==3.0.1
- python-dateutil==2.4.2
- pytz==2015.2
- six==1.9.0
- MongoDB (or enabled port-forwarding to a remote MongoDB)
- Either:
- EITHER: import the tweets to your local MongoDB instance using
tweets2db.py
- OR: open a an ssh port-forwarding connection in a separate command line window
- EITHER: import the tweets to your local MongoDB instance using
ssh -L 27017:localhost:27017 username@server
-
start the iPython notebook by running
ipython3 notebook --matplotlib=inline
-
when done: save notebook and stop the notebook server
-
close the ssh connection by typing
exit
- Word statistics
- tokenizing
- word freqs
- removing stopwords
- extracting usernames and hashtags
- text concordance, common contexts, collocations
- text dispersion
- Cooccurrences and ngrams
- most common tweets (that are not technically RTs)
- ngrams
- User stats
- user activity (most active, ranking users by activity)
- creating user specific corpora
The output of some of the commands can be very long and may contain many lines of text.
The Cell --> All Output --> Scroll Long
setting of the notebook will make reading more convenient.
- Does applying machine learning algorithms aid the categorization of Tweets?
- Tweet statistics (such as: Most often retweeted, most favourited) and examine whether this kind of 'meta-data' can aid categorization
- Fancy graphs and timelines! (matplotlib or R/ggplot2)
- Steven Bird, Ewan Klein & Edward Loper: Natural Language Processing with Python. O'Reilly 2009
- Wes McKinney: Python for Data Analysis. O'Reilly 2013
- Matthew A. Russell: Mining the Social Web. O'Reilly 2014
- Axel Maireder, Stephan Schlögl: 24 hours of an #outcry: The networked publics of a socio-political debate. EJC 2014
- aufschreiStat - Aufschrei statistics in Java, which was never finished but served as an inspiration to this project
- aufschreib - A JavaScript webapp to classify #aufschrei Tweets (manually and through automatic categorization) and create statistics/timelines
- #aufschrei Timeline - Display of all #aufschrei Tweets
- 24 hours of an #outcry: The networked publics of a socio-political debate (EJC paper)
- Analysis of the first 24 hours (timeline, content, user networks)