Data Science

Twitter Data Science Event

Maya Hari, Twitter Managing Director APAC

World’s largest collection of human thought

Indonesia’s flood relief effort empowered by geolocated tweets.

Half of the world’s millennials will be in Asia by 2020. Asia is a great canvas for solving and innovating on problems, making Asia an engine for growth for Twitter.

Transliteration and languages are the next frontier of opportunity.

Miguel Rios, Data Science lead at Twitter

Twitter #interactive visualizations The Data Science Venn Diagram — Drew Conway

Data Science Organization in Twitter

Foundational Data Science work
Growth Team
Habits Team
Health/Metrics Team

Product Data Science Lifecycle

Opportunity Sizing -> Testable Hypotheses -> Experiment & Design -> Instrumentation & Metrics -> Experiment Review -> Post-release Check-ins

Experiment review includes analyzing user behaviour, to come up with new ideas

Diana Macias, Client Engineering Manager

140 character limit story

2016: Timing wasn’t right for going beyond 140
Email sent by a Japanese engineer (Iku) previously iOS engineer, now on machine learning team:
- From personal research project, Japanese characters need 140 characters, but English users need 280-300 characters
- Japanese can contain about 1.9x more information than English in a character, visualized by separating tweets by language
- Japanese plain tweet length averages around 14 characters
- Tweet lengths fit a log-normal distribution: extending the log-normal distribution for english, 280 characters is the number found.
The 280 character project was born, customizing character limit based on the keyboard.
9% of tweets hit the 140 limit, 1% of the tweets hit the 280 limit
Post analysis of 280 project:
- Less abbreviations
- More kind words like “please”, “thank you”
- Same tweet length

Angad Singh, Data Science Team Lead, Twitter SG

Twitter SG Data Science Team

small team of 5 data scientists
focused on user behaviour and international growth
experimentation on Android/iOS/Web

3Vs of data at Twitter

Velocity: Rate at which data is created
- Hundreds of millions of Tweets are sent per day. TPS record: one-second peak of 143,199 Tweets per second
- Order of 100B interaction events per day
Volume: 100s of petabytes of data
Variety: Tweets, Users, LIkes, Retweets and many more

Production Systems

Batch (Hadoop, HDFS, MapReduce)
Real-time (Eventbus, Kafka Streams)

Analytics Tools

Batch: Scalding, Spark
Real-time: Heron
Lambda (Batch + Real-time): Summingbird, TSAR
Interactive: Presto, Vertica, R

Analytics Front-ends

Hadoop

Some of the largest Hadoop clusters in the world: > 10k nodes
Store 100s of peabytes of data
More than 100k daily jobs
twitter/ambrose for visualizing Hadoop jobs

Core Data Libraries

twitter/scalding: DSL on top of Cascading (Java library for MapReduce)
twitter/summingbird: Lambda architecture: real-time and batch

Interactive SQL

Interactive means that results of a query are available from the range of seconds to couple of minutes
SQL still lingua franca of ad-hoc data analysis:
- Presto
- HP Vertica
- Google BigQuery

Data Visualization:

Apache Zepplin
Tableu

Data Insights

Analytics - Basic Counting
A/B Testing
Data Science - Exploratory Analysis
Data Science - Machine Learning

Basic Counting

Daily/Monthly Acitve Users
Number of Tweets

Data Science - Custom Analytics

Cause of spikes and dips in main metrics

Machine Learning

Recommendations
- Users: Who to follow
- Tweets: Algorithmic timeline
Cortex, DL based on Torch framework (now Tensorflow)
- Identify NSFW images
- Recognize what is happening in live feeds

Ideal Talent Stack

Systems (Hadoop, Distributed Systems)
Programming (Scala, Scalding, SQL)
Math (Statistics, Linear Algebra)

Jethro's Braindump