Data Science
Twitter Data Science Event
Maya Hari, Twitter Managing Director APAC
World’s largest collection of human thought
Indonesia’s flood relief effort empowered by geolocated tweets.
Half of the world’s millennials will be in Asia by 2020. Asia is a great canvas for solving and innovating on problems, making Asia an engine for growth for Twitter.
Transliteration and languages are the next frontier of opportunity.
Miguel Rios, Data Science lead at Twitter
Twitter #interactive visualizations The Data Science Venn Diagram — Drew Conway
Data Science Organization in Twitter
- Foundational Data Science work
- Growth Team
- Habits Team
- Health/Metrics Team
Product Data Science Lifecycle
Opportunity Sizing -> Testable Hypotheses -> Experiment & Design -> Instrumentation & Metrics -> Experiment Review -> Post-release Check-ins
- Experiment review includes analyzing user behaviour, to come up with new ideas
Diana Macias, Client Engineering Manager
140 character limit story
- 2016: Timing wasn’t right for going beyond 140
- Email sent by a Japanese engineer (Iku) previously iOS engineer,
now on machine learning team:
- From personal research project, Japanese characters need 140 characters, but English users need 280-300 characters
- Japanese can contain about 1.9x more information than English in a character, visualized by separating tweets by language
- Japanese plain tweet length averages around 14 characters
- Tweet lengths fit a log-normal distribution: extending the log-normal distribution for english, 280 characters is the number found.
- The 280 character project was born, customizing character limit based on the keyboard.
- 9% of tweets hit the 140 limit, 1% of the tweets hit the 280 limit
- Post analysis of 280 project:
- Less abbreviations
- More kind words like “please”, “thank you”
- Same tweet length
Angad Singh, Data Science Team Lead, Twitter SG
Twitter SG Data Science Team
- small team of 5 data scientists
- focused on user behaviour and international growth
- experimentation on Android/iOS/Web
3Vs of data at Twitter
- Velocity: Rate at which data is created
- Hundreds of millions of Tweets are sent per day. TPS record: one-second peak of 143,199 Tweets per second
- Order of 100B interaction events per day
- Volume: 100s of petabytes of data
- Variety: Tweets, Users, LIkes, Retweets and many more
Production Systems
- Batch (Hadoop, HDFS, MapReduce)
- Real-time (Eventbus, Kafka Streams)
Analytics Tools
- Batch: Scalding, Spark
- Real-time: Heron
- Lambda (Batch + Real-time): Summingbird, TSAR
- Interactive: Presto, Vertica, R
Analytics Front-ends
Hadoop
- Some of the largest Hadoop clusters in the world: > 10k nodes
- Store 100s of peabytes of data
- More than 100k daily jobs
- twitter/ambrose for visualizing Hadoop jobs
Core Data Libraries
- twitter/scalding: DSL on top of Cascading (Java library for MapReduce)
- twitter/summingbird: Lambda architecture: real-time and batch
Interactive SQL
- Interactive means that results of a query are available from the range of seconds to couple of minutes
- SQL still lingua franca of ad-hoc data analysis:
- Presto
- HP Vertica
- Google BigQuery
Data Visualization:
- Apache Zepplin
- Tableu
Data Insights
- Analytics - Basic Counting
- A/B Testing
- Data Science - Exploratory Analysis
- Data Science - Machine Learning
Basic Counting
- Daily/Monthly Acitve Users
- Number of Tweets
Data Science - Custom Analytics
- Cause of spikes and dips in main metrics
Machine Learning
- Recommendations
- Users: Who to follow
- Tweets: Algorithmic timeline
- Cortex, DL based on Torch framework (now Tensorflow)
- Identify NSFW images
- Recognize what is happening in live feeds
Ideal Talent Stack
- Systems (Hadoop, Distributed Systems)
- Programming (Scala, Scalding, SQL)
- Math (Statistics, Linear Algebra)