Jethro's Braindump

Data Science

Twitter Data Science Event

Maya Hari, Twitter Managing Director APAC

World’s largest collection of human thought

Indonesia’s flood relief effort empowered by geolocated tweets.

Half of the world’s millennials will be in Asia by 2020. Asia is a great canvas for solving and innovating on problems, making Asia an engine for growth for Twitter.

Transliteration and languages are the next frontier of opportunity.

Miguel Rios, Data Science lead at Twitter

Twitter #interactive visualizations The Data Science Venn Diagram — Drew Conway

Data Science Organization in Twitter

  1. Foundational Data Science work
  2. Growth Team
  3. Habits Team
  4. Health/Metrics Team

Product Data Science Lifecycle

Opportunity Sizing -> Testable Hypotheses -> Experiment & Design -> Instrumentation & Metrics -> Experiment Review -> Post-release Check-ins

  • Experiment review includes analyzing user behaviour, to come up with new ideas

Diana Macias, Client Engineering Manager

140 character limit story

  • 2016: Timing wasn’t right for going beyond 140
  • Email sent by a Japanese engineer (Iku) previously iOS engineer, now on machine learning team:
    • From personal research project, Japanese characters need 140 characters, but English users need 280-300 characters
    • Japanese can contain about 1.9x more information than English in a character, visualized by separating tweets by language
    • Japanese plain tweet length averages around 14 characters
    • Tweet lengths fit a log-normal distribution: extending the log-normal distribution for english, 280 characters is the number found.
  • The 280 character project was born, customizing character limit based on the keyboard.
  • 9% of tweets hit the 140 limit, 1% of the tweets hit the 280 limit
  • Post analysis of 280 project:
    • Less abbreviations
    • More kind words like “please”, “thank you”
    • Same tweet length

Angad Singh, Data Science Team Lead, Twitter SG

Twitter SG Data Science Team

  • small team of 5 data scientists
  • focused on user behaviour and international growth
  • experimentation on Android/iOS/Web

3Vs of data at Twitter

  • Velocity: Rate at which data is created
    • Hundreds of millions of Tweets are sent per day. TPS record: one-second peak of 143,199 Tweets per second
    • Order of 100B interaction events per day
  • Volume: 100s of petabytes of data
  • Variety: Tweets, Users, LIkes, Retweets and many more

Production Systems

  • Batch (Hadoop, HDFS, MapReduce)
  • Real-time (Eventbus, Kafka Streams)

Analytics Tools

  • Batch: Scalding, Spark
  • Real-time: Heron
  • Lambda (Batch + Real-time): Summingbird, TSAR
  • Interactive: Presto, Vertica, R

Analytics Front-ends

Hadoop

  • Some of the largest Hadoop clusters in the world: > 10k nodes
  • Store 100s of peabytes of data
  • More than 100k daily jobs
  • twitter/ambrose for visualizing Hadoop jobs

Core Data Libraries

Interactive SQL

  • Interactive means that results of a query are available from the range of seconds to couple of minutes
  • SQL still lingua franca of ad-hoc data analysis:
    • Presto
    • HP Vertica
    • Google BigQuery

Data Visualization:

  • Apache Zepplin
  • Tableu

Data Insights

  • Analytics - Basic Counting
  • A/B Testing
  • Data Science - Exploratory Analysis
  • Data Science - Machine Learning

Basic Counting

  • Daily/Monthly Acitve Users
  • Number of Tweets

Data Science - Custom Analytics

  • Cause of spikes and dips in main metrics

Machine Learning

  • Recommendations
    • Users: Who to follow
    • Tweets: Algorithmic timeline
  • Cortex, DL based on Torch framework (now Tensorflow)
    • Identify NSFW images
    • Recognize what is happening in live feeds

Ideal Talent Stack

  • Systems (Hadoop, Distributed Systems)
  • Programming (Scala, Scalding, SQL)
  • Math (Statistics, Linear Algebra)