Joey Yoo
Data Science Projects + α

Sparkify Postgre ETL

Summary

Create a python pipeline that ingest json log files and load it onto the sparkify database in appropriate tables

SQL Advantages

  • Adheres to the ACID principle:

    • Atomicity: Each transaction is either complete or rolled back
    • Consistencty: Data written to a database must be valid accodring to all rules
    • Isolation: Concurrent transaction do not contend with one another and acts sequentially
    • Durability: Transaction commit is permanent
  • SQL database can scale vertically
  • Can use joins to aggregate multiple tables in a single query
  • Examples: MySQL, Microsoft SQL, PostgreSQL, Oracle

Sparkify Data

  • We apply the star schema in storing the sparkify data. Star schema splits the data into fact tables and dimension tables. Fact tables record measurements for a specific event that connects out to many dimension tables that records the specifics of each field of that events.
  • Star Schema is one of the simplest and widely used DB design. We can use single join to create the relationship between the fact table and any dimension table. However, the data is denormalized and might have high level of redundancy.
  • For sparkify, user playing a single song marks an event. This will form a fact table that records the information of user_id, timestamp of event, artists_id, and song_id.
  • The dimension table will be users, artists, time, songs and save the details of each category.

Result

The STAR schema is as below:

Midnight
Schema for database sparkifydb

Tools

  • Postgre SQL
  • Python (Psycopg2, Pandas, Numpy)

Link Github Repo