CSE 562

Database Systems

Oliver Kennedy

[email protected]

133A Ketter Hall

Why are databases awesome?

They're everywhere

They combine everything

Algorithms Systems Hardware Theory Databases

They're $$

Rank Organization Sales FY Market cap Headquarters
1 Microsoft $97.58 2015 $340.8 Redmond, WA, US
2 Oracle $38.8 2015 $187.6 Redwood City, CA, US
3 SAP $23.3 2015 $90.2 Walldorf, Germany
4 VMware $6 2015 $35.7 Palo Alto, CA, US
5 Symantec $6.6 2015 $16.3 Mountain View, CA, US
6 HCL Technologies $5.2 2015 $21.1 Noida, UP, India
7 Fiserv $5.1 2015 $19 Brookfield, WI, US
8 Intuit $4.6 2015 $27.1 Mountain View, CA, US
9 Amadeus IT Group $4.5 2013 $19.8 Madrid, Spain
10 CA Technologies $4.3 2015 $14.3 Islandia, NY, US
(https://en.wikipedia.org/wiki/List_of_the_largest_software_companies)

7 out of the largest 10 software companies ($698.8bn out of $771.9bn)
make their money in databases (and most of the others are data-adjacent)

Ok... so what's "databases"?

Databases

Answering questions about data

  • Ensuring Result Accuracy
  • Combining Multiple Sources
  • Making question-answering fast
  • Summarizing big data

Safely making changes to data

  • Preventing "buggy" data
  • Keeping copies consistent
  • Updating data in parallel

Techniques

  • Data Modeling
  • Cost-Based Optimization

Recipes

  • Join Algorithms
  • Index Data Structures

Knowledge

  • The Memory Hierarchy
  • Equivalence Rules

Which tools to use

... and when

This Course in Short

For X, the best, correct choice is Y, at least when Z

  • How do you define 'correct' and 'best'?
  • What correct alternatives are available?
  • How do you find the best available alternative?

What is 'Better'?

  • Declarative Queries: 'Easy to think about' vs 'Fast'
  • Data Layouts: Space vs Fast Updates vs Fast Queries
  • Parallel Updates: Reactive vs Proactive Concurrency

Today

  • Logistics: What you need to know
  • Project Outline: A relational query engine
  • Ways to Fail: What not to do and why
  • Intro: So what is a database anyway?
  • SQL: Recapping everyone's favorite language

People

Syllabus & Website

https://odin.cse.buffalo.edu/teaching/cse-562

Course Project

http://dubstep.odin.cse.buffalo.edu

Course Structure

  • Programming Assignment (50% of overall grade)
    3-person Groups Build a Relational Query Engine
  • Theoretical Content (50% of overall grade)
    • Midterm on March 4 (15% of overall grade)
    • Midterm on Apr 8 (15% of overall grade)
    • Comprehensive Final Exam (20% of overall grade)

Optional Textbook 1

Optional Textbook 2

HardcoverSoftcover
$160
$25
has Index, ToC no Index, ToC

Optional Textbook 3

Course Project

Data µBases

Embedded Databases

  • SQLite is in your browser, computer, phone, ...
  • Declarative data management is awesome!
  • Tech needs small data: Mobile, IoT, Web

Your team's goal:
build (part of) an embedded database

The Project

  • I give you data (CSV Files + A Schema)
  • I ask you questions about the data (SQL)
  • You give me an answer

Real world challenge: You start with...

  • ... an empty GIT repository
  • ... a few open-source libraries

Real world challenge: You get graded on...

... Correctness
Do you produce the right answer?
1/3 grade for computing the correct answer
... Speed
How fast did you produce the answer?
2/3 grade for meeting or beating the reference implementation

Submission

You write code You push to GIT Hit Submit DµBStep compiles π s Run You get emailed Try   again

Query Processor

SQL Query Parsed SQL Rel. Alg. Statistics Optimized RA Execution Plan Data Query Results JSQLParser Checkpoint 0

DµBStep

Checkpoint 0: "Hello World" due Feb 10
SELECT * FROM FOO
Checkpoint 1: "Expressions" due Mar 9
SELECT A*23 FROM FOO WHERE B > 10
Checkpoint 2: "Precomputation" due Apr 13
SELECT SUM(A) FROM FOO WHERE B > 10
Checkpoint 3: "Joins" due May 11
SELECT SUM(A) FROM FOO, BAR WHERE B > 10

Getting Started

A link to the submission system is on the syllabus.

There's a handy "create account" link.

Use your UBIT email address to create an account.

TODO By Feb 10

  1. Create a group of up to 3 people
  2. Register your group
  3. Access your group's GIT repository
  4. Download JSQLParser
  5. Look at the JSQLParser Javadoc
  6. Read the Checkpoint 0 Overview
  7. Create a SQL "Hello World" program
  8. Hit Submit

A SQL "Hello World"

  1. Read SELECT * FROM [tablename] from System.in
  2. Use JSQLParser to parse the query
    • You don't need to use JSQLParser... but it will make your life easier later.
  3. Get the Table Name
  4. Read in the file data/[tablename].csv
  5. Write the contents out to System.out

Submit as many times as you need to

Your grade will never ever decrease because you decided to submit just one more time

Ways to Fail

(please do not do these things)

  • Start your project the weekend before it's due.
  • Avoid your TAs and their office hours.
  • Stay quiet in class when your professor says something you don't understand.
  • Wait until the deadline to submit for the first time.
  • Cheat.

Academic Integrity

Cheating is submitting any work that you did not do yourself, as if you did

References (properly cited)
Wikipedia, Wikibooks: OK
Public Code
StackExchange, GitHub: NOT OK
Discussing concepts with classmates
"HashTables have O(1) lookups": OK (except exams)
Sharing code or answers with classmates
"Just look at how I did it": NOT OK

MOSS

MOSS

Academic Integrity

Zero Tolerance
If I catch you submitting someone else's code, you will fail the class.
Group Responsibility
If your teammate cheats on a group project, the entire group will be penalized.
Share Code, Share Blame
If someone else submits your code, you will be penalized as well.

Any Questions or Concerns?

What does a Database
Management System Do?

Data Management

Analysis: Answering user questions about a dataset
What kind of tools can we give users?
  • Declarative Languages
  • Organizational Datastructures (e.g., Indexes)
Transactions: Safely saving and sharing changing data
What kind of tools can we give users?
  • Consistency Primitives
  • Data Validation Primitives

Data

vs

Files

Databases

{
  "firstName": "John", "lastName": "Smith","age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York", "state": "NY",
    "postalCode": 10021
  },
  "phoneNumbers": [
    {   "type": "home",
        "number": "212 555-1234"
    },{ "type": "fax",
        "number": "646 555-4567"
    } 
  ]
}

Databases understand and exploit the data's structure

Types

Primitives
Integer, String, Floating Point Number, Date
Collections
List/Array, Set, Bag
Tuple
Struct, Dictionary/Object

Type Glossary

Primitive
Basic Building Blocks
Tuple (aka Record)
Several 'fields' of different types
(a schema defines names/types for each field)
Set
Collection of unique records all of the same type.
Bag
Unsorted collection of records of the same type.
List
Sorted collection of records of the same type.

Relational Database Glossary

Your data is currently an unsorted set
of 100-attribute Tuples.

Tomorrow, you will be repeatedly asked for the value of 1 specific attribute from 5 specific rows, each identified by a 1 specific attribute.