# Part 2: Zip Files
As you deal with bigger datasets, those datasets will often be
compressed. Compressed means that the format takes advantage of
patterns and redundancy in data to same a bigger file in less space.
For example, say you have a string like this: “HAHAHAHAHAHAHAHAHAHA”.
You should imagine inventing a notation for representing that string
with fewer characters (maybe something like “HA{x10}”).
Zip is one common compression format. In addition to compressing
files, .zips often bundle multiple files together. In the past, you
would have run `unzip` in the terminal before starting to write your
code. However, it is also possible to directly read the contents of a
`.zip` file in Python. Doing so is often more convenient; the code
may also quite possibly be faster.
## Generating a .zip
To create an `example.zip` file, run the following (don’t worry,
understanding this particular snippet isn’t expected for this lab):
“`python
import pandas as pd
from zipfile import ZipFile, ZIP_DEFLATED
from io import TextIOWrapper
with open(“hello.txt”, “w”) as f:
f.write(“hello world”)
with ZipFile(“example.zip”, “w”, compression=ZIP_DEFLATED) as zf:
with zf.open(“hello.txt”, “w”) as f:
f.write(bytes(“hello world”, “utf-8”))
with zf.open(“ha.txt”, “w”) as f:
f.write(bytes(“ha”*10000, “utf-8”))
with zf.open(“bugs.csv”, “w”) as f:
pd.DataFrame([[“Mon”,7], [“Tue”,4], [“Wed”,3], [“Thu”,6], [“Fri”,9]],
columns=[“day”, “bugs”]).to_csv(TextIOWrapper(f), index=False)
“`
## ZipFile
We can access the file by using the `ZipFile` type, imported from the `zipfile` module:
“`python
from zipfile import ZipFile
“`
ZipFiles are context managers, much like file objects. Let’s try
creating one using `with`, then loop over info about the files inside
using [this
method](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.infolist):
“`python
with ZipFile(‘example.zip’) as zf:
for info in zf.infolist():
print(info)
“`
Let’s print off the size and compression ratio (uncompressed size divided by compressed size) of each file:
“`python
with ZipFile(‘example.zip’) as zf:
for info in zf.infolist():
orig_mb = info.file_size / (1024**2) # there are 1024**2 bytes in a MB
ratio = info.file_size / info.compress_size
s = “file {name:s}, {mb:.3f} MB (uncompressed), {ratio:.1f} compression ratio”
print(s.format(name=info.filename, mb=orig_mb, ratio=ratio))
“`
Take a minute to look through — what file is largest? What is its
compression ratio?
The compression ratio is the original size divided by the compressed
size, so bigger means more savings. `ha.txt` contains “hahahahaha…”
(repeated 10 thousand times), which is highly compressible.
As practice, compute the overall compression ration (sum of all
uncompressed sizes divided by sum of all compressed sizes) — it ought
to be about 216.
## Binary Open
Ok, forget zips for a minute, and run the following:
“`python
with open(“hello.txt”, “r”) as f:
data1 = f.read()
with open(“hello.txt”, “rb”) as f:
data2 = f.read()
print(type(data1), type(data2))
“`
What type does `f.read()` return if we use “r” for the mode? What
about “rb”?
The “b” stands for “binary” or “bytes”, so we get back type `bytes`.
If we open in text mode (the default), as in the first open, the bytes
automatically get translated to strings, using some encoding (like
“utf-8”) that assigns characters to byte-represented numbers.
Run this:
“`python
from io import TextIOWrapper
“`
`TextIOWrapper` objects “wrap” file objects are used to convert bytes
to characters on the fly. For example, try the following:
“`python
with open(“hello.txt”, “rb”) as f:
tio = TextIOWrapper(f)
data3 = tio.read()
print(type(data3))
“`
Even though we open in binary mode, we get a string thanks to
`TextIOWrapper`! You can think of the example where we read into
`data1` as a shorthand for what we did to get `data3`.
## Reading Files
A ZipFile has a method named `open` that works a lot like the `open`
function you’re familiar with. A ZipFile is a context manager, and so
is the object returned by `ZipFile.open(…)`, so we’ll end up with
nested `with` statements to make sure everything gets closed up
properly. Let’s take a look at the compressed schedule file:
“`python
with ZipFile(‘example.zip’) as zf:
with zf.open(“hello.txt”, “r”) as f:
print(f.read())
“`
Woah, why do we get `b’hello world’`? For regular files, “r” mode
defaults to reading text, but for files inside a zip, it defaults to
binary mode, so we got back bytes.
TextIOWrapper saves the day:
“`python
with ZipFile(‘example.zip’) as zf:
with zf.open(“hello.txt”, “r”) as f:
tio = TextIOWrapper(f)
print(tio.read())
“`
With regular files, TextIOWrapper is a bit useless (why not just open
with “r” instead of “rb”?), but for zips, it is crucial.
## Pandas
Pandas can read a DataFrame even from a binary stream. So you can can do this:
“`python
with ZipFile(‘example.zip’) as zf:
with zf.open(“bugs.csv”) as f:
df = pd.read_csv(f)
df
“`