Handle various file types in Python for Data Science | Pandas
If you have been using pandas for a while then you might already know how to import simple CSV files. But in reality, data is stored in many many file formats. What would you do if your side project requires you to access, say, MATLAB, JSON, Stata or even hdf5 files in Python?
This article will show you how easy it is to import those files using pandas
Let’s start off with
Pickle files
pickle is a filetype native to python which means you cannot use the data in other programming languages.
The idea behind pickle is to serialize a given Python object into a byte stream and store it as a ‘.pickle’ file in Python3 and ‘.pkl’ in Python2
To retrieve the contents from the file, you have to unpickle (deserialize) the pickled file.
PS: Everything(list, dictionary, string, ….) in python is an object.
First import the library (you don’t have to install anything since this library comes built-in)
import pickle
Then create some dummy data.
both ‘l’ and ‘d’ contain the same data but with different datatypes
# nested list
l = [['col_1', 'col_2', 'col_3'],
[27, 34, 'zahash'],
[28, 30, 'is'],
[30, 33, 'an'],
[12, 23, 'awesome'],
[20, 27, 'programmer']]# dictionary
d = {
'col_1' : [27, 28, 30, 12, 20],
'col_2' : [34, 30, 33, 23, 27],
'col_3' : ['zahash', 'is', 'an', 'awesome', 'programmer']
}
Writing pickle file:
# 'wb' means 'write byte'
# 'pickleFile_L.pickle' file will be created if it doesn't exist
# 'dump' the data into the filewith open('pickleFile_L.pickle', 'wb') as pickFile_l:
pickle.dump(l, pickFile_l)with open('pickleFile_D.pickle', 'wb') as pickFile_d:
pickle.dump(d, pickFile_d)
Reading pickle file:
# 'rb' means 'read byte'
# 'load' the contents of the pickle file into a variablewith open('pickleFile_L.pickle', 'rb') as readFile_l:
data_l = pickle.load(readFile_l)with open('pickleFile_D.pickle', 'rb') as readFile_d:
data_d = pickle.load(readFile_d)
You can store multiple python objects into a single pickle file but is NOT recommended at all and you should avoid doing that.
After fetching the data, you can make a DataFrame out of it using Pandas
df1 = pd.DataFrame(d)
df2 = pd.DataFrame(l[0:], columns = l[0])
moving onto …
Excel files
you first have to install a few packages to read and write excel files with pandas
pip3 install xlrd
pip3 install xlwt
pip3 install openpyxl
Reading excel files:
sales_data = pd.read_excel('SampleData.xlsx', sheet_name = 'sales')
churn_data = pd.read_excel('SampleData.xlsx', sheet_name = 'churns')
Writing DataFrames to excel files:
df.to_excel('new file.xlsx', sheet_name='d')
if you don’t want to include the index then you can specify the parameter index = False
SAS files
it’s easy to read SAS files but unfortunately, you cannot write SAS files. You can instead save the file as .csv and use the SAS software to do that
You can import both (.sas7bdat) and (.xport) files
Reading SAS files:
sas_df = pd.read_sas('airline.sas7bdat')sas_df.head()
Output:
YEAR Y W R L K
0 1948.0 1.214 0.243 0.1454 1.415 0.612
1 1949.0 1.354 0.260 0.2181 1.384 0.559
2 1950.0 1.569 0.278 0.3157 1.388 0.573
3 1951.0 1.948 0.297 0.3940 1.550 0.564
4 1952.0 2.265 0.310 0.3559 1.802 0.574
Stata files
Pandas also has support for reading and writing Stata files
Reading Stata files:
stata_df = pd.read_stata('airline.dta')
Output:
YEAR Y W R L K
0 1948.0 1.214 0.243 0.1454 1.415 0.612
1 1949.0 1.354 0.260 0.2181 1.384 0.559
2 1950.0 1.569 0.278 0.3157 1.388 0.573
3 1951.0 1.948 0.297 0.3940 1.550 0.564
4 1952.0 2.265 0.310 0.3559 1.802 0.574
Writing Stata files:
df.to_stata('some_file_name.dta')
HDF5 files
Hierarchical Data Format (HDF) is an open-source file format for storing huge amounts of numerical data typically used in research.
recently LIGO used HDF5 file format to store Terabytes of gravitational waves data. This shows how robust the file format is!!
HDF5 files have a structure similar to nested dictionaries so working with them is quite easy.
you will need to install the h5py library first
pip3 install h5py
Reading HDF5 file:
with h5py.File('alloc.h5', 'r') as file_obj:print('keys : {}'.format(file_obj.keys()))
data1 = file_obj['DS1']
print('dataset(DS1) type : {}'.format(type(data1)))
print('\n printing the first dataset... \n')
for ele in data1:
print(ele)
print('\n')
data2 = file_obj['DS2']
print('dataset(DS2) type : {}'.format(type(data2)))
print('\n printing the second dataset... \n')
for ele in data2:
print(ele)
output:
keys : <KeysViewHDF5 ['DS1', 'DS2']>dataset(DS1) type : <class 'h5py._hl.dataset.Dataset'> printing the first dataset... [ 0 -1 -2 -3 -4 -5 -6]
[0 0 0 0 0 0 0]
[0 1 2 3 4 5 6]
[ 0 2 4 6 8 10 12]
dataset(DS2) type : <class 'h5py._hl.dataset.Dataset'> printing the second dataset... [ 0 -1 -2 -3 -4 -5 -6]
[0 0 0 0 0 0 0]
[0 1 2 3 4 5 6]
[ 0 2 4 6 8 10 12]
This shows us that the actual file is like this
this a relatively simple file. Most HDF5 files are deeply nested
Writing an HDF5 file:
to keep everything simple and clear, let’s write the same data that we read above.
dataset1 = [[0, -1, -2, -3, -4, -5, -6]
[0, 0, 0, 0, 0, 0, 0]
[0, 1, 2, 3, 4, 5, 6]
[0, 2, 4, 6, 8, 10, 12]]dataset2 = [[0, -1, -2, -3, -4, -5, -6]
[0, 0, 0, 0, 0, 0, 0]
[0, 1, 2, 3, 4, 5, 6]
[0, 2, 4, 6, 8, 10, 12]]with h5py.File('some_file_name.h5', 'w') as file_obj:
file_obj.create_dataset('DS1', data=dataset1)
file_obj.create_dataset('DS2', data=dataset2)
JSON files
JSON is one of the most common file formats you will encounter both as a Data Scientist and as a Web Developer.
luckily python has a built-in json library that you can just import
import json
Reading JSON file:
let’s say you have a JSON file named ‘sample.json’ like this
Reading JSON file:
with open('sample.json', 'r') as file_obj:
data = json.load(file_obj)
The data will be stored in the form of a python dictionary
Writing JSON file:
use arguments indent = 4
to avoid writing the whole data in a single line in the file. The problem with writing all the data into a single line in the file is that text editors will have a really hard time opening that file because they have to read the entire data at once.
sort_keys = True
will sort the keys in alphabetic order
data = {'a': [1, 2, 3], 'b': ['lol', 'what', 'if']}with open("some_file_name.json", "w") as file_obj:
json.dump(data, file_obj, indent=4, sort_keys=True)
The output file will look something like this
you can also make a DataFrame out of a JSON file by first reading it into a python dictionary and then running pd.DataFrame(data)